[
    {
        "paper_id": "1704.00864.json",
        "table_id": "table_1",
        "table_content": "\\begin{table*}[t]\n\\begin{center}{\\footnotesize\n\\begin{tabular}{@{\\extracolsep{4pt}}lccccc@{}}\n\\hline\n\\hline\nBasis & State $n$ & Energy gap ($\\Delta E_{0n}$) & Dipole moment ($\\mu_n$) & Transition dipole moment ($t_{0n}$) & Oscillator strength ($f_{0n}$) \\\\\n\\hline\naug-cc-pVDZ & 0 $\\;$ (${}^1\\Sigma^+$)  & -            & -2.3251372(2) & -            & -           \\\\\n            & 1 $\\;$ (${}^1\\Sigma^+$)  & 0.130434(1)  &  2.01947(4)   & 0.965189(7)  & 0.081007(1) \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)  & 0.2149799(6) & -3.3543(9)    & 0.37471(1)   & 0.020123(1) \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)  & 0.229077(4)  &  5.0832(8)    & 0.09126(8)   & 0.001271(2) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)  & 0.246350(3)  & -0.2958(3)    & 0.56074(2)   & 0.051639(4) \\\\\n\\hline\naug-cc-pVTZ & 0 $\\;$ (${}^1\\Sigma^+$) & -            & -2.306440(9)  & -            & -           \\\\\n            & 1 $\\;$ (${}^1\\Sigma^+$) & 0.132458(3)  &  2.02541(7)   & 0.93538(2)   & 0.077262(4) \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$) & 0.216705(6)  & -3.794(1)     & 0.41146(2)   & 0.024459(2) \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$) & 0.230621(2)  &  5.533(1)     & 0.07042(8)   & 0.000762(2) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$) & 0.246520(2)  & -0.6235(7)    & 0.693170(7)  & 0.078966(2) \\\\\n\\hline\naug-cc-pVQZ & 0 $\\;$ (${}^1\\Sigma^+$) & -            & -2.30168(3)   & -            & -           \\\\\n            & 1 $\\;$ (${}^1\\Sigma^+$) & 0.132943(7)  &  2.0188(1)    & 0.92658(4)   & 0.076093(7) \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$) & 0.217616(7)  & -3.696(2)     & 0.3984(1)    & 0.02303(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$) & 0.231229(2)  &  6.211(2)     & 0.1083(2)    & 0.001809(6) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$) & 0.242846(9)  & -1.998(2)     & 0.6201(5)    & 0.06224(9)  \\\\\n\\hline\n\\hline\n\\end{tabular}\n}\n\\caption{Final converged estimates for the LiH molecule at an internuclear distance of $1.5957$\\AA. Results are for the five lowest energy states in the $A_1$ irrep of the $C_{2v}$ point group, with $M_S=0$ and $S=\\textrm{even}$ quantum numbers (which happen to all be ${}^1\\Sigma^+$ states). $n=0$ refers to the ground state, $n>1$ to excited states. Numbers in parentheses denote stochastic error, not initiator error. Energy gaps ($\\Delta E_{1n}$) were calculated using RDM-based energy estimates, Eq.~(\\ref{eq:rdm_energy}). Integrals were generated using the PySCF program\\cite{pyscf}. In the small aug-cc-pVDZ, all results were verified against exact FCI results obtained from PySCF (not shown here).}\n\\label{tab:lih}\n\\end{center}\n\\end{table*}",
        "caption": "Final converged estimates for the LiH molecule at an internuclear distance of $1.5957$\\AA. Results are for the five lowest energy states in the $A_1$ irrep of the $C_{2v}$ point group, with $M_S=0$ and $S=\\textrm{even}$ quantum numbers (which happen to all be ${}^1\\Sigma^+$ states). $n=0$ refers to the ground state, $n>1$ to excited states. Numbers in parentheses denote stochastic error, not initiator error. Energy gaps ($\\Delta E_{1n}$) were calculated using RDM-based energy estimates, Eq.~(\\ref{eq:rdm_energy}). Integrals were generated using the PySCF program\\cite{pyscf}. In the small aug-cc-pVDZ, all results were verified against exact FCI results obtained from PySCF (not shown here).",
        "label": "tab:lih",
        "section_info": "5 Results\n\\section{Results}\n\\label{sec:results}\n\nAs an initial test of these ideas, we consider the calculation of dipole moments, transition dipole moments, and oscillator strengths for low-lying states of small diatomic molecules. These quantities are of great importance for understanding various properties of molecular systems. The oscillator strength in particular is required to explain optical spectra, as it determines the probabilities of absorption and emission of photons coupling different electronic states. Nonetheless, dipole moments are challenging to calculate accurately, even for small molecules, because they are very sensitive to the quality of the wave function and single-particle basis set used, generally requiring many diffuse orbitals for an accurate description, with far greater basis set sensitivity than the energy\\cite{Green1974}.\n\nWe therefore begin by considering the LiH and BH molecules in aug-cc-pVDZ, aug-cc-pVTZ and aug-cc-pVQZ, containing $32$, $69$  and $126$ spatial orbitals respectively. The aug-cc-pVQZ basis 2-RDM was unobtainable in the previous RDM implementation, despite the small molecular size. We then consider the MgO molecule in an aug-cc-pVDZ basis set. We note that while the calculation of dipole moments only requires the 1-RDM, for these calculations we obtain the 1-RDM by contracting the 2-RDM, which we also use to calculate the energy using the estimator\n\\begin{equation}\n(E_{\\textrm{RDM}})_n = \\frac{ \\textrm{Tr} \\big[ \\hat{H} \\; \\hat{\\Gamma}^n \\big] }{ \\textrm{Tr} \\big[ \\hat{\\Gamma}^n \\big] }.\n\\label{eq:rdm_energy}\n\\end{equation}\nTherefore, the following is a good test of the newly-introduced ideas, as well as providing further insight into the effect of the initiator adaptation for different estimators and excited states.\n\nThe dipole moment for the state $|\\Phi^n\\ket$ is defined by\n\\begin{equation}\n\\bs{\\mu}_{n} = \\sum_{pq} \\gamma_{p,q}^{n} \\bra p | \\hat{\\bs{r}} | q \\ket.\n\\end{equation}\nwhile a transition dipole moment, $\\bs{t}_{nm}$, is defined by Eq.~(\\ref{eq:trans_dip_mom}), and the corresponding oscillator strength by\n\\begin{equation}\nf_{nm} = \\frac{2}{3} \\Delta E_{nm} |\\bs{t}_{nm}|^2,\n\\end{equation}\nfor an energy gap of $\\Delta E_{nm}$ between states $|\\Phi^n\\ket$ and $|\\Phi^m\\ket$.\n\n\\begin{figure*}[t!]\n\\includegraphics{lih.eps}\n\\caption{Initiator error convergence for the five lowest energy states of LiH in an aug-cc-pVQZ basis, at an internuclear distance of $1.5957$\\AA~as the number of walkers in each distribution is increased. Results are shifted relative to their values at the largest walker population considered, therefore approximately representing the initiator error. (a) Dipole moments. (b) Transition dipole moments from the ground state. (c) Energy calculated from a trial estimator, $E_{\\textrm{Trial}}$. (d) Energy calculated from the RDM estimator, $E_{\\textrm{RDM}}$. $N_w$ denotes the number of walkers for \\emph{each} state and replica sampled. Simulations were typically averaged over 5 simulations to obtain error bars.}\n\\label{fig:lih_init}\n\\end{figure*}\n\n\\begin{table*}[t]\n\\begin{center}{\\footnotesize\n\\begin{tabular}{@{\\extracolsep{4pt}}lccccc@{}}\n\\hline\n\\hline\nBasis & State $n$ & Energy gap ($\\Delta E_{0n}$) & Dipole moment ($\\mu_n$) & Transition dipole moment ($t_{0n}$) & Oscillator strength ($f_{0n}$) \\\\\n\\hline\naug-cc-pVDZ & 0 $\\;$ (${}^1\\Sigma^+$)  & -            & -2.3251372(2) & -            & -           \\\\\n            & 1 $\\;$ (${}^1\\Sigma^+$)  & 0.130434(1)  &  2.01947(4)   & 0.965189(7)  & 0.081007(1) \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)  & 0.2149799(6) & -3.3543(9)    & 0.37471(1)   & 0.020123(1) \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)  & 0.229077(4)  &  5.0832(8)    & 0.09126(8)   & 0.001271(2) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)  & 0.246350(3)  & -0.2958(3)    & 0.56074(2)   & 0.051639(4) \\\\\n\\hline\naug-cc-pVTZ & 0 $\\;$ (${}^1\\Sigma^+$) & -            & -2.306440(9)  & -            & -           \\\\\n            & 1 $\\;$ (${}^1\\Sigma^+$) & 0.132458(3)  &  2.02541(7)   & 0.93538(2)   & 0.077262(4) \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$) & 0.216705(6)  & -3.794(1)     & 0.41146(2)   & 0.024459(2) \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$) & 0.230621(2)  &  5.533(1)     & 0.07042(8)   & 0.000762(2) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$) & 0.246520(2)  & -0.6235(7)    & 0.693170(7)  & 0.078966(2) \\\\\n\\hline\naug-cc-pVQZ & 0 $\\;$ (${}^1\\Sigma^+$) & -            & -2.30168(3)   & -            & -           \\\\\n            & 1 $\\;$ (${}^1\\Sigma^+$) & 0.132943(7)  &  2.0188(1)    & 0.92658(4)   & 0.076093(7) \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$) & 0.217616(7)  & -3.696(2)     & 0.3984(1)    & 0.02303(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$) & 0.231229(2)  &  6.211(2)     & 0.1083(2)    & 0.001809(6) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$) & 0.242846(9)  & -1.998(2)     & 0.6201(5)    & 0.06224(9)  \\\\\n\\hline\n\\hline\n\\end{tabular}\n}\n\\caption{Final converged estimates for the LiH molecule at an internuclear distance of $1.5957$\\AA. Results are for the five lowest energy states in the $A_1$ irrep of the $C_{2v}$ point group, with $M_S=0$ and $S=\\textrm{even}$ quantum numbers (which happen to all be ${}^1\\Sigma^+$ states). $n=0$ refers to the ground state, $n>1$ to excited states. Numbers in parentheses denote stochastic error, not initiator error. Energy gaps ($\\Delta E_{1n}$) were calculated using RDM-based energy estimates, Eq.~(\\ref{eq:rdm_energy}). Integrals were generated using the PySCF program\\cite{pyscf}. In the small aug-cc-pVDZ, all results were verified against exact FCI results obtained from PySCF (not shown here).}\n\\label{tab:lih}\n\\end{center}\n\\end{table*}\n\nFor all simulations, the intial restricted Hartree--Fock (RHF) calculation was performed by PySCF\\cite{pyscf}. Integrals from PySCF were then passed to our FCIQMC program, \\url{NECI}, for the main calculation, which output one and two body density matrices. These were then contracted with integrals from PySCF to calculate final dipole moment estimates. Energy estimates were calculated on-the-fly in \\url{NECI}.\n\nThe five lowest energy states were calculated for LiH and BH, and the four lowest states of MgO, considering only states with $M_s=0$ and using the $A_1$ irreducible representation (irrep) of the $C_{2v}$ point group. Also, time-reversal symmetrized functions\\cite{Smeyers1973} were used as the many-particle basis states, therefore restricting the total spin quantum number, $S$, to be even, and thus removing triplet states. In all cases, the FCIQMC simulation time step was varied in the initial iterations so as to prevent ``bloom'' events, where many walkers can be created in a single spawning event (which often leads to large initiator error).\n\nWe also note that in generating excitations for the walker spawning step, we use an approach that greatly improves efficiency compared to the uniform sampling used in early FCIQMC results\\cite{Booth2009}. In this approach, the pair of orbital labels from which electrons are excited, $(i,j)$, are chosen uniformly, while the orbitals excited to, $(a,b)$, are selected with probabilities drawn from a Cauchy-Schwarz distribution, namely $p(ab|ij) \\propto \\sqrt{\\langle ia|ia \\rangle \\langle jb|jb \\rangle}$.\\cite{Smart_unpublished} Another approach to select connections efficiently was considered by Holmes \\emph{et. al.}\\cite{Holmes2016}, but not used here.\n\nAll simulations used the semi-stochastic adaptation to reduce stochastic errors\\cite{Petruzielo2012, Blunt2015}. For the LiH molecule the deterministic space consisted of all configurations up to and including double excitations from the Hartree--Fock determinant. For the BH and MgO molecules the deterministic space was formed from the $10^4$ most populated configurations across all wave functions sampled, once the simulations were deemed to have largely converged, using the approach described in Ref.~(\\onlinecite{Blunt2015}).\n\n\\subsection{LiH}\n\nSimulations on LiH were performed using between $1.25 \\times 10^4$ and $10^6$ walkers per simulation (i.e., for each state and replica sampled), in order to converge initiator error for all states. Density matrices were typically averaged over $10^5$ iterations, once convergence was deemed to have been reached for all states and all estimators. These entire simulations were then repeated five times with different initial RNG seeds, and the results averaged in order to calculate error estimates.\n\nFigure~\\ref{fig:lih_init} shows initiator convergence for LiH in the aug-cc-pVQZ basis set, for the lowest five energy eigenstates, and for four different estimators: dipole moments, transition dipoles moments, and energies calculated from both the RDM-based energy estimator, Eq.~(\\ref{eq:rdm_energy}), and from a trial wave function-projected estimator:\n\\begin{equation}\n(E_{\\textrm{Trial}})_n = \\frac{ \\bra \\Psi_{\\textrm{Trial}}^n | \\hat{H} | \\Psi^n \\ket }{ \\bra \\Psi_{\\textrm{Trial}}^n | \\Psi^n \\ket }.\n\\label{eq:trial_energy}\n\\end{equation}\nHere, $| \\Psi_{\\textrm{Trial}}^n \\ket$ is a trial wave function designed to have a large overlap with the exact state $| \\Phi^n \\ket$. We have discussed the use of such trial wave function estimators in excited-state FCIQMC in Ref.~(\\onlinecite{Blunt2015_3}). To generate $| \\Psi_{\\textrm{Trial}}^n \\ket$, we calculate the configuration interaction singles and doubles (CISD) wave functions for the lowest fifteen energy states. Then, once convergence of all FCIQMC simulations is deemed to have been reached, we assign each simulation one trial wave function by choosing the CISD solution with the largest overlap in each case. The reason for obtaining more CISD solutions than FCIQMC simulations is that CISD solutions can have a different energy ordering to FCI solutions. Averaging of each $E_{\\textrm{Trial}}$ estimate was performed from roughly the same point that RDM sampling began, and so both RDM and trial energy estimates are obtained from a similar number of iterations, usually $10^5$.\n\nThe initiator-FCIQMC estimates in Figure~\\ref{fig:lih_init} are all plotted relative to their values at the largest walker population considered, $N_{w}=10^6$. Here, convergence has been largely reached in all cases, and so the figures effectively plot initiator error against walker population. Reassuringly, initiator error in energy estimates is incredibly small for both estimators and for all states. Indeed, the largest error at the smallest walker population tested is less than $\\sim 0.5$ m$E_\\textrm{h}$ for $E_{\\textrm{Trial}}$.\n\nInterestingly, initiator error in $E_{\\textrm{RDM}}$ is much smaller than in $E_{\\textrm{Trial}}$. This is a trend that we have often observed, although exceptions do occur (and in the limit of an exact $| \\Psi_{\\textrm{Trial}}^n \\ket$, the initiator error is zero). Initiator error in the $E_{\\textrm{RDM}}$ energies are variational in all cases within stochastic errors, while it is not strictly enforced (though common) for this to also be the case for $E_{\\textrm{Trial}}$. For RDM-based energy estimates, this variationality is effectively ensured by the Hylleraas-Undheim-McDonald theorem\\cite{Hylleraas1930, McDonald1933}, which is expected to approximately hold for FCIQMC-sampled wave functions. Initiator error is larger for excited states, as previously observed\\cite{Blunt2015_3}. This is expected due to the more multi-configurational nature of excited states. It remains to be seen whether orbital optimization can increase this rate of convergence for excited states. Random errors however are larger in the RDM-based energy estimates, which is expected due to the fact that two uncorrelated simulations (from the two replicas) contribute to this quantity. However, error bars are extremely small in all cases here, always being smaller than $10^{-2}$ m$E_{\\textrm{h}}$.\n\n\\begin{figure*}[t!]\n\\includegraphics{bh.eps}\n\\caption{Initiator convergence for the five lowest energy states of BH in an aug-cc-pVTZ basis, at an internuclear distance of $1.2324$\\AA. Results are shifted relative to their values at the largest walker population considered, therefore approximately representing the initiator error. (a) Dipole moments. (b) Transition dipole moments from the ground state. (c) Energy calculated from a trial estimator, $E_{\\textrm{Trial}}$. (d) Energy calculated from the RDM estimator, $E_{\\textrm{RDM}}$. $N_w$ denotes the number of walkers for \\emph{each} state and replica sampled. Simulations were typically averaged over 5 simulations to obtain error bars.}\n\\label{fig:bh_init}\n\\end{figure*}\n\n\\begin{table*}[t]\n\\begin{center}{\\footnotesize\n\\begin{tabular}{@{\\extracolsep{4pt}}lccccc@{}}\n\\hline\n\\hline\nBasis & State $n$ & Energy gap ($\\Delta E_{0n}$) & Dipole moment ($\\mu_n$) & Transition dipole moment ($t_{0n}$) & Oscillator strength ($f_{0n}$) \\\\\n\\hline\naug-cc-pVDZ & 0 $\\;$ (${}^1\\Sigma^+$)     & -           &  0.528082(7) & -           & -           \\\\\n            & 1 $\\;$ ($\\; {}^1\\Delta \\;$) & 0.216230(3) & -0.18983(3)  & 0.0         & 0.0         \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)     & 0.23727(1)  & -1.4146(5)   & 0.93478(3)  & 0.13822(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)     & 0.257587(4) & -0.3219(3)   & 0.2102(1)   & 0.007590(9) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)     & 0.282665(1) &  3.5459(1)   & 0.44725(4)  & 0.037696(7) \\\\\n\\hline\naug-cc-pVTZ & 0 $\\;$ (${}^1\\Sigma^+$)     & -           &  0.54561(2) & -          & -           \\\\\n            & 1 $\\;$ ($\\; {}^1\\Delta \\;$) & 0.211482(6) & -0.19271(7) & 0.0        & 0.0         \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)     & 0.238668(8) & -1.2943(5)  & 0.88508(5) & 0.12464(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)     & 0.253574(4) & -0.4973(6)  & 0.1454(2)  & 0.00358(1)  \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)     & 0.283481(2) &  3.4088(2)  & 0.35740(7) & 0.024141(9) \\\\\n\\hline\naug-cc-pVQZ & 0 $\\;$ (${}^1\\Sigma^+$)     & -          &  0.54914(6) & -         & -          \\\\\n            & 1 $\\;$ ($\\; {}^1\\Delta \\;$) & 0.21059(2) & -0.1968(3)  & 0.0       & 0.0        \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)     & 0.23876(3) & -1.268(3)   & 0.8704(3) & 0.1206(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)     & 0.25261(3) & -0.504(3)   & 0.139(1)  & 0.00327(7) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)     & 0.28289(1) &  3.2889(9)  & 0.3138(1) & 0.01857(2) \\\\\n\\hline\n\\hline\n\\end{tabular}\n}\n\\caption{Final converged estimates for the BH molecule at an internuclear distance of $1.2324$\\AA. Results are for the five lowest energy states in the $A_1$ irrep of the $C_{2v}$ point group, with $M_S=0$ and $S=\\textrm{even}$ quantum numbers. $n=0$ refers to the ground state, $n>1$ to excited states. Numbers in parentheses denote stochastic error, not initiator error. Energy gaps ($\\Delta E_{1n}$) were calculated using RDM-based energy estimates, Eq.~(\\ref{eq:rdm_energy}). Integrals were generated using the PySCF program\\cite{pyscf}.}\n\\label{tab:bh}\n\\end{center}\n\\end{table*}\n\nThe calculation of dipole moments provides a more interesting test, due to their greater dependence on more highly-excited determinants and diffuse single-particle orbitals. The relative initiator error is much larger, particularly for certain excited states (i.e. $\\mu_2$ and $\\mu_3$). The transition dipole moments considered involve transitions from the ground ($n=0$) state to excited ($n>1$) states. Because they always involve the ground state, it is to be expected that they have smaller relative initiator and stochastic error, compared to the corresponding non-transition dipole moment (i.e. $t_{0n}$ compared to $\\mu_n$). This expectation is borne out in the results, with initiator and stochastic error in $t_{0n}$ often being $\\sim 5$ times smaller than for $\\mu_n$. For the calculation of dipole moments from FCIQMC-sampled RDMs, relative stochastic errors are clearly much larger than for energies, and so the use of the semi-stochastic adaptation is of great importance here, whereas its use can be somewhat unnecessary in small ground-state energy calculations.\n\nClearly, the accurate calculation of dipole moments is more challenging than energies, requiring larger walker populations to obtain similar relative errors. However, this is not uniquely a feature of the initiator approximation in FCIQMC, but is equally true in other approximate methods, where properties such as the dipole moment are far more sensitive to the basis set and quality of the wavefunction than ground state energetics. That we are able to observe systematic converge of these quantities, with respect to a single simulation parameter, is reassuring.\n\nTable~\\ref{tab:lih} gives final results for the aug-cc-pV$X$Z basis sets, with $X=2,3,4$. Results in the small $X=2$ basis were fully converged at the smallest walker populations considered, $N_w = 1.25 \\times 10^4$, as confirmed by comparison to FCI results from the PySCF program. As expected, dipole moments vary quite substantially with basis set, particularly for the second, third and fourth excited states, demonstrating the importance of large basis sets with diffuse functions. Errors in brackets denote stochastic error bars, not initiator error, which is larger. However, given the careful convergence of initiator error, as shown in Figure~\\ref{fig:lih_init}, we expect dipole moments to be converged to around $10^{-3}e a_0$ in most cases, and energies to be converged \\emph{substantially} beyond chemical accuracy.\n\n\\subsection{BH}\n\nFigure~\\ref{fig:bh_init} shows results for BH in the aug-cc-pVTZ basis set and at an internuclear distance of $1.2324$\\AA, demonstrating similar initiator convergence plots to those in Figure~\\ref{fig:lih_init}. Here, results used between $1.25 \\times 10^4$ and $2 \\times 10^6$ walkers per simulation. RDM estimators and $E_{\\textrm{Trial}}$ were averaged over $5 \\times 10^4$ iterations, once convergence was achieved for all states and estimators. Here, instead of using CISD solutions as trial wave functions for $E_{\\textrm{Trial}}$, a slightly different approach was used: a ``trial space'' was defined as consisting of the $2 \\times 10^3$ most populated configurations across all simulations, once convergence had been approximately reached. Trial wave functions were then obtained as the eigenstates of $\\hat{H}$ within this subspace. This is similar to the approach to generate the deterministic space, as described above\\cite{Blunt2015}, and allows important basis states to be picked, while allowing an inexpensive calculation to determine each $| \\Psi_{\\textrm{Trial}}^n \\ket$.\n\nResults contain the same features as observed for LiH. Initiator error in the energy estimates are extremely small in all cases, particularly for estimates obtained from contraction of the RDM, and initiator convergence always occurs variationally. Stochastic error bars are larger for $E_{\\textrm{RDM}}$, as well as for excited states, but always extremely small. For dipole moments, similar trends also occur. Initiator and stochastic relative errors for the dipole moment are very small for the ground and first excited states ($\\mu_0$ and $\\mu_1$) and for the corresponding transition dipole moment ($t_{01}$) even at small walker populations. However, results for higher excited states contain larger errors, although we once again observe that errors in $t_{0n}$ are smaller than errors in $\\mu_n$ for each $n$, presumably because of the involvement of the ground state, which is well converged at lower walker populations, in each of the transition dipole moments considered.\n\nTable~\\ref{tab:bh} shows final results in aug-cc-pV$X$Z basis sets, for $X=2,3,4$. Results for $X=2$ used $2 \\times 10^5$ walkers per simulation, while results for $X=3$ and $X=4$ results used $2 \\times 10^6$ walkers per simulation. The expected strong dependence of dipole moments on the basis set is once again observed. This is particularly true for the second, third and fourth excited states ($n=2,3,4$). We note that these three states also contained the largest initiator error at small walker populations, as seen in Figure~\\ref{fig:bh_init}. This is probably not a coincidence, since the initiator approximation will inevitably result in a poorer description of highly excited regions of the wave function, presumably including excitations into high-energy diffuse functions, which appear important for accurate calculation of dipole moments for these particular states. Despite larger initiator error than for energy estimates, there is still a substantial undersampling of the space here, using $2 \\times 10^6$ walkers for a space size of $\\sim 7 \\times 10^9$ for the aug-cc-pVQZ basis, even for this small molecule, with benefits of Monte Carlo sampling typically increasing with system size.\n\n\\subsection{MgO}\n\n\\begin{figure*}[t!]\n\\includegraphics{mgo.eps}\n\\caption{Initiator convergence for dipole moments (left) and energies (right), for MgO in an aug-cc-pVDZ basis set, at an internuclear distance of $1.749$\\AA, and with 4 core electrons frozen. The four lowest-energy states are considered in the $A_1$ irrep of $C_{2v}$ and with $S=\\textrm{even}$ enforced (all ${}^1\\Sigma^+$ states). Energies are calculated from both RDM ($E_{\\textrm{RDM}}$) and trial wave function ($E_{\\textrm{Trial}}$) based estimates, and become equal to good accuracy at large walker number, $N_w$. Dipole moments appear mostly converged at $N_w=3.2 \\times 10^7$, except for $\\mu_1$. Error bars are only available for $N_w < 10^6$, but are small by this point and should only decrease in magnitude for larger walker populations.}\n\\label{fig:mgo_init}\n\\end{figure*}\n\n\\begin{table*}[t]\n\\begin{center}{\\footnotesize\n\\begin{tabular}{@{\\extracolsep{4pt}}c|ccc|ccc@{}}\n\\hline\n\\hline\nState $n$  & \\multicolumn{3}{c|}{ Energy/$E_{\\textrm{h}}$ } & \\multicolumn{3}{c}{ Dipole moment ($\\mu_n$) /$ea_0$ } \\\\\n\\hline\n & CCSD & CCSDT & FCIQMC & CCSD & CCSDT & FCIQMC \\\\\n\\hline\n0 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.632  &  -274.651  &  -274.654 &  2.590  &  2.398  &  2.382 \\\\\n1 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.531  &  -274.559  &  -274.564 &  1.811  &  2.008  &  2.289 \\\\\n2 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.480  &  -274.514  &  -274.517 &  0.297  &  0.847  &  1.154 \\\\\n3 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.440  &  -274.478  &  -274.480 & -0.366  &  0.529  &  1.198 \\\\\n\\hline\n\\hline\n\\end{tabular}\n}\n\\caption{Energies and dipole moments for MgO in an aug-cc-pVDZ basis set, at an internuclear distance of $1.749$\\AA, and with 4 core electrons frozen at the Hartree--Fock level. The four lowest-energy states are considered in the $A_1$ irrep of $C_{2v}$ and with $S=\\textrm{even}$ enforced (all ${}^1\\Sigma^+$ states). Error bars on FCIQMC results are not given, but are smaller than the order to which results are presented. FCIQMC energies are taken from the RDM-based estimates, $E_{\\textrm{RDM}}$. CCSD and CCSDT values were obtained from NWChem\\cite{NWChem}.}\n\\label{tab:mgo}\n\\end{center}\n\\end{table*}\n\nTo study a more challenging problem, we consider the calculation of energies and dipole moments for the MgO molecule, at its ground state equilibrium separation of $1.749$\\AA, and with 4 core electrons frozen at the Hartree--Fock level. Thus, a total of 16 electrons are correlated in $48$ spatial orbitals. Enforcing $M_s=0$, using the $A_1$ irrep of the $C_{2v}$ point group, and working with time-reversal symmetrized functions\\cite{Smeyers1973} (to enforce $S=\\textrm{even}$), results in a space size of roughly $1.8 \\times 10^{16}$ basis functions. This is a large space, particularly given the challenges of converging initiator error in excited-state dipole moments, as seen already.\n\nFigure~\\ref{fig:mgo_init} presents initiator convergence for walker populations (per state and per replica), $N_w$, ranging from $2.5 \\times 10^4$ to $3.2 \\times 10^7$. The ground state and first three excited states are calculated. For $N_w \\le 4 \\times 10^5$, error bars are calculated by averaging over 5 repeated calculations with varying RNG seeds. Due to the expensive nature of calculations, repeats were not performed for $N_w > 4 \\times 10^5$, and so error bars were not obtained. However, these error bars should mostly only decrease with increasing $N_w$, and are already small at $N_w = 4 \\times 10^5$. Therefore, at the largest walker populations considered, stochastic error should be much smaller than initiator error.\n\nInitiator profiles of both $E_{\\textrm{RDM}}$ and $E_{\\textrm{Trial}}$ estimators are presented in Figure~\\ref{fig:mgo_init}. At convergence, these should clearly become equal. By $N_w = 3.2 \\times 10^7$, this is the case to much better than $1$m$E_\\textrm{h}$ accuracy. As previously found, convergence is monotonic in all cases and $E_{\\textrm{RDM}}$ usually results in smaller initiator error.\n\nConvergence of dipole moments is also shown. Here, relative initiator error is once again larger than for energies, and convergence is non-monotonic. Because of this non-monotonic behavior, combined with the challenging nature of the system, our confidence in the accurate convergence of these values is somewhat less than for LiH and BH results. We cannot rule out the possibility of sudden further convergence at higher $N_w$ values. However we believe any significant deviations unlikely, although it is clear that $\\mu_1$ in particular is not fully converged on the scale shown.\n\nTable~\\ref{tab:mgo} presents FCIQMC energies and dipole moments, using $N_w = 3.2 \\times 10^7$, and with energies taken from the $E_{\\textrm{RDM}}$ estimator. For comparison, coupled cluster results are shown, using both singles and doubles (CCSD) and singles, doubles and triples (CCSDT). These values were calculated using NWChem package\\cite{NWChem}, with the equation-of-motion (EOM-CCSD and EOM-CCSDT) variants used for excited states. As expected, energies obtained from CCSDT are accurate compared to FCIQMC values, even for excited states. Meanwhile, dipole moments show greater differences, particularly for the $n=3$ state. For this state, EOM-CCSD and EOM-CCSDT values also greatly differ, with a flipped dipole moment resulting from EOM-CCSD. These results are consistent with those observed in FCIQMC in regions of large initiator error, that the relative error in dipole moments is much greater than in energies. We again expect that this is primarily due to the increased dependence on highly-excited determinants, and such configurations have particularly large amplitudes in excited states. CCSD and CCSDT appear to be unable to describe the wave function with sufficient accuracy in this region of configuration space, for this system, and for these challenging states.\n\n5.1 LiH\n\\subsection{LiH}\n\nSimulations on LiH were performed using between $1.25 \\times 10^4$ and $10^6$ walkers per simulation (i.e., for each state and replica sampled), in order to converge initiator error for all states. Density matrices were typically averaged over $10^5$ iterations, once convergence was deemed to have been reached for all states and all estimators. These entire simulations were then repeated five times with different initial RNG seeds, and the results averaged in order to calculate error estimates.\n\nFigure~\\ref{fig:lih_init} shows initiator convergence for LiH in the aug-cc-pVQZ basis set, for the lowest five energy eigenstates, and for four different estimators: dipole moments, transition dipoles moments, and energies calculated from both the RDM-based energy estimator, Eq.~(\\ref{eq:rdm_energy}), and from a trial wave function-projected estimator:\n\\begin{equation}\n(E_{\\textrm{Trial}})_n = \\frac{ \\bra \\Psi_{\\textrm{Trial}}^n | \\hat{H} | \\Psi^n \\ket }{ \\bra \\Psi_{\\textrm{Trial}}^n | \\Psi^n \\ket }.\n\\label{eq:trial_energy}\n\\end{equation}\nHere, $| \\Psi_{\\textrm{Trial}}^n \\ket$ is a trial wave function designed to have a large overlap with the exact state $| \\Phi^n \\ket$. We have discussed the use of such trial wave function estimators in excited-state FCIQMC in Ref.~(\\onlinecite{Blunt2015_3}). To generate $| \\Psi_{\\textrm{Trial}}^n \\ket$, we calculate the configuration interaction singles and doubles (CISD) wave functions for the lowest fifteen energy states. Then, once convergence of all FCIQMC simulations is deemed to have been reached, we assign each simulation one trial wave function by choosing the CISD solution with the largest overlap in each case. The reason for obtaining more CISD solutions than FCIQMC simulations is that CISD solutions can have a different energy ordering to FCI solutions. Averaging of each $E_{\\textrm{Trial}}$ estimate was performed from roughly the same point that RDM sampling began, and so both RDM and trial energy estimates are obtained from a similar number of iterations, usually $10^5$.\n\nThe initiator-FCIQMC estimates in Figure~\\ref{fig:lih_init} are all plotted relative to their values at the largest walker population considered, $N_{w}=10^6$. Here, convergence has been largely reached in all cases, and so the figures effectively plot initiator error against walker population. Reassuringly, initiator error in energy estimates is incredibly small for both estimators and for all states. Indeed, the largest error at the smallest walker population tested is less than $\\sim 0.5$ m$E_\\textrm{h}$ for $E_{\\textrm{Trial}}$.\n\nInterestingly, initiator error in $E_{\\textrm{RDM}}$ is much smaller than in $E_{\\textrm{Trial}}$. This is a trend that we have often observed, although exceptions do occur (and in the limit of an exact $| \\Psi_{\\textrm{Trial}}^n \\ket$, the initiator error is zero). Initiator error in the $E_{\\textrm{RDM}}$ energies are variational in all cases within stochastic errors, while it is not strictly enforced (though common) for this to also be the case for $E_{\\textrm{Trial}}$. For RDM-based energy estimates, this variationality is effectively ensured by the Hylleraas-Undheim-McDonald theorem\\cite{Hylleraas1930, McDonald1933}, which is expected to approximately hold for FCIQMC-sampled wave functions. Initiator error is larger for excited states, as previously observed\\cite{Blunt2015_3}. This is expected due to the more multi-configurational nature of excited states. It remains to be seen whether orbital optimization can increase this rate of convergence for excited states. Random errors however are larger in the RDM-based energy estimates, which is expected due to the fact that two uncorrelated simulations (from the two replicas) contribute to this quantity. However, error bars are extremely small in all cases here, always being smaller than $10^{-2}$ m$E_{\\textrm{h}}$.\n\n\\begin{figure*}[t!]\n\\includegraphics{bh.eps}\n\\caption{Initiator convergence for the five lowest energy states of BH in an aug-cc-pVTZ basis, at an internuclear distance of $1.2324$\\AA. Results are shifted relative to their values at the largest walker population considered, therefore approximately representing the initiator error. (a) Dipole moments. (b) Transition dipole moments from the ground state. (c) Energy calculated from a trial estimator, $E_{\\textrm{Trial}}$. (d) Energy calculated from the RDM estimator, $E_{\\textrm{RDM}}$. $N_w$ denotes the number of walkers for \\emph{each} state and replica sampled. Simulations were typically averaged over 5 simulations to obtain error bars.}\n\\label{fig:bh_init}\n\\end{figure*}\n\n\\begin{table*}[t]\n\\begin{center}{\\footnotesize\n\\begin{tabular}{@{\\extracolsep{4pt}}lccccc@{}}\n\\hline\n\\hline\nBasis & State $n$ & Energy gap ($\\Delta E_{0n}$) & Dipole moment ($\\mu_n$) & Transition dipole moment ($t_{0n}$) & Oscillator strength ($f_{0n}$) \\\\\n\\hline\naug-cc-pVDZ & 0 $\\;$ (${}^1\\Sigma^+$)     & -           &  0.528082(7) & -           & -           \\\\\n            & 1 $\\;$ ($\\; {}^1\\Delta \\;$) & 0.216230(3) & -0.18983(3)  & 0.0         & 0.0         \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)     & 0.23727(1)  & -1.4146(5)   & 0.93478(3)  & 0.13822(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)     & 0.257587(4) & -0.3219(3)   & 0.2102(1)   & 0.007590(9) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)     & 0.282665(1) &  3.5459(1)   & 0.44725(4)  & 0.037696(7) \\\\\n\\hline\naug-cc-pVTZ & 0 $\\;$ (${}^1\\Sigma^+$)     & -           &  0.54561(2) & -          & -           \\\\\n            & 1 $\\;$ ($\\; {}^1\\Delta \\;$) & 0.211482(6) & -0.19271(7) & 0.0        & 0.0         \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)     & 0.238668(8) & -1.2943(5)  & 0.88508(5) & 0.12464(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)     & 0.253574(4) & -0.4973(6)  & 0.1454(2)  & 0.00358(1)  \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)     & 0.283481(2) &  3.4088(2)  & 0.35740(7) & 0.024141(9) \\\\\n\\hline\naug-cc-pVQZ & 0 $\\;$ (${}^1\\Sigma^+$)     & -          &  0.54914(6) & -         & -          \\\\\n            & 1 $\\;$ ($\\; {}^1\\Delta \\;$) & 0.21059(2) & -0.1968(3)  & 0.0       & 0.0        \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)     & 0.23876(3) & -1.268(3)   & 0.8704(3) & 0.1206(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)     & 0.25261(3) & -0.504(3)   & 0.139(1)  & 0.00327(7) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)     & 0.28289(1) &  3.2889(9)  & 0.3138(1) & 0.01857(2) \\\\\n\\hline\n\\hline\n\\end{tabular}\n}\n\\caption{Final converged estimates for the BH molecule at an internuclear distance of $1.2324$\\AA. Results are for the five lowest energy states in the $A_1$ irrep of the $C_{2v}$ point group, with $M_S=0$ and $S=\\textrm{even}$ quantum numbers. $n=0$ refers to the ground state, $n>1$ to excited states. Numbers in parentheses denote stochastic error, not initiator error. Energy gaps ($\\Delta E_{1n}$) were calculated using RDM-based energy estimates, Eq.~(\\ref{eq:rdm_energy}). Integrals were generated using the PySCF program\\cite{pyscf}.}\n\\label{tab:bh}\n\\end{center}\n\\end{table*}\n\nThe calculation of dipole moments provides a more interesting test, due to their greater dependence on more highly-excited determinants and diffuse single-particle orbitals. The relative initiator error is much larger, particularly for certain excited states (i.e. $\\mu_2$ and $\\mu_3$). The transition dipole moments considered involve transitions from the ground ($n=0$) state to excited ($n>1$) states. Because they always involve the ground state, it is to be expected that they have smaller relative initiator and stochastic error, compared to the corresponding non-transition dipole moment (i.e. $t_{0n}$ compared to $\\mu_n$). This expectation is borne out in the results, with initiator and stochastic error in $t_{0n}$ often being $\\sim 5$ times smaller than for $\\mu_n$. For the calculation of dipole moments from FCIQMC-sampled RDMs, relative stochastic errors are clearly much larger than for energies, and so the use of the semi-stochastic adaptation is of great importance here, whereas its use can be somewhat unnecessary in small ground-state energy calculations.\n\nClearly, the accurate calculation of dipole moments is more challenging than energies, requiring larger walker populations to obtain similar relative errors. However, this is not uniquely a feature of the initiator approximation in FCIQMC, but is equally true in other approximate methods, where properties such as the dipole moment are far more sensitive to the basis set and quality of the wavefunction than ground state energetics. That we are able to observe systematic converge of these quantities, with respect to a single simulation parameter, is reassuring.\n\nTable~\\ref{tab:lih} gives final results for the aug-cc-pV$X$Z basis sets, with $X=2,3,4$. Results in the small $X=2$ basis were fully converged at the smallest walker populations considered, $N_w = 1.25 \\times 10^4$, as confirmed by comparison to FCI results from the PySCF program. As expected, dipole moments vary quite substantially with basis set, particularly for the second, third and fourth excited states, demonstrating the importance of large basis sets with diffuse functions. Errors in brackets denote stochastic error bars, not initiator error, which is larger. However, given the careful convergence of initiator error, as shown in Figure~\\ref{fig:lih_init}, we expect dipole moments to be converged to around $10^{-3}e a_0$ in most cases, and energies to be converged \\emph{substantially} beyond chemical accuracy.\n\n",
        "Descriptive_question1": "What is the dipole moment value for the ground state (n=0) of LiH in the aug-cc-pVQZ basis set in table_1?",
        "Descriptive_question2": "What is the energy gap value for state n=1 of LiH in the aug-cc-pVDZ basis set in table_1?",
        "Reasoning_question1": "How does the dipole moment of LiH change across different basis sets for the third excited state (n=3) in table_1, and what might this indicate about the basis set sensitivity?",
        "Reasoning_question2": "Comparing the oscillator strengths for transitions from the ground state to excited states in the aug-cc-pVTZ basis set in table_1, which excited state transition is the most probable and why?",
        "Descriptive_answer1": "-2.30168",
        "Descriptive_answer2": "0.130434",
        "Reasoning_answer1": "Examining the dipole moment for the third excited state (n=3) of LiH across different basis sets in table_1, we observe a significant variation. In aug-cc-pVDZ, the dipole moment is 5.0832; in aug-cc-pVTZ, it increases to 5.533; and in aug-cc-pVQZ, it further rises to 6.211. This progression indicates a consistent increase in the dipole moment value as the basis set size and quality improve from double-zeta (pVDZ) to triple-zeta (pVTZ) to quadruple-zeta (pVQZ). The increase suggests that larger basis sets, which include more diffuse orbitals, provide a better description of the electron distribution, particularly for excited states where electron delocalization might be more pronounced. This trend highlights the high sensitivity of dipole moments to the basis set, as noted in the context, because dipole moments depend heavily on the quality of the wave function and the inclusion of diffuse functions to capture the spatial extent of the electronic structure accurately. Therefore, this variation underscores the importance of using larger, more complete basis sets for precise calculations of dipole moments in excited states.",
        "Reasoning_answer2": "To determine the most probable transition from the ground state to an excited state in the aug-cc-pVTZ basis set for LiH in table_1, we compare the oscillator strengths (f_0n), which indicate the probability of absorption or emission between states. The oscillator strengths for transitions from the ground state (n=0) are as follows: for n=1, f_01 = 0.077262; for n=2, f_02 = 0.024459; for n=3, f_03 = 0.000762; and for n=4, f_04 = 0.078966. Among these, the highest oscillator strength is for the transition to state n=4 at 0.078966, closely followed by the transition to state n=1 at 0.077262. Although the values are very close, the transition to state n=4 is marginally higher, indicating it as the most probable. This probability is influenced by both the energy gap (ΔE_0n) and the transition dipole moment (t_0n), as oscillator strength is proportional to the product of these factors. For n=4, ΔE_04 = 0.246520 and t_04 = 0.693170, while for n=1, ΔE_01 = 0.132458 and t_01 = 0.93538. Despite a smaller energy gap for n=1, the combination of a larger transition dipole moment and a higher energy gap for n=4 results in a slightly higher oscillator strength, suggesting that this transition has the highest likelihood of occurring due to a favorable balance of these contributing factors."
    },
    {
        "paper_id": "1704.00864.json",
        "table_id": "table_2",
        "table_content": "\\begin{table*}[t]\n\\begin{center}{\\footnotesize\n\\begin{tabular}{@{\\extracolsep{4pt}}lccccc@{}}\n\\hline\n\\hline\nBasis & State $n$ & Energy gap ($\\Delta E_{0n}$) & Dipole moment ($\\mu_n$) & Transition dipole moment ($t_{0n}$) & Oscillator strength ($f_{0n}$) \\\\\n\\hline\naug-cc-pVDZ & 0 $\\;$ (${}^1\\Sigma^+$)     & -           &  0.528082(7) & -           & -           \\\\\n            & 1 $\\;$ ($\\; {}^1\\Delta \\;$) & 0.216230(3) & -0.18983(3)  & 0.0         & 0.0         \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)     & 0.23727(1)  & -1.4146(5)   & 0.93478(3)  & 0.13822(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)     & 0.257587(4) & -0.3219(3)   & 0.2102(1)   & 0.007590(9) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)     & 0.282665(1) &  3.5459(1)   & 0.44725(4)  & 0.037696(7) \\\\\n\\hline\naug-cc-pVTZ & 0 $\\;$ (${}^1\\Sigma^+$)     & -           &  0.54561(2) & -          & -           \\\\\n            & 1 $\\;$ ($\\; {}^1\\Delta \\;$) & 0.211482(6) & -0.19271(7) & 0.0        & 0.0         \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)     & 0.238668(8) & -1.2943(5)  & 0.88508(5) & 0.12464(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)     & 0.253574(4) & -0.4973(6)  & 0.1454(2)  & 0.00358(1)  \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)     & 0.283481(2) &  3.4088(2)  & 0.35740(7) & 0.024141(9) \\\\\n\\hline\naug-cc-pVQZ & 0 $\\;$ (${}^1\\Sigma^+$)     & -          &  0.54914(6) & -         & -          \\\\\n            & 1 $\\;$ ($\\; {}^1\\Delta \\;$) & 0.21059(2) & -0.1968(3)  & 0.0       & 0.0        \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)     & 0.23876(3) & -1.268(3)   & 0.8704(3) & 0.1206(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)     & 0.25261(3) & -0.504(3)   & 0.139(1)  & 0.00327(7) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)     & 0.28289(1) &  3.2889(9)  & 0.3138(1) & 0.01857(2) \\\\\n\\hline\n\\hline\n\\end{tabular}\n}\n\\caption{Final converged estimates for the BH molecule at an internuclear distance of $1.2324$\\AA. Results are for the five lowest energy states in the $A_1$ irrep of the $C_{2v}$ point group, with $M_S=0$ and $S=\\textrm{even}$ quantum numbers. $n=0$ refers to the ground state, $n>1$ to excited states. Numbers in parentheses denote stochastic error, not initiator error. Energy gaps ($\\Delta E_{1n}$) were calculated using RDM-based energy estimates, Eq.~(\\ref{eq:rdm_energy}). Integrals were generated using the PySCF program\\cite{pyscf}.}\n\\label{tab:bh}\n\\end{center}\n\\end{table*}",
        "caption": "Final converged estimates for the BH molecule at an internuclear distance of $1.2324$\\AA. Results are for the five lowest energy states in the $A_1$ irrep of the $C_{2v}$ point group, with $M_S=0$ and $S=\\textrm{even}$ quantum numbers. $n=0$ refers to the ground state, $n>1$ to excited states. Numbers in parentheses denote stochastic error, not initiator error. Energy gaps ($\\Delta E_{1n}$) were calculated using RDM-based energy estimates, Eq.~(\\ref{eq:rdm_energy}). Integrals were generated using the PySCF program\\cite{pyscf}.",
        "label": "tab:bh",
        "section_info": "5 Results\n\\section{Results}\n\\label{sec:results}\n\nAs an initial test of these ideas, we consider the calculation of dipole moments, transition dipole moments, and oscillator strengths for low-lying states of small diatomic molecules. These quantities are of great importance for understanding various properties of molecular systems. The oscillator strength in particular is required to explain optical spectra, as it determines the probabilities of absorption and emission of photons coupling different electronic states. Nonetheless, dipole moments are challenging to calculate accurately, even for small molecules, because they are very sensitive to the quality of the wave function and single-particle basis set used, generally requiring many diffuse orbitals for an accurate description, with far greater basis set sensitivity than the energy\\cite{Green1974}.\n\nWe therefore begin by considering the LiH and BH molecules in aug-cc-pVDZ, aug-cc-pVTZ and aug-cc-pVQZ, containing $32$, $69$  and $126$ spatial orbitals respectively. The aug-cc-pVQZ basis 2-RDM was unobtainable in the previous RDM implementation, despite the small molecular size. We then consider the MgO molecule in an aug-cc-pVDZ basis set. We note that while the calculation of dipole moments only requires the 1-RDM, for these calculations we obtain the 1-RDM by contracting the 2-RDM, which we also use to calculate the energy using the estimator\n\\begin{equation}\n(E_{\\textrm{RDM}})_n = \\frac{ \\textrm{Tr} \\big[ \\hat{H} \\; \\hat{\\Gamma}^n \\big] }{ \\textrm{Tr} \\big[ \\hat{\\Gamma}^n \\big] }.\n\\label{eq:rdm_energy}\n\\end{equation}\nTherefore, the following is a good test of the newly-introduced ideas, as well as providing further insight into the effect of the initiator adaptation for different estimators and excited states.\n\nThe dipole moment for the state $|\\Phi^n\\ket$ is defined by\n\\begin{equation}\n\\bs{\\mu}_{n} = \\sum_{pq} \\gamma_{p,q}^{n} \\bra p | \\hat{\\bs{r}} | q \\ket.\n\\end{equation}\nwhile a transition dipole moment, $\\bs{t}_{nm}$, is defined by Eq.~(\\ref{eq:trans_dip_mom}), and the corresponding oscillator strength by\n\\begin{equation}\nf_{nm} = \\frac{2}{3} \\Delta E_{nm} |\\bs{t}_{nm}|^2,\n\\end{equation}\nfor an energy gap of $\\Delta E_{nm}$ between states $|\\Phi^n\\ket$ and $|\\Phi^m\\ket$.\n\n\\begin{figure*}[t!]\n\\includegraphics{lih.eps}\n\\caption{Initiator error convergence for the five lowest energy states of LiH in an aug-cc-pVQZ basis, at an internuclear distance of $1.5957$\\AA~as the number of walkers in each distribution is increased. Results are shifted relative to their values at the largest walker population considered, therefore approximately representing the initiator error. (a) Dipole moments. (b) Transition dipole moments from the ground state. (c) Energy calculated from a trial estimator, $E_{\\textrm{Trial}}$. (d) Energy calculated from the RDM estimator, $E_{\\textrm{RDM}}$. $N_w$ denotes the number of walkers for \\emph{each} state and replica sampled. Simulations were typically averaged over 5 simulations to obtain error bars.}\n\\label{fig:lih_init}\n\\end{figure*}\n\n\\begin{table*}[t]\n\\begin{center}{\\footnotesize\n\\begin{tabular}{@{\\extracolsep{4pt}}lccccc@{}}\n\\hline\n\\hline\nBasis & State $n$ & Energy gap ($\\Delta E_{0n}$) & Dipole moment ($\\mu_n$) & Transition dipole moment ($t_{0n}$) & Oscillator strength ($f_{0n}$) \\\\\n\\hline\naug-cc-pVDZ & 0 $\\;$ (${}^1\\Sigma^+$)  & -            & -2.3251372(2) & -            & -           \\\\\n            & 1 $\\;$ (${}^1\\Sigma^+$)  & 0.130434(1)  &  2.01947(4)   & 0.965189(7)  & 0.081007(1) \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)  & 0.2149799(6) & -3.3543(9)    & 0.37471(1)   & 0.020123(1) \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)  & 0.229077(4)  &  5.0832(8)    & 0.09126(8)   & 0.001271(2) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)  & 0.246350(3)  & -0.2958(3)    & 0.56074(2)   & 0.051639(4) \\\\\n\\hline\naug-cc-pVTZ & 0 $\\;$ (${}^1\\Sigma^+$) & -            & -2.306440(9)  & -            & -           \\\\\n            & 1 $\\;$ (${}^1\\Sigma^+$) & 0.132458(3)  &  2.02541(7)   & 0.93538(2)   & 0.077262(4) \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$) & 0.216705(6)  & -3.794(1)     & 0.41146(2)   & 0.024459(2) \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$) & 0.230621(2)  &  5.533(1)     & 0.07042(8)   & 0.000762(2) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$) & 0.246520(2)  & -0.6235(7)    & 0.693170(7)  & 0.078966(2) \\\\\n\\hline\naug-cc-pVQZ & 0 $\\;$ (${}^1\\Sigma^+$) & -            & -2.30168(3)   & -            & -           \\\\\n            & 1 $\\;$ (${}^1\\Sigma^+$) & 0.132943(7)  &  2.0188(1)    & 0.92658(4)   & 0.076093(7) \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$) & 0.217616(7)  & -3.696(2)     & 0.3984(1)    & 0.02303(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$) & 0.231229(2)  &  6.211(2)     & 0.1083(2)    & 0.001809(6) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$) & 0.242846(9)  & -1.998(2)     & 0.6201(5)    & 0.06224(9)  \\\\\n\\hline\n\\hline\n\\end{tabular}\n}\n\\caption{Final converged estimates for the LiH molecule at an internuclear distance of $1.5957$\\AA. Results are for the five lowest energy states in the $A_1$ irrep of the $C_{2v}$ point group, with $M_S=0$ and $S=\\textrm{even}$ quantum numbers (which happen to all be ${}^1\\Sigma^+$ states). $n=0$ refers to the ground state, $n>1$ to excited states. Numbers in parentheses denote stochastic error, not initiator error. Energy gaps ($\\Delta E_{1n}$) were calculated using RDM-based energy estimates, Eq.~(\\ref{eq:rdm_energy}). Integrals were generated using the PySCF program\\cite{pyscf}. In the small aug-cc-pVDZ, all results were verified against exact FCI results obtained from PySCF (not shown here).}\n\\label{tab:lih}\n\\end{center}\n\\end{table*}\n\nFor all simulations, the intial restricted Hartree--Fock (RHF) calculation was performed by PySCF\\cite{pyscf}. Integrals from PySCF were then passed to our FCIQMC program, \\url{NECI}, for the main calculation, which output one and two body density matrices. These were then contracted with integrals from PySCF to calculate final dipole moment estimates. Energy estimates were calculated on-the-fly in \\url{NECI}.\n\nThe five lowest energy states were calculated for LiH and BH, and the four lowest states of MgO, considering only states with $M_s=0$ and using the $A_1$ irreducible representation (irrep) of the $C_{2v}$ point group. Also, time-reversal symmetrized functions\\cite{Smeyers1973} were used as the many-particle basis states, therefore restricting the total spin quantum number, $S$, to be even, and thus removing triplet states. In all cases, the FCIQMC simulation time step was varied in the initial iterations so as to prevent ``bloom'' events, where many walkers can be created in a single spawning event (which often leads to large initiator error).\n\nWe also note that in generating excitations for the walker spawning step, we use an approach that greatly improves efficiency compared to the uniform sampling used in early FCIQMC results\\cite{Booth2009}. In this approach, the pair of orbital labels from which electrons are excited, $(i,j)$, are chosen uniformly, while the orbitals excited to, $(a,b)$, are selected with probabilities drawn from a Cauchy-Schwarz distribution, namely $p(ab|ij) \\propto \\sqrt{\\langle ia|ia \\rangle \\langle jb|jb \\rangle}$.\\cite{Smart_unpublished} Another approach to select connections efficiently was considered by Holmes \\emph{et. al.}\\cite{Holmes2016}, but not used here.\n\nAll simulations used the semi-stochastic adaptation to reduce stochastic errors\\cite{Petruzielo2012, Blunt2015}. For the LiH molecule the deterministic space consisted of all configurations up to and including double excitations from the Hartree--Fock determinant. For the BH and MgO molecules the deterministic space was formed from the $10^4$ most populated configurations across all wave functions sampled, once the simulations were deemed to have largely converged, using the approach described in Ref.~(\\onlinecite{Blunt2015}).\n\n\\subsection{LiH}\n\nSimulations on LiH were performed using between $1.25 \\times 10^4$ and $10^6$ walkers per simulation (i.e., for each state and replica sampled), in order to converge initiator error for all states. Density matrices were typically averaged over $10^5$ iterations, once convergence was deemed to have been reached for all states and all estimators. These entire simulations were then repeated five times with different initial RNG seeds, and the results averaged in order to calculate error estimates.\n\nFigure~\\ref{fig:lih_init} shows initiator convergence for LiH in the aug-cc-pVQZ basis set, for the lowest five energy eigenstates, and for four different estimators: dipole moments, transition dipoles moments, and energies calculated from both the RDM-based energy estimator, Eq.~(\\ref{eq:rdm_energy}), and from a trial wave function-projected estimator:\n\\begin{equation}\n(E_{\\textrm{Trial}})_n = \\frac{ \\bra \\Psi_{\\textrm{Trial}}^n | \\hat{H} | \\Psi^n \\ket }{ \\bra \\Psi_{\\textrm{Trial}}^n | \\Psi^n \\ket }.\n\\label{eq:trial_energy}\n\\end{equation}\nHere, $| \\Psi_{\\textrm{Trial}}^n \\ket$ is a trial wave function designed to have a large overlap with the exact state $| \\Phi^n \\ket$. We have discussed the use of such trial wave function estimators in excited-state FCIQMC in Ref.~(\\onlinecite{Blunt2015_3}). To generate $| \\Psi_{\\textrm{Trial}}^n \\ket$, we calculate the configuration interaction singles and doubles (CISD) wave functions for the lowest fifteen energy states. Then, once convergence of all FCIQMC simulations is deemed to have been reached, we assign each simulation one trial wave function by choosing the CISD solution with the largest overlap in each case. The reason for obtaining more CISD solutions than FCIQMC simulations is that CISD solutions can have a different energy ordering to FCI solutions. Averaging of each $E_{\\textrm{Trial}}$ estimate was performed from roughly the same point that RDM sampling began, and so both RDM and trial energy estimates are obtained from a similar number of iterations, usually $10^5$.\n\nThe initiator-FCIQMC estimates in Figure~\\ref{fig:lih_init} are all plotted relative to their values at the largest walker population considered, $N_{w}=10^6$. Here, convergence has been largely reached in all cases, and so the figures effectively plot initiator error against walker population. Reassuringly, initiator error in energy estimates is incredibly small for both estimators and for all states. Indeed, the largest error at the smallest walker population tested is less than $\\sim 0.5$ m$E_\\textrm{h}$ for $E_{\\textrm{Trial}}$.\n\nInterestingly, initiator error in $E_{\\textrm{RDM}}$ is much smaller than in $E_{\\textrm{Trial}}$. This is a trend that we have often observed, although exceptions do occur (and in the limit of an exact $| \\Psi_{\\textrm{Trial}}^n \\ket$, the initiator error is zero). Initiator error in the $E_{\\textrm{RDM}}$ energies are variational in all cases within stochastic errors, while it is not strictly enforced (though common) for this to also be the case for $E_{\\textrm{Trial}}$. For RDM-based energy estimates, this variationality is effectively ensured by the Hylleraas-Undheim-McDonald theorem\\cite{Hylleraas1930, McDonald1933}, which is expected to approximately hold for FCIQMC-sampled wave functions. Initiator error is larger for excited states, as previously observed\\cite{Blunt2015_3}. This is expected due to the more multi-configurational nature of excited states. It remains to be seen whether orbital optimization can increase this rate of convergence for excited states. Random errors however are larger in the RDM-based energy estimates, which is expected due to the fact that two uncorrelated simulations (from the two replicas) contribute to this quantity. However, error bars are extremely small in all cases here, always being smaller than $10^{-2}$ m$E_{\\textrm{h}}$.\n\n\\begin{figure*}[t!]\n\\includegraphics{bh.eps}\n\\caption{Initiator convergence for the five lowest energy states of BH in an aug-cc-pVTZ basis, at an internuclear distance of $1.2324$\\AA. Results are shifted relative to their values at the largest walker population considered, therefore approximately representing the initiator error. (a) Dipole moments. (b) Transition dipole moments from the ground state. (c) Energy calculated from a trial estimator, $E_{\\textrm{Trial}}$. (d) Energy calculated from the RDM estimator, $E_{\\textrm{RDM}}$. $N_w$ denotes the number of walkers for \\emph{each} state and replica sampled. Simulations were typically averaged over 5 simulations to obtain error bars.}\n\\label{fig:bh_init}\n\\end{figure*}\n\n\\begin{table*}[t]\n\\begin{center}{\\footnotesize\n\\begin{tabular}{@{\\extracolsep{4pt}}lccccc@{}}\n\\hline\n\\hline\nBasis & State $n$ & Energy gap ($\\Delta E_{0n}$) & Dipole moment ($\\mu_n$) & Transition dipole moment ($t_{0n}$) & Oscillator strength ($f_{0n}$) \\\\\n\\hline\naug-cc-pVDZ & 0 $\\;$ (${}^1\\Sigma^+$)     & -           &  0.528082(7) & -           & -           \\\\\n            & 1 $\\;$ ($\\; {}^1\\Delta \\;$) & 0.216230(3) & -0.18983(3)  & 0.0         & 0.0         \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)     & 0.23727(1)  & -1.4146(5)   & 0.93478(3)  & 0.13822(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)     & 0.257587(4) & -0.3219(3)   & 0.2102(1)   & 0.007590(9) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)     & 0.282665(1) &  3.5459(1)   & 0.44725(4)  & 0.037696(7) \\\\\n\\hline\naug-cc-pVTZ & 0 $\\;$ (${}^1\\Sigma^+$)     & -           &  0.54561(2) & -          & -           \\\\\n            & 1 $\\;$ ($\\; {}^1\\Delta \\;$) & 0.211482(6) & -0.19271(7) & 0.0        & 0.0         \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)     & 0.238668(8) & -1.2943(5)  & 0.88508(5) & 0.12464(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)     & 0.253574(4) & -0.4973(6)  & 0.1454(2)  & 0.00358(1)  \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)     & 0.283481(2) &  3.4088(2)  & 0.35740(7) & 0.024141(9) \\\\\n\\hline\naug-cc-pVQZ & 0 $\\;$ (${}^1\\Sigma^+$)     & -          &  0.54914(6) & -         & -          \\\\\n            & 1 $\\;$ ($\\; {}^1\\Delta \\;$) & 0.21059(2) & -0.1968(3)  & 0.0       & 0.0        \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)     & 0.23876(3) & -1.268(3)   & 0.8704(3) & 0.1206(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)     & 0.25261(3) & -0.504(3)   & 0.139(1)  & 0.00327(7) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)     & 0.28289(1) &  3.2889(9)  & 0.3138(1) & 0.01857(2) \\\\\n\\hline\n\\hline\n\\end{tabular}\n}\n\\caption{Final converged estimates for the BH molecule at an internuclear distance of $1.2324$\\AA. Results are for the five lowest energy states in the $A_1$ irrep of the $C_{2v}$ point group, with $M_S=0$ and $S=\\textrm{even}$ quantum numbers. $n=0$ refers to the ground state, $n>1$ to excited states. Numbers in parentheses denote stochastic error, not initiator error. Energy gaps ($\\Delta E_{1n}$) were calculated using RDM-based energy estimates, Eq.~(\\ref{eq:rdm_energy}). Integrals were generated using the PySCF program\\cite{pyscf}.}\n\\label{tab:bh}\n\\end{center}\n\\end{table*}\n\nThe calculation of dipole moments provides a more interesting test, due to their greater dependence on more highly-excited determinants and diffuse single-particle orbitals. The relative initiator error is much larger, particularly for certain excited states (i.e. $\\mu_2$ and $\\mu_3$). The transition dipole moments considered involve transitions from the ground ($n=0$) state to excited ($n>1$) states. Because they always involve the ground state, it is to be expected that they have smaller relative initiator and stochastic error, compared to the corresponding non-transition dipole moment (i.e. $t_{0n}$ compared to $\\mu_n$). This expectation is borne out in the results, with initiator and stochastic error in $t_{0n}$ often being $\\sim 5$ times smaller than for $\\mu_n$. For the calculation of dipole moments from FCIQMC-sampled RDMs, relative stochastic errors are clearly much larger than for energies, and so the use of the semi-stochastic adaptation is of great importance here, whereas its use can be somewhat unnecessary in small ground-state energy calculations.\n\nClearly, the accurate calculation of dipole moments is more challenging than energies, requiring larger walker populations to obtain similar relative errors. However, this is not uniquely a feature of the initiator approximation in FCIQMC, but is equally true in other approximate methods, where properties such as the dipole moment are far more sensitive to the basis set and quality of the wavefunction than ground state energetics. That we are able to observe systematic converge of these quantities, with respect to a single simulation parameter, is reassuring.\n\nTable~\\ref{tab:lih} gives final results for the aug-cc-pV$X$Z basis sets, with $X=2,3,4$. Results in the small $X=2$ basis were fully converged at the smallest walker populations considered, $N_w = 1.25 \\times 10^4$, as confirmed by comparison to FCI results from the PySCF program. As expected, dipole moments vary quite substantially with basis set, particularly for the second, third and fourth excited states, demonstrating the importance of large basis sets with diffuse functions. Errors in brackets denote stochastic error bars, not initiator error, which is larger. However, given the careful convergence of initiator error, as shown in Figure~\\ref{fig:lih_init}, we expect dipole moments to be converged to around $10^{-3}e a_0$ in most cases, and energies to be converged \\emph{substantially} beyond chemical accuracy.\n\n\\subsection{BH}\n\nFigure~\\ref{fig:bh_init} shows results for BH in the aug-cc-pVTZ basis set and at an internuclear distance of $1.2324$\\AA, demonstrating similar initiator convergence plots to those in Figure~\\ref{fig:lih_init}. Here, results used between $1.25 \\times 10^4$ and $2 \\times 10^6$ walkers per simulation. RDM estimators and $E_{\\textrm{Trial}}$ were averaged over $5 \\times 10^4$ iterations, once convergence was achieved for all states and estimators. Here, instead of using CISD solutions as trial wave functions for $E_{\\textrm{Trial}}$, a slightly different approach was used: a ``trial space'' was defined as consisting of the $2 \\times 10^3$ most populated configurations across all simulations, once convergence had been approximately reached. Trial wave functions were then obtained as the eigenstates of $\\hat{H}$ within this subspace. This is similar to the approach to generate the deterministic space, as described above\\cite{Blunt2015}, and allows important basis states to be picked, while allowing an inexpensive calculation to determine each $| \\Psi_{\\textrm{Trial}}^n \\ket$.\n\nResults contain the same features as observed for LiH. Initiator error in the energy estimates are extremely small in all cases, particularly for estimates obtained from contraction of the RDM, and initiator convergence always occurs variationally. Stochastic error bars are larger for $E_{\\textrm{RDM}}$, as well as for excited states, but always extremely small. For dipole moments, similar trends also occur. Initiator and stochastic relative errors for the dipole moment are very small for the ground and first excited states ($\\mu_0$ and $\\mu_1$) and for the corresponding transition dipole moment ($t_{01}$) even at small walker populations. However, results for higher excited states contain larger errors, although we once again observe that errors in $t_{0n}$ are smaller than errors in $\\mu_n$ for each $n$, presumably because of the involvement of the ground state, which is well converged at lower walker populations, in each of the transition dipole moments considered.\n\nTable~\\ref{tab:bh} shows final results in aug-cc-pV$X$Z basis sets, for $X=2,3,4$. Results for $X=2$ used $2 \\times 10^5$ walkers per simulation, while results for $X=3$ and $X=4$ results used $2 \\times 10^6$ walkers per simulation. The expected strong dependence of dipole moments on the basis set is once again observed. This is particularly true for the second, third and fourth excited states ($n=2,3,4$). We note that these three states also contained the largest initiator error at small walker populations, as seen in Figure~\\ref{fig:bh_init}. This is probably not a coincidence, since the initiator approximation will inevitably result in a poorer description of highly excited regions of the wave function, presumably including excitations into high-energy diffuse functions, which appear important for accurate calculation of dipole moments for these particular states. Despite larger initiator error than for energy estimates, there is still a substantial undersampling of the space here, using $2 \\times 10^6$ walkers for a space size of $\\sim 7 \\times 10^9$ for the aug-cc-pVQZ basis, even for this small molecule, with benefits of Monte Carlo sampling typically increasing with system size.\n\n\\subsection{MgO}\n\n\\begin{figure*}[t!]\n\\includegraphics{mgo.eps}\n\\caption{Initiator convergence for dipole moments (left) and energies (right), for MgO in an aug-cc-pVDZ basis set, at an internuclear distance of $1.749$\\AA, and with 4 core electrons frozen. The four lowest-energy states are considered in the $A_1$ irrep of $C_{2v}$ and with $S=\\textrm{even}$ enforced (all ${}^1\\Sigma^+$ states). Energies are calculated from both RDM ($E_{\\textrm{RDM}}$) and trial wave function ($E_{\\textrm{Trial}}$) based estimates, and become equal to good accuracy at large walker number, $N_w$. Dipole moments appear mostly converged at $N_w=3.2 \\times 10^7$, except for $\\mu_1$. Error bars are only available for $N_w < 10^6$, but are small by this point and should only decrease in magnitude for larger walker populations.}\n\\label{fig:mgo_init}\n\\end{figure*}\n\n\\begin{table*}[t]\n\\begin{center}{\\footnotesize\n\\begin{tabular}{@{\\extracolsep{4pt}}c|ccc|ccc@{}}\n\\hline\n\\hline\nState $n$  & \\multicolumn{3}{c|}{ Energy/$E_{\\textrm{h}}$ } & \\multicolumn{3}{c}{ Dipole moment ($\\mu_n$) /$ea_0$ } \\\\\n\\hline\n & CCSD & CCSDT & FCIQMC & CCSD & CCSDT & FCIQMC \\\\\n\\hline\n0 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.632  &  -274.651  &  -274.654 &  2.590  &  2.398  &  2.382 \\\\\n1 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.531  &  -274.559  &  -274.564 &  1.811  &  2.008  &  2.289 \\\\\n2 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.480  &  -274.514  &  -274.517 &  0.297  &  0.847  &  1.154 \\\\\n3 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.440  &  -274.478  &  -274.480 & -0.366  &  0.529  &  1.198 \\\\\n\\hline\n\\hline\n\\end{tabular}\n}\n\\caption{Energies and dipole moments for MgO in an aug-cc-pVDZ basis set, at an internuclear distance of $1.749$\\AA, and with 4 core electrons frozen at the Hartree--Fock level. The four lowest-energy states are considered in the $A_1$ irrep of $C_{2v}$ and with $S=\\textrm{even}$ enforced (all ${}^1\\Sigma^+$ states). Error bars on FCIQMC results are not given, but are smaller than the order to which results are presented. FCIQMC energies are taken from the RDM-based estimates, $E_{\\textrm{RDM}}$. CCSD and CCSDT values were obtained from NWChem\\cite{NWChem}.}\n\\label{tab:mgo}\n\\end{center}\n\\end{table*}\n\nTo study a more challenging problem, we consider the calculation of energies and dipole moments for the MgO molecule, at its ground state equilibrium separation of $1.749$\\AA, and with 4 core electrons frozen at the Hartree--Fock level. Thus, a total of 16 electrons are correlated in $48$ spatial orbitals. Enforcing $M_s=0$, using the $A_1$ irrep of the $C_{2v}$ point group, and working with time-reversal symmetrized functions\\cite{Smeyers1973} (to enforce $S=\\textrm{even}$), results in a space size of roughly $1.8 \\times 10^{16}$ basis functions. This is a large space, particularly given the challenges of converging initiator error in excited-state dipole moments, as seen already.\n\nFigure~\\ref{fig:mgo_init} presents initiator convergence for walker populations (per state and per replica), $N_w$, ranging from $2.5 \\times 10^4$ to $3.2 \\times 10^7$. The ground state and first three excited states are calculated. For $N_w \\le 4 \\times 10^5$, error bars are calculated by averaging over 5 repeated calculations with varying RNG seeds. Due to the expensive nature of calculations, repeats were not performed for $N_w > 4 \\times 10^5$, and so error bars were not obtained. However, these error bars should mostly only decrease with increasing $N_w$, and are already small at $N_w = 4 \\times 10^5$. Therefore, at the largest walker populations considered, stochastic error should be much smaller than initiator error.\n\nInitiator profiles of both $E_{\\textrm{RDM}}$ and $E_{\\textrm{Trial}}$ estimators are presented in Figure~\\ref{fig:mgo_init}. At convergence, these should clearly become equal. By $N_w = 3.2 \\times 10^7$, this is the case to much better than $1$m$E_\\textrm{h}$ accuracy. As previously found, convergence is monotonic in all cases and $E_{\\textrm{RDM}}$ usually results in smaller initiator error.\n\nConvergence of dipole moments is also shown. Here, relative initiator error is once again larger than for energies, and convergence is non-monotonic. Because of this non-monotonic behavior, combined with the challenging nature of the system, our confidence in the accurate convergence of these values is somewhat less than for LiH and BH results. We cannot rule out the possibility of sudden further convergence at higher $N_w$ values. However we believe any significant deviations unlikely, although it is clear that $\\mu_1$ in particular is not fully converged on the scale shown.\n\nTable~\\ref{tab:mgo} presents FCIQMC energies and dipole moments, using $N_w = 3.2 \\times 10^7$, and with energies taken from the $E_{\\textrm{RDM}}$ estimator. For comparison, coupled cluster results are shown, using both singles and doubles (CCSD) and singles, doubles and triples (CCSDT). These values were calculated using NWChem package\\cite{NWChem}, with the equation-of-motion (EOM-CCSD and EOM-CCSDT) variants used for excited states. As expected, energies obtained from CCSDT are accurate compared to FCIQMC values, even for excited states. Meanwhile, dipole moments show greater differences, particularly for the $n=3$ state. For this state, EOM-CCSD and EOM-CCSDT values also greatly differ, with a flipped dipole moment resulting from EOM-CCSD. These results are consistent with those observed in FCIQMC in regions of large initiator error, that the relative error in dipole moments is much greater than in energies. We again expect that this is primarily due to the increased dependence on highly-excited determinants, and such configurations have particularly large amplitudes in excited states. CCSD and CCSDT appear to be unable to describe the wave function with sufficient accuracy in this region of configuration space, for this system, and for these challenging states.\n\n5.2 BH\n\\subsection{BH}\n\nFigure~\\ref{fig:bh_init} shows results for BH in the aug-cc-pVTZ basis set and at an internuclear distance of $1.2324$\\AA, demonstrating similar initiator convergence plots to those in Figure~\\ref{fig:lih_init}. Here, results used between $1.25 \\times 10^4$ and $2 \\times 10^6$ walkers per simulation. RDM estimators and $E_{\\textrm{Trial}}$ were averaged over $5 \\times 10^4$ iterations, once convergence was achieved for all states and estimators. Here, instead of using CISD solutions as trial wave functions for $E_{\\textrm{Trial}}$, a slightly different approach was used: a ``trial space'' was defined as consisting of the $2 \\times 10^3$ most populated configurations across all simulations, once convergence had been approximately reached. Trial wave functions were then obtained as the eigenstates of $\\hat{H}$ within this subspace. This is similar to the approach to generate the deterministic space, as described above\\cite{Blunt2015}, and allows important basis states to be picked, while allowing an inexpensive calculation to determine each $| \\Psi_{\\textrm{Trial}}^n \\ket$.\n\nResults contain the same features as observed for LiH. Initiator error in the energy estimates are extremely small in all cases, particularly for estimates obtained from contraction of the RDM, and initiator convergence always occurs variationally. Stochastic error bars are larger for $E_{\\textrm{RDM}}$, as well as for excited states, but always extremely small. For dipole moments, similar trends also occur. Initiator and stochastic relative errors for the dipole moment are very small for the ground and first excited states ($\\mu_0$ and $\\mu_1$) and for the corresponding transition dipole moment ($t_{01}$) even at small walker populations. However, results for higher excited states contain larger errors, although we once again observe that errors in $t_{0n}$ are smaller than errors in $\\mu_n$ for each $n$, presumably because of the involvement of the ground state, which is well converged at lower walker populations, in each of the transition dipole moments considered.\n\nTable~\\ref{tab:bh} shows final results in aug-cc-pV$X$Z basis sets, for $X=2,3,4$. Results for $X=2$ used $2 \\times 10^5$ walkers per simulation, while results for $X=3$ and $X=4$ results used $2 \\times 10^6$ walkers per simulation. The expected strong dependence of dipole moments on the basis set is once again observed. This is particularly true for the second, third and fourth excited states ($n=2,3,4$). We note that these three states also contained the largest initiator error at small walker populations, as seen in Figure~\\ref{fig:bh_init}. This is probably not a coincidence, since the initiator approximation will inevitably result in a poorer description of highly excited regions of the wave function, presumably including excitations into high-energy diffuse functions, which appear important for accurate calculation of dipole moments for these particular states. Despite larger initiator error than for energy estimates, there is still a substantial undersampling of the space here, using $2 \\times 10^6$ walkers for a space size of $\\sim 7 \\times 10^9$ for the aug-cc-pVQZ basis, even for this small molecule, with benefits of Monte Carlo sampling typically increasing with system size.\n\n",
        "Descriptive_question1": "What is the dipole moment for the ground state (n=0) of the BH molecule in the aug-cc-pVQZ basis set as shown in table_2?",
        "Descriptive_question2": "What is the energy gap for state n=2 of the BH molecule in the aug-cc-pVTZ basis set according to table_2?",
        "Reasoning_question1": "How does the dipole moment of the BH molecule for state n=4 vary across the different basis sets (aug-cc-pVDZ, aug-cc-pVTZ, aug-cc-pVQZ) in table_2, and what might this indicate about basis set sensitivity?",
        "Reasoning_question2": "Compare the oscillator strengths for transitions from the ground state to state n=2 across the three basis sets in table_2. What trend can be observed, and what could be the underlying reason for this behavior?",
        "Descriptive_answer1": "0.54914",
        "Descriptive_answer2": "0.238668",
        "Reasoning_answer1": "Looking at the dipole moment for state n=4 of the BH molecule across the basis sets in table_2, I observe the following values: 3.5459 in aug-cc-pVDZ, 3.4088 in aug-cc-pVTZ, and 3.2889 in aug-cc-pVQZ. This shows a decreasing trend as the basis set size increases from double-zeta (pVDZ) to quadruple-zeta (pVQZ). Analyzing this, the dipole moment appears to be converging toward a lower value with larger basis sets, which include more diffuse orbitals and provide a better description of the electron density, especially in the outer regions of the molecule. This variation indicates a high basis set sensitivity for the dipole moment in excited states like n=4, suggesting that accurate calculations require larger basis sets to capture the electronic structure properly, as smaller basis sets might overestimate the dipole moment due to insufficient flexibility in describing the wave function.",
        "Reasoning_answer2": "Examining the oscillator strengths for transitions from the ground state to state n=2 across the basis sets in table_2, I find the values are 0.13822 in aug-cc-pVDZ, 0.12464 in aug-cc-pVTZ, and 0.1206 in aug-cc-pVQZ. This reveals a clear decreasing trend as the basis set size increases. Considering the underlying reason, oscillator strength depends on both the energy gap and the transition dipole moment. From the table, the energy gap for state n=2 remains relatively stable across the basis sets (0.23727, 0.238668, 0.23876), so the decrease in oscillator strength is primarily driven by the reduction in transition dipole moment (0.93478, 0.88508, 0.8704). This suggests that larger basis sets provide a more accurate representation of the electronic wave functions, leading to a refined calculation of the transition dipole moment that decreases slightly as the basis set improves, likely due to better accounting for electron distribution and interactions in the excited state."
    },
    {
        "paper_id": "1704.00864.json",
        "table_id": "table_3",
        "table_content": "\\begin{table*}[t]\n\\begin{center}{\\footnotesize\n\\begin{tabular}{@{\\extracolsep{4pt}}c|ccc|ccc@{}}\n\\hline\n\\hline\nState $n$  & \\multicolumn{3}{c|}{ Energy/$E_{\\textrm{h}}$ } & \\multicolumn{3}{c}{ Dipole moment ($\\mu_n$) /$ea_0$ } \\\\\n\\hline\n & CCSD & CCSDT & FCIQMC & CCSD & CCSDT & FCIQMC \\\\\n\\hline\n0 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.632  &  -274.651  &  -274.654 &  2.590  &  2.398  &  2.382 \\\\\n1 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.531  &  -274.559  &  -274.564 &  1.811  &  2.008  &  2.289 \\\\\n2 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.480  &  -274.514  &  -274.517 &  0.297  &  0.847  &  1.154 \\\\\n3 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.440  &  -274.478  &  -274.480 & -0.366  &  0.529  &  1.198 \\\\\n\\hline\n\\hline\n\\end{tabular}\n}\n\\caption{Energies and dipole moments for MgO in an aug-cc-pVDZ basis set, at an internuclear distance of $1.749$\\AA, and with 4 core electrons frozen at the Hartree--Fock level. The four lowest-energy states are considered in the $A_1$ irrep of $C_{2v}$ and with $S=\\textrm{even}$ enforced (all ${}^1\\Sigma^+$ states). Error bars on FCIQMC results are not given, but are smaller than the order to which results are presented. FCIQMC energies are taken from the RDM-based estimates, $E_{\\textrm{RDM}}$. CCSD and CCSDT values were obtained from NWChem\\cite{NWChem}.}\n\\label{tab:mgo}\n\\end{center}\n\\end{table*}",
        "caption": "Energies and dipole moments for MgO in an aug-cc-pVDZ basis set, at an internuclear distance of $1.749$\\AA, and with 4 core electrons frozen at the Hartree--Fock level. The four lowest-energy states are considered in the $A_1$ irrep of $C_{2v}$ and with $S=\\textrm{even}$ enforced (all ${}^1\\Sigma^+$ states). Error bars on FCIQMC results are not given, but are smaller than the order to which results are presented. FCIQMC energies are taken from the RDM-based estimates, $E_{\\textrm{RDM}}$. CCSD and CCSDT values were obtained from NWChem\\cite{NWChem}.",
        "label": "tab:mgo",
        "section_info": "5 Results\n\\section{Results}\n\\label{sec:results}\n\nAs an initial test of these ideas, we consider the calculation of dipole moments, transition dipole moments, and oscillator strengths for low-lying states of small diatomic molecules. These quantities are of great importance for understanding various properties of molecular systems. The oscillator strength in particular is required to explain optical spectra, as it determines the probabilities of absorption and emission of photons coupling different electronic states. Nonetheless, dipole moments are challenging to calculate accurately, even for small molecules, because they are very sensitive to the quality of the wave function and single-particle basis set used, generally requiring many diffuse orbitals for an accurate description, with far greater basis set sensitivity than the energy\\cite{Green1974}.\n\nWe therefore begin by considering the LiH and BH molecules in aug-cc-pVDZ, aug-cc-pVTZ and aug-cc-pVQZ, containing $32$, $69$  and $126$ spatial orbitals respectively. The aug-cc-pVQZ basis 2-RDM was unobtainable in the previous RDM implementation, despite the small molecular size. We then consider the MgO molecule in an aug-cc-pVDZ basis set. We note that while the calculation of dipole moments only requires the 1-RDM, for these calculations we obtain the 1-RDM by contracting the 2-RDM, which we also use to calculate the energy using the estimator\n\\begin{equation}\n(E_{\\textrm{RDM}})_n = \\frac{ \\textrm{Tr} \\big[ \\hat{H} \\; \\hat{\\Gamma}^n \\big] }{ \\textrm{Tr} \\big[ \\hat{\\Gamma}^n \\big] }.\n\\label{eq:rdm_energy}\n\\end{equation}\nTherefore, the following is a good test of the newly-introduced ideas, as well as providing further insight into the effect of the initiator adaptation for different estimators and excited states.\n\nThe dipole moment for the state $|\\Phi^n\\ket$ is defined by\n\\begin{equation}\n\\bs{\\mu}_{n} = \\sum_{pq} \\gamma_{p,q}^{n} \\bra p | \\hat{\\bs{r}} | q \\ket.\n\\end{equation}\nwhile a transition dipole moment, $\\bs{t}_{nm}$, is defined by Eq.~(\\ref{eq:trans_dip_mom}), and the corresponding oscillator strength by\n\\begin{equation}\nf_{nm} = \\frac{2}{3} \\Delta E_{nm} |\\bs{t}_{nm}|^2,\n\\end{equation}\nfor an energy gap of $\\Delta E_{nm}$ between states $|\\Phi^n\\ket$ and $|\\Phi^m\\ket$.\n\n\\begin{figure*}[t!]\n\\includegraphics{lih.eps}\n\\caption{Initiator error convergence for the five lowest energy states of LiH in an aug-cc-pVQZ basis, at an internuclear distance of $1.5957$\\AA~as the number of walkers in each distribution is increased. Results are shifted relative to their values at the largest walker population considered, therefore approximately representing the initiator error. (a) Dipole moments. (b) Transition dipole moments from the ground state. (c) Energy calculated from a trial estimator, $E_{\\textrm{Trial}}$. (d) Energy calculated from the RDM estimator, $E_{\\textrm{RDM}}$. $N_w$ denotes the number of walkers for \\emph{each} state and replica sampled. Simulations were typically averaged over 5 simulations to obtain error bars.}\n\\label{fig:lih_init}\n\\end{figure*}\n\n\\begin{table*}[t]\n\\begin{center}{\\footnotesize\n\\begin{tabular}{@{\\extracolsep{4pt}}lccccc@{}}\n\\hline\n\\hline\nBasis & State $n$ & Energy gap ($\\Delta E_{0n}$) & Dipole moment ($\\mu_n$) & Transition dipole moment ($t_{0n}$) & Oscillator strength ($f_{0n}$) \\\\\n\\hline\naug-cc-pVDZ & 0 $\\;$ (${}^1\\Sigma^+$)  & -            & -2.3251372(2) & -            & -           \\\\\n            & 1 $\\;$ (${}^1\\Sigma^+$)  & 0.130434(1)  &  2.01947(4)   & 0.965189(7)  & 0.081007(1) \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)  & 0.2149799(6) & -3.3543(9)    & 0.37471(1)   & 0.020123(1) \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)  & 0.229077(4)  &  5.0832(8)    & 0.09126(8)   & 0.001271(2) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)  & 0.246350(3)  & -0.2958(3)    & 0.56074(2)   & 0.051639(4) \\\\\n\\hline\naug-cc-pVTZ & 0 $\\;$ (${}^1\\Sigma^+$) & -            & -2.306440(9)  & -            & -           \\\\\n            & 1 $\\;$ (${}^1\\Sigma^+$) & 0.132458(3)  &  2.02541(7)   & 0.93538(2)   & 0.077262(4) \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$) & 0.216705(6)  & -3.794(1)     & 0.41146(2)   & 0.024459(2) \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$) & 0.230621(2)  &  5.533(1)     & 0.07042(8)   & 0.000762(2) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$) & 0.246520(2)  & -0.6235(7)    & 0.693170(7)  & 0.078966(2) \\\\\n\\hline\naug-cc-pVQZ & 0 $\\;$ (${}^1\\Sigma^+$) & -            & -2.30168(3)   & -            & -           \\\\\n            & 1 $\\;$ (${}^1\\Sigma^+$) & 0.132943(7)  &  2.0188(1)    & 0.92658(4)   & 0.076093(7) \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$) & 0.217616(7)  & -3.696(2)     & 0.3984(1)    & 0.02303(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$) & 0.231229(2)  &  6.211(2)     & 0.1083(2)    & 0.001809(6) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$) & 0.242846(9)  & -1.998(2)     & 0.6201(5)    & 0.06224(9)  \\\\\n\\hline\n\\hline\n\\end{tabular}\n}\n\\caption{Final converged estimates for the LiH molecule at an internuclear distance of $1.5957$\\AA. Results are for the five lowest energy states in the $A_1$ irrep of the $C_{2v}$ point group, with $M_S=0$ and $S=\\textrm{even}$ quantum numbers (which happen to all be ${}^1\\Sigma^+$ states). $n=0$ refers to the ground state, $n>1$ to excited states. Numbers in parentheses denote stochastic error, not initiator error. Energy gaps ($\\Delta E_{1n}$) were calculated using RDM-based energy estimates, Eq.~(\\ref{eq:rdm_energy}). Integrals were generated using the PySCF program\\cite{pyscf}. In the small aug-cc-pVDZ, all results were verified against exact FCI results obtained from PySCF (not shown here).}\n\\label{tab:lih}\n\\end{center}\n\\end{table*}\n\nFor all simulations, the intial restricted Hartree--Fock (RHF) calculation was performed by PySCF\\cite{pyscf}. Integrals from PySCF were then passed to our FCIQMC program, \\url{NECI}, for the main calculation, which output one and two body density matrices. These were then contracted with integrals from PySCF to calculate final dipole moment estimates. Energy estimates were calculated on-the-fly in \\url{NECI}.\n\nThe five lowest energy states were calculated for LiH and BH, and the four lowest states of MgO, considering only states with $M_s=0$ and using the $A_1$ irreducible representation (irrep) of the $C_{2v}$ point group. Also, time-reversal symmetrized functions\\cite{Smeyers1973} were used as the many-particle basis states, therefore restricting the total spin quantum number, $S$, to be even, and thus removing triplet states. In all cases, the FCIQMC simulation time step was varied in the initial iterations so as to prevent ``bloom'' events, where many walkers can be created in a single spawning event (which often leads to large initiator error).\n\nWe also note that in generating excitations for the walker spawning step, we use an approach that greatly improves efficiency compared to the uniform sampling used in early FCIQMC results\\cite{Booth2009}. In this approach, the pair of orbital labels from which electrons are excited, $(i,j)$, are chosen uniformly, while the orbitals excited to, $(a,b)$, are selected with probabilities drawn from a Cauchy-Schwarz distribution, namely $p(ab|ij) \\propto \\sqrt{\\langle ia|ia \\rangle \\langle jb|jb \\rangle}$.\\cite{Smart_unpublished} Another approach to select connections efficiently was considered by Holmes \\emph{et. al.}\\cite{Holmes2016}, but not used here.\n\nAll simulations used the semi-stochastic adaptation to reduce stochastic errors\\cite{Petruzielo2012, Blunt2015}. For the LiH molecule the deterministic space consisted of all configurations up to and including double excitations from the Hartree--Fock determinant. For the BH and MgO molecules the deterministic space was formed from the $10^4$ most populated configurations across all wave functions sampled, once the simulations were deemed to have largely converged, using the approach described in Ref.~(\\onlinecite{Blunt2015}).\n\n\\subsection{LiH}\n\nSimulations on LiH were performed using between $1.25 \\times 10^4$ and $10^6$ walkers per simulation (i.e., for each state and replica sampled), in order to converge initiator error for all states. Density matrices were typically averaged over $10^5$ iterations, once convergence was deemed to have been reached for all states and all estimators. These entire simulations were then repeated five times with different initial RNG seeds, and the results averaged in order to calculate error estimates.\n\nFigure~\\ref{fig:lih_init} shows initiator convergence for LiH in the aug-cc-pVQZ basis set, for the lowest five energy eigenstates, and for four different estimators: dipole moments, transition dipoles moments, and energies calculated from both the RDM-based energy estimator, Eq.~(\\ref{eq:rdm_energy}), and from a trial wave function-projected estimator:\n\\begin{equation}\n(E_{\\textrm{Trial}})_n = \\frac{ \\bra \\Psi_{\\textrm{Trial}}^n | \\hat{H} | \\Psi^n \\ket }{ \\bra \\Psi_{\\textrm{Trial}}^n | \\Psi^n \\ket }.\n\\label{eq:trial_energy}\n\\end{equation}\nHere, $| \\Psi_{\\textrm{Trial}}^n \\ket$ is a trial wave function designed to have a large overlap with the exact state $| \\Phi^n \\ket$. We have discussed the use of such trial wave function estimators in excited-state FCIQMC in Ref.~(\\onlinecite{Blunt2015_3}). To generate $| \\Psi_{\\textrm{Trial}}^n \\ket$, we calculate the configuration interaction singles and doubles (CISD) wave functions for the lowest fifteen energy states. Then, once convergence of all FCIQMC simulations is deemed to have been reached, we assign each simulation one trial wave function by choosing the CISD solution with the largest overlap in each case. The reason for obtaining more CISD solutions than FCIQMC simulations is that CISD solutions can have a different energy ordering to FCI solutions. Averaging of each $E_{\\textrm{Trial}}$ estimate was performed from roughly the same point that RDM sampling began, and so both RDM and trial energy estimates are obtained from a similar number of iterations, usually $10^5$.\n\nThe initiator-FCIQMC estimates in Figure~\\ref{fig:lih_init} are all plotted relative to their values at the largest walker population considered, $N_{w}=10^6$. Here, convergence has been largely reached in all cases, and so the figures effectively plot initiator error against walker population. Reassuringly, initiator error in energy estimates is incredibly small for both estimators and for all states. Indeed, the largest error at the smallest walker population tested is less than $\\sim 0.5$ m$E_\\textrm{h}$ for $E_{\\textrm{Trial}}$.\n\nInterestingly, initiator error in $E_{\\textrm{RDM}}$ is much smaller than in $E_{\\textrm{Trial}}$. This is a trend that we have often observed, although exceptions do occur (and in the limit of an exact $| \\Psi_{\\textrm{Trial}}^n \\ket$, the initiator error is zero). Initiator error in the $E_{\\textrm{RDM}}$ energies are variational in all cases within stochastic errors, while it is not strictly enforced (though common) for this to also be the case for $E_{\\textrm{Trial}}$. For RDM-based energy estimates, this variationality is effectively ensured by the Hylleraas-Undheim-McDonald theorem\\cite{Hylleraas1930, McDonald1933}, which is expected to approximately hold for FCIQMC-sampled wave functions. Initiator error is larger for excited states, as previously observed\\cite{Blunt2015_3}. This is expected due to the more multi-configurational nature of excited states. It remains to be seen whether orbital optimization can increase this rate of convergence for excited states. Random errors however are larger in the RDM-based energy estimates, which is expected due to the fact that two uncorrelated simulations (from the two replicas) contribute to this quantity. However, error bars are extremely small in all cases here, always being smaller than $10^{-2}$ m$E_{\\textrm{h}}$.\n\n\\begin{figure*}[t!]\n\\includegraphics{bh.eps}\n\\caption{Initiator convergence for the five lowest energy states of BH in an aug-cc-pVTZ basis, at an internuclear distance of $1.2324$\\AA. Results are shifted relative to their values at the largest walker population considered, therefore approximately representing the initiator error. (a) Dipole moments. (b) Transition dipole moments from the ground state. (c) Energy calculated from a trial estimator, $E_{\\textrm{Trial}}$. (d) Energy calculated from the RDM estimator, $E_{\\textrm{RDM}}$. $N_w$ denotes the number of walkers for \\emph{each} state and replica sampled. Simulations were typically averaged over 5 simulations to obtain error bars.}\n\\label{fig:bh_init}\n\\end{figure*}\n\n\\begin{table*}[t]\n\\begin{center}{\\footnotesize\n\\begin{tabular}{@{\\extracolsep{4pt}}lccccc@{}}\n\\hline\n\\hline\nBasis & State $n$ & Energy gap ($\\Delta E_{0n}$) & Dipole moment ($\\mu_n$) & Transition dipole moment ($t_{0n}$) & Oscillator strength ($f_{0n}$) \\\\\n\\hline\naug-cc-pVDZ & 0 $\\;$ (${}^1\\Sigma^+$)     & -           &  0.528082(7) & -           & -           \\\\\n            & 1 $\\;$ ($\\; {}^1\\Delta \\;$) & 0.216230(3) & -0.18983(3)  & 0.0         & 0.0         \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)     & 0.23727(1)  & -1.4146(5)   & 0.93478(3)  & 0.13822(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)     & 0.257587(4) & -0.3219(3)   & 0.2102(1)   & 0.007590(9) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)     & 0.282665(1) &  3.5459(1)   & 0.44725(4)  & 0.037696(7) \\\\\n\\hline\naug-cc-pVTZ & 0 $\\;$ (${}^1\\Sigma^+$)     & -           &  0.54561(2) & -          & -           \\\\\n            & 1 $\\;$ ($\\; {}^1\\Delta \\;$) & 0.211482(6) & -0.19271(7) & 0.0        & 0.0         \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)     & 0.238668(8) & -1.2943(5)  & 0.88508(5) & 0.12464(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)     & 0.253574(4) & -0.4973(6)  & 0.1454(2)  & 0.00358(1)  \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)     & 0.283481(2) &  3.4088(2)  & 0.35740(7) & 0.024141(9) \\\\\n\\hline\naug-cc-pVQZ & 0 $\\;$ (${}^1\\Sigma^+$)     & -          &  0.54914(6) & -         & -          \\\\\n            & 1 $\\;$ ($\\; {}^1\\Delta \\;$) & 0.21059(2) & -0.1968(3)  & 0.0       & 0.0        \\\\\n            & 2 $\\;$ (${}^1\\Sigma^+$)     & 0.23876(3) & -1.268(3)   & 0.8704(3) & 0.1206(1)  \\\\\n            & 3 $\\;$ (${}^1\\Sigma^+$)     & 0.25261(3) & -0.504(3)   & 0.139(1)  & 0.00327(7) \\\\\n            & 4 $\\;$ (${}^1\\Sigma^+$)     & 0.28289(1) &  3.2889(9)  & 0.3138(1) & 0.01857(2) \\\\\n\\hline\n\\hline\n\\end{tabular}\n}\n\\caption{Final converged estimates for the BH molecule at an internuclear distance of $1.2324$\\AA. Results are for the five lowest energy states in the $A_1$ irrep of the $C_{2v}$ point group, with $M_S=0$ and $S=\\textrm{even}$ quantum numbers. $n=0$ refers to the ground state, $n>1$ to excited states. Numbers in parentheses denote stochastic error, not initiator error. Energy gaps ($\\Delta E_{1n}$) were calculated using RDM-based energy estimates, Eq.~(\\ref{eq:rdm_energy}). Integrals were generated using the PySCF program\\cite{pyscf}.}\n\\label{tab:bh}\n\\end{center}\n\\end{table*}\n\nThe calculation of dipole moments provides a more interesting test, due to their greater dependence on more highly-excited determinants and diffuse single-particle orbitals. The relative initiator error is much larger, particularly for certain excited states (i.e. $\\mu_2$ and $\\mu_3$). The transition dipole moments considered involve transitions from the ground ($n=0$) state to excited ($n>1$) states. Because they always involve the ground state, it is to be expected that they have smaller relative initiator and stochastic error, compared to the corresponding non-transition dipole moment (i.e. $t_{0n}$ compared to $\\mu_n$). This expectation is borne out in the results, with initiator and stochastic error in $t_{0n}$ often being $\\sim 5$ times smaller than for $\\mu_n$. For the calculation of dipole moments from FCIQMC-sampled RDMs, relative stochastic errors are clearly much larger than for energies, and so the use of the semi-stochastic adaptation is of great importance here, whereas its use can be somewhat unnecessary in small ground-state energy calculations.\n\nClearly, the accurate calculation of dipole moments is more challenging than energies, requiring larger walker populations to obtain similar relative errors. However, this is not uniquely a feature of the initiator approximation in FCIQMC, but is equally true in other approximate methods, where properties such as the dipole moment are far more sensitive to the basis set and quality of the wavefunction than ground state energetics. That we are able to observe systematic converge of these quantities, with respect to a single simulation parameter, is reassuring.\n\nTable~\\ref{tab:lih} gives final results for the aug-cc-pV$X$Z basis sets, with $X=2,3,4$. Results in the small $X=2$ basis were fully converged at the smallest walker populations considered, $N_w = 1.25 \\times 10^4$, as confirmed by comparison to FCI results from the PySCF program. As expected, dipole moments vary quite substantially with basis set, particularly for the second, third and fourth excited states, demonstrating the importance of large basis sets with diffuse functions. Errors in brackets denote stochastic error bars, not initiator error, which is larger. However, given the careful convergence of initiator error, as shown in Figure~\\ref{fig:lih_init}, we expect dipole moments to be converged to around $10^{-3}e a_0$ in most cases, and energies to be converged \\emph{substantially} beyond chemical accuracy.\n\n\\subsection{BH}\n\nFigure~\\ref{fig:bh_init} shows results for BH in the aug-cc-pVTZ basis set and at an internuclear distance of $1.2324$\\AA, demonstrating similar initiator convergence plots to those in Figure~\\ref{fig:lih_init}. Here, results used between $1.25 \\times 10^4$ and $2 \\times 10^6$ walkers per simulation. RDM estimators and $E_{\\textrm{Trial}}$ were averaged over $5 \\times 10^4$ iterations, once convergence was achieved for all states and estimators. Here, instead of using CISD solutions as trial wave functions for $E_{\\textrm{Trial}}$, a slightly different approach was used: a ``trial space'' was defined as consisting of the $2 \\times 10^3$ most populated configurations across all simulations, once convergence had been approximately reached. Trial wave functions were then obtained as the eigenstates of $\\hat{H}$ within this subspace. This is similar to the approach to generate the deterministic space, as described above\\cite{Blunt2015}, and allows important basis states to be picked, while allowing an inexpensive calculation to determine each $| \\Psi_{\\textrm{Trial}}^n \\ket$.\n\nResults contain the same features as observed for LiH. Initiator error in the energy estimates are extremely small in all cases, particularly for estimates obtained from contraction of the RDM, and initiator convergence always occurs variationally. Stochastic error bars are larger for $E_{\\textrm{RDM}}$, as well as for excited states, but always extremely small. For dipole moments, similar trends also occur. Initiator and stochastic relative errors for the dipole moment are very small for the ground and first excited states ($\\mu_0$ and $\\mu_1$) and for the corresponding transition dipole moment ($t_{01}$) even at small walker populations. However, results for higher excited states contain larger errors, although we once again observe that errors in $t_{0n}$ are smaller than errors in $\\mu_n$ for each $n$, presumably because of the involvement of the ground state, which is well converged at lower walker populations, in each of the transition dipole moments considered.\n\nTable~\\ref{tab:bh} shows final results in aug-cc-pV$X$Z basis sets, for $X=2,3,4$. Results for $X=2$ used $2 \\times 10^5$ walkers per simulation, while results for $X=3$ and $X=4$ results used $2 \\times 10^6$ walkers per simulation. The expected strong dependence of dipole moments on the basis set is once again observed. This is particularly true for the second, third and fourth excited states ($n=2,3,4$). We note that these three states also contained the largest initiator error at small walker populations, as seen in Figure~\\ref{fig:bh_init}. This is probably not a coincidence, since the initiator approximation will inevitably result in a poorer description of highly excited regions of the wave function, presumably including excitations into high-energy diffuse functions, which appear important for accurate calculation of dipole moments for these particular states. Despite larger initiator error than for energy estimates, there is still a substantial undersampling of the space here, using $2 \\times 10^6$ walkers for a space size of $\\sim 7 \\times 10^9$ for the aug-cc-pVQZ basis, even for this small molecule, with benefits of Monte Carlo sampling typically increasing with system size.\n\n\\subsection{MgO}\n\n\\begin{figure*}[t!]\n\\includegraphics{mgo.eps}\n\\caption{Initiator convergence for dipole moments (left) and energies (right), for MgO in an aug-cc-pVDZ basis set, at an internuclear distance of $1.749$\\AA, and with 4 core electrons frozen. The four lowest-energy states are considered in the $A_1$ irrep of $C_{2v}$ and with $S=\\textrm{even}$ enforced (all ${}^1\\Sigma^+$ states). Energies are calculated from both RDM ($E_{\\textrm{RDM}}$) and trial wave function ($E_{\\textrm{Trial}}$) based estimates, and become equal to good accuracy at large walker number, $N_w$. Dipole moments appear mostly converged at $N_w=3.2 \\times 10^7$, except for $\\mu_1$. Error bars are only available for $N_w < 10^6$, but are small by this point and should only decrease in magnitude for larger walker populations.}\n\\label{fig:mgo_init}\n\\end{figure*}\n\n\\begin{table*}[t]\n\\begin{center}{\\footnotesize\n\\begin{tabular}{@{\\extracolsep{4pt}}c|ccc|ccc@{}}\n\\hline\n\\hline\nState $n$  & \\multicolumn{3}{c|}{ Energy/$E_{\\textrm{h}}$ } & \\multicolumn{3}{c}{ Dipole moment ($\\mu_n$) /$ea_0$ } \\\\\n\\hline\n & CCSD & CCSDT & FCIQMC & CCSD & CCSDT & FCIQMC \\\\\n\\hline\n0 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.632  &  -274.651  &  -274.654 &  2.590  &  2.398  &  2.382 \\\\\n1 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.531  &  -274.559  &  -274.564 &  1.811  &  2.008  &  2.289 \\\\\n2 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.480  &  -274.514  &  -274.517 &  0.297  &  0.847  &  1.154 \\\\\n3 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.440  &  -274.478  &  -274.480 & -0.366  &  0.529  &  1.198 \\\\\n\\hline\n\\hline\n\\end{tabular}\n}\n\\caption{Energies and dipole moments for MgO in an aug-cc-pVDZ basis set, at an internuclear distance of $1.749$\\AA, and with 4 core electrons frozen at the Hartree--Fock level. The four lowest-energy states are considered in the $A_1$ irrep of $C_{2v}$ and with $S=\\textrm{even}$ enforced (all ${}^1\\Sigma^+$ states). Error bars on FCIQMC results are not given, but are smaller than the order to which results are presented. FCIQMC energies are taken from the RDM-based estimates, $E_{\\textrm{RDM}}$. CCSD and CCSDT values were obtained from NWChem\\cite{NWChem}.}\n\\label{tab:mgo}\n\\end{center}\n\\end{table*}\n\nTo study a more challenging problem, we consider the calculation of energies and dipole moments for the MgO molecule, at its ground state equilibrium separation of $1.749$\\AA, and with 4 core electrons frozen at the Hartree--Fock level. Thus, a total of 16 electrons are correlated in $48$ spatial orbitals. Enforcing $M_s=0$, using the $A_1$ irrep of the $C_{2v}$ point group, and working with time-reversal symmetrized functions\\cite{Smeyers1973} (to enforce $S=\\textrm{even}$), results in a space size of roughly $1.8 \\times 10^{16}$ basis functions. This is a large space, particularly given the challenges of converging initiator error in excited-state dipole moments, as seen already.\n\nFigure~\\ref{fig:mgo_init} presents initiator convergence for walker populations (per state and per replica), $N_w$, ranging from $2.5 \\times 10^4$ to $3.2 \\times 10^7$. The ground state and first three excited states are calculated. For $N_w \\le 4 \\times 10^5$, error bars are calculated by averaging over 5 repeated calculations with varying RNG seeds. Due to the expensive nature of calculations, repeats were not performed for $N_w > 4 \\times 10^5$, and so error bars were not obtained. However, these error bars should mostly only decrease with increasing $N_w$, and are already small at $N_w = 4 \\times 10^5$. Therefore, at the largest walker populations considered, stochastic error should be much smaller than initiator error.\n\nInitiator profiles of both $E_{\\textrm{RDM}}$ and $E_{\\textrm{Trial}}$ estimators are presented in Figure~\\ref{fig:mgo_init}. At convergence, these should clearly become equal. By $N_w = 3.2 \\times 10^7$, this is the case to much better than $1$m$E_\\textrm{h}$ accuracy. As previously found, convergence is monotonic in all cases and $E_{\\textrm{RDM}}$ usually results in smaller initiator error.\n\nConvergence of dipole moments is also shown. Here, relative initiator error is once again larger than for energies, and convergence is non-monotonic. Because of this non-monotonic behavior, combined with the challenging nature of the system, our confidence in the accurate convergence of these values is somewhat less than for LiH and BH results. We cannot rule out the possibility of sudden further convergence at higher $N_w$ values. However we believe any significant deviations unlikely, although it is clear that $\\mu_1$ in particular is not fully converged on the scale shown.\n\nTable~\\ref{tab:mgo} presents FCIQMC energies and dipole moments, using $N_w = 3.2 \\times 10^7$, and with energies taken from the $E_{\\textrm{RDM}}$ estimator. For comparison, coupled cluster results are shown, using both singles and doubles (CCSD) and singles, doubles and triples (CCSDT). These values were calculated using NWChem package\\cite{NWChem}, with the equation-of-motion (EOM-CCSD and EOM-CCSDT) variants used for excited states. As expected, energies obtained from CCSDT are accurate compared to FCIQMC values, even for excited states. Meanwhile, dipole moments show greater differences, particularly for the $n=3$ state. For this state, EOM-CCSD and EOM-CCSDT values also greatly differ, with a flipped dipole moment resulting from EOM-CCSD. These results are consistent with those observed in FCIQMC in regions of large initiator error, that the relative error in dipole moments is much greater than in energies. We again expect that this is primarily due to the increased dependence on highly-excited determinants, and such configurations have particularly large amplitudes in excited states. CCSD and CCSDT appear to be unable to describe the wave function with sufficient accuracy in this region of configuration space, for this system, and for these challenging states.\n\n5.3 MgO\n\\subsection{MgO}\n\n\\begin{figure*}[t!]\n\\includegraphics{mgo.eps}\n\\caption{Initiator convergence for dipole moments (left) and energies (right), for MgO in an aug-cc-pVDZ basis set, at an internuclear distance of $1.749$\\AA, and with 4 core electrons frozen. The four lowest-energy states are considered in the $A_1$ irrep of $C_{2v}$ and with $S=\\textrm{even}$ enforced (all ${}^1\\Sigma^+$ states). Energies are calculated from both RDM ($E_{\\textrm{RDM}}$) and trial wave function ($E_{\\textrm{Trial}}$) based estimates, and become equal to good accuracy at large walker number, $N_w$. Dipole moments appear mostly converged at $N_w=3.2 \\times 10^7$, except for $\\mu_1$. Error bars are only available for $N_w < 10^6$, but are small by this point and should only decrease in magnitude for larger walker populations.}\n\\label{fig:mgo_init}\n\\end{figure*}\n\n\\begin{table*}[t]\n\\begin{center}{\\footnotesize\n\\begin{tabular}{@{\\extracolsep{4pt}}c|ccc|ccc@{}}\n\\hline\n\\hline\nState $n$  & \\multicolumn{3}{c|}{ Energy/$E_{\\textrm{h}}$ } & \\multicolumn{3}{c}{ Dipole moment ($\\mu_n$) /$ea_0$ } \\\\\n\\hline\n & CCSD & CCSDT & FCIQMC & CCSD & CCSDT & FCIQMC \\\\\n\\hline\n0 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.632  &  -274.651  &  -274.654 &  2.590  &  2.398  &  2.382 \\\\\n1 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.531  &  -274.559  &  -274.564 &  1.811  &  2.008  &  2.289 \\\\\n2 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.480  &  -274.514  &  -274.517 &  0.297  &  0.847  &  1.154 \\\\\n3 $\\;$ (${}^1\\Sigma^+$) $\\;$ &  -274.440  &  -274.478  &  -274.480 & -0.366  &  0.529  &  1.198 \\\\\n\\hline\n\\hline\n\\end{tabular}\n}\n\\caption{Energies and dipole moments for MgO in an aug-cc-pVDZ basis set, at an internuclear distance of $1.749$\\AA, and with 4 core electrons frozen at the Hartree--Fock level. The four lowest-energy states are considered in the $A_1$ irrep of $C_{2v}$ and with $S=\\textrm{even}$ enforced (all ${}^1\\Sigma^+$ states). Error bars on FCIQMC results are not given, but are smaller than the order to which results are presented. FCIQMC energies are taken from the RDM-based estimates, $E_{\\textrm{RDM}}$. CCSD and CCSDT values were obtained from NWChem\\cite{NWChem}.}\n\\label{tab:mgo}\n\\end{center}\n\\end{table*}\n\nTo study a more challenging problem, we consider the calculation of energies and dipole moments for the MgO molecule, at its ground state equilibrium separation of $1.749$\\AA, and with 4 core electrons frozen at the Hartree--Fock level. Thus, a total of 16 electrons are correlated in $48$ spatial orbitals. Enforcing $M_s=0$, using the $A_1$ irrep of the $C_{2v}$ point group, and working with time-reversal symmetrized functions\\cite{Smeyers1973} (to enforce $S=\\textrm{even}$), results in a space size of roughly $1.8 \\times 10^{16}$ basis functions. This is a large space, particularly given the challenges of converging initiator error in excited-state dipole moments, as seen already.\n\nFigure~\\ref{fig:mgo_init} presents initiator convergence for walker populations (per state and per replica), $N_w$, ranging from $2.5 \\times 10^4$ to $3.2 \\times 10^7$. The ground state and first three excited states are calculated. For $N_w \\le 4 \\times 10^5$, error bars are calculated by averaging over 5 repeated calculations with varying RNG seeds. Due to the expensive nature of calculations, repeats were not performed for $N_w > 4 \\times 10^5$, and so error bars were not obtained. However, these error bars should mostly only decrease with increasing $N_w$, and are already small at $N_w = 4 \\times 10^5$. Therefore, at the largest walker populations considered, stochastic error should be much smaller than initiator error.\n\nInitiator profiles of both $E_{\\textrm{RDM}}$ and $E_{\\textrm{Trial}}$ estimators are presented in Figure~\\ref{fig:mgo_init}. At convergence, these should clearly become equal. By $N_w = 3.2 \\times 10^7$, this is the case to much better than $1$m$E_\\textrm{h}$ accuracy. As previously found, convergence is monotonic in all cases and $E_{\\textrm{RDM}}$ usually results in smaller initiator error.\n\nConvergence of dipole moments is also shown. Here, relative initiator error is once again larger than for energies, and convergence is non-monotonic. Because of this non-monotonic behavior, combined with the challenging nature of the system, our confidence in the accurate convergence of these values is somewhat less than for LiH and BH results. We cannot rule out the possibility of sudden further convergence at higher $N_w$ values. However we believe any significant deviations unlikely, although it is clear that $\\mu_1$ in particular is not fully converged on the scale shown.\n\nTable~\\ref{tab:mgo} presents FCIQMC energies and dipole moments, using $N_w = 3.2 \\times 10^7$, and with energies taken from the $E_{\\textrm{RDM}}$ estimator. For comparison, coupled cluster results are shown, using both singles and doubles (CCSD) and singles, doubles and triples (CCSDT). These values were calculated using NWChem package\\cite{NWChem}, with the equation-of-motion (EOM-CCSD and EOM-CCSDT) variants used for excited states. As expected, energies obtained from CCSDT are accurate compared to FCIQMC values, even for excited states. Meanwhile, dipole moments show greater differences, particularly for the $n=3$ state. For this state, EOM-CCSD and EOM-CCSDT values also greatly differ, with a flipped dipole moment resulting from EOM-CCSD. These results are consistent with those observed in FCIQMC in regions of large initiator error, that the relative error in dipole moments is much greater than in energies. We again expect that this is primarily due to the increased dependence on highly-excited determinants, and such configurations have particularly large amplitudes in excited states. CCSD and CCSDT appear to be unable to describe the wave function with sufficient accuracy in this region of configuration space, for this system, and for these challenging states.\n\n",
        "Descriptive_question1": "What is the FCIQMC energy value for the ground state (n=0) of MgO in table_3?",
        "Descriptive_question2": "What is the CCSD dipole moment value for state n=3 of MgO in table_3?",
        "Reasoning_question1": "How does the FCIQMC energy for state n=1 compare to the CCSD and CCSDT energies in table_3, and what might this indicate about the accuracy of these methods for excited states?",
        "Reasoning_question2": "Why might the dipole moment values for state n=3 in table_3 show such a large discrepancy between CCSD, CCSDT, and FCIQMC, and what does this suggest about the challenges in calculating dipole moments for higher excited states?",
        "Descriptive_answer1": "-274.654",
        "Descriptive_answer2": "-0.366",
        "Reasoning_answer1": "Let's compare the FCIQMC energy for state n=1 of MgO in table_3 with the corresponding CCSD and CCSDT values. The FCIQMC energy is -274.564 E_h, while the CCSD energy is -274.531 E_h and the CCSDT energy is -274.559 E_h. First, I note that FCIQMC provides a lower energy value compared to both CCSD and CCSDT, indicating it is closer to the true ground state energy for this excited state, as lower energy often corresponds to a better approximation in variational methods. The difference between CCSD and FCIQMC is 0.033 E_h, which is significant, suggesting that CCSD might not capture all necessary electron correlations for this excited state. CCSDT, with a difference of 0.005 E_h from FCIQMC, performs much better, indicating that including triple excitations improves the accuracy. This comparison suggests that for excited states like n=1, higher-order correlation methods like CCSDT or full configuration interaction approaches like FCIQMC are more reliable than CCSD, which may struggle with the multi-configurational nature of excited states.",
        "Reasoning_answer2": "Examining the dipole moment values for state n=3 in table_3, I observe a large discrepancy: CCSD gives -0.366 ea_0, CCSDT gives 0.529 ea_0, and FCIQMC gives 1.198 ea_0. Let's analyze why this might be the case. First, dipole moments are highly sensitive to the quality of the wavefunction, especially for higher excited states, which often involve more complex electronic configurations. The negative value in CCSD suggests a different orientation of the dipole compared to the positive values in CCSDT and FCIQMC, indicating that CCSD might fail to correctly describe the electron distribution in this state. The shift from CCSD to CCSDT shows that including triple excitations significantly alters the wavefunction, pushing the dipole moment to a positive value, though still far from FCIQMC. FCIQMC, being a near-exact method, likely captures more of the electron correlation and contributions from highly excited determinants, which are crucial for accurate dipole moment calculations. This discrepancy suggests that calculating dipole moments for higher excited states is particularly challenging due to their dependence on diffuse orbitals and multi-configurational effects, which lower-order methods like CCSD and even CCSDT may not fully account for. It highlights the need for high-accuracy methods like FCIQMC to reliably predict such properties in complex systems."
    },
    {
        "paper_id": "1812.04217.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}\n\\begin{tabular}{c|c|c}\n& Green's function & Out-of-time-ordered correlator (OTOC) \\\\\n\\hline\n\\hline\nMatsubara (time) & \n$C_{AB}^M(\\tau)=-\\langle \\hat A(\\tau)\\hat B(0)\\rangle$\n&\n$C_{(AB)^2}^M(\\tau)=-\\langle \\hat A(\\tau)\\hat B(0), \\hat A(\\tau)\\hat B(0) \\rangle$\n\\\\\n& ($0\\le\\tau\\le\\beta\\hbar$)\n& ($0\\le \\tau\\le \\frac{\\beta\\hbar}{2}$)\n\\\\\n& $C_{AB}^M(\\tau)=\\mp \\langle \\hat B(0)\\hat A(\\tau)\\rangle$\n& $C_{(AB)^2}^M(\\tau)=-\\langle \\hat B(0)\\hat A(\\tau), \\hat B(0)\\hat A(\\tau) \\rangle$\n\\\\\n& ($-\\beta\\hbar\\le\\tau<0$) & ($-\\tfrac{\\beta\\hbar}{2} \\le \\tau < 0$)\n\\\\\nperiodicity & $C_{AB}^M(\\tau+\\beta\\hbar)=\\pm C_{AB}^M(\\tau)$\n& $C_{(AB)^2}^M(\\tau+\\frac{\\beta\\hbar}{2})=C_{(AB)^2}^M(\\tau)$\n\\\\\nMatsubara (frequency) \n& $\\displaystyle\nC_{AB}^M(i\\omega_n)=\\int_0^{\\beta\\hbar}d\\tau e^{i\\omega_n\\tau}C_{AB}^M(\\tau)$\n& $\\displaystyle\nC_{(AB)^2}^M(i\\varpi_n)=\\int_0^{\\frac{\\beta\\hbar}{2}}d\\tau e^{i\\varpi_n\\tau}C_{(AB)^2}^M(\\tau)$\n\\\\\n& $\\omega_n=\n\\begin{cases}\n2n\\pi/\\beta\\hbar \\\\\n(2n+1)\\pi/\\beta\\hbar\n\\end{cases}\n(n\\in\\mathbb Z)$\n& $\\varpi_n=4n\\pi/\\beta\\hbar \\;\\;\\; (n\\in \\mathbb Z)$\n\\\\\nretarded \n& $C_{AB}^R(t,t')=-i\\theta(t-t')\\langle[\\hat A(t), \\hat B(t')]_\\mp\\rangle$\n& $C_{(AB)^2}^R(t,t')=-i\\theta(t-t')[\\langle \\hat A(t)\\hat B(t'), \\hat A(t)\\hat B(t')\\rangle$\n\\\\\n& &\n\\qquad $-\\langle \\hat B(t')\\hat A(t), \\hat B(t')\\hat A(t)\\rangle]$\n\\\\\nadvanced \n& $C_{AB}^A(t,t')=i\\theta(t'-t)\\langle[\\hat A(t), \\hat B(t')]_\\mp\\rangle$\n& $C_{(AB)^2}^A(t,t')=i\\theta(t'-t)[\\langle \\hat A(t)\\hat B(t'), \\hat A(t)\\hat B(t')\\rangle$\n\\\\\n& &\n\\qquad $-\\langle \\hat B(t')\\hat A(t), \\hat B(t')\\hat A(t)\\rangle]$\n\\\\\nKeldysh \n& $C_{AB}^K(t,t')=-i\\theta(t-t')\\langle[\\hat A(t), \\hat B(t')]_\\pm\\rangle$\n& $C_{(AB)^2}^K(t,t')=-i\\theta(t-t')[\\langle \\hat A(t)\\hat B(t'), \\hat A(t)\\hat B(t')\\rangle$\n\\\\\n& &\n\\qquad $+\\langle \\hat B(t')\\hat A(t), \\hat B(t')\\hat A(t)\\rangle]$\n\\\\\nanalytic continuation & $C_{AB}^M(i\\omega_n)\\longrightarrow C_{AB}^R(\\omega)$\n& $C_{(AB)^2}^M(i\\varpi_n)\\longrightarrow C_{(AB)^2}^R(\\omega)$\n\\\\\n& ($i\\omega_n\\to\\omega+i\\delta$) & ($i\\varpi_n\\to\\omega+i\\delta$)\n\\\\\nFDT & $C_{AB}^K(\\omega)=\n\\coth\\left(\\frac{\\beta\\hbar\\omega}{2}\\right)^{\\pm 1}$\n& $C_{(AB)^2}^K(\\omega)=\n\\coth\\left(\\frac{\\beta\\hbar\\omega}{4}\\right)$\n\\\\\n& \\qquad\\qquad\\qquad\\quad $\\times[C_{AB}^R(\\omega)-C_{AB}^A(\\omega)]$\n& \\qquad\\qquad\\qquad\\qquad\\qquad $\\times[C_{(AB)^2}^R(\\omega)-C_{(AB)^2}^A(\\omega)]$\n\\end{tabular}\n\\caption{Comparison between Green's function and OTOC. \nFor the Green's function, the statistical average $\\langle \\hat X\\rangle\\equiv{\\rm Tr}(e^{-\\beta\\hat H}\\hat X)/Z$ is used,\nwhile for the OTOC, the bipartite statistical average $\\langle \\hat X, \\hat Y\\rangle\\equiv{\\rm Tr}(e^{-\\frac{\\beta}{2}\\hat H}\\hat X e^{-\\frac{\\beta}{2}\\hat H}\\hat Y)/Z$ is used.\nIn the Green's function column, the upper sign is taken when either $\\hat A$ or $\\hat B$ is bosonic,\nand the lower sign is taken when both $\\hat A$ and $\\hat B$ are fermionic.}\n\\label{table:comparison}\n\\end{table}",
        "caption": "Comparison between Green's function and OTOC. \nFor the Green's function, the statistical average $\\langle \\hat X\\rangle\\equiv{\\rm Tr}(e^{-\\beta\\hat H}\\hat X)/Z$ is used,\nwhile for the OTOC, the bipartite statistical average $\\langle \\hat X, \\hat Y\\rangle\\equiv{\\rm Tr}(e^{-\\frac{\\beta}{2}\\hat H}\\hat X e^{-\\frac{\\beta}{2}\\hat H}\\hat Y)/Z$ is used.\nIn the Green's function column, the upper sign is taken when either $\\hat A$ or $\\hat B$ is bosonic,\nand the lower sign is taken when both $\\hat A$ and $\\hat B$ are fermionic.",
        "label": "table:comparison",
        "section_info": "3 Analytic continuation to real-time OTOCs\n\\section{Analytic continuation to real-time OTOCs}\n\\label{real-time OTOC}\n\nIn the previous section, we have introduced the imaginary-frequency four-point function $C_{(AB)^2}^M(i\\varpi_n)$ (\\ref{matsubara OTOC}). Let us recall that the Matsubara Green's function $C_{AB}^M(i\\omega_n)$ (\\ref{matsubara TOC})\ncan be analytically continued to the retarded Green's function $C_{AB}^R(\\omega)$\nvia the replacement $i\\omega_n\\to \\omega+i\\delta$. \nHere $C_{AB}^R(\\omega)=\\int_{-\\infty}^{\\infty}dt\\, e^{i\\omega t}C_{AB}^R(t,0)$ is the Fourier transform of\n\\begin{align}\nC_{AB}^R(t,t')\n&\\equiv\n-i\\theta(t-t')\\langle[\\hat A(t), \\hat B(t')]_\\mp \\rangle\n\\label{retarded Green}\n\\end{align}\nwith $[,]_\\mp$ representing the anticommutator ($\\{,\\}$) when both $\\hat A$ and $\\hat B$ are fermionic\nand the commutator ($[,]$) otherwise.\nIt is thus natural to ask what kind of function corresponds to the analytic continuation\nof $C_{(AB)^2}^M(i\\varpi_n)$.\n\nBelow we show that the analytic continuation of $C_{(AB)^2}^M(i\\varpi_n)$ through $i\\varpi_n\\to \\omega+i\\delta$\nis given by what we call {\\it the retarded OTOC} $C_{(AB)^2}^R(\\omega)$,\nwhich is defined by the Fourier transform $C_{(AB)^2}^R(\\omega)=\\int_{-\\infty}^{\\infty} dt\\, e^{i\\omega t}C_{(AB)^2}^R(t,0)$\nof\n\\begin{align}\nC_{(AB)^2}^R(t,t')\n&\\equiv\n-i\\theta(t-t')[\\langle \\hat A(t)\\hat B(t'), \\hat A(t)\\hat B(t')\\rangle\n-\\langle \\hat B(t')\\hat A(t), \\hat B(t')\\hat A(t)\\rangle].\n\\label{retarded OTOC}\n\\end{align}\nHere $\\theta(t)$ is the step function defined by $\\theta(t)=1$ ($t\\ge 0$) and $=0$ ($t<0$), and we used the notation of the bipartite statistical average\n\\begin{align}\n\\langle \\hat X, \\hat Y\\rangle\n&\\equiv{\\rm Tr}(\\hat\\rho^{\\frac{1}{2}}\\hat X\\hat\\rho^{\\frac{1}{2}}\\hat Y)\n\\end{align}\n(with $\\hat\\rho=e^{-\\beta\\hat H}/Z$ being the density matrix),\nwhich has previously appeared in the study of OTOCs \\cite{MaldacenaShenkerStanford2016, Yao2016,\nPatelSachdev2017, Patel2017, TsujiShitaraUeda2018a, LiaoGalitski2018}. \nIn terms of the bipartite statistical average,\nthe imaginary-time four-point function introduced in the previous section can be written as\n\\begin{align}\nC_{(AB)^2}^M(\\tau)\n&=\n\\begin{cases}\n-\\langle \\hat A(\\tau)\\hat B(0), \\hat A(\\tau)\\hat B(0) \\rangle\n& 0\\le \\tau\\le \\frac{\\beta\\hbar}{2},\n\\\\\n-\\langle \\hat B(0)\\hat A(\\tau), \\hat B(0)\\hat A(\\tau) \\rangle\n& -\\tfrac{\\beta\\hbar}{2} \\le \\tau < 0.\n\\end{cases}\n\\end{align}\nIf we introduce the commutator-anticommutator representation of OTOCs,\n\\begin{align}\nC_{[A,B]_{\\alpha_1}[A,B]_{\\alpha_2}}(t,t')\n&\\equiv\n\\langle[\\hat A(t), \\hat B(t')]_{\\alpha_1}, [\\hat A(t), \\hat B(t')]_{\\alpha_2} \\rangle\n\\quad\n(\\alpha_1, \\alpha_2=\\pm),\n\\end{align}\n$C_{(AB)^2}^R(t,t')$ can be written in the form \n\\begin{align}\nC_{(AB)^2}^R(t,t')\n&=\n-i\\theta(t-t')C_{\\{A,B\\}[A,B]}(t,t').\n\\label{retarded OTOC 2}\n\\end{align}\n\nThe original motivation to employ this form\nwas that the squared commutator $\\langle[\\hat A(t), \\hat B(t')]^2\\rangle$ might be ill-defined\nin the context of quantum field theory, because two operators can approach each other arbitrarily close in time,\nwhich may cause divergences. In this situation, one usually needs to regularize the squared commutator.\nOne prescription to regularize it is to take the bipartite statistical average,\n$\\langle[\\hat A(t), \\hat B(t')], [\\hat A(t), \\hat B(t')] \\rangle$, with which the two commutators are separated\nin the imaginary-time direction \\cite{MaldacenaShenkerStanford2016}. \nThere is an information-theoretic meaning \nof the difference between the usual and bipartite statistical averages, which is given by\nthe Wigner-Yanase (WY) skew information \\cite{WignerYanase1963},\n\\begin{align}\nI_{\\frac{1}{2}}(\\hat\\rho,\\hat O)\n&\\equiv\n-\\frac{1}{2}{\\rm Tr}([\\hat\\rho^{\\frac{1}{2}}, \\hat O]^2)\n=\n\\langle \\hat O^2\\rangle-\\langle \\hat O, \\hat O \\rangle,\n\\end{align}\nfor a quantum state $\\hat\\rho$ and an observable $\\hat O$ (which is a hermitian operator).\nIt represents the information content of quantum fluctuations of the observable $\\hat O$\ncontained in the quantum state $\\hat\\rho$ (for further details on the WY skew information in the present context,\nwe refer to Refs.~\\cite{TsujiShitaraUeda2018a, Luo2005}).\nIf quantum fluctuations are suppressed (e.g., in the semiclassical regime), \none expects that OTOCs in the form of the usual and bipartite statistical averages would share\ncommon semiclassical features such as the chaotic exponential growth in the short-time regime (butterfly effect).\n\nIt has also recently been pointed out that OTOCs in the form of the usual statistical average\nmay involve scattering processes that contribute to the exponential growth but are not relevant to\nmany-body chaos, while OTOCs with the bipartite statistical average correctly capture \nchaotic properties \\cite{LiaoGalitski2018}.\nHereafter we focus on OTOCs in the form of the bipartite statistical average.\n\nThe relation between $C_{(AB)^2}^M(i\\varpi_n)$ and $C_{(AB)^2}^R(\\omega)$ is most clearly seen in the spectral\nrepresentation. To obtain the spectral representation for $C_{(AB)^2}^M(i\\varpi_n)$, we expand it \nin the basis of eigenstates of $\\hat H$ denoted by $|n\\rangle$ with eigenenergies $E_n$,\n\\begin{align}\nC_{(AB)^2}^M(\\tau)\n&=\n-\\frac{1}{Z}\\sum_{klmn} e^{-\\frac{\\beta}{2}(E_k+E_m)}\ne^{\\frac{1}{\\hbar}(E_k-E_l+E_m-E_n)\\tau}\n\\langle k|\\hat A|l\\rangle \\langle l|\\hat B|m\\rangle\n\\langle m|\\hat A|n\\rangle \\langle n|\\hat B|k\\rangle\n\\notag\n\\\\\n&=\n-\\frac{1}{Z}\\int_{-\\infty}^{\\infty}d\\omega'\\, e^{-\\omega'\\tau}\\sum_{klmn} e^{-\\frac{\\beta}{2}(E_k+E_m)}\n\\delta(\\omega'+\\tfrac{1}{\\hbar}(E_k-E_l+E_m-E_n))\n\\notag\n\\\\\n&\\quad\\times\n\\langle k|\\hat A|l\\rangle \\langle l|\\hat B|m\\rangle\n\\langle m|\\hat A|n\\rangle \\langle n|\\hat B|k\\rangle.\n\\end{align}\nFrom the first to the second line, we inserted \n$1=\\int_{-\\infty}^{\\infty} d\\omega'\\, \\delta(\\omega'+\\frac{1}{\\hbar}(E_k-E_l+E_m-E_n))$,\nwhere $\\delta(\\omega)$ is the delta function. By Fourier transforming $C_{(AB)^2}^M(\\tau)$, we obtain\n\\begin{align}\nC_{(AB)^2}^M(i\\varpi_n)\n&=\n\\frac{1}{Z}\\int_{-\\infty}^{\\infty}d\\omega'\\, \\frac{1-e^{-\\frac{\\beta\\hbar\\omega'}{2}}}{i\\varpi_n-\\omega'}\\sum_{klmn} e^{-\\frac{\\beta}{2}(E_k+E_m)}\n\\delta(\\omega'+\\tfrac{1}{\\hbar}(E_k-E_l+E_m-E_n))\n\\notag\n\\\\\n&\\quad\\times\n\\langle k|\\hat A|l\\rangle \\langle l|\\hat B|m\\rangle\n\\langle m|\\hat A|n\\rangle \\langle n|\\hat B|k\\rangle\n\\notag\n\\\\\n&=\n\\frac{1}{Z}\\int_{-\\infty}^{\\infty}d\\omega'\\, \\frac{1}{i\\varpi_n-\\omega'}\\sum_{klmn} \n(e^{-\\frac{\\beta}{2}(E_k+E_m)}-e^{-\\frac{\\beta}{2}(E_l+E_n)})\n\\delta(\\omega'+\\tfrac{1}{\\hbar}(E_k-E_l+E_m-E_n))\n\\notag\n\\\\\n&\\quad\\times\n\\langle k|\\hat A|l\\rangle \\langle l|\\hat B|m\\rangle\n\\langle m|\\hat A|n\\rangle \\langle n|\\hat B|k\\rangle.\n\\end{align}\nMotivated by the above expression, let us define the spectral function for the OTOC by\n\\begin{align}\n\\mathscr A_{(AB)^2}(\\omega)\n&\\equiv\n\\frac{1}{Z}\\sum_{klmn} \n(e^{-\\frac{\\beta}{2}(E_k+E_m)}-e^{-\\frac{\\beta}{2}(E_l+E_n)})\n\\delta(\\omega+\\tfrac{1}{\\hbar}(E_k-E_l+E_m-E_n))\n\\notag\n\\\\\n&\\quad\\times\n\\langle k|\\hat A|l\\rangle \\langle l|\\hat B|m\\rangle\n\\langle m|\\hat A|n\\rangle \\langle n|\\hat B|k\\rangle.\n\\label{OTOC spectral function}\n\\end{align}\nNote that $\\mathscr A_{(AB)^2}(\\omega)$ takes real values \nwhen $\\hat B=\\hat A^\\dagger$, since $\\mathscr A_{(AB)^2}(\\omega)^\\ast=\\mathscr A_{(B^\\dagger A^\\dagger)^2}(\\omega)$.\nHowever, in this case $\\mathscr A_{(AA^\\dagger)^2}(\\omega)$ is not necessarily positive semidefinite for $\\omega\\ge 0$. \nOne exception is the low-temperature limit, where $\\mathscr A_{(AA^\\dagger)^2}(\\omega)$ becomes positive semidefinite for $\\omega\\ge 0$. To see this, let us denote the ground state as $|g\\rangle$\nwith the eigenenergy $E_g$. In the zero-temperature limit, the spectral function approaches\n\\begin{align}\n\\mathscr A_{(AA^\\dagger)^2}(\\omega)\n&\\to\n\\frac{1}{Z}\\sum_{ln} \ne^{-\\beta E_g}\n\\delta(\\omega+\\tfrac{1}{\\hbar}(2E_g-E_l-E_n))\n\\langle g|\\hat A|l\\rangle \\langle l|\\hat A^\\dagger|g\\rangle\n\\langle g|\\hat A|n\\rangle \\langle n|\\hat A^\\dagger|g\\rangle\n\\notag\n\\\\\n&\\quad\n-\\frac{1}{Z}\\sum_{km} \ne^{-\\beta E_g}\n\\delta(\\omega+\\tfrac{1}{\\hbar}(E_k+E_m-2E_g))\n\\langle k|\\hat A|g\\rangle \\langle g|\\hat A^\\dagger|m\\rangle\n\\langle m|\\hat A|g\\rangle \\langle g|\\hat A^\\dagger|k\\rangle\n\\notag\n\\\\\n&=\n\\sum_{km} \n[\\delta(\\omega-\\tfrac{1}{\\hbar}(E_k+E_m-2E_g))\n-\\delta(\\omega+\\tfrac{1}{\\hbar}(E_k+E_m-2E_g))]\n|\\langle g|\\hat A|k\\rangle|^2\n|\\langle g|\\hat A|m\\rangle|^2\n\\notag\n\\\\\n&\\ge\n0\n\\quad\n(\\omega\\ge 0).\n\\end{align}\nThe spectral sum is given by\n\\begin{align}\n\\int_{-\\infty}^{\\infty} d\\omega\\, \\mathscr A_{(AB)^2}(\\omega)\n&=\n\\langle \\{\\hat A, \\hat B\\}, [\\hat A, \\hat B] \\rangle\n=:\nc_{AB}.\n\\end{align}\nUsing the spectral function $\\mathscr A_{(AB)^2}(\\omega)$, \nthe imaginary-frequency function $C_{(AB)^2}^M(i\\varpi_n)$ can be written as\n\\begin{align}\nC_{(AB)^2}^M(i\\varpi_n)\n&=\n\\int_{-\\infty}^{\\infty} d\\omega'\\, \\frac{\\mathscr A_{(AB)^2}(\\omega')}{i\\varpi_n-\\omega'}.\n\\label{matsubara OTOC Lehmann}\n\\end{align}\nThis is analogous to the Lehmann representation for the Matsubara Green's function,\n\\begin{align}\nC_{AB}^M(i\\omega_n)\n&=\n\\int_{-\\infty}^{\\infty} d\\omega' \\frac{\\mathscr A_{AB}(\\omega')}{i\\omega_n-\\omega'},\n\\end{align}\nwhere $\\mathscr A_{AB}(\\omega)$ is the spectral function for the Matsubara Green's function defined by\n\\begin{align}\n\\mathscr A_{AB}(\\omega)\n&\\equiv\n\\frac{1}{Z}\\sum_{kl} (e^{-\\beta E_k}\\mp e^{-\\beta E_l})\\delta(\\omega+\\tfrac{1}{\\hbar}(E_k-E_l))\n\\langle k|\\hat A|l\\rangle \\langle l|\\hat B|k\\rangle.\n\\end{align}\nHere the sign $+$ is taken when both $\\hat A$ and $\\hat B$ are fermionic and the sign $-$ is taken otherwise.\n\nIn a similar manner, we can obtain the spectral representation of the retarded OTOC,\nwhich is expanded in the eigenbasis of the Hamiltonian as\n\\begin{align}\nC_{(AB)^2}^R(t,t')\n&=\n-i\\theta(t-t')\\frac{1}{Z}\\sum_{klmn}\ne^{-\\frac{\\beta}{2}(E_k+E_m)}\n\\Big[e^{\\frac{i}{\\hbar}(E_k-E_l+E_m-E_n)(t-t')}\n\\langle k|\\hat A|l\\rangle \\langle l|\\hat B|m\\rangle\n\\langle m|\\hat A|n\\rangle \\langle n|\\hat B|k\\rangle\n\\notag\n\\\\\n&\\quad\n-e^{-\\frac{i}{\\hbar}(E_k-E_l+E_m-E_n)(t-t')}\n\\langle k|\\hat B|l\\rangle \\langle l|\\hat A|m\\rangle\n\\langle m|\\hat B|n\\rangle \\langle n|\\hat A|k\\rangle\\Big].\n\\label{retarded OTOC 3}\n\\end{align}\nWe permute the summation labels for the second term in Eq.~(\\ref{retarded OTOC 3}) \nas $k\\to l\\to m\\to n\\to k$ to obtain\n\\begin{align}\nC_{(AB)^2}^R(t,t')\n&=\n-i\\theta(t-t')\\frac{1}{Z}\\sum_{klmn}\n\\big[e^{-\\frac{\\beta}{2}(E_k+E_m)}-e^{-\\frac{\\beta}{2}(E_l+E_n)}\\big]\ne^{\\frac{i}{\\hbar}(E_k-E_l+E_m-E_n)(t-t')}\n\\notag\n\\\\\n&\\quad\\times\n\\langle k|\\hat A|l\\rangle \\langle l|\\hat B|m\\rangle\n\\langle m|\\hat A|n\\rangle \\langle n|\\hat B|k\\rangle.\n\\end{align}\nBy using the expression for the Fourier transformation of the step function\n\\begin{align}\n\\theta(t)\n&=\n\\frac{i}{2\\pi}\\int_{-\\infty}^{\\infty} d\\omega' \\frac{e^{-i\\omega't}}{\\omega'+i\\delta}\n\\end{align}\nwith a positive infinitesimal constant $\\delta$,\nwe can Fourier transform the retarded OTOC as\n\\begin{align}\nC_{(AB)^2}^R(\\omega)\n&=\n\\int_{-\\infty}^{\\infty} d\\omega' \\frac{1}{\\omega'+i\\delta}\n\\frac{1}{Z}\\sum_{klmn}\n\\big[e^{-\\frac{\\beta}{2}(E_k+E_m)}-e^{-\\frac{\\beta}{2}(E_l+E_n)}\\big]\n\\notag\n\\\\\n&\\quad\\times\n\\delta(\\omega-\\omega'+\\tfrac{1}{\\hbar}(E_k-E_l+E_m-E_n))\n\\langle k|\\hat A|l\\rangle \\langle l|\\hat B|m\\rangle\n\\langle m|\\hat A|n\\rangle \\langle n|\\hat B|k\\rangle.\n\\end{align}\nOne notices that the same form of the spectral function $\\mathscr A_{(AB)^2}(\\omega)$ (\\ref{OTOC spectral function})\nhas appeared in the above expression. Thus, we find that the retarded OTOC has a spectral representation \n\\begin{align}\nC_{(AB)^2}^R(\\omega)\n&=\n\\int_{-\\infty}^{\\infty} d\\omega' \\frac{\\mathscr A_{(AB)^2}(\\omega')}{\\omega-\\omega'+i\\delta}.\n\\label{retarded OTOC Lehmann}\n\\end{align}\nOne can see that $C_{(AB)^2}^R(\\omega)$ is analytic in the upper half of the complex plane.\nIn the limit of $\\omega\\to\\infty$, it behaves as\n\\begin{align}\nC_{(AB)^2}^R(\\omega)\n&\\sim\n\\frac{c_{AB}}{\\omega}.\n\\label{1/omega}\n\\end{align}\nBy comparing Eq.~(\\ref{matsubara OTOC Lehmann}) and (\\ref{retarded OTOC Lehmann}),\nwe prove that the imaginary-frequency function $C_{(AB)^2}^M(i\\varpi_n)$\ncan be analytically continued to the retarded OTOC $C_{(AB)^2}^R(\\omega)$\nthrough $i\\varpi_n\\to\\omega+i\\delta$,\n\\begin{align}\nC_{(AB)^2}^M(i\\varpi_n)\n&\\xrightarrow{i\\varpi_n\\to\\omega+i\\delta}\nC_{(AB)^2}^R(\\omega).\n\\end{align}\n\nSince $C_{(AB)^2}^R(\\omega)$ is analytic in the upper half plane and uniformly decays to zero as in Eq.~(\\ref{1/omega})\nfor $\\omega\\to\\infty$, it should satisfy the Kramers-Kronig relation,\n\\begin{align}{\\rm Re}\\, C_{(AB)^2}^R(\\omega)\n&=\n-\\frac{1}{\\pi}\\mathcal P\\int_{-\\infty}^{\\infty} d\\omega' \n\\frac{{\\rm Im}\\, C_{(AB)^2}^R(\\omega')}{\\omega-\\omega'},\n\\\\\n{\\rm Im}\\, C_{(AB)^2}^R(\\omega)\n&=\n\\frac{1}{\\pi}\\mathcal P\\int_{-\\infty}^{\\infty} d\\omega' \n\\frac{{\\rm Re}\\, C_{(AB)^2}^R(\\omega')}{\\omega-\\omega'}.\n\\end{align}\n\nWe also define the advanced OTOC as\n\\begin{align}\nC_{(AB)^2}^A(t,t')\n&\\equiv\ni\\theta(t'-t)C_{\\{A,B\\},[A,B]}(t,t')\n\\notag\n\\\\\n&=\ni\\theta(t'-t)[\\langle \\hat A(t)\\hat B(t'), \\hat A(t)\\hat B(t')\\rangle\n-\\langle \\hat B(t')\\hat A(t), \\hat B(t')\\hat A(t)\\rangle].\n\\label{advanced OTOC}\n\\end{align}\nIn the same way as for the retarded OTOC, the advanced OTOC has the spectral representation \n\\begin{align}\nC_{(AB)^2}^A(\\omega)\n&=\n\\int_{-\\infty}^{\\infty} d\\omega' \\frac{\\mathscr A_{(AB)^2}(\\omega')}{\\omega-\\omega'-i\\delta}.\n\\label{advanced OTOC Lehmann}\n\\end{align}\nHence the advanced OTOC $C_{(AB)^2}^A(\\omega)$ is analytic in the lower half plane.\nBy comparing Eq.~(\\ref{matsubara OTOC Lehmann}) and Eq.~(\\ref{advanced OTOC Lehmann}), we can see that\n$C_{(AB)^2}^A(\\omega)$ is obtained by analytic continuation from $C_{(AB)^2}^M(i\\varpi_n)$ via $i\\varpi_n\\to \\omega-i\\delta$.\nThe retarded and advanced OTOCs are related via\n\\begin{align}\nC_{(AB)^2}^R(\\omega)^\\ast\n&=\nC_{(B^\\dagger A^\\dagger)^2}^A(\\omega).\n\\end{align}\nIn the case of $\\hat B=\\hat A^\\dagger$, the spectral function $\\mathscr A_{(AB)^2}(\\omega)$\n(which is real in this case) is given by the imaginary part of the retarded OTOC,\n\\begin{align}\n\\mathscr A_{(AA^\\dagger)^2}(\\omega)\n&=\n-\\frac{1}{\\pi}{\\rm Im}\\, C_{(AA^\\dagger)^2}^R(\\omega).\n\\end{align}\n\nSo far, we have explained how to obtain the retarded and advanced OTOCs\nby analytic continuation of the imaginary-time four-point function $C_{(AB)^2}^M(i\\varpi_n)$.\nThis allows us to access $\\langle \\hat A(t)\\hat B(t'), \\hat A(t)\\hat B(t')\\rangle-\\langle \\hat B(t')\\hat A(t), \\hat B(t')\\hat A(t)\\rangle$ [see Eqs.~(\\ref{retarded OTOC}) and (\\ref{advanced OTOC})].\nIn order to get the full information on OTOCs, we also need to calculate \nthe complementary part,\n$\\langle \\hat A(t)\\hat B(t'), \\hat A(t)\\hat B(t')\\rangle+\\langle \\hat B(t')\\hat A(t), \\hat B(t')\\hat A(t)\\rangle$. This can be done by using the out-of-time-order fluctuation-dissipation theorem,\nwhich is the out-of-time-order extension of the conventional fluctuation-dissipation theorem, \nexpressed as  \n\\begin{align}\nC_{AB}^K(\\omega)\n&=\n\\begin{cases}\n\\displaystyle\n\\coth\\left(\\frac{\\beta\\hbar\\omega}{2}\\right)[C_{AB}^R(\\omega)-C_{AB}^A(\\omega)] & \\mbox{either $\\hat A$ or $\\hat B$ is bosonic},\n\\\\\n\\displaystyle\n\\tanh\\left(\\frac{\\beta\\hbar\\omega}{2}\\right)[C_{AB}^R(\\omega)-C_{AB}^A(\\omega)] & \\mbox{both $\\hat A$ and $\\hat B$ are fermionic}.\n\\label{FD}\n\\end{cases}\n\\end{align}\nHere we have defined the Keldysh Green's function\n\\begin{align}\nC_{AB}^K(\\omega)\n&=\n-i\\langle[\\hat A(t), \\hat B(t')]_\\pm \\rangle\n\\end{align}\nwith the sign $+$ taken if either $\\hat A$ or $\\hat B$ are bosonic\nand the sign $-$ taken if both $\\hat A$ and $\\hat B$ are fermionic.\nFollowing the analogy between the Green's functions and OTOCs,\nlet us define the ``Keldysh'' component of OTOCs as\n\\begin{align}\nC_{(AB)^2}^K(t,t')\n&\\equiv\n-\\frac{i}{2}[C_{\\{A,B\\}^2}(t,t')+C_{[A,B]^2}(t,t')]\n\\notag\n\\\\\n&=\n-i[\\langle \\hat A(t)\\hat B(t'), \\hat A(t)\\hat B(t')\\rangle\n+\\langle \\hat B(t')\\hat A(t), \\hat B(t')\\hat A(t)\\rangle].\n\\end{align}\nOne can see that $C_{(AB)^2}^K(t,t')$ is exactly the complementary part that we needed\nto reconstruct OTOCs from the imaginary-time data.\nThe out-of-time-order fluctuation-dissipation theorem has an analogous form\nto the conventional one,\n\\begin{align}\nC_{(AB)^2}^K(\\omega)\n&=\n\\coth\\left(\\frac{\\beta\\hbar\\omega}{4}\\right)[C_{(AB)^2}^R(\\omega)-C_{(AB)^2}^A(\\omega)].\n\\end{align}\nNote that the argument of the cotangent factor ($\\frac{\\beta\\hbar\\omega}{4}$) is just half of that\nfor the conventional fluctuation-dissipation theorem (\\ref{FD}).\nThe out-of-time-order fluctuation-dissipation theorem takes the same form\nfor arbitrary statistics (bosonic or fermionic) for the operators $\\hat A$ and $\\hat B$.\nIn Table~\\ref{table:comparison}, we list the definitions and properties of the Green's function and OTOC.\nOne can see a clear parallelism between the two \ntypes of correlation functions. \n\n\\begin{table}\n\\begin{tabular}{c|c|c}\n& Green's function & Out-of-time-ordered correlator (OTOC) \\\\\n\\hline\n\\hline\nMatsubara (time) & \n$C_{AB}^M(\\tau)=-\\langle \\hat A(\\tau)\\hat B(0)\\rangle$\n&\n$C_{(AB)^2}^M(\\tau)=-\\langle \\hat A(\\tau)\\hat B(0), \\hat A(\\tau)\\hat B(0) \\rangle$\n\\\\\n& ($0\\le\\tau\\le\\beta\\hbar$)\n& ($0\\le \\tau\\le \\frac{\\beta\\hbar}{2}$)\n\\\\\n& $C_{AB}^M(\\tau)=\\mp \\langle \\hat B(0)\\hat A(\\tau)\\rangle$\n& $C_{(AB)^2}^M(\\tau)=-\\langle \\hat B(0)\\hat A(\\tau), \\hat B(0)\\hat A(\\tau) \\rangle$\n\\\\\n& ($-\\beta\\hbar\\le\\tau<0$) & ($-\\tfrac{\\beta\\hbar}{2} \\le \\tau < 0$)\n\\\\\nperiodicity & $C_{AB}^M(\\tau+\\beta\\hbar)=\\pm C_{AB}^M(\\tau)$\n& $C_{(AB)^2}^M(\\tau+\\frac{\\beta\\hbar}{2})=C_{(AB)^2}^M(\\tau)$\n\\\\\nMatsubara (frequency) \n& $\\displaystyle\nC_{AB}^M(i\\omega_n)=\\int_0^{\\beta\\hbar}d\\tau e^{i\\omega_n\\tau}C_{AB}^M(\\tau)$\n& $\\displaystyle\nC_{(AB)^2}^M(i\\varpi_n)=\\int_0^{\\frac{\\beta\\hbar}{2}}d\\tau e^{i\\varpi_n\\tau}C_{(AB)^2}^M(\\tau)$\n\\\\\n& $\\omega_n=\n\\begin{cases}\n2n\\pi/\\beta\\hbar \\\\\n(2n+1)\\pi/\\beta\\hbar\n\\end{cases}\n(n\\in\\mathbb Z)$\n& $\\varpi_n=4n\\pi/\\beta\\hbar \\;\\;\\; (n\\in \\mathbb Z)$\n\\\\\nretarded \n& $C_{AB}^R(t,t')=-i\\theta(t-t')\\langle[\\hat A(t), \\hat B(t')]_\\mp\\rangle$\n& $C_{(AB)^2}^R(t,t')=-i\\theta(t-t')[\\langle \\hat A(t)\\hat B(t'), \\hat A(t)\\hat B(t')\\rangle$\n\\\\\n& &\n\\qquad $-\\langle \\hat B(t')\\hat A(t), \\hat B(t')\\hat A(t)\\rangle]$\n\\\\\nadvanced \n& $C_{AB}^A(t,t')=i\\theta(t'-t)\\langle[\\hat A(t), \\hat B(t')]_\\mp\\rangle$\n& $C_{(AB)^2}^A(t,t')=i\\theta(t'-t)[\\langle \\hat A(t)\\hat B(t'), \\hat A(t)\\hat B(t')\\rangle$\n\\\\\n& &\n\\qquad $-\\langle \\hat B(t')\\hat A(t), \\hat B(t')\\hat A(t)\\rangle]$\n\\\\\nKeldysh \n& $C_{AB}^K(t,t')=-i\\theta(t-t')\\langle[\\hat A(t), \\hat B(t')]_\\pm\\rangle$\n& $C_{(AB)^2}^K(t,t')=-i\\theta(t-t')[\\langle \\hat A(t)\\hat B(t'), \\hat A(t)\\hat B(t')\\rangle$\n\\\\\n& &\n\\qquad $+\\langle \\hat B(t')\\hat A(t), \\hat B(t')\\hat A(t)\\rangle]$\n\\\\\nanalytic continuation & $C_{AB}^M(i\\omega_n)\\longrightarrow C_{AB}^R(\\omega)$\n& $C_{(AB)^2}^M(i\\varpi_n)\\longrightarrow C_{(AB)^2}^R(\\omega)$\n\\\\\n& ($i\\omega_n\\to\\omega+i\\delta$) & ($i\\varpi_n\\to\\omega+i\\delta$)\n\\\\\nFDT & $C_{AB}^K(\\omega)=\n\\coth\\left(\\frac{\\beta\\hbar\\omega}{2}\\right)^{\\pm 1}$\n& $C_{(AB)^2}^K(\\omega)=\n\\coth\\left(\\frac{\\beta\\hbar\\omega}{4}\\right)$\n\\\\\n& \\qquad\\qquad\\qquad\\quad $\\times[C_{AB}^R(\\omega)-C_{AB}^A(\\omega)]$\n& \\qquad\\qquad\\qquad\\qquad\\qquad $\\times[C_{(AB)^2}^R(\\omega)-C_{(AB)^2}^A(\\omega)]$\n\\end{tabular}\n\\caption{Comparison between Green's function and OTOC. \nFor the Green's function, the statistical average $\\langle \\hat X\\rangle\\equiv{\\rm Tr}(e^{-\\beta\\hat H}\\hat X)/Z$ is used,\nwhile for the OTOC, the bipartite statistical average $\\langle \\hat X, \\hat Y\\rangle\\equiv{\\rm Tr}(e^{-\\frac{\\beta}{2}\\hat H}\\hat X e^{-\\frac{\\beta}{2}\\hat H}\\hat Y)/Z$ is used.\nIn the Green's function column, the upper sign is taken when either $\\hat A$ or $\\hat B$ is bosonic,\nand the lower sign is taken when both $\\hat A$ and $\\hat B$ are fermionic.}\n\\label{table:comparison}\n\\end{table}\n\n\\begin{figure}[htbp]\n\\includegraphics[width=10cm]{analytic-continuation.eps}\n\\caption{The procedure to compute the out-of-time-ordered correlation function\n$\\langle A(t)B(0), A(t)B(0)\\rangle \\pm \\langle B(0)A(t), B(0)A(t)\\rangle$ from the imaginary-time data\nobtained by QMC calculations.}\n\\label{analytic continuation}\n\\end{figure}\n\nBy using the out-of-time-order fluctuation-dissipation theorem, we obtain\n$C_{(AB)^2}^K(\\omega)$ from $C_{(AB)^2}^R(\\omega)$ and $C_{(AB)^2}^A(\\omega)$.\nFinally, we perform the inverse Fourier transformation of $C_{(AB)^2}^K(\\omega)$\nand $C_{(AB)^2}^R(\\omega)-C_{(AB)^2}^A(\\omega)$ to derive\n$\\langle \\hat A(t)\\hat B(t'), \\hat A(t)\\hat B(t')\\rangle \\pm\n\\langle \\hat B(t')\\hat A(t), \\hat B(t')\\hat A(t)\\rangle$.\nWe summarize the procedure of deriving real-time OTOCs\nfrom the measurement of the imaginary-time four-point function \nin Fig.~\\ref{analytic continuation}.\n\n\n\n",
        "Descriptive_question1": "What type of correlation function is represented in the first column of table_1?",
        "Descriptive_question2": "What is the periodicity of the Matsubara Green's function in table_1 when considering the range τ + βℏ?",
        "Reasoning_question1": "How does the use of bipartite statistical average in OTOC, as shown in table_1, differ from the statistical average used in Green's function, and what might be the significance of this difference in capturing chaotic properties?",
        "Reasoning_question2": "Based on the information in table_1, why might the argument of the cotangent factor in the out-of-time-order fluctuation-dissipation theorem for OTOC be half of that in the conventional fluctuation-dissipation theorem for Green's function?",
        "Descriptive_answer1": "Green's function",
        "Descriptive_answer2": "periodic",
        "Reasoning_answer1": "The bipartite statistical average used in OTOC, defined as ⟨X, Y⟩ = Tr(e^{-βH/2}X e^{-βH/2}Y)/Z, differs from the statistical average in Green's function, defined as ⟨X⟩ = Tr(e^{-βH}X)/Z, by splitting the thermal density matrix into two halves. This separation in the imaginary-time direction regularizes the squared commutator in quantum field theory, avoiding divergences when operators approach each other closely in time. The significance of this difference lies in the ability of the bipartite average to isolate quantum fluctuations relevant to many-body chaos, as it connects to the Wigner-Yanase skew information, which quantifies information content of quantum fluctuations. Unlike the usual statistical average, which may include irrelevant scattering processes contributing to exponential growth, the bipartite average in OTOC better captures true chaotic properties, especially in semiclassical regimes where quantum fluctuations are suppressed, allowing for the observation of features like the butterfly effect.",
        "Reasoning_answer2": "In table_1, the cotangent factor in the out-of-time-order fluctuation-dissipation theorem (FDT) for OTOC is coth(βℏω/4), while for Green's function it is coth(βℏω/2) or tanh(βℏω/2) depending on bosonic or fermionic statistics. This halving of the argument likely stems from the bipartite nature of the statistical average in OTOC, where the thermal factor e^{-βH} is split into e^{-βH/2} for each part of the correlation. This effectively halves the thermal energy scale in the exponential weighting, reflecting a different thermal averaging structure. Additionally, since OTOC involves four-point correlations compared to the two-point correlations in Green's function, the effective energy or frequency scale in the thermal factor might be adjusted, resulting in the halved argument to account for the doubled operator interactions in the correlation structure."
    },
    {
        "paper_id": "1511.03946.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}\n\\begin{center}\n\\begin{tabular}{@{} l *3c @{}}\n\\toprule\n\\multicolumn{1}{c}{Calculation} & Number of & Configurations \\\\\n & levels & included \\\\\n \\midrule\n       \\multicolumn{1}{c}{PBP} & 124 &  3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$, 3s$^2$3p$^3$3d\\\\\n       \\multicolumn{1}{c}{} & & 3s$^2$3p$^3$[4s, 4p, 4d] \\\\  \n       \n        \\multicolumn{1}{c}{} & & \\\\\n                     \n         \\multicolumn{1}{c}{DARC1} & 209 & 3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$, 3s$^2$3p$^3$3d\\\\\n           \\multicolumn{1}{c}{} &  & 3s$^2$3p$^3$[4s, 4p] +  \\\\\n           \\multicolumn{1}{c}{} &  & 3s$^2$3p$^2$3d$^2$ + 3p$^5$3d \\\\\n                                          \n             \\multicolumn{1}{c}{} & & \\\\                     \n                                                        \n            \\multicolumn{1}{c}{DARC2} & 257 &  DARC1 + \\\\\n             \\multicolumn{1}{c}{} & &  3s$^2$3p$^3$[4d, 5s] \\\\\n\n               \\multicolumn{1}{c}{} & & \\\\\n\n             \\multicolumn{1}{c}{DARC3} & 557 &  DARC1 + \\\\\n             \\multicolumn{1}{c}{} & &  3s3p$^4$3d + 3s3p$^3$3d$^2$ \\\\\n                    \\multicolumn{1}{c}{} & & + 2s$^2$2p$^5$3s$^2$3p$^5$ \\\\\n\n      \\bottomrule\n \\end{tabular}\n \\caption{The list of calculations performed throughout this paper are recorded and indexed for reference in the first column. The configurations and levels associated are also retained. \\label{tab:calculations}}\n \\end{center}\n\\end{table}",
        "caption": "The list of calculations performed throughout this paper are recorded and indexed for reference in the first column. The configurations and levels associated are also retained. \\label{tab:calculations}",
        "label": "tab:calculations",
        "section_info": "2 Structure model\n\\section{Structure model}\\label{sec:structure}\nThe photoionization processes of interest can be described by the following equations,\n\\begin{equation}\\label{eq:ground}\nh\\nu + (2p^63s^23p^5) ^2\\rm{P}^{o}_{3/2, 1/2} \\rightarrow Ar^{2+} + \\rm{e}^-\n\\end{equation}\n\\begin{equation}\\label{eq:excited}\nh\\nu + (2p^63s3p^6) ^2\\rm{S}^{e}_{1/2} \\rightarrow Ar^{2+} + \\rm{e}^-\n\\end{equation}\nwhere it is found that the dominant contributions to the total photoionization come from the Ar$^{2+}$ 3s$^2$3p$^4$ and 3s3p$^5$ levels. We have investigated two methods for generating an appropriate basis set expansion of the Ar {\\sc iii} ion. The first is carried out through a Breit-Pauli approach using the computer code {\\sc civ3} \\citep{1975CoPhC...9..141H, 1991CoPhC..64..455H}, and secondly, using the relativistic computer code {\\sc grasp0} \\citep{1996CoPhC..94..249P}. This stage of the calculation is crucially important enabling an accurate representation of both the initial target as well as the residual ion which is then to be constructed and incorporated into the \\textbf{R}-matrix method. \n\n\\begin{table}\n\\begin{center}\n\\begin{tabular}{@{} l *3c @{}}\n\\toprule\n\\multicolumn{1}{c}{Calculation} & Number of & Configurations \\\\\n & levels & included \\\\\n \\midrule\n       \\multicolumn{1}{c}{PBP} & 124 &  3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$, 3s$^2$3p$^3$3d\\\\\n       \\multicolumn{1}{c}{} & & 3s$^2$3p$^3$[4s, 4p, 4d] \\\\  \n       \n        \\multicolumn{1}{c}{} & & \\\\\n                     \n         \\multicolumn{1}{c}{DARC1} & 209 & 3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$, 3s$^2$3p$^3$3d\\\\\n           \\multicolumn{1}{c}{} &  & 3s$^2$3p$^3$[4s, 4p] +  \\\\\n           \\multicolumn{1}{c}{} &  & 3s$^2$3p$^2$3d$^2$ + 3p$^5$3d \\\\\n                                          \n             \\multicolumn{1}{c}{} & & \\\\                     \n                                                        \n            \\multicolumn{1}{c}{DARC2} & 257 &  DARC1 + \\\\\n             \\multicolumn{1}{c}{} & &  3s$^2$3p$^3$[4d, 5s] \\\\\n\n               \\multicolumn{1}{c}{} & & \\\\\n\n             \\multicolumn{1}{c}{DARC3} & 557 &  DARC1 + \\\\\n             \\multicolumn{1}{c}{} & &  3s3p$^4$3d + 3s3p$^3$3d$^2$ \\\\\n                    \\multicolumn{1}{c}{} & & + 2s$^2$2p$^5$3s$^2$3p$^5$ \\\\\n\n      \\bottomrule\n \\end{tabular}\n \\caption{The list of calculations performed throughout this paper are recorded and indexed for reference in the first column. The configurations and levels associated are also retained. \\label{tab:calculations}}\n \\end{center}\n\\end{table}\n\n\\subsection{Breit-Pauli approach}\nWe employed an analytic Slater type orbital description for the bound orbitals up to 3p from the tables of \\citet{1974ADNDT..14..177C}. The computer package {\\sc civ3} was then utilized to extend this basis expansion by including the 3d, 4s, 4p and 4d orbitals. These additional orbitals have been optimised in an $LS\\pi$ coupling scheme on the lowest quintet states of the configurations 3s$^2$3p$^3$[3d, 4s, 4p, 4d] respectively. A total of 124 $J\\pi$ levels were included in the basis set with configurations from 3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$ and 3s$^2$3p$^3$[3d, 4s, 4p, 4d]. Configuration-interaction terms are also included to account for additional correlation in each wavefunction. These configuration-interaction expansions of the target wavefunctions employ a Breit-Pauli approach through one body perturbative corrections to the non-relativistic Hamiltonian operator. These corrections are described in full in the literature \\citep{1980JPhB...13.4299S} and carried through to be used consistently in the Breit-Pauli (PBP) \\textbf{R}-matrix method. This, the first of our Ar {\\sc iii} models, is labelled in Table \\ref{tab:calculations} as PBP.\n\n\\subsection{Relativistic approach}\nThe computer code {\\sc grasp0} has also been used to construct a bound orbital basis set for Ar {\\sc iii}. The method involves the Dirac-Coloumb Hamiltonian,\n\\[\nH_{D}= \\sum_i -ic\\bm{\\alpha} \\nabla_i + (\\bm{\\beta} - 1)c^2-\\frac{Z}{r_i} + \\sum_{i<j}\\frac{1}{|\\mathbf{r}_j-\\mathbf{r}_i|}\n\\]\nwhere the electrons are labelled by $i$ and $j$ and the summation is taken over all electrons of the system. The matrices $\\bm{\\alpha}$ and $\\bm{\\beta}$ are directly related to the Pauli spin matrices, $c$ is the speed of light and the atomic number is $Z=18$. The relativistic orbitals are described with a large component, $\\mathcal{P}_{nl}$ and small component $\\mathcal{Q}_{nl}$. The target wavefunctions are appropriately defined on a radial grid for input into the relativistic Dirac Atomic \\textbf{R}-matrix Codes ({\\sc darc}).\n\n\nUnlike in {\\sc civ3} where we optimise the additional orbitals on the lowest lying states of the respective configuration, {\\sc grasp0} considers the optimization process on every state included in the calculation, unless specified otherwise by the user.\n\n\\begin{table*}\n\\begin{center}\n\\begin{tabular}{@{} l *9r *1c @{}}\n\\toprule\n\\multicolumn{1}{c}{Index} & Configuration & Level & NIST & \\multicolumn{4}{c}{\\textbf{Present}} & Stancalie & Burgos  \\\\\n & & & & PBP & DARC1 & DARC2 & DARC3 & et al. & et al.  \\\\\n\n \\midrule\n \n       \\multicolumn{1}{c}{1} & 3s$^2$3p$^4$ & $^3$P$_2$ & 0.0000 & 0.0000 & 0.0000 & 0.0000 & 0.0000 & 0.0000 & 0.0000 \\\\\n        \\multicolumn{1}{c}{2} & 3s$^2$3p$^4$ &$^3$P$_1$ & 0.1379 & 0.1498 & 0.1391 & 0.1391 &  0.1367 & 0.1325 & 0.1306\\\\\n        \\multicolumn{1}{c}{3} & 3s$^2$3p$^4$ &$^3$P$_0$ & 0.1947 & 0.2119 & 0.1952 & 0.1952 & 0.1942 & 0.1895 & 0.1864\\\\\n       \\multicolumn{1}{c}{4} & 3s$^2$3p$^4$ & $^1$D$_2$ & 1.7370 & 2.0093 & 2.1402 & 2.1402 & 2.0423 & 1.9266 & 2.0245\\\\\n       \\multicolumn{1}{c}{5} & 3s$^2$3p$^4$ & $^1$S$_0$ & 4.1244 & 4.3337 & 3.4318 & 3.4318 & 4.2673 & 4.3089 & 3.8654\\\\ \n      \\multicolumn{1}{c}{6} & 3s3p$^5$ & $^3$P$_2$ & 14.1095 & 14.1116 & 14.1877 & 14.0961 & 13.9520 & 13.9529 & 13.6370\\\\ \n        \\multicolumn{1}{c}{7} & 3s3p$^5$ & $^3$P$_1$ & 14.2331 & 14.2461 & 14.3136 & 14.2219 & 14.0752 & 14.2220 & 13.7526\\\\ \n         \\multicolumn{1}{c}{8} & 3s3p$^5$ & $^3$P$_0$ & 14.2988 & 14.3164 & 14.3793 & 14.2877 & 14.1394 & 14.1490 & 13.8125\\\\ \n        \\multicolumn{1}{c}{9} & 3s3p$^5$ & $^1$P$_1$ & 17.8565 & 18.4640 & 18.2749 & 18.2060 & 18.1696 & 17.4928 & 17.7255\\\\\n          \\multicolumn{1}{c}{10} & 3s$^2$3p$^3$3d & $^5$D$_0$ & -- & 18.2182 & 18.0206 & 18.0102 &  17.6992 & -- & --\\\\   \n         \\multicolumn{1}{c}{11} & 3s$^2$3p$^3$3d & $^5$D$_1$ & 17.9635 & 18.2203  & 18.0211 & 18.0107 & 17.6996 & 17.5912 & 17.9119\\\\   \n          \\multicolumn{1}{c}{12} & 3s$^2$3p$^3$3d & $^5$D$_2$ & 17.9642 & 18.2243 & 18.0220 & 18.0116 & 17.7005 & 17.5908 & -- \\\\   \n         \\multicolumn{1}{c}{13} & 3s$^2$3p$^3$3d & $^5$D$_3$ & 17.9650 & 18.2306 & 18.0233 & 18.0130 & 17.7019 & 17.3458 & 17.9214 \\\\   \n          \\multicolumn{1}{c}{14} & 3s$^2$3p$^3$3d & $^5$D$_4$ & 17.9667 & 18.2394 & 18.0255 & 18.0152 & 17.7040 & 17.5911 & 17.9296\\\\        \n\\bottomrule\n \\end{tabular}\n \\caption{Energies and assignments for the lowest 14 levels of the Ar {\\sc iii} system presented in eV. \\citet{2012EPJD...66...84S} and  \\citet{2009A&A...500.1253M} are theoretical \\textbf{R}-matrix results for electron impact excitation calculations, and the NIST results are observational data taken from \\citet{2010JPCRD..39c3101S}. The remaining results are the present works summarized in Table \\ref{tab:calculations}. \\label{tab:energy}}\n \\end{center}\n\\end{table*}\n\n\nInitially we have included the important configurations 3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$, 3s$^2$3p$^3$[3d, 4s, 4p], 3p$^5$3d and 3s$^2$3p$^2$3d$^2$, which gives rise to 209 levels. This model is labelled as DARC1 in Table \\ref{tab:calculations}. We augment this model with the inclusion of the 3s$^2$3p$^3$[4d, 5s] configurations in DARC2 raising the number of levels to 257. The reason to perform this slightly larger evaluation was to test whether the inclusion of these high lying $nl=$ 4d and 5s levels affect the convergence of the photoionization cross section, or whether their contribution could be deemed negligible. An accurate representation for the low-lying wavefunctions of the residual ion is always of major importance for photoionization calculations. In an attempt to further improve correlation effects we perform a final relativistic evaluation, in which we include the additional 3s3p$^4$3d levels (mixing with 3s$^2$3p$^4$ and have the effect of lowering the relative $^3$P$_2$ ground energy), as well as all levels with configuration 3s3p$^3$3d$^2$ which improve the odd parity levels. The configuration 2p$^5$3s$^2$3p$^5$ has also been incorporated into the expansion of Ar {\\sc iii} as it allows us to extend our evaluations to L-shell photoionization and results in an additional 10 levels. We label this, our largest Ar {\\sc iii} model, as DARC3 in Table \\ref{tab:calculations} containing 557 individual fine-structure levels. \n\n\n\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.6, angle=-90]{figure1.eps}\n\\caption{Photoionization cross section from the initial $^2$P$^{\\rm{o}}_{3/2}$ to allowed final states given in Mb on a logarithmic scale against the photon energy in eV. The dashed black line represents the result from DARC1, the solid blue line is the contribution from levels indexed 1-5 in Table \\ref{tab:energy}, and the solid orange line is the extension to DARC2. \\label{fig:comparison}}\n\\end{figure*}\n\nWe present in Table \\ref{tab:energy} the energy levels (in eV) relative to the Ar {\\sc iii} ground state for the lowest 14 fine-structure levels for this ion. We directly compare the {\\em ab initio} energies from the PBP, DARC1, DARC2 and DARC3 evaluations with two theoretical \\textbf{R}-matrix works \\citep{2012EPJD...66...84S, 2009A&A...500.1253M}, both of which performed electron-impact excitation evaluations and generated their basis set for Ar {\\sc iii} with the {\\sc autostructure} \\citep{1986JPhB...19.3827B}, adapted from the original {\\sc superstructure} \\citep{1974CoPhC...8..270E}, computer package. Comparisons are also made with the recorded NIST levels compiled by \\citet{2010JPCRD..39c3101S} which incorporates designations from observed spectral analysis of existing works. \n\nAll of the present calculations agree extremely well when compared with NIST, but the results from the DARC3 model give best overall agreement across the 14 levels considered. Differences of less than 4\\% are found for all energy separations with the exception of the 3s$^2$3p$^4\\; ^1$D$_2$ state where a difference of 15\\% is recorded. This level of disparity is evident, however, for all the theoretical predictions listed. Due to the sophistication of the DARC3 model and the fact that it allows us to investigate L-shell photoionization, it is this model that we incorporate primarily into our collision calculations.\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.6, angle=-90]{figure2.eps}\n\\caption{Total photoionization cross section measured in Mb on a logarithmic scale as a function of photon energy in eV. All results display the initial ground state, statistically weighted $^2$P$^{\\rm o}$, with $J=3/2$ and $J=1/2$ odd states to all allowed final states. A 10meV gaussian convolution at FWHM is applied to compare directly with experimental resolution for all theoretical calculations. The yellow circles, green circles with error bars, and solid turquoise line are the experimental results, absolute measurements at resonance free regions and theoretical calculations respectively, performed by \\citet{2011PhRvA..84a3413C}. The dashed orange and solid purple lines represent our PBP and DARC3 calculations respectively. \\label{fig:valence}}\n\\end{figure*}\n\n\n\n\n\n\n\n\n\n\n\n\n2.1 Breit-Pauli approach\n\\subsection{Breit-Pauli approach}\nWe employed an analytic Slater type orbital description for the bound orbitals up to 3p from the tables of \\citet{1974ADNDT..14..177C}. The computer package {\\sc civ3} was then utilized to extend this basis expansion by including the 3d, 4s, 4p and 4d orbitals. These additional orbitals have been optimised in an $LS\\pi$ coupling scheme on the lowest quintet states of the configurations 3s$^2$3p$^3$[3d, 4s, 4p, 4d] respectively. A total of 124 $J\\pi$ levels were included in the basis set with configurations from 3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$ and 3s$^2$3p$^3$[3d, 4s, 4p, 4d]. Configuration-interaction terms are also included to account for additional correlation in each wavefunction. These configuration-interaction expansions of the target wavefunctions employ a Breit-Pauli approach through one body perturbative corrections to the non-relativistic Hamiltonian operator. These corrections are described in full in the literature \\citep{1980JPhB...13.4299S} and carried through to be used consistently in the Breit-Pauli (PBP) \\textbf{R}-matrix method. This, the first of our Ar {\\sc iii} models, is labelled in Table \\ref{tab:calculations} as PBP.\n\n2.2 Relativistic approach\n\\subsection{Relativistic approach}\nThe computer code {\\sc grasp0} has also been used to construct a bound orbital basis set for Ar {\\sc iii}. The method involves the Dirac-Coloumb Hamiltonian,\n\\[\nH_{D}= \\sum_i -ic\\bm{\\alpha} \\nabla_i + (\\bm{\\beta} - 1)c^2-\\frac{Z}{r_i} + \\sum_{i<j}\\frac{1}{|\\mathbf{r}_j-\\mathbf{r}_i|}\n\\]\nwhere the electrons are labelled by $i$ and $j$ and the summation is taken over all electrons of the system. The matrices $\\bm{\\alpha}$ and $\\bm{\\beta}$ are directly related to the Pauli spin matrices, $c$ is the speed of light and the atomic number is $Z=18$. The relativistic orbitals are described with a large component, $\\mathcal{P}_{nl}$ and small component $\\mathcal{Q}_{nl}$. The target wavefunctions are appropriately defined on a radial grid for input into the relativistic Dirac Atomic \\textbf{R}-matrix Codes ({\\sc darc}).\n\n\nUnlike in {\\sc civ3} where we optimise the additional orbitals on the lowest lying states of the respective configuration, {\\sc grasp0} considers the optimization process on every state included in the calculation, unless specified otherwise by the user.\n\n\\begin{table*}\n\\begin{center}\n\\begin{tabular}{@{} l *9r *1c @{}}\n\\toprule\n\\multicolumn{1}{c}{Index} & Configuration & Level & NIST & \\multicolumn{4}{c}{\\textbf{Present}} & Stancalie & Burgos  \\\\\n & & & & PBP & DARC1 & DARC2 & DARC3 & et al. & et al.  \\\\\n\n \\midrule\n \n       \\multicolumn{1}{c}{1} & 3s$^2$3p$^4$ & $^3$P$_2$ & 0.0000 & 0.0000 & 0.0000 & 0.0000 & 0.0000 & 0.0000 & 0.0000 \\\\\n        \\multicolumn{1}{c}{2} & 3s$^2$3p$^4$ &$^3$P$_1$ & 0.1379 & 0.1498 & 0.1391 & 0.1391 &  0.1367 & 0.1325 & 0.1306\\\\\n        \\multicolumn{1}{c}{3} & 3s$^2$3p$^4$ &$^3$P$_0$ & 0.1947 & 0.2119 & 0.1952 & 0.1952 & 0.1942 & 0.1895 & 0.1864\\\\\n       \\multicolumn{1}{c}{4} & 3s$^2$3p$^4$ & $^1$D$_2$ & 1.7370 & 2.0093 & 2.1402 & 2.1402 & 2.0423 & 1.9266 & 2.0245\\\\\n       \\multicolumn{1}{c}{5} & 3s$^2$3p$^4$ & $^1$S$_0$ & 4.1244 & 4.3337 & 3.4318 & 3.4318 & 4.2673 & 4.3089 & 3.8654\\\\ \n      \\multicolumn{1}{c}{6} & 3s3p$^5$ & $^3$P$_2$ & 14.1095 & 14.1116 & 14.1877 & 14.0961 & 13.9520 & 13.9529 & 13.6370\\\\ \n        \\multicolumn{1}{c}{7} & 3s3p$^5$ & $^3$P$_1$ & 14.2331 & 14.2461 & 14.3136 & 14.2219 & 14.0752 & 14.2220 & 13.7526\\\\ \n         \\multicolumn{1}{c}{8} & 3s3p$^5$ & $^3$P$_0$ & 14.2988 & 14.3164 & 14.3793 & 14.2877 & 14.1394 & 14.1490 & 13.8125\\\\ \n        \\multicolumn{1}{c}{9} & 3s3p$^5$ & $^1$P$_1$ & 17.8565 & 18.4640 & 18.2749 & 18.2060 & 18.1696 & 17.4928 & 17.7255\\\\\n          \\multicolumn{1}{c}{10} & 3s$^2$3p$^3$3d & $^5$D$_0$ & -- & 18.2182 & 18.0206 & 18.0102 &  17.6992 & -- & --\\\\   \n         \\multicolumn{1}{c}{11} & 3s$^2$3p$^3$3d & $^5$D$_1$ & 17.9635 & 18.2203  & 18.0211 & 18.0107 & 17.6996 & 17.5912 & 17.9119\\\\   \n          \\multicolumn{1}{c}{12} & 3s$^2$3p$^3$3d & $^5$D$_2$ & 17.9642 & 18.2243 & 18.0220 & 18.0116 & 17.7005 & 17.5908 & -- \\\\   \n         \\multicolumn{1}{c}{13} & 3s$^2$3p$^3$3d & $^5$D$_3$ & 17.9650 & 18.2306 & 18.0233 & 18.0130 & 17.7019 & 17.3458 & 17.9214 \\\\   \n          \\multicolumn{1}{c}{14} & 3s$^2$3p$^3$3d & $^5$D$_4$ & 17.9667 & 18.2394 & 18.0255 & 18.0152 & 17.7040 & 17.5911 & 17.9296\\\\        \n\\bottomrule\n \\end{tabular}\n \\caption{Energies and assignments for the lowest 14 levels of the Ar {\\sc iii} system presented in eV. \\citet{2012EPJD...66...84S} and  \\citet{2009A&A...500.1253M} are theoretical \\textbf{R}-matrix results for electron impact excitation calculations, and the NIST results are observational data taken from \\citet{2010JPCRD..39c3101S}. The remaining results are the present works summarized in Table \\ref{tab:calculations}. \\label{tab:energy}}\n \\end{center}\n\\end{table*}\n\n\nInitially we have included the important configurations 3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$, 3s$^2$3p$^3$[3d, 4s, 4p], 3p$^5$3d and 3s$^2$3p$^2$3d$^2$, which gives rise to 209 levels. This model is labelled as DARC1 in Table \\ref{tab:calculations}. We augment this model with the inclusion of the 3s$^2$3p$^3$[4d, 5s] configurations in DARC2 raising the number of levels to 257. The reason to perform this slightly larger evaluation was to test whether the inclusion of these high lying $nl=$ 4d and 5s levels affect the convergence of the photoionization cross section, or whether their contribution could be deemed negligible. An accurate representation for the low-lying wavefunctions of the residual ion is always of major importance for photoionization calculations. In an attempt to further improve correlation effects we perform a final relativistic evaluation, in which we include the additional 3s3p$^4$3d levels (mixing with 3s$^2$3p$^4$ and have the effect of lowering the relative $^3$P$_2$ ground energy), as well as all levels with configuration 3s3p$^3$3d$^2$ which improve the odd parity levels. The configuration 2p$^5$3s$^2$3p$^5$ has also been incorporated into the expansion of Ar {\\sc iii} as it allows us to extend our evaluations to L-shell photoionization and results in an additional 10 levels. We label this, our largest Ar {\\sc iii} model, as DARC3 in Table \\ref{tab:calculations} containing 557 individual fine-structure levels. \n\n\n\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.6, angle=-90]{figure1.eps}\n\\caption{Photoionization cross section from the initial $^2$P$^{\\rm{o}}_{3/2}$ to allowed final states given in Mb on a logarithmic scale against the photon energy in eV. The dashed black line represents the result from DARC1, the solid blue line is the contribution from levels indexed 1-5 in Table \\ref{tab:energy}, and the solid orange line is the extension to DARC2. \\label{fig:comparison}}\n\\end{figure*}\n\nWe present in Table \\ref{tab:energy} the energy levels (in eV) relative to the Ar {\\sc iii} ground state for the lowest 14 fine-structure levels for this ion. We directly compare the {\\em ab initio} energies from the PBP, DARC1, DARC2 and DARC3 evaluations with two theoretical \\textbf{R}-matrix works \\citep{2012EPJD...66...84S, 2009A&A...500.1253M}, both of which performed electron-impact excitation evaluations and generated their basis set for Ar {\\sc iii} with the {\\sc autostructure} \\citep{1986JPhB...19.3827B}, adapted from the original {\\sc superstructure} \\citep{1974CoPhC...8..270E}, computer package. Comparisons are also made with the recorded NIST levels compiled by \\citet{2010JPCRD..39c3101S} which incorporates designations from observed spectral analysis of existing works. \n\nAll of the present calculations agree extremely well when compared with NIST, but the results from the DARC3 model give best overall agreement across the 14 levels considered. Differences of less than 4\\% are found for all energy separations with the exception of the 3s$^2$3p$^4\\; ^1$D$_2$ state where a difference of 15\\% is recorded. This level of disparity is evident, however, for all the theoretical predictions listed. Due to the sophistication of the DARC3 model and the fact that it allows us to investigate L-shell photoionization, it is this model that we incorporate primarily into our collision calculations.\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.6, angle=-90]{figure2.eps}\n\\caption{Total photoionization cross section measured in Mb on a logarithmic scale as a function of photon energy in eV. All results display the initial ground state, statistically weighted $^2$P$^{\\rm o}$, with $J=3/2$ and $J=1/2$ odd states to all allowed final states. A 10meV gaussian convolution at FWHM is applied to compare directly with experimental resolution for all theoretical calculations. The yellow circles, green circles with error bars, and solid turquoise line are the experimental results, absolute measurements at resonance free regions and theoretical calculations respectively, performed by \\citet{2011PhRvA..84a3413C}. The dashed orange and solid purple lines represent our PBP and DARC3 calculations respectively. \\label{fig:valence}}\n\\end{figure*}\n\n\n\n\n\n\n\n\n\n\n\n\n4 Results\n\\section{Results}\\label{sec:results}\nBefore embarking on the large scale DARC3 calculation we thought it prudent to investigate first the important properties and characteristics found in the photoionization cross section of Ar {\\sc ii} in its ground state. In Figure \\ref{fig:comparison}\nwe present the total photoionization cross section in Mb on a logarithmic scale as a function of photon energy in eV, from the initial ground Ar {\\sc ii} $^2$P$^{\\rm{o}}_{3/2}$ state to all allowed final states. Three calculations are presented in this figure; both the 209 level DARC1 and its contributions from the 3s$^2$3p$^4$ levels indexed as 1-5 in Table \\ref{tab:energy} and the extended 257 level DARC2 calculation. Clearly Figure \\ref{fig:comparison}\nshows the importance of including at least the first five 3s$^2$3p$^4$ levels of Ar {\\sc iii} in this photoionization calculation. The contributions from these levels dominates the total cross section up to a photon energy of approximately 50eV and all three calculations exhibit excellent agreement up to this point. It is essential, therefore, that an accurate description is achieved for the wavefunction representation of those low-lying levels. Above 50eV the additional levels associated with the more complex DARC1 and DARC2 models come into play and the cross section rises as we move to higher photon energies as more channels become accessible. Interestingly the inclusion of the additional 3s$^2$3p$^3$4d and 3s$^2$3p$^3$5s levels in the DARC2 model has little or no effect on the photoionization cross section produced by the DARC1 model up to 60eV, both datasets showing near perfect agreement. Therefore we do not retain these additional 3s$^2$3p$^3$4d and 3s$^2$3p$^3$5s configurations in our largest DARC3 calculation as can be seen from Table \\ref{tab:calculations}.\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.6, angle=-90]{figure3.eps}\n\\caption{Photoionization cross section measured in Mb on a logarithmic scale as a function of the photon energy between 27.8-29.2 eV just above threshold. The solid black line is the current statistically weighted, initial ground state, DARC3 calculation against the experimental values from \\citet{2011PhRvA..84a3413C} represented by the yellow circles taken from Figure \\ref{fig:valence}. \\label{fig:valence_zoom}}\n\\end{figure*}\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.58, angle=-90]{figure4.eps}\n\\caption{Total ground state photoionization cross section measured in Mb as a function of the photon energy between 0-280eV. The transition is from the initial state 3s3p$^6$ $^2$S$_{1/2}$ to all allowed final states from the DARC3 model. \\label{fig:excited}}\n\\end{figure*}\n\n\\subsection{Valence shell photoionization}\nThe only available data currently in the literature for valence shell photoionization of Ar {\\sc ii} up to photon energies of 60eV is performed by \\citet{2011PhRvA..84a3413C}. In this paper both theoretical and experimental cross sections are presented. Absolute cross sections are obtained from the merged beam technique at the Advanced Light Source (ALS) with a spectral resolution of 10meV. It was found that the primary ion beam contained a mixture of both $^2$P$_{3/2}^{\\rm o}$ and $^2$P$_{1/2}^{\\rm o}$ initial states. Hence the total cross section was presented as a statistical weighting of the odd parity $J=3/2$ ground and $J=1/2$ metastable states respectively. The accompanying theoretical cross sections presented by \\citet{2011PhRvA..84a3413C} were evaluated using the Breit-Pauli \\textbf{R}-matrix approach. A total of 48 $LSJ\\pi$ fine-structure levels were included in the wavefunction representation with configurations 3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$ and 3s$^2$3p$^2$3d$^2$. Some important correlation effects are thus omitted from this model such as levels associated with the 3s$^2$3p$^3$3d configuration and those arising from the lower $n=4$ complex. \n\n\n\n\n\\begin{figure}\n\\includegraphics[scale=0.29, angle=-90]{figure7.eps}\n\\caption{The photoionization cross section is presented on a linear scale against photon energy in eV above 261.2eV. The solid black line represents our current DARC3 model convoluted at 140meV FWHM and the dashed black line is the contribution to the cross section from valence shell photoionization of the 3s and 3p. The grey circles are experimental values of \\citet{2012PhRvA..85d3408B} for the single ionization channel and pink circles represent the total contribution. Each 2p$^5$3s$^2$3p$^5$ threshold is represented by an asterisk.\\label{fig:s_and_d}}\n\\end{figure}\n\nIn order to compare with this data we present in Figure \\ref{fig:valence} the total photoionization cross section from the initial $^2$P$^{\\rm{o}}$ ground state of Ar {\\sc ii} statistically weighted to the $J=3/2$ and $J=1/2$ states. We present two of our calculations in the figure, the most sophisticated DARC3 model and, in order to perform a direct comparison with the Breit-Pauli theoretical results of \\citet{2011PhRvA..84a3413C}, the PBP 124 level model outlined in Table \\ref{tab:calculations}. To match experimental resolving power, we convolute our total results with a 10meV gaussian profile at full-width half-maximum (FWHM). In addition, to replicate the target thresholds, we have shifted our threshold values recorded in Table \\ref{tab:energy} to the experimental NIST values where possible, during the diagonalization of the Hamiltonian matrix. The remaining levels not contained in NIST are shifted by an average proportion to each corresponding angular and spin momentum state, which has little effect on the background and is meant only for consistency. This ensures that resonance features are properly positioned with respect to the observed thresholds, making a direct comparison with experiment more meaningful.\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.6, angle=-90]{figure5.eps}\n\\caption{Photoionization cross section measured in Mb on a logarithmic scale as a function of the photon energy in eV between 250-270eV. The circles are the experimental results from \\citet{2012PhRvA..85d3408B} with error bars included. The solid black line represents our current, DARC3 model results for the statistically weighted initial ground state, convoluted at 140meV at FWHM. \\label{fig:bigL}}\n\\end{figure*}\n\nWe can clearly see in Figure \\ref{fig:valence} that the low energy region just above threshold is completely dominated by 3s$^2$3p$^5$ $\\rightarrow$ 3s$^2$3p$^4nl$ transitions occurring at discrete energies prior to the ejection of an electron. This densely populated region of Rydberg resonances up to approximately 30eV is followed by a steep decline in the photoionization cross section forming the expected Cooper minimum around 45-50eV. This minimum is well known to appear in the spectra of noble gases \\citep{1962PhRv..128..681C}. Above this minimum the cross section rises due to excitations from 3p $\\rightarrow$ 3d transitions, before monotonically decreasing towards zero with increasing photon energy. \n\nExcellent agreement is evident between the 124 level PBP and the 48 level Breit-Pauli calculation of \\citet{2011PhRvA..84a3413C}, for all photon energies up to 60eV. Note that the cross section in Figure \\ref{fig:valence} is plotted on a log scale. Evidently the larger basis expansion of the present PBP evaluation, which includes the 4s, 4p and 4d orbitals, has minimal effect on the resulting photoionization cross section. Both of these Breit-Pauli evaluations, however, underestimate the cross section above roughly 45eV and lie considerably lower than the experimental measurements from ALS. The larger DARC3 evaluation, incorporating 557 fine-structure levels, gives much better agreement with experiment at photon energies above the Cooper minimum. This is partly due to the more substantial calculation, and also a more accurate description of the wavefunctions included. Both techniques in fact are known to reproduce similar results as shown in a study by \\citet{2005JPhB...38.1667B}, showing that the average difference in effective collision strengths for Fe$^{14+}$ to be 6$\\%$ between all transitions considered. The additional levels included, and the Rydberg resonances converging onto their thresholds, have the effect of raising the cross section above 45eV.\n\nIn order to further emphasize the excellent agreement between the DARC3 and the experimental measurements, we zoom in on the photon energy region just above threshold, from 27.8-29.2eV, in Figure \\ref{fig:valence_zoom}. It is clearly evident that the disparities found between theory and experiment in this very narrow energy range are negligible and excellent conformity is achieved. This high level of agreement supports the accuracy of the DARC3 evaluation and we believe that these valence shell photoionization cross sections for the ground state of Ar {\\sc ii} accurately reproduce the experimental spectrum. \n\nIn Figure \\ref{fig:excited} we present the total photoionization cross section for the process defined in Equation \\ref{eq:excited}, photoionization from the lowest excited initial 3s3p$^6\\; ^2$S$_{1/2}$ bound state of Ar {\\sc ii} to all possible allowed final states of Ar {\\sc iii}. These evaluations were carried out using the DARC3 model and present for the first time, cross sections for photoionization from an excited Ar {\\sc ii} state. There are no other theoretical or experimental data with which we can compare in this figure. The cross section is presented as a function of the photon energy in eV which ranges from just above the ionization threshold to beyond the opening of the L-shell thresholds. The photoionization cross section tends towards zero with increasing energy, and it is only due to the inclusion of the additional 10 hole states in the DARC3 model do we witness contributions to the cross section at photon energies between 200-250 eV. \n\n\\subsection{L-shell photoionization}\nCalculations and experiment have been carried out at the L-shell energy region between 250-280eV by \\citet{2012PhRvA..85d3408B} at the SOLEIL facility in France as described in Section \\ref{sec:introduction}. All the results herein have been convoluted with a 140meV Gaussian profile at FWHM to match the spectral resolution of experiment. Similar to the valence shell comparisons, the initial ground state cross section is formed from a statistically weighted average of the contributions from the odd $J=3/2$ and $J=1/2$ partial waves. Due to time of flight between the ion source and interacting region, excited levels can populate the main ion beam. This leads to a possible inclusion of the initial 3s3p$^6$ $^2$S$_{1/2}$ bound state which may also contribute to the total cross section. In Figure \\ref{fig:excited} we have already shown the immediate result of the lowest excited initial bound state transitions arising from the configuration 3s3p$^6$.\n\n\n\n\n\n\\begin{figure}\n\\includegraphics[scale=0.30, angle=-90]{figure6.eps}\n\\caption{The total convoluted FWHM at 140meV photoionization cross section between 254-256eV taken from Figure \\ref{fig:bigL} highlighting the intense resonant peaks. The spectra is broken into the contributions from each dipole allowed symmetry from both initial (middle third) and metastable (bottom third) initial states according to their statistical weighting. These even $J=5/2, 3/2, 1/2$ partial symmetries are the solid turquoise, red and purple lines. The total (top third) summed contribution is presented by the solid black curve and the two dominate resonances are marked by the dashed line. \\label{fig:res_peaks}}\n\\end{figure}\n\nIn order to compare with experiment, we have presented our results against various ionization channels from \\citet{2012PhRvA..85d3408B} in Figure \\ref{fig:s_and_d}. The timescale for Auger decay is much shorter than the time of flight required by the Argon ions after interaction with a photon, and therefore, the single ionization channel from experiment depicts the characteristics of photoionizing a valence electron. We can directly compare with this process in Figure \\ref{fig:s_and_d} by omitting the contribution from the additional 10 target states annotated by asterisks. Both above and during these thresholds we expect a rise in the photoionization cross section as more channels are opened and become accessible. The total result obtained by DARC3 can be compared directly to the combination of both single and double ionization modes of experiment. We have neglected the error bars for both modes in order to visualise the results more clearly.\n\nWe now present in Figure \\ref{fig:bigL} the photoionization cross section, on a linear scale, as a function of incident photon energy in eV across the L-shell threshold range from 250-270eV. Comparisons are made between the present DARC3 cross section and the measurements performed by \\citet{2012PhRvA..85d3408B}. Clearly excellent agreement is evident between theory and experiment across the range considered, as the features and energy positions of the resonance profiles exhibit good agreement. We note that as we have employed orbitals optimized on the valence state photoionization, an energy shift of 7.5eV was required to match the experimental spectra to our current results. The theory clearly predicts this process to a high standard of accuracy and allows us to benchmark the quality of results obtained from experiment.\n\nIn an attempt to investigate the features further, we have broken down the spectrum in Figure \\ref{fig:res_peaks} from the total into each of the allowed, final, even $J$ states $J=1/2$, $J=3/2$ and $J=5/2$. Clearly visible is the intense spike at $\\approx 254.9$ eV which is dominated by transitions of the form, 2p $\\rightarrow$ nd, ns which are engulfed by the convolution. The second strong peak at $\\approx 255.65$ eV is visible mostly through the metastable initial state transition from another strong 2p $\\rightarrow$ nd, J = 3/2 resonance. In reference to Figure \\ref{fig:excited}, the cross section has already reached close towards zero in the photon energy range of interest and therefore any contribution to the total cross section from these initial excited bound states would result in a reduction to the intensity of each resonant state. \n\nThis method of deconstructing the cross section is also important to identify which initial state has been photoionized during the experiment. It is clear however that the strongest profiles are not well isolated and therefore eliminates the possibility to further conduct any analysis on the weighted contributions. We therefore retain the statistical averaging of the ground state as our best result.\n\nAll resonances in this paper were identified using the technique detailed by \\citet{1996JPhB...29.4529Q} and \\citet{1998CoPhC.114..225Q}, which involves an analytic approach complementary to the \\textbf{R}-matrix method. By exploiting multichannel quantum defect theory \\citep{1983RPPh...46..167S}, each resonant state part of a Rydberg series has constant defect, $\\mu$ for each effective $n$ quantum number defined by,\n\\begin{equation}\\label{eq:resonance}\nE_r = E_{n \\rightarrow \\infty} - \\Big[\\frac{Z-N}{n-\\mu}\\Big]^2\n\\end{equation}\nwhere $E_r$ is the resonance energy converging to the target thresholds, $E_{n \\rightarrow \\infty}$. The overlapping nature of the resonant states makes it difficult to accurately evaluate resonance widths and assign each transition taking place. It is possible to deduce that the hole resonant states arise from 2p $\\rightarrow nd, (n+1)s$ transitions for $n\\ge3$, and correspond to the strongest peaks evident in Figure \\ref{fig:res_peaks}.\n\n\n\n\n\n\n\n\n\n\n\n\n\n4.1 Valence shell photoionization\n\\subsection{Valence shell photoionization}\nThe only available data currently in the literature for valence shell photoionization of Ar {\\sc ii} up to photon energies of 60eV is performed by \\citet{2011PhRvA..84a3413C}. In this paper both theoretical and experimental cross sections are presented. Absolute cross sections are obtained from the merged beam technique at the Advanced Light Source (ALS) with a spectral resolution of 10meV. It was found that the primary ion beam contained a mixture of both $^2$P$_{3/2}^{\\rm o}$ and $^2$P$_{1/2}^{\\rm o}$ initial states. Hence the total cross section was presented as a statistical weighting of the odd parity $J=3/2$ ground and $J=1/2$ metastable states respectively. The accompanying theoretical cross sections presented by \\citet{2011PhRvA..84a3413C} were evaluated using the Breit-Pauli \\textbf{R}-matrix approach. A total of 48 $LSJ\\pi$ fine-structure levels were included in the wavefunction representation with configurations 3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$ and 3s$^2$3p$^2$3d$^2$. Some important correlation effects are thus omitted from this model such as levels associated with the 3s$^2$3p$^3$3d configuration and those arising from the lower $n=4$ complex. \n\n\n\n\n\\begin{figure}\n\\includegraphics[scale=0.29, angle=-90]{figure7.eps}\n\\caption{The photoionization cross section is presented on a linear scale against photon energy in eV above 261.2eV. The solid black line represents our current DARC3 model convoluted at 140meV FWHM and the dashed black line is the contribution to the cross section from valence shell photoionization of the 3s and 3p. The grey circles are experimental values of \\citet{2012PhRvA..85d3408B} for the single ionization channel and pink circles represent the total contribution. Each 2p$^5$3s$^2$3p$^5$ threshold is represented by an asterisk.\\label{fig:s_and_d}}\n\\end{figure}\n\nIn order to compare with this data we present in Figure \\ref{fig:valence} the total photoionization cross section from the initial $^2$P$^{\\rm{o}}$ ground state of Ar {\\sc ii} statistically weighted to the $J=3/2$ and $J=1/2$ states. We present two of our calculations in the figure, the most sophisticated DARC3 model and, in order to perform a direct comparison with the Breit-Pauli theoretical results of \\citet{2011PhRvA..84a3413C}, the PBP 124 level model outlined in Table \\ref{tab:calculations}. To match experimental resolving power, we convolute our total results with a 10meV gaussian profile at full-width half-maximum (FWHM). In addition, to replicate the target thresholds, we have shifted our threshold values recorded in Table \\ref{tab:energy} to the experimental NIST values where possible, during the diagonalization of the Hamiltonian matrix. The remaining levels not contained in NIST are shifted by an average proportion to each corresponding angular and spin momentum state, which has little effect on the background and is meant only for consistency. This ensures that resonance features are properly positioned with respect to the observed thresholds, making a direct comparison with experiment more meaningful.\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.6, angle=-90]{figure5.eps}\n\\caption{Photoionization cross section measured in Mb on a logarithmic scale as a function of the photon energy in eV between 250-270eV. The circles are the experimental results from \\citet{2012PhRvA..85d3408B} with error bars included. The solid black line represents our current, DARC3 model results for the statistically weighted initial ground state, convoluted at 140meV at FWHM. \\label{fig:bigL}}\n\\end{figure*}\n\nWe can clearly see in Figure \\ref{fig:valence} that the low energy region just above threshold is completely dominated by 3s$^2$3p$^5$ $\\rightarrow$ 3s$^2$3p$^4nl$ transitions occurring at discrete energies prior to the ejection of an electron. This densely populated region of Rydberg resonances up to approximately 30eV is followed by a steep decline in the photoionization cross section forming the expected Cooper minimum around 45-50eV. This minimum is well known to appear in the spectra of noble gases \\citep{1962PhRv..128..681C}. Above this minimum the cross section rises due to excitations from 3p $\\rightarrow$ 3d transitions, before monotonically decreasing towards zero with increasing photon energy. \n\nExcellent agreement is evident between the 124 level PBP and the 48 level Breit-Pauli calculation of \\citet{2011PhRvA..84a3413C}, for all photon energies up to 60eV. Note that the cross section in Figure \\ref{fig:valence} is plotted on a log scale. Evidently the larger basis expansion of the present PBP evaluation, which includes the 4s, 4p and 4d orbitals, has minimal effect on the resulting photoionization cross section. Both of these Breit-Pauli evaluations, however, underestimate the cross section above roughly 45eV and lie considerably lower than the experimental measurements from ALS. The larger DARC3 evaluation, incorporating 557 fine-structure levels, gives much better agreement with experiment at photon energies above the Cooper minimum. This is partly due to the more substantial calculation, and also a more accurate description of the wavefunctions included. Both techniques in fact are known to reproduce similar results as shown in a study by \\citet{2005JPhB...38.1667B}, showing that the average difference in effective collision strengths for Fe$^{14+}$ to be 6$\\%$ between all transitions considered. The additional levels included, and the Rydberg resonances converging onto their thresholds, have the effect of raising the cross section above 45eV.\n\nIn order to further emphasize the excellent agreement between the DARC3 and the experimental measurements, we zoom in on the photon energy region just above threshold, from 27.8-29.2eV, in Figure \\ref{fig:valence_zoom}. It is clearly evident that the disparities found between theory and experiment in this very narrow energy range are negligible and excellent conformity is achieved. This high level of agreement supports the accuracy of the DARC3 evaluation and we believe that these valence shell photoionization cross sections for the ground state of Ar {\\sc ii} accurately reproduce the experimental spectrum. \n\nIn Figure \\ref{fig:excited} we present the total photoionization cross section for the process defined in Equation \\ref{eq:excited}, photoionization from the lowest excited initial 3s3p$^6\\; ^2$S$_{1/2}$ bound state of Ar {\\sc ii} to all possible allowed final states of Ar {\\sc iii}. These evaluations were carried out using the DARC3 model and present for the first time, cross sections for photoionization from an excited Ar {\\sc ii} state. There are no other theoretical or experimental data with which we can compare in this figure. The cross section is presented as a function of the photon energy in eV which ranges from just above the ionization threshold to beyond the opening of the L-shell thresholds. The photoionization cross section tends towards zero with increasing energy, and it is only due to the inclusion of the additional 10 hole states in the DARC3 model do we witness contributions to the cross section at photon energies between 200-250 eV. \n\n",
        "Descriptive_question1": "What is the number of levels for the PBP calculation in table_1?",
        "Descriptive_question2": "Which calculation in table_1 includes the configuration 3s^23p^3[4d, 5s]?",
        "Reasoning_question1": "Why might the DARC3 calculation in table_1 have a significantly higher number of levels compared to PBP, and what could be the implications for the accuracy of photoionization cross sections?",
        "Reasoning_question2": "How does the inclusion of additional configurations in DARC2 compared to DARC1 in table_1 potentially affect the convergence of the photoionization cross section at higher photon energies?",
        "Descriptive_answer1": "124",
        "Descriptive_answer2": "DARC2",
        "Reasoning_answer1": "The DARC3 calculation has a significantly higher number of levels (557) compared to PBP (124) because it includes a broader range of configurations such as 3s3p^4 3d, 3s3p^3 3d^2, and 2s^2 2p^5 3s^2 3p^5, in addition to the configurations in DARC1. This extensive inclusion of configurations captures more fine-structure levels and correlation effects, especially for L-shell photoionization. Starting from the PBP model's basis, which includes fewer configurations and levels, each subsequent DARC model builds upon the previous one by adding more complex and higher-energy configurations. This progression culminates in DARC3, which incorporates additional levels that account for intricate electron interactions and inner-shell processes. The implication for photoionization cross sections is that DARC3 likely provides a more accurate representation, particularly at higher photon energies or for transitions involving inner shells, as it accounts for more resonance states and channels. This is supported by the context in the paper, which notes DARC3's superior agreement with experimental data and its use in collision calculations due to its sophistication.",
        "Reasoning_answer2": "The inclusion of additional configurations 3s^2 3p^3 [4d, 5s] in DARC2 compared to DARC1 increases the number of levels from 209 to 257, potentially affecting the convergence of the photoionization cross section at higher photon energies. Starting with DARC1, which already includes a comprehensive set of configurations like 3s^2 3p^4, 3s 3p^5, and others up to 3p^5 3d, the addition in DARC2 introduces higher orbital states (4d and 5s). These higher states are expected to contribute more significantly at elevated photon energies where transitions to these levels become accessible. However, according to the results discussed in the paper, the impact of these additional configurations on the cross section up to 60 eV is minimal, suggesting that their influence on convergence might be negligible in this energy range. Beyond 60 eV, these configurations could play a role in capturing additional resonance structures or channels, potentially improving convergence by providing a more complete description of the wavefunctions. Yet, the paper indicates that these configurations were not retained in DARC3, implying that their contribution was deemed less critical for overall accuracy compared to other configurations added in DARC3."
    },
    {
        "paper_id": "1511.03946.json",
        "table_id": "table_2",
        "table_content": "\\begin{table*}\n\\begin{center}\n\\begin{tabular}{@{} l *9r *1c @{}}\n\\toprule\n\\multicolumn{1}{c}{Index} & Configuration & Level & NIST & \\multicolumn{4}{c}{\\textbf{Present}} & Stancalie & Burgos  \\\\\n & & & & PBP & DARC1 & DARC2 & DARC3 & et al. & et al.  \\\\\n\n \\midrule\n \n       \\multicolumn{1}{c}{1} & 3s$^2$3p$^4$ & $^3$P$_2$ & 0.0000 & 0.0000 & 0.0000 & 0.0000 & 0.0000 & 0.0000 & 0.0000 \\\\\n        \\multicolumn{1}{c}{2} & 3s$^2$3p$^4$ &$^3$P$_1$ & 0.1379 & 0.1498 & 0.1391 & 0.1391 &  0.1367 & 0.1325 & 0.1306\\\\\n        \\multicolumn{1}{c}{3} & 3s$^2$3p$^4$ &$^3$P$_0$ & 0.1947 & 0.2119 & 0.1952 & 0.1952 & 0.1942 & 0.1895 & 0.1864\\\\\n       \\multicolumn{1}{c}{4} & 3s$^2$3p$^4$ & $^1$D$_2$ & 1.7370 & 2.0093 & 2.1402 & 2.1402 & 2.0423 & 1.9266 & 2.0245\\\\\n       \\multicolumn{1}{c}{5} & 3s$^2$3p$^4$ & $^1$S$_0$ & 4.1244 & 4.3337 & 3.4318 & 3.4318 & 4.2673 & 4.3089 & 3.8654\\\\ \n      \\multicolumn{1}{c}{6} & 3s3p$^5$ & $^3$P$_2$ & 14.1095 & 14.1116 & 14.1877 & 14.0961 & 13.9520 & 13.9529 & 13.6370\\\\ \n        \\multicolumn{1}{c}{7} & 3s3p$^5$ & $^3$P$_1$ & 14.2331 & 14.2461 & 14.3136 & 14.2219 & 14.0752 & 14.2220 & 13.7526\\\\ \n         \\multicolumn{1}{c}{8} & 3s3p$^5$ & $^3$P$_0$ & 14.2988 & 14.3164 & 14.3793 & 14.2877 & 14.1394 & 14.1490 & 13.8125\\\\ \n        \\multicolumn{1}{c}{9} & 3s3p$^5$ & $^1$P$_1$ & 17.8565 & 18.4640 & 18.2749 & 18.2060 & 18.1696 & 17.4928 & 17.7255\\\\\n          \\multicolumn{1}{c}{10} & 3s$^2$3p$^3$3d & $^5$D$_0$ & -- & 18.2182 & 18.0206 & 18.0102 &  17.6992 & -- & --\\\\   \n         \\multicolumn{1}{c}{11} & 3s$^2$3p$^3$3d & $^5$D$_1$ & 17.9635 & 18.2203  & 18.0211 & 18.0107 & 17.6996 & 17.5912 & 17.9119\\\\   \n          \\multicolumn{1}{c}{12} & 3s$^2$3p$^3$3d & $^5$D$_2$ & 17.9642 & 18.2243 & 18.0220 & 18.0116 & 17.7005 & 17.5908 & -- \\\\   \n         \\multicolumn{1}{c}{13} & 3s$^2$3p$^3$3d & $^5$D$_3$ & 17.9650 & 18.2306 & 18.0233 & 18.0130 & 17.7019 & 17.3458 & 17.9214 \\\\   \n          \\multicolumn{1}{c}{14} & 3s$^2$3p$^3$3d & $^5$D$_4$ & 17.9667 & 18.2394 & 18.0255 & 18.0152 & 17.7040 & 17.5911 & 17.9296\\\\        \n\\bottomrule\n \\end{tabular}\n \\caption{Energies and assignments for the lowest 14 levels of the Ar {\\sc iii} system presented in eV. \\citet{2012EPJD...66...84S} and  \\citet{2009A&A...500.1253M} are theoretical \\textbf{R}-matrix results for electron impact excitation calculations, and the NIST results are observational data taken from \\citet{2010JPCRD..39c3101S}. The remaining results are the present works summarized in Table \\ref{tab:calculations}. \\label{tab:energy}}\n \\end{center}\n\\end{table*}",
        "caption": "Energies and assignments for the lowest 14 levels of the Ar {\\sc iii} system presented in eV. \\citet{2012EPJD...66...84S} and  \\citet{2009A&A...500.1253M} are theoretical \\textbf{R}-matrix results for electron impact excitation calculations, and the NIST results are observational data taken from \\citet{2010JPCRD..39c3101S}. The remaining results are the present works summarized in Table \\ref{tab:calculations}. \\label{tab:energy}",
        "label": "tab:energy",
        "section_info": "2 Structure model\n\\section{Structure model}\\label{sec:structure}\nThe photoionization processes of interest can be described by the following equations,\n\\begin{equation}\\label{eq:ground}\nh\\nu + (2p^63s^23p^5) ^2\\rm{P}^{o}_{3/2, 1/2} \\rightarrow Ar^{2+} + \\rm{e}^-\n\\end{equation}\n\\begin{equation}\\label{eq:excited}\nh\\nu + (2p^63s3p^6) ^2\\rm{S}^{e}_{1/2} \\rightarrow Ar^{2+} + \\rm{e}^-\n\\end{equation}\nwhere it is found that the dominant contributions to the total photoionization come from the Ar$^{2+}$ 3s$^2$3p$^4$ and 3s3p$^5$ levels. We have investigated two methods for generating an appropriate basis set expansion of the Ar {\\sc iii} ion. The first is carried out through a Breit-Pauli approach using the computer code {\\sc civ3} \\citep{1975CoPhC...9..141H, 1991CoPhC..64..455H}, and secondly, using the relativistic computer code {\\sc grasp0} \\citep{1996CoPhC..94..249P}. This stage of the calculation is crucially important enabling an accurate representation of both the initial target as well as the residual ion which is then to be constructed and incorporated into the \\textbf{R}-matrix method. \n\n\\begin{table}\n\\begin{center}\n\\begin{tabular}{@{} l *3c @{}}\n\\toprule\n\\multicolumn{1}{c}{Calculation} & Number of & Configurations \\\\\n & levels & included \\\\\n \\midrule\n       \\multicolumn{1}{c}{PBP} & 124 &  3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$, 3s$^2$3p$^3$3d\\\\\n       \\multicolumn{1}{c}{} & & 3s$^2$3p$^3$[4s, 4p, 4d] \\\\  \n       \n        \\multicolumn{1}{c}{} & & \\\\\n                     \n         \\multicolumn{1}{c}{DARC1} & 209 & 3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$, 3s$^2$3p$^3$3d\\\\\n           \\multicolumn{1}{c}{} &  & 3s$^2$3p$^3$[4s, 4p] +  \\\\\n           \\multicolumn{1}{c}{} &  & 3s$^2$3p$^2$3d$^2$ + 3p$^5$3d \\\\\n                                          \n             \\multicolumn{1}{c}{} & & \\\\                     \n                                                        \n            \\multicolumn{1}{c}{DARC2} & 257 &  DARC1 + \\\\\n             \\multicolumn{1}{c}{} & &  3s$^2$3p$^3$[4d, 5s] \\\\\n\n               \\multicolumn{1}{c}{} & & \\\\\n\n             \\multicolumn{1}{c}{DARC3} & 557 &  DARC1 + \\\\\n             \\multicolumn{1}{c}{} & &  3s3p$^4$3d + 3s3p$^3$3d$^2$ \\\\\n                    \\multicolumn{1}{c}{} & & + 2s$^2$2p$^5$3s$^2$3p$^5$ \\\\\n\n      \\bottomrule\n \\end{tabular}\n \\caption{The list of calculations performed throughout this paper are recorded and indexed for reference in the first column. The configurations and levels associated are also retained. \\label{tab:calculations}}\n \\end{center}\n\\end{table}\n\n\\subsection{Breit-Pauli approach}\nWe employed an analytic Slater type orbital description for the bound orbitals up to 3p from the tables of \\citet{1974ADNDT..14..177C}. The computer package {\\sc civ3} was then utilized to extend this basis expansion by including the 3d, 4s, 4p and 4d orbitals. These additional orbitals have been optimised in an $LS\\pi$ coupling scheme on the lowest quintet states of the configurations 3s$^2$3p$^3$[3d, 4s, 4p, 4d] respectively. A total of 124 $J\\pi$ levels were included in the basis set with configurations from 3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$ and 3s$^2$3p$^3$[3d, 4s, 4p, 4d]. Configuration-interaction terms are also included to account for additional correlation in each wavefunction. These configuration-interaction expansions of the target wavefunctions employ a Breit-Pauli approach through one body perturbative corrections to the non-relativistic Hamiltonian operator. These corrections are described in full in the literature \\citep{1980JPhB...13.4299S} and carried through to be used consistently in the Breit-Pauli (PBP) \\textbf{R}-matrix method. This, the first of our Ar {\\sc iii} models, is labelled in Table \\ref{tab:calculations} as PBP.\n\n\\subsection{Relativistic approach}\nThe computer code {\\sc grasp0} has also been used to construct a bound orbital basis set for Ar {\\sc iii}. The method involves the Dirac-Coloumb Hamiltonian,\n\\[\nH_{D}= \\sum_i -ic\\bm{\\alpha} \\nabla_i + (\\bm{\\beta} - 1)c^2-\\frac{Z}{r_i} + \\sum_{i<j}\\frac{1}{|\\mathbf{r}_j-\\mathbf{r}_i|}\n\\]\nwhere the electrons are labelled by $i$ and $j$ and the summation is taken over all electrons of the system. The matrices $\\bm{\\alpha}$ and $\\bm{\\beta}$ are directly related to the Pauli spin matrices, $c$ is the speed of light and the atomic number is $Z=18$. The relativistic orbitals are described with a large component, $\\mathcal{P}_{nl}$ and small component $\\mathcal{Q}_{nl}$. The target wavefunctions are appropriately defined on a radial grid for input into the relativistic Dirac Atomic \\textbf{R}-matrix Codes ({\\sc darc}).\n\n\nUnlike in {\\sc civ3} where we optimise the additional orbitals on the lowest lying states of the respective configuration, {\\sc grasp0} considers the optimization process on every state included in the calculation, unless specified otherwise by the user.\n\n\\begin{table*}\n\\begin{center}\n\\begin{tabular}{@{} l *9r *1c @{}}\n\\toprule\n\\multicolumn{1}{c}{Index} & Configuration & Level & NIST & \\multicolumn{4}{c}{\\textbf{Present}} & Stancalie & Burgos  \\\\\n & & & & PBP & DARC1 & DARC2 & DARC3 & et al. & et al.  \\\\\n\n \\midrule\n \n       \\multicolumn{1}{c}{1} & 3s$^2$3p$^4$ & $^3$P$_2$ & 0.0000 & 0.0000 & 0.0000 & 0.0000 & 0.0000 & 0.0000 & 0.0000 \\\\\n        \\multicolumn{1}{c}{2} & 3s$^2$3p$^4$ &$^3$P$_1$ & 0.1379 & 0.1498 & 0.1391 & 0.1391 &  0.1367 & 0.1325 & 0.1306\\\\\n        \\multicolumn{1}{c}{3} & 3s$^2$3p$^4$ &$^3$P$_0$ & 0.1947 & 0.2119 & 0.1952 & 0.1952 & 0.1942 & 0.1895 & 0.1864\\\\\n       \\multicolumn{1}{c}{4} & 3s$^2$3p$^4$ & $^1$D$_2$ & 1.7370 & 2.0093 & 2.1402 & 2.1402 & 2.0423 & 1.9266 & 2.0245\\\\\n       \\multicolumn{1}{c}{5} & 3s$^2$3p$^4$ & $^1$S$_0$ & 4.1244 & 4.3337 & 3.4318 & 3.4318 & 4.2673 & 4.3089 & 3.8654\\\\ \n      \\multicolumn{1}{c}{6} & 3s3p$^5$ & $^3$P$_2$ & 14.1095 & 14.1116 & 14.1877 & 14.0961 & 13.9520 & 13.9529 & 13.6370\\\\ \n        \\multicolumn{1}{c}{7} & 3s3p$^5$ & $^3$P$_1$ & 14.2331 & 14.2461 & 14.3136 & 14.2219 & 14.0752 & 14.2220 & 13.7526\\\\ \n         \\multicolumn{1}{c}{8} & 3s3p$^5$ & $^3$P$_0$ & 14.2988 & 14.3164 & 14.3793 & 14.2877 & 14.1394 & 14.1490 & 13.8125\\\\ \n        \\multicolumn{1}{c}{9} & 3s3p$^5$ & $^1$P$_1$ & 17.8565 & 18.4640 & 18.2749 & 18.2060 & 18.1696 & 17.4928 & 17.7255\\\\\n          \\multicolumn{1}{c}{10} & 3s$^2$3p$^3$3d & $^5$D$_0$ & -- & 18.2182 & 18.0206 & 18.0102 &  17.6992 & -- & --\\\\   \n         \\multicolumn{1}{c}{11} & 3s$^2$3p$^3$3d & $^5$D$_1$ & 17.9635 & 18.2203  & 18.0211 & 18.0107 & 17.6996 & 17.5912 & 17.9119\\\\   \n          \\multicolumn{1}{c}{12} & 3s$^2$3p$^3$3d & $^5$D$_2$ & 17.9642 & 18.2243 & 18.0220 & 18.0116 & 17.7005 & 17.5908 & -- \\\\   \n         \\multicolumn{1}{c}{13} & 3s$^2$3p$^3$3d & $^5$D$_3$ & 17.9650 & 18.2306 & 18.0233 & 18.0130 & 17.7019 & 17.3458 & 17.9214 \\\\   \n          \\multicolumn{1}{c}{14} & 3s$^2$3p$^3$3d & $^5$D$_4$ & 17.9667 & 18.2394 & 18.0255 & 18.0152 & 17.7040 & 17.5911 & 17.9296\\\\        \n\\bottomrule\n \\end{tabular}\n \\caption{Energies and assignments for the lowest 14 levels of the Ar {\\sc iii} system presented in eV. \\citet{2012EPJD...66...84S} and  \\citet{2009A&A...500.1253M} are theoretical \\textbf{R}-matrix results for electron impact excitation calculations, and the NIST results are observational data taken from \\citet{2010JPCRD..39c3101S}. The remaining results are the present works summarized in Table \\ref{tab:calculations}. \\label{tab:energy}}\n \\end{center}\n\\end{table*}\n\n\nInitially we have included the important configurations 3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$, 3s$^2$3p$^3$[3d, 4s, 4p], 3p$^5$3d and 3s$^2$3p$^2$3d$^2$, which gives rise to 209 levels. This model is labelled as DARC1 in Table \\ref{tab:calculations}. We augment this model with the inclusion of the 3s$^2$3p$^3$[4d, 5s] configurations in DARC2 raising the number of levels to 257. The reason to perform this slightly larger evaluation was to test whether the inclusion of these high lying $nl=$ 4d and 5s levels affect the convergence of the photoionization cross section, or whether their contribution could be deemed negligible. An accurate representation for the low-lying wavefunctions of the residual ion is always of major importance for photoionization calculations. In an attempt to further improve correlation effects we perform a final relativistic evaluation, in which we include the additional 3s3p$^4$3d levels (mixing with 3s$^2$3p$^4$ and have the effect of lowering the relative $^3$P$_2$ ground energy), as well as all levels with configuration 3s3p$^3$3d$^2$ which improve the odd parity levels. The configuration 2p$^5$3s$^2$3p$^5$ has also been incorporated into the expansion of Ar {\\sc iii} as it allows us to extend our evaluations to L-shell photoionization and results in an additional 10 levels. We label this, our largest Ar {\\sc iii} model, as DARC3 in Table \\ref{tab:calculations} containing 557 individual fine-structure levels. \n\n\n\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.6, angle=-90]{figure1.eps}\n\\caption{Photoionization cross section from the initial $^2$P$^{\\rm{o}}_{3/2}$ to allowed final states given in Mb on a logarithmic scale against the photon energy in eV. The dashed black line represents the result from DARC1, the solid blue line is the contribution from levels indexed 1-5 in Table \\ref{tab:energy}, and the solid orange line is the extension to DARC2. \\label{fig:comparison}}\n\\end{figure*}\n\nWe present in Table \\ref{tab:energy} the energy levels (in eV) relative to the Ar {\\sc iii} ground state for the lowest 14 fine-structure levels for this ion. We directly compare the {\\em ab initio} energies from the PBP, DARC1, DARC2 and DARC3 evaluations with two theoretical \\textbf{R}-matrix works \\citep{2012EPJD...66...84S, 2009A&A...500.1253M}, both of which performed electron-impact excitation evaluations and generated their basis set for Ar {\\sc iii} with the {\\sc autostructure} \\citep{1986JPhB...19.3827B}, adapted from the original {\\sc superstructure} \\citep{1974CoPhC...8..270E}, computer package. Comparisons are also made with the recorded NIST levels compiled by \\citet{2010JPCRD..39c3101S} which incorporates designations from observed spectral analysis of existing works. \n\nAll of the present calculations agree extremely well when compared with NIST, but the results from the DARC3 model give best overall agreement across the 14 levels considered. Differences of less than 4\\% are found for all energy separations with the exception of the 3s$^2$3p$^4\\; ^1$D$_2$ state where a difference of 15\\% is recorded. This level of disparity is evident, however, for all the theoretical predictions listed. Due to the sophistication of the DARC3 model and the fact that it allows us to investigate L-shell photoionization, it is this model that we incorporate primarily into our collision calculations.\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.6, angle=-90]{figure2.eps}\n\\caption{Total photoionization cross section measured in Mb on a logarithmic scale as a function of photon energy in eV. All results display the initial ground state, statistically weighted $^2$P$^{\\rm o}$, with $J=3/2$ and $J=1/2$ odd states to all allowed final states. A 10meV gaussian convolution at FWHM is applied to compare directly with experimental resolution for all theoretical calculations. The yellow circles, green circles with error bars, and solid turquoise line are the experimental results, absolute measurements at resonance free regions and theoretical calculations respectively, performed by \\citet{2011PhRvA..84a3413C}. The dashed orange and solid purple lines represent our PBP and DARC3 calculations respectively. \\label{fig:valence}}\n\\end{figure*}\n\n\n\n\n\n\n\n\n\n\n\n\n2.2 Relativistic approach\n\\subsection{Relativistic approach}\nThe computer code {\\sc grasp0} has also been used to construct a bound orbital basis set for Ar {\\sc iii}. The method involves the Dirac-Coloumb Hamiltonian,\n\\[\nH_{D}= \\sum_i -ic\\bm{\\alpha} \\nabla_i + (\\bm{\\beta} - 1)c^2-\\frac{Z}{r_i} + \\sum_{i<j}\\frac{1}{|\\mathbf{r}_j-\\mathbf{r}_i|}\n\\]\nwhere the electrons are labelled by $i$ and $j$ and the summation is taken over all electrons of the system. The matrices $\\bm{\\alpha}$ and $\\bm{\\beta}$ are directly related to the Pauli spin matrices, $c$ is the speed of light and the atomic number is $Z=18$. The relativistic orbitals are described with a large component, $\\mathcal{P}_{nl}$ and small component $\\mathcal{Q}_{nl}$. The target wavefunctions are appropriately defined on a radial grid for input into the relativistic Dirac Atomic \\textbf{R}-matrix Codes ({\\sc darc}).\n\n\nUnlike in {\\sc civ3} where we optimise the additional orbitals on the lowest lying states of the respective configuration, {\\sc grasp0} considers the optimization process on every state included in the calculation, unless specified otherwise by the user.\n\n\\begin{table*}\n\\begin{center}\n\\begin{tabular}{@{} l *9r *1c @{}}\n\\toprule\n\\multicolumn{1}{c}{Index} & Configuration & Level & NIST & \\multicolumn{4}{c}{\\textbf{Present}} & Stancalie & Burgos  \\\\\n & & & & PBP & DARC1 & DARC2 & DARC3 & et al. & et al.  \\\\\n\n \\midrule\n \n       \\multicolumn{1}{c}{1} & 3s$^2$3p$^4$ & $^3$P$_2$ & 0.0000 & 0.0000 & 0.0000 & 0.0000 & 0.0000 & 0.0000 & 0.0000 \\\\\n        \\multicolumn{1}{c}{2} & 3s$^2$3p$^4$ &$^3$P$_1$ & 0.1379 & 0.1498 & 0.1391 & 0.1391 &  0.1367 & 0.1325 & 0.1306\\\\\n        \\multicolumn{1}{c}{3} & 3s$^2$3p$^4$ &$^3$P$_0$ & 0.1947 & 0.2119 & 0.1952 & 0.1952 & 0.1942 & 0.1895 & 0.1864\\\\\n       \\multicolumn{1}{c}{4} & 3s$^2$3p$^4$ & $^1$D$_2$ & 1.7370 & 2.0093 & 2.1402 & 2.1402 & 2.0423 & 1.9266 & 2.0245\\\\\n       \\multicolumn{1}{c}{5} & 3s$^2$3p$^4$ & $^1$S$_0$ & 4.1244 & 4.3337 & 3.4318 & 3.4318 & 4.2673 & 4.3089 & 3.8654\\\\ \n      \\multicolumn{1}{c}{6} & 3s3p$^5$ & $^3$P$_2$ & 14.1095 & 14.1116 & 14.1877 & 14.0961 & 13.9520 & 13.9529 & 13.6370\\\\ \n        \\multicolumn{1}{c}{7} & 3s3p$^5$ & $^3$P$_1$ & 14.2331 & 14.2461 & 14.3136 & 14.2219 & 14.0752 & 14.2220 & 13.7526\\\\ \n         \\multicolumn{1}{c}{8} & 3s3p$^5$ & $^3$P$_0$ & 14.2988 & 14.3164 & 14.3793 & 14.2877 & 14.1394 & 14.1490 & 13.8125\\\\ \n        \\multicolumn{1}{c}{9} & 3s3p$^5$ & $^1$P$_1$ & 17.8565 & 18.4640 & 18.2749 & 18.2060 & 18.1696 & 17.4928 & 17.7255\\\\\n          \\multicolumn{1}{c}{10} & 3s$^2$3p$^3$3d & $^5$D$_0$ & -- & 18.2182 & 18.0206 & 18.0102 &  17.6992 & -- & --\\\\   \n         \\multicolumn{1}{c}{11} & 3s$^2$3p$^3$3d & $^5$D$_1$ & 17.9635 & 18.2203  & 18.0211 & 18.0107 & 17.6996 & 17.5912 & 17.9119\\\\   \n          \\multicolumn{1}{c}{12} & 3s$^2$3p$^3$3d & $^5$D$_2$ & 17.9642 & 18.2243 & 18.0220 & 18.0116 & 17.7005 & 17.5908 & -- \\\\   \n         \\multicolumn{1}{c}{13} & 3s$^2$3p$^3$3d & $^5$D$_3$ & 17.9650 & 18.2306 & 18.0233 & 18.0130 & 17.7019 & 17.3458 & 17.9214 \\\\   \n          \\multicolumn{1}{c}{14} & 3s$^2$3p$^3$3d & $^5$D$_4$ & 17.9667 & 18.2394 & 18.0255 & 18.0152 & 17.7040 & 17.5911 & 17.9296\\\\        \n\\bottomrule\n \\end{tabular}\n \\caption{Energies and assignments for the lowest 14 levels of the Ar {\\sc iii} system presented in eV. \\citet{2012EPJD...66...84S} and  \\citet{2009A&A...500.1253M} are theoretical \\textbf{R}-matrix results for electron impact excitation calculations, and the NIST results are observational data taken from \\citet{2010JPCRD..39c3101S}. The remaining results are the present works summarized in Table \\ref{tab:calculations}. \\label{tab:energy}}\n \\end{center}\n\\end{table*}\n\n\nInitially we have included the important configurations 3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$, 3s$^2$3p$^3$[3d, 4s, 4p], 3p$^5$3d and 3s$^2$3p$^2$3d$^2$, which gives rise to 209 levels. This model is labelled as DARC1 in Table \\ref{tab:calculations}. We augment this model with the inclusion of the 3s$^2$3p$^3$[4d, 5s] configurations in DARC2 raising the number of levels to 257. The reason to perform this slightly larger evaluation was to test whether the inclusion of these high lying $nl=$ 4d and 5s levels affect the convergence of the photoionization cross section, or whether their contribution could be deemed negligible. An accurate representation for the low-lying wavefunctions of the residual ion is always of major importance for photoionization calculations. In an attempt to further improve correlation effects we perform a final relativistic evaluation, in which we include the additional 3s3p$^4$3d levels (mixing with 3s$^2$3p$^4$ and have the effect of lowering the relative $^3$P$_2$ ground energy), as well as all levels with configuration 3s3p$^3$3d$^2$ which improve the odd parity levels. The configuration 2p$^5$3s$^2$3p$^5$ has also been incorporated into the expansion of Ar {\\sc iii} as it allows us to extend our evaluations to L-shell photoionization and results in an additional 10 levels. We label this, our largest Ar {\\sc iii} model, as DARC3 in Table \\ref{tab:calculations} containing 557 individual fine-structure levels. \n\n\n\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.6, angle=-90]{figure1.eps}\n\\caption{Photoionization cross section from the initial $^2$P$^{\\rm{o}}_{3/2}$ to allowed final states given in Mb on a logarithmic scale against the photon energy in eV. The dashed black line represents the result from DARC1, the solid blue line is the contribution from levels indexed 1-5 in Table \\ref{tab:energy}, and the solid orange line is the extension to DARC2. \\label{fig:comparison}}\n\\end{figure*}\n\nWe present in Table \\ref{tab:energy} the energy levels (in eV) relative to the Ar {\\sc iii} ground state for the lowest 14 fine-structure levels for this ion. We directly compare the {\\em ab initio} energies from the PBP, DARC1, DARC2 and DARC3 evaluations with two theoretical \\textbf{R}-matrix works \\citep{2012EPJD...66...84S, 2009A&A...500.1253M}, both of which performed electron-impact excitation evaluations and generated their basis set for Ar {\\sc iii} with the {\\sc autostructure} \\citep{1986JPhB...19.3827B}, adapted from the original {\\sc superstructure} \\citep{1974CoPhC...8..270E}, computer package. Comparisons are also made with the recorded NIST levels compiled by \\citet{2010JPCRD..39c3101S} which incorporates designations from observed spectral analysis of existing works. \n\nAll of the present calculations agree extremely well when compared with NIST, but the results from the DARC3 model give best overall agreement across the 14 levels considered. Differences of less than 4\\% are found for all energy separations with the exception of the 3s$^2$3p$^4\\; ^1$D$_2$ state where a difference of 15\\% is recorded. This level of disparity is evident, however, for all the theoretical predictions listed. Due to the sophistication of the DARC3 model and the fact that it allows us to investigate L-shell photoionization, it is this model that we incorporate primarily into our collision calculations.\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.6, angle=-90]{figure2.eps}\n\\caption{Total photoionization cross section measured in Mb on a logarithmic scale as a function of photon energy in eV. All results display the initial ground state, statistically weighted $^2$P$^{\\rm o}$, with $J=3/2$ and $J=1/2$ odd states to all allowed final states. A 10meV gaussian convolution at FWHM is applied to compare directly with experimental resolution for all theoretical calculations. The yellow circles, green circles with error bars, and solid turquoise line are the experimental results, absolute measurements at resonance free regions and theoretical calculations respectively, performed by \\citet{2011PhRvA..84a3413C}. The dashed orange and solid purple lines represent our PBP and DARC3 calculations respectively. \\label{fig:valence}}\n\\end{figure*}\n\n\n\n\n\n\n\n\n\n\n\n\n4 Results\n\\section{Results}\\label{sec:results}\nBefore embarking on the large scale DARC3 calculation we thought it prudent to investigate first the important properties and characteristics found in the photoionization cross section of Ar {\\sc ii} in its ground state. In Figure \\ref{fig:comparison}\nwe present the total photoionization cross section in Mb on a logarithmic scale as a function of photon energy in eV, from the initial ground Ar {\\sc ii} $^2$P$^{\\rm{o}}_{3/2}$ state to all allowed final states. Three calculations are presented in this figure; both the 209 level DARC1 and its contributions from the 3s$^2$3p$^4$ levels indexed as 1-5 in Table \\ref{tab:energy} and the extended 257 level DARC2 calculation. Clearly Figure \\ref{fig:comparison}\nshows the importance of including at least the first five 3s$^2$3p$^4$ levels of Ar {\\sc iii} in this photoionization calculation. The contributions from these levels dominates the total cross section up to a photon energy of approximately 50eV and all three calculations exhibit excellent agreement up to this point. It is essential, therefore, that an accurate description is achieved for the wavefunction representation of those low-lying levels. Above 50eV the additional levels associated with the more complex DARC1 and DARC2 models come into play and the cross section rises as we move to higher photon energies as more channels become accessible. Interestingly the inclusion of the additional 3s$^2$3p$^3$4d and 3s$^2$3p$^3$5s levels in the DARC2 model has little or no effect on the photoionization cross section produced by the DARC1 model up to 60eV, both datasets showing near perfect agreement. Therefore we do not retain these additional 3s$^2$3p$^3$4d and 3s$^2$3p$^3$5s configurations in our largest DARC3 calculation as can be seen from Table \\ref{tab:calculations}.\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.6, angle=-90]{figure3.eps}\n\\caption{Photoionization cross section measured in Mb on a logarithmic scale as a function of the photon energy between 27.8-29.2 eV just above threshold. The solid black line is the current statistically weighted, initial ground state, DARC3 calculation against the experimental values from \\citet{2011PhRvA..84a3413C} represented by the yellow circles taken from Figure \\ref{fig:valence}. \\label{fig:valence_zoom}}\n\\end{figure*}\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.58, angle=-90]{figure4.eps}\n\\caption{Total ground state photoionization cross section measured in Mb as a function of the photon energy between 0-280eV. The transition is from the initial state 3s3p$^6$ $^2$S$_{1/2}$ to all allowed final states from the DARC3 model. \\label{fig:excited}}\n\\end{figure*}\n\n\\subsection{Valence shell photoionization}\nThe only available data currently in the literature for valence shell photoionization of Ar {\\sc ii} up to photon energies of 60eV is performed by \\citet{2011PhRvA..84a3413C}. In this paper both theoretical and experimental cross sections are presented. Absolute cross sections are obtained from the merged beam technique at the Advanced Light Source (ALS) with a spectral resolution of 10meV. It was found that the primary ion beam contained a mixture of both $^2$P$_{3/2}^{\\rm o}$ and $^2$P$_{1/2}^{\\rm o}$ initial states. Hence the total cross section was presented as a statistical weighting of the odd parity $J=3/2$ ground and $J=1/2$ metastable states respectively. The accompanying theoretical cross sections presented by \\citet{2011PhRvA..84a3413C} were evaluated using the Breit-Pauli \\textbf{R}-matrix approach. A total of 48 $LSJ\\pi$ fine-structure levels were included in the wavefunction representation with configurations 3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$ and 3s$^2$3p$^2$3d$^2$. Some important correlation effects are thus omitted from this model such as levels associated with the 3s$^2$3p$^3$3d configuration and those arising from the lower $n=4$ complex. \n\n\n\n\n\\begin{figure}\n\\includegraphics[scale=0.29, angle=-90]{figure7.eps}\n\\caption{The photoionization cross section is presented on a linear scale against photon energy in eV above 261.2eV. The solid black line represents our current DARC3 model convoluted at 140meV FWHM and the dashed black line is the contribution to the cross section from valence shell photoionization of the 3s and 3p. The grey circles are experimental values of \\citet{2012PhRvA..85d3408B} for the single ionization channel and pink circles represent the total contribution. Each 2p$^5$3s$^2$3p$^5$ threshold is represented by an asterisk.\\label{fig:s_and_d}}\n\\end{figure}\n\nIn order to compare with this data we present in Figure \\ref{fig:valence} the total photoionization cross section from the initial $^2$P$^{\\rm{o}}$ ground state of Ar {\\sc ii} statistically weighted to the $J=3/2$ and $J=1/2$ states. We present two of our calculations in the figure, the most sophisticated DARC3 model and, in order to perform a direct comparison with the Breit-Pauli theoretical results of \\citet{2011PhRvA..84a3413C}, the PBP 124 level model outlined in Table \\ref{tab:calculations}. To match experimental resolving power, we convolute our total results with a 10meV gaussian profile at full-width half-maximum (FWHM). In addition, to replicate the target thresholds, we have shifted our threshold values recorded in Table \\ref{tab:energy} to the experimental NIST values where possible, during the diagonalization of the Hamiltonian matrix. The remaining levels not contained in NIST are shifted by an average proportion to each corresponding angular and spin momentum state, which has little effect on the background and is meant only for consistency. This ensures that resonance features are properly positioned with respect to the observed thresholds, making a direct comparison with experiment more meaningful.\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.6, angle=-90]{figure5.eps}\n\\caption{Photoionization cross section measured in Mb on a logarithmic scale as a function of the photon energy in eV between 250-270eV. The circles are the experimental results from \\citet{2012PhRvA..85d3408B} with error bars included. The solid black line represents our current, DARC3 model results for the statistically weighted initial ground state, convoluted at 140meV at FWHM. \\label{fig:bigL}}\n\\end{figure*}\n\nWe can clearly see in Figure \\ref{fig:valence} that the low energy region just above threshold is completely dominated by 3s$^2$3p$^5$ $\\rightarrow$ 3s$^2$3p$^4nl$ transitions occurring at discrete energies prior to the ejection of an electron. This densely populated region of Rydberg resonances up to approximately 30eV is followed by a steep decline in the photoionization cross section forming the expected Cooper minimum around 45-50eV. This minimum is well known to appear in the spectra of noble gases \\citep{1962PhRv..128..681C}. Above this minimum the cross section rises due to excitations from 3p $\\rightarrow$ 3d transitions, before monotonically decreasing towards zero with increasing photon energy. \n\nExcellent agreement is evident between the 124 level PBP and the 48 level Breit-Pauli calculation of \\citet{2011PhRvA..84a3413C}, for all photon energies up to 60eV. Note that the cross section in Figure \\ref{fig:valence} is plotted on a log scale. Evidently the larger basis expansion of the present PBP evaluation, which includes the 4s, 4p and 4d orbitals, has minimal effect on the resulting photoionization cross section. Both of these Breit-Pauli evaluations, however, underestimate the cross section above roughly 45eV and lie considerably lower than the experimental measurements from ALS. The larger DARC3 evaluation, incorporating 557 fine-structure levels, gives much better agreement with experiment at photon energies above the Cooper minimum. This is partly due to the more substantial calculation, and also a more accurate description of the wavefunctions included. Both techniques in fact are known to reproduce similar results as shown in a study by \\citet{2005JPhB...38.1667B}, showing that the average difference in effective collision strengths for Fe$^{14+}$ to be 6$\\%$ between all transitions considered. The additional levels included, and the Rydberg resonances converging onto their thresholds, have the effect of raising the cross section above 45eV.\n\nIn order to further emphasize the excellent agreement between the DARC3 and the experimental measurements, we zoom in on the photon energy region just above threshold, from 27.8-29.2eV, in Figure \\ref{fig:valence_zoom}. It is clearly evident that the disparities found between theory and experiment in this very narrow energy range are negligible and excellent conformity is achieved. This high level of agreement supports the accuracy of the DARC3 evaluation and we believe that these valence shell photoionization cross sections for the ground state of Ar {\\sc ii} accurately reproduce the experimental spectrum. \n\nIn Figure \\ref{fig:excited} we present the total photoionization cross section for the process defined in Equation \\ref{eq:excited}, photoionization from the lowest excited initial 3s3p$^6\\; ^2$S$_{1/2}$ bound state of Ar {\\sc ii} to all possible allowed final states of Ar {\\sc iii}. These evaluations were carried out using the DARC3 model and present for the first time, cross sections for photoionization from an excited Ar {\\sc ii} state. There are no other theoretical or experimental data with which we can compare in this figure. The cross section is presented as a function of the photon energy in eV which ranges from just above the ionization threshold to beyond the opening of the L-shell thresholds. The photoionization cross section tends towards zero with increasing energy, and it is only due to the inclusion of the additional 10 hole states in the DARC3 model do we witness contributions to the cross section at photon energies between 200-250 eV. \n\n\\subsection{L-shell photoionization}\nCalculations and experiment have been carried out at the L-shell energy region between 250-280eV by \\citet{2012PhRvA..85d3408B} at the SOLEIL facility in France as described in Section \\ref{sec:introduction}. All the results herein have been convoluted with a 140meV Gaussian profile at FWHM to match the spectral resolution of experiment. Similar to the valence shell comparisons, the initial ground state cross section is formed from a statistically weighted average of the contributions from the odd $J=3/2$ and $J=1/2$ partial waves. Due to time of flight between the ion source and interacting region, excited levels can populate the main ion beam. This leads to a possible inclusion of the initial 3s3p$^6$ $^2$S$_{1/2}$ bound state which may also contribute to the total cross section. In Figure \\ref{fig:excited} we have already shown the immediate result of the lowest excited initial bound state transitions arising from the configuration 3s3p$^6$.\n\n\n\n\n\n\\begin{figure}\n\\includegraphics[scale=0.30, angle=-90]{figure6.eps}\n\\caption{The total convoluted FWHM at 140meV photoionization cross section between 254-256eV taken from Figure \\ref{fig:bigL} highlighting the intense resonant peaks. The spectra is broken into the contributions from each dipole allowed symmetry from both initial (middle third) and metastable (bottom third) initial states according to their statistical weighting. These even $J=5/2, 3/2, 1/2$ partial symmetries are the solid turquoise, red and purple lines. The total (top third) summed contribution is presented by the solid black curve and the two dominate resonances are marked by the dashed line. \\label{fig:res_peaks}}\n\\end{figure}\n\nIn order to compare with experiment, we have presented our results against various ionization channels from \\citet{2012PhRvA..85d3408B} in Figure \\ref{fig:s_and_d}. The timescale for Auger decay is much shorter than the time of flight required by the Argon ions after interaction with a photon, and therefore, the single ionization channel from experiment depicts the characteristics of photoionizing a valence electron. We can directly compare with this process in Figure \\ref{fig:s_and_d} by omitting the contribution from the additional 10 target states annotated by asterisks. Both above and during these thresholds we expect a rise in the photoionization cross section as more channels are opened and become accessible. The total result obtained by DARC3 can be compared directly to the combination of both single and double ionization modes of experiment. We have neglected the error bars for both modes in order to visualise the results more clearly.\n\nWe now present in Figure \\ref{fig:bigL} the photoionization cross section, on a linear scale, as a function of incident photon energy in eV across the L-shell threshold range from 250-270eV. Comparisons are made between the present DARC3 cross section and the measurements performed by \\citet{2012PhRvA..85d3408B}. Clearly excellent agreement is evident between theory and experiment across the range considered, as the features and energy positions of the resonance profiles exhibit good agreement. We note that as we have employed orbitals optimized on the valence state photoionization, an energy shift of 7.5eV was required to match the experimental spectra to our current results. The theory clearly predicts this process to a high standard of accuracy and allows us to benchmark the quality of results obtained from experiment.\n\nIn an attempt to investigate the features further, we have broken down the spectrum in Figure \\ref{fig:res_peaks} from the total into each of the allowed, final, even $J$ states $J=1/2$, $J=3/2$ and $J=5/2$. Clearly visible is the intense spike at $\\approx 254.9$ eV which is dominated by transitions of the form, 2p $\\rightarrow$ nd, ns which are engulfed by the convolution. The second strong peak at $\\approx 255.65$ eV is visible mostly through the metastable initial state transition from another strong 2p $\\rightarrow$ nd, J = 3/2 resonance. In reference to Figure \\ref{fig:excited}, the cross section has already reached close towards zero in the photon energy range of interest and therefore any contribution to the total cross section from these initial excited bound states would result in a reduction to the intensity of each resonant state. \n\nThis method of deconstructing the cross section is also important to identify which initial state has been photoionized during the experiment. It is clear however that the strongest profiles are not well isolated and therefore eliminates the possibility to further conduct any analysis on the weighted contributions. We therefore retain the statistical averaging of the ground state as our best result.\n\nAll resonances in this paper were identified using the technique detailed by \\citet{1996JPhB...29.4529Q} and \\citet{1998CoPhC.114..225Q}, which involves an analytic approach complementary to the \\textbf{R}-matrix method. By exploiting multichannel quantum defect theory \\citep{1983RPPh...46..167S}, each resonant state part of a Rydberg series has constant defect, $\\mu$ for each effective $n$ quantum number defined by,\n\\begin{equation}\\label{eq:resonance}\nE_r = E_{n \\rightarrow \\infty} - \\Big[\\frac{Z-N}{n-\\mu}\\Big]^2\n\\end{equation}\nwhere $E_r$ is the resonance energy converging to the target thresholds, $E_{n \\rightarrow \\infty}$. The overlapping nature of the resonant states makes it difficult to accurately evaluate resonance widths and assign each transition taking place. It is possible to deduce that the hole resonant states arise from 2p $\\rightarrow nd, (n+1)s$ transitions for $n\\ge3$, and correspond to the strongest peaks evident in Figure \\ref{fig:res_peaks}.\n\n\n\n\n\n\n\n\n\n\n\n\n\n4.1 Valence shell photoionization\n\\subsection{Valence shell photoionization}\nThe only available data currently in the literature for valence shell photoionization of Ar {\\sc ii} up to photon energies of 60eV is performed by \\citet{2011PhRvA..84a3413C}. In this paper both theoretical and experimental cross sections are presented. Absolute cross sections are obtained from the merged beam technique at the Advanced Light Source (ALS) with a spectral resolution of 10meV. It was found that the primary ion beam contained a mixture of both $^2$P$_{3/2}^{\\rm o}$ and $^2$P$_{1/2}^{\\rm o}$ initial states. Hence the total cross section was presented as a statistical weighting of the odd parity $J=3/2$ ground and $J=1/2$ metastable states respectively. The accompanying theoretical cross sections presented by \\citet{2011PhRvA..84a3413C} were evaluated using the Breit-Pauli \\textbf{R}-matrix approach. A total of 48 $LSJ\\pi$ fine-structure levels were included in the wavefunction representation with configurations 3s$^2$3p$^4$, 3s3p$^5$, 3p$^6$ and 3s$^2$3p$^2$3d$^2$. Some important correlation effects are thus omitted from this model such as levels associated with the 3s$^2$3p$^3$3d configuration and those arising from the lower $n=4$ complex. \n\n\n\n\n\\begin{figure}\n\\includegraphics[scale=0.29, angle=-90]{figure7.eps}\n\\caption{The photoionization cross section is presented on a linear scale against photon energy in eV above 261.2eV. The solid black line represents our current DARC3 model convoluted at 140meV FWHM and the dashed black line is the contribution to the cross section from valence shell photoionization of the 3s and 3p. The grey circles are experimental values of \\citet{2012PhRvA..85d3408B} for the single ionization channel and pink circles represent the total contribution. Each 2p$^5$3s$^2$3p$^5$ threshold is represented by an asterisk.\\label{fig:s_and_d}}\n\\end{figure}\n\nIn order to compare with this data we present in Figure \\ref{fig:valence} the total photoionization cross section from the initial $^2$P$^{\\rm{o}}$ ground state of Ar {\\sc ii} statistically weighted to the $J=3/2$ and $J=1/2$ states. We present two of our calculations in the figure, the most sophisticated DARC3 model and, in order to perform a direct comparison with the Breit-Pauli theoretical results of \\citet{2011PhRvA..84a3413C}, the PBP 124 level model outlined in Table \\ref{tab:calculations}. To match experimental resolving power, we convolute our total results with a 10meV gaussian profile at full-width half-maximum (FWHM). In addition, to replicate the target thresholds, we have shifted our threshold values recorded in Table \\ref{tab:energy} to the experimental NIST values where possible, during the diagonalization of the Hamiltonian matrix. The remaining levels not contained in NIST are shifted by an average proportion to each corresponding angular and spin momentum state, which has little effect on the background and is meant only for consistency. This ensures that resonance features are properly positioned with respect to the observed thresholds, making a direct comparison with experiment more meaningful.\n\n\n\n\n\\begin{figure*}\n\\includegraphics[scale=0.6, angle=-90]{figure5.eps}\n\\caption{Photoionization cross section measured in Mb on a logarithmic scale as a function of the photon energy in eV between 250-270eV. The circles are the experimental results from \\citet{2012PhRvA..85d3408B} with error bars included. The solid black line represents our current, DARC3 model results for the statistically weighted initial ground state, convoluted at 140meV at FWHM. \\label{fig:bigL}}\n\\end{figure*}\n\nWe can clearly see in Figure \\ref{fig:valence} that the low energy region just above threshold is completely dominated by 3s$^2$3p$^5$ $\\rightarrow$ 3s$^2$3p$^4nl$ transitions occurring at discrete energies prior to the ejection of an electron. This densely populated region of Rydberg resonances up to approximately 30eV is followed by a steep decline in the photoionization cross section forming the expected Cooper minimum around 45-50eV. This minimum is well known to appear in the spectra of noble gases \\citep{1962PhRv..128..681C}. Above this minimum the cross section rises due to excitations from 3p $\\rightarrow$ 3d transitions, before monotonically decreasing towards zero with increasing photon energy. \n\nExcellent agreement is evident between the 124 level PBP and the 48 level Breit-Pauli calculation of \\citet{2011PhRvA..84a3413C}, for all photon energies up to 60eV. Note that the cross section in Figure \\ref{fig:valence} is plotted on a log scale. Evidently the larger basis expansion of the present PBP evaluation, which includes the 4s, 4p and 4d orbitals, has minimal effect on the resulting photoionization cross section. Both of these Breit-Pauli evaluations, however, underestimate the cross section above roughly 45eV and lie considerably lower than the experimental measurements from ALS. The larger DARC3 evaluation, incorporating 557 fine-structure levels, gives much better agreement with experiment at photon energies above the Cooper minimum. This is partly due to the more substantial calculation, and also a more accurate description of the wavefunctions included. Both techniques in fact are known to reproduce similar results as shown in a study by \\citet{2005JPhB...38.1667B}, showing that the average difference in effective collision strengths for Fe$^{14+}$ to be 6$\\%$ between all transitions considered. The additional levels included, and the Rydberg resonances converging onto their thresholds, have the effect of raising the cross section above 45eV.\n\nIn order to further emphasize the excellent agreement between the DARC3 and the experimental measurements, we zoom in on the photon energy region just above threshold, from 27.8-29.2eV, in Figure \\ref{fig:valence_zoom}. It is clearly evident that the disparities found between theory and experiment in this very narrow energy range are negligible and excellent conformity is achieved. This high level of agreement supports the accuracy of the DARC3 evaluation and we believe that these valence shell photoionization cross sections for the ground state of Ar {\\sc ii} accurately reproduce the experimental spectrum. \n\nIn Figure \\ref{fig:excited} we present the total photoionization cross section for the process defined in Equation \\ref{eq:excited}, photoionization from the lowest excited initial 3s3p$^6\\; ^2$S$_{1/2}$ bound state of Ar {\\sc ii} to all possible allowed final states of Ar {\\sc iii}. These evaluations were carried out using the DARC3 model and present for the first time, cross sections for photoionization from an excited Ar {\\sc ii} state. There are no other theoretical or experimental data with which we can compare in this figure. The cross section is presented as a function of the photon energy in eV which ranges from just above the ionization threshold to beyond the opening of the L-shell thresholds. The photoionization cross section tends towards zero with increasing energy, and it is only due to the inclusion of the additional 10 hole states in the DARC3 model do we witness contributions to the cross section at photon energies between 200-250 eV. \n\n",
        "Descriptive_question1": "What is the NIST energy value for the level indexed as 1 in table_2?",
        "Descriptive_question2": "Which configuration is associated with index 6 in table_2?",
        "Reasoning_question1": "How does the energy value of the ^1D_2 level (index 4) in table_2 compare across the different calculations (PBP, DARC1, DARC2, DARC3) relative to the NIST value, and what might this suggest about the accuracy of the models?",
        "Reasoning_question2": "What trend can be observed in the energy differences between the NIST values and the DARC3 calculations for the 3s3p^5 configuration levels in table_2, and what could this indicate about the model's performance for higher energy states?",
        "Descriptive_answer1": "0.0000",
        "Descriptive_answer2": "3s3p^5",
        "Reasoning_answer1": "Let's compare the energy value for the ^1D_2 level (index 4) across the calculations relative to the NIST value. The NIST value for this level is 1.7370 eV. The PBP calculation gives 2.0093 eV, which is 0.2723 eV higher than NIST (a difference of about 15.7%). DARC1 and DARC2 both report 2.1402 eV, a difference of 0.4032 eV or roughly 23.2% higher than NIST. DARC3 shows 2.0423 eV, a difference of 0.3053 eV or about 17.6% higher than NIST. Comparing these deviations, PBP is closest to NIST, followed by DARC3, with DARC1 and DARC2 showing the largest discrepancy. This suggests that none of the models accurately capture this specific energy level, with differences ranging from 15-23%. As noted in the context, this level of disparity for ^1D_2 is consistent across all theoretical predictions, indicating that there might be inherent challenges in modeling this state, possibly due to complex electron interactions or correlation effects not fully accounted for in any of the models. Interestingly, while DARC3 is described as the most sophisticated model with the best overall agreement across other levels, for this particular level, PBP performs slightly better, suggesting that model complexity does not always correlate with accuracy for every state.",
        "Reasoning_answer2": "Let's analyze the energy differences between NIST and DARC3 for the 3s3p^5 configuration levels (indices 6 to 9) to identify any trends. For index 6 (^3P_2), NIST is 14.1095 eV and DARC3 is 13.9520 eV, a difference of -0.1575 eV (DARC3 lower by about 1.1%). For index 7 (^3P_1), NIST is 14.2331 eV and DARC3 is 14.0752 eV, a difference of -0.1579 eV (about 1.1% lower). For index 8 (^3P_0), NIST is 14.2988 eV and DARC3 is 14.1394 eV, a difference of -0.1594 eV (about 1.1% lower). For index 9 (^1P_1), NIST is 17.8565 eV and DARC3 is 18.1696 eV, a difference of +0.3131 eV (about 1.8% higher). Observing these differences, we see that for the ^3P levels (indices 6-8), DARC3 consistently underestimates the energy by approximately 0.16 eV or 1.1%, showing a stable negative bias. However, for the ^1P_1 level (index 9), DARC3 overestimates by 0.31 eV or 1.8%, indicating a shift in behavior. This suggests that DARC3's performance may vary with the specific term or spin state within the same configuration. The consistent underestimation for ^3P levels could indicate a systematic error in accounting for certain interactions or relativistic effects at these energy levels, while the overestimation for ^1P_1 might reflect different correlation effects or sensitivity to higher energy states. Overall, the small percentage differences (1-2%) indicate good agreement, but the change in sign of the difference at higher energy within the configuration might suggest that DARC3's accuracy slightly decreases or behaves inconsistently as energy increases within the 3s3p^5 configuration, possibly due to incomplete modeling of higher-order effects."
    },
    {
        "paper_id": "1812.04103.json",
        "table_id": "table_1",
        "table_content": "\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of DR. The leave-one-subject-out\n\t\tcross-validation is used. Larger values indicate better performance.}\n\t\\label{table:results_baseline_dr}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9250$\\pm$0.0118\n\t\t& 0.9084$\\pm$0.0056\n\t\t& 0.8926$\\pm$0.0119\n\t\t& 0.9087$\\pm$0.0066 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9530$\\pm$0.0074}\n\t\t& \\textbf{0.9245$\\pm$0.0049}\n\t\t& \\textbf{0.9102$\\pm$0.0101}\n\t\t& \\textbf{0.9292$\\pm$0.0050} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}",
        "caption": "Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of DR. The leave-one-subject-out\n\t\tcross-validation is used. Larger values indicate better performance.",
        "label": "table:results_baseline_dr",
        "section_info": "3 Results and Discussion\n\\section{Results and Discussion}\n\n\nWe perform experiments on the 3D multimodality isointense infant brain MR image segmentation task to evaluate our non-local U-Nets. The task is to perform automatic segmentation of MR images into cerebrospinal fluid~(CSF), gray matter~(GM) and white matter~(WM) regions. We first introduce the baseline model and the evaluation methods used in our experiments. Then the training and inference processes are described. We provide comparison results in terms of both effectiveness and efficiency, and conduct ablation studies to demonstrate that how each global aggregation block in our non-local U-Nets improves the performance. In addition, we explore the trade-off between the inference speed and accuracy based on different overlapping step sizes, and analyze the impact of patch size. The experimental code and dataset information have been made publicly available~\\footnote{\\url{https://github.com/divelab/Non-local-U-Nets}}.\n\n\\subsection{Experimental Setup}\n\n\nWe use CC-3D-FCN~\\cite{nie20183} as our baseline.\nCC-3D-FCN is a 3D fully convolutional network~(3D-FCN) with\nconvolution and concatenate~(CC) skip connections, which is designed\nfor 3D multimodality isointense infant brain image segmentation. It\nhas been shown to outperform traditional machine learning methods,\nsuch as FMRIB's automated segmentation\ntool~(FAST)~\\cite{zhang2001segmentation}, majority\nvoting~(MV), random forest~(RF)~\\cite{criminisi2013decision} and random forest with\nauto-context model~(LINKS)~\\cite{wang2015links}. Moreover, studies\nin~\\cite{nie20183} has showed the superiority of CC-3D-FCN to\nprevious deep learning models, like 2D, 3D\nCNNs~\\cite{zhang2015deep}, DeepMedic~\\cite{kamnitsas2017efficient},\nand the original 3D U-Net~\\cite{cciccek20163d}. Therefore, it is\nappropriate to use CC-3D-FCN as the baseline of our experiments.\nNote that our dataset is different from that in~\\cite{nie20183}.\n\n\nIn our experiments, we employ the Dice ratio~(DR) and propose the 3D modified\nHausdorff distance~(3D-MHD) as the evaluation metrics. These two\nmethods evaluate the accuracy only for binary segmentation tasks, so\nit is required to transform the 4-class segmentation map predicted by our model into\n4 binary segmentation maps for evaluation. That is, a 3D binary\nsegmentation map should be constructed for each class, where 1 denotes the voxel\nin the position belongs to the class and 0 means the opposite. In\nour experiments, we derive binary segmentation maps directly from\n4-class segmentation maps. The evaluation is performed on binary\nsegmentation maps for CSF, GM and WM.\n\n\nSpecifically, let $P$ and $L$ represent the predicted binary segmentation map for one class\nand the corresponding ground truth label, respectively. The DR is given by\n$DR=2|P \\cap L|/(|P|+|L|)$,\n\n\n\nwhere $|\\cdot|$ denotes the number of 1's in a segmentation map and $|P \\cap\nL|$ means the number of 1's shared by $P$ and $L$. Apparently, DR is a value\nin $[0,1]$ and a larger DR indicates a more accurate segmentation.\n\n\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of DR. The leave-one-subject-out\n\t\tcross-validation is used. Larger values indicate better performance.}\n\t\\label{table:results_baseline_dr}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9250$\\pm$0.0118\n\t\t& 0.9084$\\pm$0.0056\n\t\t& 0.8926$\\pm$0.0119\n\t\t& 0.9087$\\pm$0.0066 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9530$\\pm$0.0074}\n\t\t& \\textbf{0.9245$\\pm$0.0049}\n\t\t& \\textbf{0.9102$\\pm$0.0101}\n\t\t& \\textbf{0.9292$\\pm$0.0050} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of 3D-MHD. The leave-one-subject-out\n\t\tcross-validation is used. Smaller values indicate better performance. Note that 3D-MHD gives different results from MHD.}\n\t\\label{table:results_baseline_mhd}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.3417$\\pm$0.0245\n\t\t& 0.6537$\\pm$0.0483\n\t\t& 0.4817$\\pm$0.0454\n\t\t& 0.4924$\\pm$0.0345 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.2554$\\pm$0.0207}\n\t\t& \\textbf{0.5950$\\pm$0.0428}\n\t\t& \\textbf{0.4454$\\pm$0.0040}\n\t\t& \\textbf{0.4319$\\pm$0.0313} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\\begin{table*}[!t]\n\t\\centering\n\t\\caption{Comparison of segmentation performance on the 13 testing subjects of iSeg-2017 between our proposed model\n\t\tand the baseline model in terms of DR. Larger values indicate better performance.}\n\t\\label{table:iseg_results}\n\t\\begin{tabular}{ l | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9324$\\pm$0.0067\n\t\t& 0.9146$\\pm$0.0074\n\t\t& 0.8974$\\pm$0.0123 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9557$\\pm$0.0060}\n\t\t& \\textbf{0.9219$\\pm$0.0089}\n\t\t& \\textbf{0.9044$\\pm$0.0153} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\nThe modified Hausdorff distance~(MHD)~\\cite{dubuisson1994modified} is\ndesigned to compute the similarity between two objects. Here, an object is a\nset of points where a point is represented by a vector. Specifically, given\ntwo sets of vectors $A$ and $B$, MHD is computed by\n$MHD=\\max(d(A,B),d(B,A))$,\n\n\n\nwhere the distance between two sets is defined as\n$d(A,B)=1/|A|\\sum_{a \\in A}{d(a,B)}$,\n\n\n\nand the distance between a vector and a set is defined as\n$d(a,B)=\\min_{b \\in B}||a-b||$.\n\n\n\nPrevious studies~\\cite{wang2015links,zhang2015deep,nie20183}\napplied MHD for evaluation by treating a 3D $D \\times H \\times W$\nmap as $H \\times W$ $D$-dimensional vectors. However, there are two\nmore different ways to vectorize the 3D map, depending on the\ndirection of forming vectors, \\emph{i.e.,} $D \\times H$\n$W$-dimensional vectors and $D \\times W$ $H$-dimensional vectors.\nEach vectorization leads to different evaluation results by MHD. To\nmake it a direction-independent evaluation metric as DR, we define\n3D-MHD, which computes the averaged MHD based on the three different\nvectorizations. A smaller 3D-MHD indicates a\nhigher segmentation accuracy.\n\n\\subsection{Training and Inference Strategies}\n\n\nOur proposed non-local U-Nets apply Dropout~\\cite{srivastava2014dropout} with\na rate of 0.5 in each global aggregation block and the output block\nbefore the final $1 \\times 1 \\times 1$ convolution. A weight\ndecay~\\cite{krogh1992simple} with a rate of $2e-6$ is also employed.\nTo train the model, we use randomly cropped small patches. In this\nway, we obtain sufficient training data and the requirement on\nmemory is reduced. No extra data augmentation is needed. The\nexperimental results below suggest that patches\nwith a size of $32^3$ leads to the best\nperformance. The batch size is set to 5. The Adam\noptimizer~\\cite{kingma2014adam} with a learning rate of 0.001 is\nemployed to perform the gradient descent algorithm.\n\n\nIn the inference process, following~\\cite{nie20183}, we extract\npatches with the same size as that used in training. For example, to\ngenerate $32^3$ patches for inference, we slide a\nwindow of size $32^3$ through the original image\nwith a constant overlapping step size. The overlapping step size\nmust be smaller than or equal to the patch size, in order to\nguarantee that extracted patches cover the whole image.\nConsequently, prediction for all these patches provides segmentation\nprobability results for every voxel in the original image. For\nvoxels that receive multiple results due to overlapping, we average\nthem to produce the final prediction. The overlapping step size is\nan important hyper-parameter affecting the inference speed and the\nsegmentation accuracy. A smaller overlapping step size results in\nbetter accuracy, but increases the inference time as more patches\nare generated. We explore the trade-off in our experiments.\n\n\\subsection{Comparison with the Baseline}\\label{sec:baseline}\n\n\nWe compare our non-local U-Nets with the baseline on our dataset.\nFollowing~\\cite{nie20183}, the patch size is set to $32^3$\nand the overlapping step size for inference is set to $8$. To remove the\nbias of different subjects, the leave-one-subject-out cross-validation is\nused for evaluating segmentation performance. That is, for 10 subjects in our\ndataset, we train and evaluate models 10 times correspondingly. Each time one\nof the 10 subjects is left out for validation and the other 9 subjects are\nused for training. The mean and standard deviation of segmentation\nperformance of the 10 runs are reported.\n\n\nTables~\\ref{table:results_baseline_dr}\nand~\\ref{table:results_baseline_mhd} provide the experimental\nresults. In terms of both evaluation metrics, our non-local U-Nets achieve\nsignificant improvements over the baseline model. Due to the small\nvariances of the results, we focus on one of the 10 runs for\nvisualization and ablation studies, where the models are trained on the\nfirst 9 subjects and evaluated on the $10^{th}$ subject. A\nvisualization of the segmentation results in this run is given by\nFig.~\\ref{fig:results_visual}. By comparing the areas in red\ncircles, we can see that our model is capable of catching more\ndetails than the baseline model. We also visualize the training\nprocesses to illustrate the superiority of our model.\nFig.~\\ref{fig:results_training} shows the training and validation\ncurves in this run of our model and the baseline model,\nrespectively. Clearly, our model converges faster to a lower\ntraining loss. In addition, according to the better validation\nresults, our model does not suffer from over-fitting.\n\n\nTo further show the efficiency of our proposed model, we compare the\nnumber of parameters as reported in Table~\\ref{table:num_params}.\nOur model reduces $28\\%$ parameters compared to CC-3D-FCN and\nachieves better performance. A comparison of inference time is also\nprovided in Table~\\ref{table:infer_time}. The settings of our device\nare - GPU: Nvidia Titan Xp 12GB; CPU: Intel Xeon E5-2620v4 2.10GHz;\nOS: Ubuntu 16.04.3 LTS.\n\n\nSince our data has been used as the training data in the iSeg-2017\nchallenge, we also compare the\nresults evaluated on the 13 testing subjects in\nTable~\\ref{table:iseg_results}. According to the leader board, our model\nachieves one of the top performances. Results in terms of DR are reported since\nit is the only shared evaluation metric.\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of the number of parameters between our proposed model\n\t\tand the baseline model.}\n\t\\label{table:num_params}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Number of Parameters \\\\\n\t\t\\midrule\n\t\tBaseline  & 2,534,276 \\\\\n\t\tNon-local U-Net & \\textbf{1,821,124} \\\\\n\t\t\\bottomrule\n\n\n\n\n\n\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of inference time between our proposed model and the baseline model. The leave-one-subject-out cross-validation is used. The patch size is set to $32^3$ and the overlapping step size for inference is set to $8$.}\n\t\\label{table:infer_time}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Inference Time (min) \\\\\n\t\t\\midrule\n\t\tBaseline  & 3.85$\\pm$0.15 \\\\\n\t\tNon-local U-Net  & \\textbf{3.06$\\pm$0.12} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.95\\columnwidth]{figure4.png}\n\t\\caption{Visualization of the segmentation results on the $10^{th}$ subject\n\t\tby our proposed model and the baseline model. Both models are trained on the\n\t\tfirst 9 subjects. The first column shows the original segmentation maps. The\n\t\tsecond, third and fourth columns show the binary segmentation maps for CSF,\n\t\tGM and WM, respectively.}\n\t\\label{fig:results_visual}\n\\end{figure}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure5.png}\n\t\\caption{Comparison of training processes and validation results between our proposed model and the\n\t\tbaseline model when training on the first 9 subjects and using the $10^{th}$ subject for validation.}\n\t\\label{fig:results_training}\n\\end{figure}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of DR. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Larger values indicate\n\t\tbetter performance. Details of models are provided in the text.}\n\t\\label{table:results_ablation_dr}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.9585} & 0.9099 & 0.8625 & 0.9103 \\\\\n\t\tModel2       & 0.9568 & 0.9172 & 0.8728 & 0.9156 \\\\\n\t\tModel3       & 0.9576 & 0.9198 & 0.8749 & 0.9174 \\\\\n\t\tModel4       & 0.9578 & 0.9210 & 0.8769 & 0.9186 \\\\\n\t\tModel5       & 0.9554 & 0.9225 & 0.8804 & 0.9194 \\\\\n\t\tNon-local U-Net    & 0.9572 & \\textbf{0.9278} & \\textbf{0.8867} & \\textbf{0.9239} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of 3D-MHD. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Smaller values indicate\n\t\tbetter performance. Note that 3D-MHD gives different results from MHD. Details of models are provided in\n\t\tthe text.}\n\t\\label{table:results_ablation_mhd}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.2363} & 0.6277 & 0.4705 & 0.4448 \\\\\n\t\tModel2       & 0.2404 & 0.6052 & 0.4480 & 0.4312 \\\\\n\t\tModel3       & 0.2392 & 0.5993 & 0.4429 & 0.4271 \\\\\n\t\tModel4       & 0.2397 & 0.5926 & 0.4336 & 0.4220 \\\\\n\t\tModel5       & 0.2444 & 0.5901 & 0.4288 & 0.4211 \\\\\n\t\tNon-local U-Net    & 0.2477 & \\textbf{0.5692} & \\textbf{0.4062} & \\textbf{0.4077} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\subsection{Ablation Studies of Different Modules}\\label{sec:ablation}\n\nWe perform ablation studies to show the effectiveness of each part of our\nnon-local U-Nets. Specifically, we compare the following models:\n\n\\textbf{Model1} is a 3D U-Net without short-range residual connections.\nDown-sampling and up-sampling are implemented by convolutions and\ndeconvolutions with a stride of 2, respectively. The bottom block is simply a\nconvolutional layer. Note that the baseline model, CC-3D-FCN, has showed improved performance over 3D U-Net~\\cite{nie20183}. However, the original 3D U-Net was not designed for this task~\\cite{cciccek20163d}. In our experiments, we appropriately set the hyperparameters of 3D U-Net and achieve better performance.\n\n\\textbf{Model2} is Model1 with short-range residual connections, \\emph{i.e.},\nthe blocks in Fig.~\\ref{fig:residual}(a) and (b) are applied. The bottom\nblock and up-sampling blocks are the same as those in Model1.\n\n\\textbf{Model3} replaces the first up-sampling block in Model2 with the block\nin Fig.~\\ref{fig:residual}(d).\n\n\\textbf{Model4} replaces both up-sampling blocks in Model2 with the block in\nFig.~\\ref{fig:residual}(d).\n\n\\textbf{Model5} replaces the bottom block in Model2 with the block in\nFig.~\\ref{fig:residual}(c).\n\nAll models are trained on the first 9 subjects. We report the segmentation\nperformance on the $10^{th}$ subject in Table~\\ref{table:results_ablation_dr}\nand Table~\\ref{table:results_ablation_mhd}. The results demonstrate how different\nglobal aggregation blocks in our non-local U-Nets improve the performance.\n\n\\subsection{Impact of the Overlapping Step Size}\\label{sec:overlap}\n\nAs discussed above, a small overlapping step size\nusually results in better segmentation, due to the ensemble effect.\nHowever, with a small overlapping step size, the model has to perform\ninference for more validation patches and thus decreases the inference speed. We\nexplore the trade-off in our non-local U-Nets by setting the overlapping step sizes to 4,\n8, 16, 32, respectively. Again, we train our model on the first 9 subjects and\nperform evaluation on the $10^{th}$ subject. The patch size is set to $32^3$.\nAccording to the overlapping step sizes, 11880, 1920,\n387, 80 patches need to be processed during inference, as shown in\nFig.~\\ref{fig:results_overlap_time}. In addition, Fig.~\\ref{fig:results_overlap_dr}\nplots the changes of segmentation performance in terms of DR. Obviously, 8 and\n16 are good choices that achieve accurate and fast segmentation results.\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.7\\columnwidth]{figure6.png}\n\t\\caption{Changes of segmentation performance in terms of DR, with respect to\n\t\tdifferent overlapping step sizes during inference. The model is trained on the\n\t\tfirst 9 subjects and evaluated on the $10^{th}$ subject.}\n\t\\label{fig:results_overlap_dr}\n\\end{figure}\n\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure7.png}\n\t\\caption{Changes of the number of validation patches for the $10^{th}$\n\t\tsubject, with respect to different overlapping step sizes during inference.}\n\t\\label{fig:results_overlap_time}\n\\end{figure}\n\n\\subsection{Impact of the Patch Size}\\label{sec:patch}\n\nThe patch size affects the total number of distinct training samples.\nMeanwhile, it controls the range of available global information when\nperforming segmentation for a patch. To choose the appropriate patch\nsize for the non-local U-Nets, we perform a grid search by training on the first 9\nsubjects and evaluating on the $10^{th}$ subject with the overlapping step\nsize of 8. Experiments are conducted with five different patch sizes:\n$16^3$, $24^3$, $32^3$, $40^3$, $48^3$. The results are provided in\nFig.~\\ref{fig:results_patch_dr}, where $32^3$ obtains the best\nperformance and is selected as the default setting of our model.\n\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure8.png}\n\t\\caption{Changes of segmentation performance in terms of DR, with respect to\n\t\tdifferent patch sizes. The model is trained on the first 9 subjects and\n\t\tevaluated on the $10^{th}$ subject.}\n\t\\label{fig:results_patch_dr}\n\\end{figure}\n\n3.3 Comparison with the Baseline\n\\subsection{Comparison with the Baseline}\\label{sec:baseline}\n\n\nWe compare our non-local U-Nets with the baseline on our dataset.\nFollowing~\\cite{nie20183}, the patch size is set to $32^3$\nand the overlapping step size for inference is set to $8$. To remove the\nbias of different subjects, the leave-one-subject-out cross-validation is\nused for evaluating segmentation performance. That is, for 10 subjects in our\ndataset, we train and evaluate models 10 times correspondingly. Each time one\nof the 10 subjects is left out for validation and the other 9 subjects are\nused for training. The mean and standard deviation of segmentation\nperformance of the 10 runs are reported.\n\n\nTables~\\ref{table:results_baseline_dr}\nand~\\ref{table:results_baseline_mhd} provide the experimental\nresults. In terms of both evaluation metrics, our non-local U-Nets achieve\nsignificant improvements over the baseline model. Due to the small\nvariances of the results, we focus on one of the 10 runs for\nvisualization and ablation studies, where the models are trained on the\nfirst 9 subjects and evaluated on the $10^{th}$ subject. A\nvisualization of the segmentation results in this run is given by\nFig.~\\ref{fig:results_visual}. By comparing the areas in red\ncircles, we can see that our model is capable of catching more\ndetails than the baseline model. We also visualize the training\nprocesses to illustrate the superiority of our model.\nFig.~\\ref{fig:results_training} shows the training and validation\ncurves in this run of our model and the baseline model,\nrespectively. Clearly, our model converges faster to a lower\ntraining loss. In addition, according to the better validation\nresults, our model does not suffer from over-fitting.\n\n\nTo further show the efficiency of our proposed model, we compare the\nnumber of parameters as reported in Table~\\ref{table:num_params}.\nOur model reduces $28\\%$ parameters compared to CC-3D-FCN and\nachieves better performance. A comparison of inference time is also\nprovided in Table~\\ref{table:infer_time}. The settings of our device\nare - GPU: Nvidia Titan Xp 12GB; CPU: Intel Xeon E5-2620v4 2.10GHz;\nOS: Ubuntu 16.04.3 LTS.\n\n\nSince our data has been used as the training data in the iSeg-2017\nchallenge, we also compare the\nresults evaluated on the 13 testing subjects in\nTable~\\ref{table:iseg_results}. According to the leader board, our model\nachieves one of the top performances. Results in terms of DR are reported since\nit is the only shared evaluation metric.\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of the number of parameters between our proposed model\n\t\tand the baseline model.}\n\t\\label{table:num_params}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Number of Parameters \\\\\n\t\t\\midrule\n\t\tBaseline  & 2,534,276 \\\\\n\t\tNon-local U-Net & \\textbf{1,821,124} \\\\\n\t\t\\bottomrule\n\n\n\n\n\n\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of inference time between our proposed model and the baseline model. The leave-one-subject-out cross-validation is used. The patch size is set to $32^3$ and the overlapping step size for inference is set to $8$.}\n\t\\label{table:infer_time}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Inference Time (min) \\\\\n\t\t\\midrule\n\t\tBaseline  & 3.85$\\pm$0.15 \\\\\n\t\tNon-local U-Net  & \\textbf{3.06$\\pm$0.12} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.95\\columnwidth]{figure4.png}\n\t\\caption{Visualization of the segmentation results on the $10^{th}$ subject\n\t\tby our proposed model and the baseline model. Both models are trained on the\n\t\tfirst 9 subjects. The first column shows the original segmentation maps. The\n\t\tsecond, third and fourth columns show the binary segmentation maps for CSF,\n\t\tGM and WM, respectively.}\n\t\\label{fig:results_visual}\n\\end{figure}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure5.png}\n\t\\caption{Comparison of training processes and validation results between our proposed model and the\n\t\tbaseline model when training on the first 9 subjects and using the $10^{th}$ subject for validation.}\n\t\\label{fig:results_training}\n\\end{figure}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of DR. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Larger values indicate\n\t\tbetter performance. Details of models are provided in the text.}\n\t\\label{table:results_ablation_dr}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.9585} & 0.9099 & 0.8625 & 0.9103 \\\\\n\t\tModel2       & 0.9568 & 0.9172 & 0.8728 & 0.9156 \\\\\n\t\tModel3       & 0.9576 & 0.9198 & 0.8749 & 0.9174 \\\\\n\t\tModel4       & 0.9578 & 0.9210 & 0.8769 & 0.9186 \\\\\n\t\tModel5       & 0.9554 & 0.9225 & 0.8804 & 0.9194 \\\\\n\t\tNon-local U-Net    & 0.9572 & \\textbf{0.9278} & \\textbf{0.8867} & \\textbf{0.9239} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of 3D-MHD. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Smaller values indicate\n\t\tbetter performance. Note that 3D-MHD gives different results from MHD. Details of models are provided in\n\t\tthe text.}\n\t\\label{table:results_ablation_mhd}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.2363} & 0.6277 & 0.4705 & 0.4448 \\\\\n\t\tModel2       & 0.2404 & 0.6052 & 0.4480 & 0.4312 \\\\\n\t\tModel3       & 0.2392 & 0.5993 & 0.4429 & 0.4271 \\\\\n\t\tModel4       & 0.2397 & 0.5926 & 0.4336 & 0.4220 \\\\\n\t\tModel5       & 0.2444 & 0.5901 & 0.4288 & 0.4211 \\\\\n\t\tNon-local U-Net    & 0.2477 & \\textbf{0.5692} & \\textbf{0.4062} & \\textbf{0.4077} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n",
        "Descriptive_question1": "What is the DR value for CSF in the Non-local U-Net model in table_1?",
        "Descriptive_question2": "Which model has the highest average DR score in table_1?",
        "Reasoning_question1": "Why might the Non-local U-Net consistently outperform the baseline model across all categories in table_1?",
        "Reasoning_question2": "How does the difference in DR values for WM between the two models in table_1 reflect on their segmentation accuracy?",
        "Descriptive_answer1": "0.9530",
        "Descriptive_answer2": "Non-local U-Net",
        "Reasoning_answer1": "The Non-local U-Net likely outperforms the baseline model across all categories in Table 1 due to its architectural improvements, such as the incorporation of global aggregation blocks. These blocks allow the model to capture long-range dependencies within the image data, enhancing its ability to understand complex spatial relationships compared to the baseline CC-3D-FCN, which relies on local convolutional operations and concatenate skip connections. Additionally, as mentioned in the context, the Non-local U-Net reduces the number of parameters by 28% while achieving better performance, suggesting more efficient learning. This efficiency, combined with faster convergence and lower training loss as indicated by the training curves, implies that the model is better optimized for the segmentation task of infant brain MR images into CSF, GM, and WM regions.",
        "Reasoning_answer2": "The difference in DR values for WM between the Non-local U-Net (0.9102) and the baseline model (0.8926) in Table 1 indicates that the Non-local U-Net has a higher segmentation accuracy for white matter. This difference of approximately 0.0176 suggests that the Non-local U-Net is better at correctly identifying and overlapping with the ground truth WM regions, as DR measures the similarity between predicted and actual segmentation maps with a higher value indicating better overlap. Given that WM segmentation can be challenging due to its complex structure in infant brain MR images, this improvement likely results from the Non-local U-Net's ability to incorporate global context, which helps in distinguishing WM boundaries more accurately than the baseline model. This reflects a meaningful enhancement in precision for clinical or research applications where accurate WM segmentation is critical."
    },
    {
        "paper_id": "1812.04103.json",
        "table_id": "table_2",
        "table_content": "\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of 3D-MHD. The leave-one-subject-out\n\t\tcross-validation is used. Smaller values indicate better performance. Note that 3D-MHD gives different results from MHD.}\n\t\\label{table:results_baseline_mhd}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.3417$\\pm$0.0245\n\t\t& 0.6537$\\pm$0.0483\n\t\t& 0.4817$\\pm$0.0454\n\t\t& 0.4924$\\pm$0.0345 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.2554$\\pm$0.0207}\n\t\t& \\textbf{0.5950$\\pm$0.0428}\n\t\t& \\textbf{0.4454$\\pm$0.0040}\n\t\t& \\textbf{0.4319$\\pm$0.0313} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}",
        "caption": "Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of 3D-MHD. The leave-one-subject-out\n\t\tcross-validation is used. Smaller values indicate better performance. Note that 3D-MHD gives different results from MHD.",
        "label": "table:results_baseline_mhd",
        "section_info": "3 Results and Discussion\n\\section{Results and Discussion}\n\n\nWe perform experiments on the 3D multimodality isointense infant brain MR image segmentation task to evaluate our non-local U-Nets. The task is to perform automatic segmentation of MR images into cerebrospinal fluid~(CSF), gray matter~(GM) and white matter~(WM) regions. We first introduce the baseline model and the evaluation methods used in our experiments. Then the training and inference processes are described. We provide comparison results in terms of both effectiveness and efficiency, and conduct ablation studies to demonstrate that how each global aggregation block in our non-local U-Nets improves the performance. In addition, we explore the trade-off between the inference speed and accuracy based on different overlapping step sizes, and analyze the impact of patch size. The experimental code and dataset information have been made publicly available~\\footnote{\\url{https://github.com/divelab/Non-local-U-Nets}}.\n\n\\subsection{Experimental Setup}\n\n\nWe use CC-3D-FCN~\\cite{nie20183} as our baseline.\nCC-3D-FCN is a 3D fully convolutional network~(3D-FCN) with\nconvolution and concatenate~(CC) skip connections, which is designed\nfor 3D multimodality isointense infant brain image segmentation. It\nhas been shown to outperform traditional machine learning methods,\nsuch as FMRIB's automated segmentation\ntool~(FAST)~\\cite{zhang2001segmentation}, majority\nvoting~(MV), random forest~(RF)~\\cite{criminisi2013decision} and random forest with\nauto-context model~(LINKS)~\\cite{wang2015links}. Moreover, studies\nin~\\cite{nie20183} has showed the superiority of CC-3D-FCN to\nprevious deep learning models, like 2D, 3D\nCNNs~\\cite{zhang2015deep}, DeepMedic~\\cite{kamnitsas2017efficient},\nand the original 3D U-Net~\\cite{cciccek20163d}. Therefore, it is\nappropriate to use CC-3D-FCN as the baseline of our experiments.\nNote that our dataset is different from that in~\\cite{nie20183}.\n\n\nIn our experiments, we employ the Dice ratio~(DR) and propose the 3D modified\nHausdorff distance~(3D-MHD) as the evaluation metrics. These two\nmethods evaluate the accuracy only for binary segmentation tasks, so\nit is required to transform the 4-class segmentation map predicted by our model into\n4 binary segmentation maps for evaluation. That is, a 3D binary\nsegmentation map should be constructed for each class, where 1 denotes the voxel\nin the position belongs to the class and 0 means the opposite. In\nour experiments, we derive binary segmentation maps directly from\n4-class segmentation maps. The evaluation is performed on binary\nsegmentation maps for CSF, GM and WM.\n\n\nSpecifically, let $P$ and $L$ represent the predicted binary segmentation map for one class\nand the corresponding ground truth label, respectively. The DR is given by\n$DR=2|P \\cap L|/(|P|+|L|)$,\n\n\n\nwhere $|\\cdot|$ denotes the number of 1's in a segmentation map and $|P \\cap\nL|$ means the number of 1's shared by $P$ and $L$. Apparently, DR is a value\nin $[0,1]$ and a larger DR indicates a more accurate segmentation.\n\n\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of DR. The leave-one-subject-out\n\t\tcross-validation is used. Larger values indicate better performance.}\n\t\\label{table:results_baseline_dr}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9250$\\pm$0.0118\n\t\t& 0.9084$\\pm$0.0056\n\t\t& 0.8926$\\pm$0.0119\n\t\t& 0.9087$\\pm$0.0066 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9530$\\pm$0.0074}\n\t\t& \\textbf{0.9245$\\pm$0.0049}\n\t\t& \\textbf{0.9102$\\pm$0.0101}\n\t\t& \\textbf{0.9292$\\pm$0.0050} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of 3D-MHD. The leave-one-subject-out\n\t\tcross-validation is used. Smaller values indicate better performance. Note that 3D-MHD gives different results from MHD.}\n\t\\label{table:results_baseline_mhd}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.3417$\\pm$0.0245\n\t\t& 0.6537$\\pm$0.0483\n\t\t& 0.4817$\\pm$0.0454\n\t\t& 0.4924$\\pm$0.0345 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.2554$\\pm$0.0207}\n\t\t& \\textbf{0.5950$\\pm$0.0428}\n\t\t& \\textbf{0.4454$\\pm$0.0040}\n\t\t& \\textbf{0.4319$\\pm$0.0313} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\\begin{table*}[!t]\n\t\\centering\n\t\\caption{Comparison of segmentation performance on the 13 testing subjects of iSeg-2017 between our proposed model\n\t\tand the baseline model in terms of DR. Larger values indicate better performance.}\n\t\\label{table:iseg_results}\n\t\\begin{tabular}{ l | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9324$\\pm$0.0067\n\t\t& 0.9146$\\pm$0.0074\n\t\t& 0.8974$\\pm$0.0123 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9557$\\pm$0.0060}\n\t\t& \\textbf{0.9219$\\pm$0.0089}\n\t\t& \\textbf{0.9044$\\pm$0.0153} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\nThe modified Hausdorff distance~(MHD)~\\cite{dubuisson1994modified} is\ndesigned to compute the similarity between two objects. Here, an object is a\nset of points where a point is represented by a vector. Specifically, given\ntwo sets of vectors $A$ and $B$, MHD is computed by\n$MHD=\\max(d(A,B),d(B,A))$,\n\n\n\nwhere the distance between two sets is defined as\n$d(A,B)=1/|A|\\sum_{a \\in A}{d(a,B)}$,\n\n\n\nand the distance between a vector and a set is defined as\n$d(a,B)=\\min_{b \\in B}||a-b||$.\n\n\n\nPrevious studies~\\cite{wang2015links,zhang2015deep,nie20183}\napplied MHD for evaluation by treating a 3D $D \\times H \\times W$\nmap as $H \\times W$ $D$-dimensional vectors. However, there are two\nmore different ways to vectorize the 3D map, depending on the\ndirection of forming vectors, \\emph{i.e.,} $D \\times H$\n$W$-dimensional vectors and $D \\times W$ $H$-dimensional vectors.\nEach vectorization leads to different evaluation results by MHD. To\nmake it a direction-independent evaluation metric as DR, we define\n3D-MHD, which computes the averaged MHD based on the three different\nvectorizations. A smaller 3D-MHD indicates a\nhigher segmentation accuracy.\n\n\\subsection{Training and Inference Strategies}\n\n\nOur proposed non-local U-Nets apply Dropout~\\cite{srivastava2014dropout} with\na rate of 0.5 in each global aggregation block and the output block\nbefore the final $1 \\times 1 \\times 1$ convolution. A weight\ndecay~\\cite{krogh1992simple} with a rate of $2e-6$ is also employed.\nTo train the model, we use randomly cropped small patches. In this\nway, we obtain sufficient training data and the requirement on\nmemory is reduced. No extra data augmentation is needed. The\nexperimental results below suggest that patches\nwith a size of $32^3$ leads to the best\nperformance. The batch size is set to 5. The Adam\noptimizer~\\cite{kingma2014adam} with a learning rate of 0.001 is\nemployed to perform the gradient descent algorithm.\n\n\nIn the inference process, following~\\cite{nie20183}, we extract\npatches with the same size as that used in training. For example, to\ngenerate $32^3$ patches for inference, we slide a\nwindow of size $32^3$ through the original image\nwith a constant overlapping step size. The overlapping step size\nmust be smaller than or equal to the patch size, in order to\nguarantee that extracted patches cover the whole image.\nConsequently, prediction for all these patches provides segmentation\nprobability results for every voxel in the original image. For\nvoxels that receive multiple results due to overlapping, we average\nthem to produce the final prediction. The overlapping step size is\nan important hyper-parameter affecting the inference speed and the\nsegmentation accuracy. A smaller overlapping step size results in\nbetter accuracy, but increases the inference time as more patches\nare generated. We explore the trade-off in our experiments.\n\n\\subsection{Comparison with the Baseline}\\label{sec:baseline}\n\n\nWe compare our non-local U-Nets with the baseline on our dataset.\nFollowing~\\cite{nie20183}, the patch size is set to $32^3$\nand the overlapping step size for inference is set to $8$. To remove the\nbias of different subjects, the leave-one-subject-out cross-validation is\nused for evaluating segmentation performance. That is, for 10 subjects in our\ndataset, we train and evaluate models 10 times correspondingly. Each time one\nof the 10 subjects is left out for validation and the other 9 subjects are\nused for training. The mean and standard deviation of segmentation\nperformance of the 10 runs are reported.\n\n\nTables~\\ref{table:results_baseline_dr}\nand~\\ref{table:results_baseline_mhd} provide the experimental\nresults. In terms of both evaluation metrics, our non-local U-Nets achieve\nsignificant improvements over the baseline model. Due to the small\nvariances of the results, we focus on one of the 10 runs for\nvisualization and ablation studies, where the models are trained on the\nfirst 9 subjects and evaluated on the $10^{th}$ subject. A\nvisualization of the segmentation results in this run is given by\nFig.~\\ref{fig:results_visual}. By comparing the areas in red\ncircles, we can see that our model is capable of catching more\ndetails than the baseline model. We also visualize the training\nprocesses to illustrate the superiority of our model.\nFig.~\\ref{fig:results_training} shows the training and validation\ncurves in this run of our model and the baseline model,\nrespectively. Clearly, our model converges faster to a lower\ntraining loss. In addition, according to the better validation\nresults, our model does not suffer from over-fitting.\n\n\nTo further show the efficiency of our proposed model, we compare the\nnumber of parameters as reported in Table~\\ref{table:num_params}.\nOur model reduces $28\\%$ parameters compared to CC-3D-FCN and\nachieves better performance. A comparison of inference time is also\nprovided in Table~\\ref{table:infer_time}. The settings of our device\nare - GPU: Nvidia Titan Xp 12GB; CPU: Intel Xeon E5-2620v4 2.10GHz;\nOS: Ubuntu 16.04.3 LTS.\n\n\nSince our data has been used as the training data in the iSeg-2017\nchallenge, we also compare the\nresults evaluated on the 13 testing subjects in\nTable~\\ref{table:iseg_results}. According to the leader board, our model\nachieves one of the top performances. Results in terms of DR are reported since\nit is the only shared evaluation metric.\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of the number of parameters between our proposed model\n\t\tand the baseline model.}\n\t\\label{table:num_params}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Number of Parameters \\\\\n\t\t\\midrule\n\t\tBaseline  & 2,534,276 \\\\\n\t\tNon-local U-Net & \\textbf{1,821,124} \\\\\n\t\t\\bottomrule\n\n\n\n\n\n\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of inference time between our proposed model and the baseline model. The leave-one-subject-out cross-validation is used. The patch size is set to $32^3$ and the overlapping step size for inference is set to $8$.}\n\t\\label{table:infer_time}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Inference Time (min) \\\\\n\t\t\\midrule\n\t\tBaseline  & 3.85$\\pm$0.15 \\\\\n\t\tNon-local U-Net  & \\textbf{3.06$\\pm$0.12} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.95\\columnwidth]{figure4.png}\n\t\\caption{Visualization of the segmentation results on the $10^{th}$ subject\n\t\tby our proposed model and the baseline model. Both models are trained on the\n\t\tfirst 9 subjects. The first column shows the original segmentation maps. The\n\t\tsecond, third and fourth columns show the binary segmentation maps for CSF,\n\t\tGM and WM, respectively.}\n\t\\label{fig:results_visual}\n\\end{figure}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure5.png}\n\t\\caption{Comparison of training processes and validation results between our proposed model and the\n\t\tbaseline model when training on the first 9 subjects and using the $10^{th}$ subject for validation.}\n\t\\label{fig:results_training}\n\\end{figure}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of DR. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Larger values indicate\n\t\tbetter performance. Details of models are provided in the text.}\n\t\\label{table:results_ablation_dr}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.9585} & 0.9099 & 0.8625 & 0.9103 \\\\\n\t\tModel2       & 0.9568 & 0.9172 & 0.8728 & 0.9156 \\\\\n\t\tModel3       & 0.9576 & 0.9198 & 0.8749 & 0.9174 \\\\\n\t\tModel4       & 0.9578 & 0.9210 & 0.8769 & 0.9186 \\\\\n\t\tModel5       & 0.9554 & 0.9225 & 0.8804 & 0.9194 \\\\\n\t\tNon-local U-Net    & 0.9572 & \\textbf{0.9278} & \\textbf{0.8867} & \\textbf{0.9239} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of 3D-MHD. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Smaller values indicate\n\t\tbetter performance. Note that 3D-MHD gives different results from MHD. Details of models are provided in\n\t\tthe text.}\n\t\\label{table:results_ablation_mhd}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.2363} & 0.6277 & 0.4705 & 0.4448 \\\\\n\t\tModel2       & 0.2404 & 0.6052 & 0.4480 & 0.4312 \\\\\n\t\tModel3       & 0.2392 & 0.5993 & 0.4429 & 0.4271 \\\\\n\t\tModel4       & 0.2397 & 0.5926 & 0.4336 & 0.4220 \\\\\n\t\tModel5       & 0.2444 & 0.5901 & 0.4288 & 0.4211 \\\\\n\t\tNon-local U-Net    & 0.2477 & \\textbf{0.5692} & \\textbf{0.4062} & \\textbf{0.4077} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\subsection{Ablation Studies of Different Modules}\\label{sec:ablation}\n\nWe perform ablation studies to show the effectiveness of each part of our\nnon-local U-Nets. Specifically, we compare the following models:\n\n\\textbf{Model1} is a 3D U-Net without short-range residual connections.\nDown-sampling and up-sampling are implemented by convolutions and\ndeconvolutions with a stride of 2, respectively. The bottom block is simply a\nconvolutional layer. Note that the baseline model, CC-3D-FCN, has showed improved performance over 3D U-Net~\\cite{nie20183}. However, the original 3D U-Net was not designed for this task~\\cite{cciccek20163d}. In our experiments, we appropriately set the hyperparameters of 3D U-Net and achieve better performance.\n\n\\textbf{Model2} is Model1 with short-range residual connections, \\emph{i.e.},\nthe blocks in Fig.~\\ref{fig:residual}(a) and (b) are applied. The bottom\nblock and up-sampling blocks are the same as those in Model1.\n\n\\textbf{Model3} replaces the first up-sampling block in Model2 with the block\nin Fig.~\\ref{fig:residual}(d).\n\n\\textbf{Model4} replaces both up-sampling blocks in Model2 with the block in\nFig.~\\ref{fig:residual}(d).\n\n\\textbf{Model5} replaces the bottom block in Model2 with the block in\nFig.~\\ref{fig:residual}(c).\n\nAll models are trained on the first 9 subjects. We report the segmentation\nperformance on the $10^{th}$ subject in Table~\\ref{table:results_ablation_dr}\nand Table~\\ref{table:results_ablation_mhd}. The results demonstrate how different\nglobal aggregation blocks in our non-local U-Nets improve the performance.\n\n\\subsection{Impact of the Overlapping Step Size}\\label{sec:overlap}\n\nAs discussed above, a small overlapping step size\nusually results in better segmentation, due to the ensemble effect.\nHowever, with a small overlapping step size, the model has to perform\ninference for more validation patches and thus decreases the inference speed. We\nexplore the trade-off in our non-local U-Nets by setting the overlapping step sizes to 4,\n8, 16, 32, respectively. Again, we train our model on the first 9 subjects and\nperform evaluation on the $10^{th}$ subject. The patch size is set to $32^3$.\nAccording to the overlapping step sizes, 11880, 1920,\n387, 80 patches need to be processed during inference, as shown in\nFig.~\\ref{fig:results_overlap_time}. In addition, Fig.~\\ref{fig:results_overlap_dr}\nplots the changes of segmentation performance in terms of DR. Obviously, 8 and\n16 are good choices that achieve accurate and fast segmentation results.\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.7\\columnwidth]{figure6.png}\n\t\\caption{Changes of segmentation performance in terms of DR, with respect to\n\t\tdifferent overlapping step sizes during inference. The model is trained on the\n\t\tfirst 9 subjects and evaluated on the $10^{th}$ subject.}\n\t\\label{fig:results_overlap_dr}\n\\end{figure}\n\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure7.png}\n\t\\caption{Changes of the number of validation patches for the $10^{th}$\n\t\tsubject, with respect to different overlapping step sizes during inference.}\n\t\\label{fig:results_overlap_time}\n\\end{figure}\n\n\\subsection{Impact of the Patch Size}\\label{sec:patch}\n\nThe patch size affects the total number of distinct training samples.\nMeanwhile, it controls the range of available global information when\nperforming segmentation for a patch. To choose the appropriate patch\nsize for the non-local U-Nets, we perform a grid search by training on the first 9\nsubjects and evaluating on the $10^{th}$ subject with the overlapping step\nsize of 8. Experiments are conducted with five different patch sizes:\n$16^3$, $24^3$, $32^3$, $40^3$, $48^3$. The results are provided in\nFig.~\\ref{fig:results_patch_dr}, where $32^3$ obtains the best\nperformance and is selected as the default setting of our model.\n\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure8.png}\n\t\\caption{Changes of segmentation performance in terms of DR, with respect to\n\t\tdifferent patch sizes. The model is trained on the first 9 subjects and\n\t\tevaluated on the $10^{th}$ subject.}\n\t\\label{fig:results_patch_dr}\n\\end{figure}\n\n3.3 Comparison with the Baseline\n\\subsection{Comparison with the Baseline}\\label{sec:baseline}\n\n\nWe compare our non-local U-Nets with the baseline on our dataset.\nFollowing~\\cite{nie20183}, the patch size is set to $32^3$\nand the overlapping step size for inference is set to $8$. To remove the\nbias of different subjects, the leave-one-subject-out cross-validation is\nused for evaluating segmentation performance. That is, for 10 subjects in our\ndataset, we train and evaluate models 10 times correspondingly. Each time one\nof the 10 subjects is left out for validation and the other 9 subjects are\nused for training. The mean and standard deviation of segmentation\nperformance of the 10 runs are reported.\n\n\nTables~\\ref{table:results_baseline_dr}\nand~\\ref{table:results_baseline_mhd} provide the experimental\nresults. In terms of both evaluation metrics, our non-local U-Nets achieve\nsignificant improvements over the baseline model. Due to the small\nvariances of the results, we focus on one of the 10 runs for\nvisualization and ablation studies, where the models are trained on the\nfirst 9 subjects and evaluated on the $10^{th}$ subject. A\nvisualization of the segmentation results in this run is given by\nFig.~\\ref{fig:results_visual}. By comparing the areas in red\ncircles, we can see that our model is capable of catching more\ndetails than the baseline model. We also visualize the training\nprocesses to illustrate the superiority of our model.\nFig.~\\ref{fig:results_training} shows the training and validation\ncurves in this run of our model and the baseline model,\nrespectively. Clearly, our model converges faster to a lower\ntraining loss. In addition, according to the better validation\nresults, our model does not suffer from over-fitting.\n\n\nTo further show the efficiency of our proposed model, we compare the\nnumber of parameters as reported in Table~\\ref{table:num_params}.\nOur model reduces $28\\%$ parameters compared to CC-3D-FCN and\nachieves better performance. A comparison of inference time is also\nprovided in Table~\\ref{table:infer_time}. The settings of our device\nare - GPU: Nvidia Titan Xp 12GB; CPU: Intel Xeon E5-2620v4 2.10GHz;\nOS: Ubuntu 16.04.3 LTS.\n\n\nSince our data has been used as the training data in the iSeg-2017\nchallenge, we also compare the\nresults evaluated on the 13 testing subjects in\nTable~\\ref{table:iseg_results}. According to the leader board, our model\nachieves one of the top performances. Results in terms of DR are reported since\nit is the only shared evaluation metric.\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of the number of parameters between our proposed model\n\t\tand the baseline model.}\n\t\\label{table:num_params}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Number of Parameters \\\\\n\t\t\\midrule\n\t\tBaseline  & 2,534,276 \\\\\n\t\tNon-local U-Net & \\textbf{1,821,124} \\\\\n\t\t\\bottomrule\n\n\n\n\n\n\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of inference time between our proposed model and the baseline model. The leave-one-subject-out cross-validation is used. The patch size is set to $32^3$ and the overlapping step size for inference is set to $8$.}\n\t\\label{table:infer_time}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Inference Time (min) \\\\\n\t\t\\midrule\n\t\tBaseline  & 3.85$\\pm$0.15 \\\\\n\t\tNon-local U-Net  & \\textbf{3.06$\\pm$0.12} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.95\\columnwidth]{figure4.png}\n\t\\caption{Visualization of the segmentation results on the $10^{th}$ subject\n\t\tby our proposed model and the baseline model. Both models are trained on the\n\t\tfirst 9 subjects. The first column shows the original segmentation maps. The\n\t\tsecond, third and fourth columns show the binary segmentation maps for CSF,\n\t\tGM and WM, respectively.}\n\t\\label{fig:results_visual}\n\\end{figure}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure5.png}\n\t\\caption{Comparison of training processes and validation results between our proposed model and the\n\t\tbaseline model when training on the first 9 subjects and using the $10^{th}$ subject for validation.}\n\t\\label{fig:results_training}\n\\end{figure}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of DR. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Larger values indicate\n\t\tbetter performance. Details of models are provided in the text.}\n\t\\label{table:results_ablation_dr}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.9585} & 0.9099 & 0.8625 & 0.9103 \\\\\n\t\tModel2       & 0.9568 & 0.9172 & 0.8728 & 0.9156 \\\\\n\t\tModel3       & 0.9576 & 0.9198 & 0.8749 & 0.9174 \\\\\n\t\tModel4       & 0.9578 & 0.9210 & 0.8769 & 0.9186 \\\\\n\t\tModel5       & 0.9554 & 0.9225 & 0.8804 & 0.9194 \\\\\n\t\tNon-local U-Net    & 0.9572 & \\textbf{0.9278} & \\textbf{0.8867} & \\textbf{0.9239} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of 3D-MHD. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Smaller values indicate\n\t\tbetter performance. Note that 3D-MHD gives different results from MHD. Details of models are provided in\n\t\tthe text.}\n\t\\label{table:results_ablation_mhd}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.2363} & 0.6277 & 0.4705 & 0.4448 \\\\\n\t\tModel2       & 0.2404 & 0.6052 & 0.4480 & 0.4312 \\\\\n\t\tModel3       & 0.2392 & 0.5993 & 0.4429 & 0.4271 \\\\\n\t\tModel4       & 0.2397 & 0.5926 & 0.4336 & 0.4220 \\\\\n\t\tModel5       & 0.2444 & 0.5901 & 0.4288 & 0.4211 \\\\\n\t\tNon-local U-Net    & 0.2477 & \\textbf{0.5692} & \\textbf{0.4062} & \\textbf{0.4077} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n",
        "Descriptive_question1": "What is the 3D-MHD value for the Baseline model on CSF in table_2?",
        "Descriptive_question2": "Which model has the lowest 3D-MHD value for WM in table_2?",
        "Reasoning_question1": "Why might the Non-local U-Net model consistently show lower 3D-MHD values compared to the Baseline model across all categories in table_2?",
        "Reasoning_question2": "How does the performance difference in terms of 3D-MHD between the Baseline and Non-local U-Net models for GM compare to the differences for CSF and WM in table_2?",
        "Descriptive_answer1": "0.3417",
        "Descriptive_answer2": "Non-local U-Net",
        "Reasoning_answer1": "The Non-local U-Net model likely shows consistently lower 3D-MHD values compared to the Baseline model across all categories in table_2 due to its architectural improvements. First, consider that 3D-MHD measures segmentation accuracy, with lower values indicating better performance. The Non-local U-Net incorporates global aggregation blocks that capture long-range dependencies in the data, enhancing its ability to segment complex structures like CSF, GM, and WM more accurately than the Baseline model, which lacks such mechanisms. Additionally, as mentioned in the context, the Non-local U-Net reduces the number of parameters by 28% compared to the Baseline (CC-3D-FCN), suggesting a more efficient design that avoids overfitting while still capturing critical features. This efficiency, combined with faster convergence and better validation results as visualized in the training curves, implies that the Non-local U-Net learns more effective feature representations, leading to improved segmentation and thus lower 3D-MHD scores across all metrics.",
        "Reasoning_answer2": "To compare the performance difference in terms of 3D-MHD between the Baseline and Non-local U-Net models for GM relative to CSF and WM in table_2, let's calculate the absolute differences for each category. For GM, the Baseline value is 0.6537 and the Non-local U-Net value is 0.5950, resulting in a difference of 0.0587. For CSF, the Baseline is 0.3417 and Non-local U-Net is 0.2554, giving a difference of 0.0863. For WM, the Baseline is 0.4817 and Non-local U-Net is 0.4454, resulting in a difference of 0.0363. Comparing these differences, the improvement for GM (0.0587) is less than that for CSF (0.0863) but greater than that for WM (0.0363). This suggests that the Non-local U-Net's architectural advantages, such as capturing global dependencies, have a moderate impact on GM segmentation compared to the other categories, possibly due to the inherent complexity or variability in GM structures relative to CSF, which shows the largest improvement, and WM, which shows the smallest."
    },
    {
        "paper_id": "1812.04103.json",
        "table_id": "table_3",
        "table_content": "\\begin{table*}[!t]\n\t\\centering\n\t\\caption{Comparison of segmentation performance on the 13 testing subjects of iSeg-2017 between our proposed model\n\t\tand the baseline model in terms of DR. Larger values indicate better performance.}\n\t\\label{table:iseg_results}\n\t\\begin{tabular}{ l | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9324$\\pm$0.0067\n\t\t& 0.9146$\\pm$0.0074\n\t\t& 0.8974$\\pm$0.0123 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9557$\\pm$0.0060}\n\t\t& \\textbf{0.9219$\\pm$0.0089}\n\t\t& \\textbf{0.9044$\\pm$0.0153} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}",
        "caption": "Comparison of segmentation performance on the 13 testing subjects of iSeg-2017 between our proposed model\n\t\tand the baseline model in terms of DR. Larger values indicate better performance.",
        "label": "table:iseg_results",
        "section_info": "3 Results and Discussion\n\\section{Results and Discussion}\n\n\nWe perform experiments on the 3D multimodality isointense infant brain MR image segmentation task to evaluate our non-local U-Nets. The task is to perform automatic segmentation of MR images into cerebrospinal fluid~(CSF), gray matter~(GM) and white matter~(WM) regions. We first introduce the baseline model and the evaluation methods used in our experiments. Then the training and inference processes are described. We provide comparison results in terms of both effectiveness and efficiency, and conduct ablation studies to demonstrate that how each global aggregation block in our non-local U-Nets improves the performance. In addition, we explore the trade-off between the inference speed and accuracy based on different overlapping step sizes, and analyze the impact of patch size. The experimental code and dataset information have been made publicly available~\\footnote{\\url{https://github.com/divelab/Non-local-U-Nets}}.\n\n\\subsection{Experimental Setup}\n\n\nWe use CC-3D-FCN~\\cite{nie20183} as our baseline.\nCC-3D-FCN is a 3D fully convolutional network~(3D-FCN) with\nconvolution and concatenate~(CC) skip connections, which is designed\nfor 3D multimodality isointense infant brain image segmentation. It\nhas been shown to outperform traditional machine learning methods,\nsuch as FMRIB's automated segmentation\ntool~(FAST)~\\cite{zhang2001segmentation}, majority\nvoting~(MV), random forest~(RF)~\\cite{criminisi2013decision} and random forest with\nauto-context model~(LINKS)~\\cite{wang2015links}. Moreover, studies\nin~\\cite{nie20183} has showed the superiority of CC-3D-FCN to\nprevious deep learning models, like 2D, 3D\nCNNs~\\cite{zhang2015deep}, DeepMedic~\\cite{kamnitsas2017efficient},\nand the original 3D U-Net~\\cite{cciccek20163d}. Therefore, it is\nappropriate to use CC-3D-FCN as the baseline of our experiments.\nNote that our dataset is different from that in~\\cite{nie20183}.\n\n\nIn our experiments, we employ the Dice ratio~(DR) and propose the 3D modified\nHausdorff distance~(3D-MHD) as the evaluation metrics. These two\nmethods evaluate the accuracy only for binary segmentation tasks, so\nit is required to transform the 4-class segmentation map predicted by our model into\n4 binary segmentation maps for evaluation. That is, a 3D binary\nsegmentation map should be constructed for each class, where 1 denotes the voxel\nin the position belongs to the class and 0 means the opposite. In\nour experiments, we derive binary segmentation maps directly from\n4-class segmentation maps. The evaluation is performed on binary\nsegmentation maps for CSF, GM and WM.\n\n\nSpecifically, let $P$ and $L$ represent the predicted binary segmentation map for one class\nand the corresponding ground truth label, respectively. The DR is given by\n$DR=2|P \\cap L|/(|P|+|L|)$,\n\n\n\nwhere $|\\cdot|$ denotes the number of 1's in a segmentation map and $|P \\cap\nL|$ means the number of 1's shared by $P$ and $L$. Apparently, DR is a value\nin $[0,1]$ and a larger DR indicates a more accurate segmentation.\n\n\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of DR. The leave-one-subject-out\n\t\tcross-validation is used. Larger values indicate better performance.}\n\t\\label{table:results_baseline_dr}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9250$\\pm$0.0118\n\t\t& 0.9084$\\pm$0.0056\n\t\t& 0.8926$\\pm$0.0119\n\t\t& 0.9087$\\pm$0.0066 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9530$\\pm$0.0074}\n\t\t& \\textbf{0.9245$\\pm$0.0049}\n\t\t& \\textbf{0.9102$\\pm$0.0101}\n\t\t& \\textbf{0.9292$\\pm$0.0050} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of 3D-MHD. The leave-one-subject-out\n\t\tcross-validation is used. Smaller values indicate better performance. Note that 3D-MHD gives different results from MHD.}\n\t\\label{table:results_baseline_mhd}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.3417$\\pm$0.0245\n\t\t& 0.6537$\\pm$0.0483\n\t\t& 0.4817$\\pm$0.0454\n\t\t& 0.4924$\\pm$0.0345 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.2554$\\pm$0.0207}\n\t\t& \\textbf{0.5950$\\pm$0.0428}\n\t\t& \\textbf{0.4454$\\pm$0.0040}\n\t\t& \\textbf{0.4319$\\pm$0.0313} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\\begin{table*}[!t]\n\t\\centering\n\t\\caption{Comparison of segmentation performance on the 13 testing subjects of iSeg-2017 between our proposed model\n\t\tand the baseline model in terms of DR. Larger values indicate better performance.}\n\t\\label{table:iseg_results}\n\t\\begin{tabular}{ l | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9324$\\pm$0.0067\n\t\t& 0.9146$\\pm$0.0074\n\t\t& 0.8974$\\pm$0.0123 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9557$\\pm$0.0060}\n\t\t& \\textbf{0.9219$\\pm$0.0089}\n\t\t& \\textbf{0.9044$\\pm$0.0153} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\nThe modified Hausdorff distance~(MHD)~\\cite{dubuisson1994modified} is\ndesigned to compute the similarity between two objects. Here, an object is a\nset of points where a point is represented by a vector. Specifically, given\ntwo sets of vectors $A$ and $B$, MHD is computed by\n$MHD=\\max(d(A,B),d(B,A))$,\n\n\n\nwhere the distance between two sets is defined as\n$d(A,B)=1/|A|\\sum_{a \\in A}{d(a,B)}$,\n\n\n\nand the distance between a vector and a set is defined as\n$d(a,B)=\\min_{b \\in B}||a-b||$.\n\n\n\nPrevious studies~\\cite{wang2015links,zhang2015deep,nie20183}\napplied MHD for evaluation by treating a 3D $D \\times H \\times W$\nmap as $H \\times W$ $D$-dimensional vectors. However, there are two\nmore different ways to vectorize the 3D map, depending on the\ndirection of forming vectors, \\emph{i.e.,} $D \\times H$\n$W$-dimensional vectors and $D \\times W$ $H$-dimensional vectors.\nEach vectorization leads to different evaluation results by MHD. To\nmake it a direction-independent evaluation metric as DR, we define\n3D-MHD, which computes the averaged MHD based on the three different\nvectorizations. A smaller 3D-MHD indicates a\nhigher segmentation accuracy.\n\n\\subsection{Training and Inference Strategies}\n\n\nOur proposed non-local U-Nets apply Dropout~\\cite{srivastava2014dropout} with\na rate of 0.5 in each global aggregation block and the output block\nbefore the final $1 \\times 1 \\times 1$ convolution. A weight\ndecay~\\cite{krogh1992simple} with a rate of $2e-6$ is also employed.\nTo train the model, we use randomly cropped small patches. In this\nway, we obtain sufficient training data and the requirement on\nmemory is reduced. No extra data augmentation is needed. The\nexperimental results below suggest that patches\nwith a size of $32^3$ leads to the best\nperformance. The batch size is set to 5. The Adam\noptimizer~\\cite{kingma2014adam} with a learning rate of 0.001 is\nemployed to perform the gradient descent algorithm.\n\n\nIn the inference process, following~\\cite{nie20183}, we extract\npatches with the same size as that used in training. For example, to\ngenerate $32^3$ patches for inference, we slide a\nwindow of size $32^3$ through the original image\nwith a constant overlapping step size. The overlapping step size\nmust be smaller than or equal to the patch size, in order to\nguarantee that extracted patches cover the whole image.\nConsequently, prediction for all these patches provides segmentation\nprobability results for every voxel in the original image. For\nvoxels that receive multiple results due to overlapping, we average\nthem to produce the final prediction. The overlapping step size is\nan important hyper-parameter affecting the inference speed and the\nsegmentation accuracy. A smaller overlapping step size results in\nbetter accuracy, but increases the inference time as more patches\nare generated. We explore the trade-off in our experiments.\n\n\\subsection{Comparison with the Baseline}\\label{sec:baseline}\n\n\nWe compare our non-local U-Nets with the baseline on our dataset.\nFollowing~\\cite{nie20183}, the patch size is set to $32^3$\nand the overlapping step size for inference is set to $8$. To remove the\nbias of different subjects, the leave-one-subject-out cross-validation is\nused for evaluating segmentation performance. That is, for 10 subjects in our\ndataset, we train and evaluate models 10 times correspondingly. Each time one\nof the 10 subjects is left out for validation and the other 9 subjects are\nused for training. The mean and standard deviation of segmentation\nperformance of the 10 runs are reported.\n\n\nTables~\\ref{table:results_baseline_dr}\nand~\\ref{table:results_baseline_mhd} provide the experimental\nresults. In terms of both evaluation metrics, our non-local U-Nets achieve\nsignificant improvements over the baseline model. Due to the small\nvariances of the results, we focus on one of the 10 runs for\nvisualization and ablation studies, where the models are trained on the\nfirst 9 subjects and evaluated on the $10^{th}$ subject. A\nvisualization of the segmentation results in this run is given by\nFig.~\\ref{fig:results_visual}. By comparing the areas in red\ncircles, we can see that our model is capable of catching more\ndetails than the baseline model. We also visualize the training\nprocesses to illustrate the superiority of our model.\nFig.~\\ref{fig:results_training} shows the training and validation\ncurves in this run of our model and the baseline model,\nrespectively. Clearly, our model converges faster to a lower\ntraining loss. In addition, according to the better validation\nresults, our model does not suffer from over-fitting.\n\n\nTo further show the efficiency of our proposed model, we compare the\nnumber of parameters as reported in Table~\\ref{table:num_params}.\nOur model reduces $28\\%$ parameters compared to CC-3D-FCN and\nachieves better performance. A comparison of inference time is also\nprovided in Table~\\ref{table:infer_time}. The settings of our device\nare - GPU: Nvidia Titan Xp 12GB; CPU: Intel Xeon E5-2620v4 2.10GHz;\nOS: Ubuntu 16.04.3 LTS.\n\n\nSince our data has been used as the training data in the iSeg-2017\nchallenge, we also compare the\nresults evaluated on the 13 testing subjects in\nTable~\\ref{table:iseg_results}. According to the leader board, our model\nachieves one of the top performances. Results in terms of DR are reported since\nit is the only shared evaluation metric.\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of the number of parameters between our proposed model\n\t\tand the baseline model.}\n\t\\label{table:num_params}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Number of Parameters \\\\\n\t\t\\midrule\n\t\tBaseline  & 2,534,276 \\\\\n\t\tNon-local U-Net & \\textbf{1,821,124} \\\\\n\t\t\\bottomrule\n\n\n\n\n\n\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of inference time between our proposed model and the baseline model. The leave-one-subject-out cross-validation is used. The patch size is set to $32^3$ and the overlapping step size for inference is set to $8$.}\n\t\\label{table:infer_time}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Inference Time (min) \\\\\n\t\t\\midrule\n\t\tBaseline  & 3.85$\\pm$0.15 \\\\\n\t\tNon-local U-Net  & \\textbf{3.06$\\pm$0.12} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.95\\columnwidth]{figure4.png}\n\t\\caption{Visualization of the segmentation results on the $10^{th}$ subject\n\t\tby our proposed model and the baseline model. Both models are trained on the\n\t\tfirst 9 subjects. The first column shows the original segmentation maps. The\n\t\tsecond, third and fourth columns show the binary segmentation maps for CSF,\n\t\tGM and WM, respectively.}\n\t\\label{fig:results_visual}\n\\end{figure}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure5.png}\n\t\\caption{Comparison of training processes and validation results between our proposed model and the\n\t\tbaseline model when training on the first 9 subjects and using the $10^{th}$ subject for validation.}\n\t\\label{fig:results_training}\n\\end{figure}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of DR. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Larger values indicate\n\t\tbetter performance. Details of models are provided in the text.}\n\t\\label{table:results_ablation_dr}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.9585} & 0.9099 & 0.8625 & 0.9103 \\\\\n\t\tModel2       & 0.9568 & 0.9172 & 0.8728 & 0.9156 \\\\\n\t\tModel3       & 0.9576 & 0.9198 & 0.8749 & 0.9174 \\\\\n\t\tModel4       & 0.9578 & 0.9210 & 0.8769 & 0.9186 \\\\\n\t\tModel5       & 0.9554 & 0.9225 & 0.8804 & 0.9194 \\\\\n\t\tNon-local U-Net    & 0.9572 & \\textbf{0.9278} & \\textbf{0.8867} & \\textbf{0.9239} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of 3D-MHD. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Smaller values indicate\n\t\tbetter performance. Note that 3D-MHD gives different results from MHD. Details of models are provided in\n\t\tthe text.}\n\t\\label{table:results_ablation_mhd}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.2363} & 0.6277 & 0.4705 & 0.4448 \\\\\n\t\tModel2       & 0.2404 & 0.6052 & 0.4480 & 0.4312 \\\\\n\t\tModel3       & 0.2392 & 0.5993 & 0.4429 & 0.4271 \\\\\n\t\tModel4       & 0.2397 & 0.5926 & 0.4336 & 0.4220 \\\\\n\t\tModel5       & 0.2444 & 0.5901 & 0.4288 & 0.4211 \\\\\n\t\tNon-local U-Net    & 0.2477 & \\textbf{0.5692} & \\textbf{0.4062} & \\textbf{0.4077} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\subsection{Ablation Studies of Different Modules}\\label{sec:ablation}\n\nWe perform ablation studies to show the effectiveness of each part of our\nnon-local U-Nets. Specifically, we compare the following models:\n\n\\textbf{Model1} is a 3D U-Net without short-range residual connections.\nDown-sampling and up-sampling are implemented by convolutions and\ndeconvolutions with a stride of 2, respectively. The bottom block is simply a\nconvolutional layer. Note that the baseline model, CC-3D-FCN, has showed improved performance over 3D U-Net~\\cite{nie20183}. However, the original 3D U-Net was not designed for this task~\\cite{cciccek20163d}. In our experiments, we appropriately set the hyperparameters of 3D U-Net and achieve better performance.\n\n\\textbf{Model2} is Model1 with short-range residual connections, \\emph{i.e.},\nthe blocks in Fig.~\\ref{fig:residual}(a) and (b) are applied. The bottom\nblock and up-sampling blocks are the same as those in Model1.\n\n\\textbf{Model3} replaces the first up-sampling block in Model2 with the block\nin Fig.~\\ref{fig:residual}(d).\n\n\\textbf{Model4} replaces both up-sampling blocks in Model2 with the block in\nFig.~\\ref{fig:residual}(d).\n\n\\textbf{Model5} replaces the bottom block in Model2 with the block in\nFig.~\\ref{fig:residual}(c).\n\nAll models are trained on the first 9 subjects. We report the segmentation\nperformance on the $10^{th}$ subject in Table~\\ref{table:results_ablation_dr}\nand Table~\\ref{table:results_ablation_mhd}. The results demonstrate how different\nglobal aggregation blocks in our non-local U-Nets improve the performance.\n\n\\subsection{Impact of the Overlapping Step Size}\\label{sec:overlap}\n\nAs discussed above, a small overlapping step size\nusually results in better segmentation, due to the ensemble effect.\nHowever, with a small overlapping step size, the model has to perform\ninference for more validation patches and thus decreases the inference speed. We\nexplore the trade-off in our non-local U-Nets by setting the overlapping step sizes to 4,\n8, 16, 32, respectively. Again, we train our model on the first 9 subjects and\nperform evaluation on the $10^{th}$ subject. The patch size is set to $32^3$.\nAccording to the overlapping step sizes, 11880, 1920,\n387, 80 patches need to be processed during inference, as shown in\nFig.~\\ref{fig:results_overlap_time}. In addition, Fig.~\\ref{fig:results_overlap_dr}\nplots the changes of segmentation performance in terms of DR. Obviously, 8 and\n16 are good choices that achieve accurate and fast segmentation results.\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.7\\columnwidth]{figure6.png}\n\t\\caption{Changes of segmentation performance in terms of DR, with respect to\n\t\tdifferent overlapping step sizes during inference. The model is trained on the\n\t\tfirst 9 subjects and evaluated on the $10^{th}$ subject.}\n\t\\label{fig:results_overlap_dr}\n\\end{figure}\n\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure7.png}\n\t\\caption{Changes of the number of validation patches for the $10^{th}$\n\t\tsubject, with respect to different overlapping step sizes during inference.}\n\t\\label{fig:results_overlap_time}\n\\end{figure}\n\n\\subsection{Impact of the Patch Size}\\label{sec:patch}\n\nThe patch size affects the total number of distinct training samples.\nMeanwhile, it controls the range of available global information when\nperforming segmentation for a patch. To choose the appropriate patch\nsize for the non-local U-Nets, we perform a grid search by training on the first 9\nsubjects and evaluating on the $10^{th}$ subject with the overlapping step\nsize of 8. Experiments are conducted with five different patch sizes:\n$16^3$, $24^3$, $32^3$, $40^3$, $48^3$. The results are provided in\nFig.~\\ref{fig:results_patch_dr}, where $32^3$ obtains the best\nperformance and is selected as the default setting of our model.\n\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure8.png}\n\t\\caption{Changes of segmentation performance in terms of DR, with respect to\n\t\tdifferent patch sizes. The model is trained on the first 9 subjects and\n\t\tevaluated on the $10^{th}$ subject.}\n\t\\label{fig:results_patch_dr}\n\\end{figure}\n\n3.3 Comparison with the Baseline\n\\subsection{Comparison with the Baseline}\\label{sec:baseline}\n\n\nWe compare our non-local U-Nets with the baseline on our dataset.\nFollowing~\\cite{nie20183}, the patch size is set to $32^3$\nand the overlapping step size for inference is set to $8$. To remove the\nbias of different subjects, the leave-one-subject-out cross-validation is\nused for evaluating segmentation performance. That is, for 10 subjects in our\ndataset, we train and evaluate models 10 times correspondingly. Each time one\nof the 10 subjects is left out for validation and the other 9 subjects are\nused for training. The mean and standard deviation of segmentation\nperformance of the 10 runs are reported.\n\n\nTables~\\ref{table:results_baseline_dr}\nand~\\ref{table:results_baseline_mhd} provide the experimental\nresults. In terms of both evaluation metrics, our non-local U-Nets achieve\nsignificant improvements over the baseline model. Due to the small\nvariances of the results, we focus on one of the 10 runs for\nvisualization and ablation studies, where the models are trained on the\nfirst 9 subjects and evaluated on the $10^{th}$ subject. A\nvisualization of the segmentation results in this run is given by\nFig.~\\ref{fig:results_visual}. By comparing the areas in red\ncircles, we can see that our model is capable of catching more\ndetails than the baseline model. We also visualize the training\nprocesses to illustrate the superiority of our model.\nFig.~\\ref{fig:results_training} shows the training and validation\ncurves in this run of our model and the baseline model,\nrespectively. Clearly, our model converges faster to a lower\ntraining loss. In addition, according to the better validation\nresults, our model does not suffer from over-fitting.\n\n\nTo further show the efficiency of our proposed model, we compare the\nnumber of parameters as reported in Table~\\ref{table:num_params}.\nOur model reduces $28\\%$ parameters compared to CC-3D-FCN and\nachieves better performance. A comparison of inference time is also\nprovided in Table~\\ref{table:infer_time}. The settings of our device\nare - GPU: Nvidia Titan Xp 12GB; CPU: Intel Xeon E5-2620v4 2.10GHz;\nOS: Ubuntu 16.04.3 LTS.\n\n\nSince our data has been used as the training data in the iSeg-2017\nchallenge, we also compare the\nresults evaluated on the 13 testing subjects in\nTable~\\ref{table:iseg_results}. According to the leader board, our model\nachieves one of the top performances. Results in terms of DR are reported since\nit is the only shared evaluation metric.\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of the number of parameters between our proposed model\n\t\tand the baseline model.}\n\t\\label{table:num_params}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Number of Parameters \\\\\n\t\t\\midrule\n\t\tBaseline  & 2,534,276 \\\\\n\t\tNon-local U-Net & \\textbf{1,821,124} \\\\\n\t\t\\bottomrule\n\n\n\n\n\n\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of inference time between our proposed model and the baseline model. The leave-one-subject-out cross-validation is used. The patch size is set to $32^3$ and the overlapping step size for inference is set to $8$.}\n\t\\label{table:infer_time}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Inference Time (min) \\\\\n\t\t\\midrule\n\t\tBaseline  & 3.85$\\pm$0.15 \\\\\n\t\tNon-local U-Net  & \\textbf{3.06$\\pm$0.12} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.95\\columnwidth]{figure4.png}\n\t\\caption{Visualization of the segmentation results on the $10^{th}$ subject\n\t\tby our proposed model and the baseline model. Both models are trained on the\n\t\tfirst 9 subjects. The first column shows the original segmentation maps. The\n\t\tsecond, third and fourth columns show the binary segmentation maps for CSF,\n\t\tGM and WM, respectively.}\n\t\\label{fig:results_visual}\n\\end{figure}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure5.png}\n\t\\caption{Comparison of training processes and validation results between our proposed model and the\n\t\tbaseline model when training on the first 9 subjects and using the $10^{th}$ subject for validation.}\n\t\\label{fig:results_training}\n\\end{figure}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of DR. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Larger values indicate\n\t\tbetter performance. Details of models are provided in the text.}\n\t\\label{table:results_ablation_dr}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.9585} & 0.9099 & 0.8625 & 0.9103 \\\\\n\t\tModel2       & 0.9568 & 0.9172 & 0.8728 & 0.9156 \\\\\n\t\tModel3       & 0.9576 & 0.9198 & 0.8749 & 0.9174 \\\\\n\t\tModel4       & 0.9578 & 0.9210 & 0.8769 & 0.9186 \\\\\n\t\tModel5       & 0.9554 & 0.9225 & 0.8804 & 0.9194 \\\\\n\t\tNon-local U-Net    & 0.9572 & \\textbf{0.9278} & \\textbf{0.8867} & \\textbf{0.9239} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of 3D-MHD. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Smaller values indicate\n\t\tbetter performance. Note that 3D-MHD gives different results from MHD. Details of models are provided in\n\t\tthe text.}\n\t\\label{table:results_ablation_mhd}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.2363} & 0.6277 & 0.4705 & 0.4448 \\\\\n\t\tModel2       & 0.2404 & 0.6052 & 0.4480 & 0.4312 \\\\\n\t\tModel3       & 0.2392 & 0.5993 & 0.4429 & 0.4271 \\\\\n\t\tModel4       & 0.2397 & 0.5926 & 0.4336 & 0.4220 \\\\\n\t\tModel5       & 0.2444 & 0.5901 & 0.4288 & 0.4211 \\\\\n\t\tNon-local U-Net    & 0.2477 & \\textbf{0.5692} & \\textbf{0.4062} & \\textbf{0.4077} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n",
        "Descriptive_question1": "What is the Dice Ratio (DR) for CSF in the Non-local U-Net model in table_3?",
        "Descriptive_question2": "Which model shows a higher DR for WM in table_3?",
        "Reasoning_question1": "What can be inferred about the overall performance of the Non-local U-Net compared to the baseline model across different brain regions in table_3?",
        "Reasoning_question2": "How does the variation in standard deviation of DR values for GM between the two models in table_3 reflect on their consistency of performance?",
        "Descriptive_answer1": "0.9557",
        "Descriptive_answer2": "Non-local U-Net",
        "Reasoning_answer1": "Looking at table_3, I can see that the Non-local U-Net model consistently outperforms the baseline model across all brain regions—CSF, GM, and WM—in terms of Dice Ratio (DR). For CSF, the DR is 0.9557 for Non-local U-Net compared to 0.9324 for the baseline, showing a clear improvement. For GM, Non-local U-Net achieves 0.9219 against the baseline's 0.9146, and for WM, it scores 0.9044 versus 0.8974. Since larger DR values indicate better performance, as stated in the caption, this pattern suggests that Non-local U-Net is more effective in segmenting all tested brain regions. Therefore, I can infer that the Non-local U-Net offers a superior overall performance compared to the baseline model in this experiment.",
        "Reasoning_answer2": "Analyzing the standard deviation of DR values for GM in table_3, I notice that the Non-local U-Net has a standard deviation of 0.0089, while the baseline model has a slightly lower standard deviation of 0.0074. Standard deviation reflects the variability in performance across the 13 testing subjects, so a lower value indicates more consistent results. Here, the baseline model's lower standard deviation suggests that its performance for GM segmentation is more stable across different subjects. However, even though Non-local U-Net has a higher standard deviation, implying slightly less consistency, its mean DR value of 0.9219 is still higher than the baseline's 0.9146, indicating better average performance. Thus, while Non-local U-Net might be less consistent in GM segmentation, it still achieves superior overall accuracy compared to the baseline."
    },
    {
        "paper_id": "1812.04103.json",
        "table_id": "table_4",
        "table_content": "\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of the number of parameters between our proposed model\n\t\tand the baseline model.}\n\t\\label{table:num_params}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Number of Parameters \\\\\n\t\t\\midrule\n\t\tBaseline  & 2,534,276 \\\\\n\t\tNon-local U-Net & \\textbf{1,821,124} \\\\\n\t\t\\bottomrule\n\n\n\n\n\n\n\t\\end{tabular}\n\\end{table}",
        "caption": "Comparison of the number of parameters between our proposed model\n\t\tand the baseline model.",
        "label": "table:num_params",
        "section_info": "3 Results and Discussion\n\\section{Results and Discussion}\n\n\nWe perform experiments on the 3D multimodality isointense infant brain MR image segmentation task to evaluate our non-local U-Nets. The task is to perform automatic segmentation of MR images into cerebrospinal fluid~(CSF), gray matter~(GM) and white matter~(WM) regions. We first introduce the baseline model and the evaluation methods used in our experiments. Then the training and inference processes are described. We provide comparison results in terms of both effectiveness and efficiency, and conduct ablation studies to demonstrate that how each global aggregation block in our non-local U-Nets improves the performance. In addition, we explore the trade-off between the inference speed and accuracy based on different overlapping step sizes, and analyze the impact of patch size. The experimental code and dataset information have been made publicly available~\\footnote{\\url{https://github.com/divelab/Non-local-U-Nets}}.\n\n\\subsection{Experimental Setup}\n\n\nWe use CC-3D-FCN~\\cite{nie20183} as our baseline.\nCC-3D-FCN is a 3D fully convolutional network~(3D-FCN) with\nconvolution and concatenate~(CC) skip connections, which is designed\nfor 3D multimodality isointense infant brain image segmentation. It\nhas been shown to outperform traditional machine learning methods,\nsuch as FMRIB's automated segmentation\ntool~(FAST)~\\cite{zhang2001segmentation}, majority\nvoting~(MV), random forest~(RF)~\\cite{criminisi2013decision} and random forest with\nauto-context model~(LINKS)~\\cite{wang2015links}. Moreover, studies\nin~\\cite{nie20183} has showed the superiority of CC-3D-FCN to\nprevious deep learning models, like 2D, 3D\nCNNs~\\cite{zhang2015deep}, DeepMedic~\\cite{kamnitsas2017efficient},\nand the original 3D U-Net~\\cite{cciccek20163d}. Therefore, it is\nappropriate to use CC-3D-FCN as the baseline of our experiments.\nNote that our dataset is different from that in~\\cite{nie20183}.\n\n\nIn our experiments, we employ the Dice ratio~(DR) and propose the 3D modified\nHausdorff distance~(3D-MHD) as the evaluation metrics. These two\nmethods evaluate the accuracy only for binary segmentation tasks, so\nit is required to transform the 4-class segmentation map predicted by our model into\n4 binary segmentation maps for evaluation. That is, a 3D binary\nsegmentation map should be constructed for each class, where 1 denotes the voxel\nin the position belongs to the class and 0 means the opposite. In\nour experiments, we derive binary segmentation maps directly from\n4-class segmentation maps. The evaluation is performed on binary\nsegmentation maps for CSF, GM and WM.\n\n\nSpecifically, let $P$ and $L$ represent the predicted binary segmentation map for one class\nand the corresponding ground truth label, respectively. The DR is given by\n$DR=2|P \\cap L|/(|P|+|L|)$,\n\n\n\nwhere $|\\cdot|$ denotes the number of 1's in a segmentation map and $|P \\cap\nL|$ means the number of 1's shared by $P$ and $L$. Apparently, DR is a value\nin $[0,1]$ and a larger DR indicates a more accurate segmentation.\n\n\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of DR. The leave-one-subject-out\n\t\tcross-validation is used. Larger values indicate better performance.}\n\t\\label{table:results_baseline_dr}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9250$\\pm$0.0118\n\t\t& 0.9084$\\pm$0.0056\n\t\t& 0.8926$\\pm$0.0119\n\t\t& 0.9087$\\pm$0.0066 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9530$\\pm$0.0074}\n\t\t& \\textbf{0.9245$\\pm$0.0049}\n\t\t& \\textbf{0.9102$\\pm$0.0101}\n\t\t& \\textbf{0.9292$\\pm$0.0050} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of 3D-MHD. The leave-one-subject-out\n\t\tcross-validation is used. Smaller values indicate better performance. Note that 3D-MHD gives different results from MHD.}\n\t\\label{table:results_baseline_mhd}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.3417$\\pm$0.0245\n\t\t& 0.6537$\\pm$0.0483\n\t\t& 0.4817$\\pm$0.0454\n\t\t& 0.4924$\\pm$0.0345 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.2554$\\pm$0.0207}\n\t\t& \\textbf{0.5950$\\pm$0.0428}\n\t\t& \\textbf{0.4454$\\pm$0.0040}\n\t\t& \\textbf{0.4319$\\pm$0.0313} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\\begin{table*}[!t]\n\t\\centering\n\t\\caption{Comparison of segmentation performance on the 13 testing subjects of iSeg-2017 between our proposed model\n\t\tand the baseline model in terms of DR. Larger values indicate better performance.}\n\t\\label{table:iseg_results}\n\t\\begin{tabular}{ l | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9324$\\pm$0.0067\n\t\t& 0.9146$\\pm$0.0074\n\t\t& 0.8974$\\pm$0.0123 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9557$\\pm$0.0060}\n\t\t& \\textbf{0.9219$\\pm$0.0089}\n\t\t& \\textbf{0.9044$\\pm$0.0153} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\nThe modified Hausdorff distance~(MHD)~\\cite{dubuisson1994modified} is\ndesigned to compute the similarity between two objects. Here, an object is a\nset of points where a point is represented by a vector. Specifically, given\ntwo sets of vectors $A$ and $B$, MHD is computed by\n$MHD=\\max(d(A,B),d(B,A))$,\n\n\n\nwhere the distance between two sets is defined as\n$d(A,B)=1/|A|\\sum_{a \\in A}{d(a,B)}$,\n\n\n\nand the distance between a vector and a set is defined as\n$d(a,B)=\\min_{b \\in B}||a-b||$.\n\n\n\nPrevious studies~\\cite{wang2015links,zhang2015deep,nie20183}\napplied MHD for evaluation by treating a 3D $D \\times H \\times W$\nmap as $H \\times W$ $D$-dimensional vectors. However, there are two\nmore different ways to vectorize the 3D map, depending on the\ndirection of forming vectors, \\emph{i.e.,} $D \\times H$\n$W$-dimensional vectors and $D \\times W$ $H$-dimensional vectors.\nEach vectorization leads to different evaluation results by MHD. To\nmake it a direction-independent evaluation metric as DR, we define\n3D-MHD, which computes the averaged MHD based on the three different\nvectorizations. A smaller 3D-MHD indicates a\nhigher segmentation accuracy.\n\n\\subsection{Training and Inference Strategies}\n\n\nOur proposed non-local U-Nets apply Dropout~\\cite{srivastava2014dropout} with\na rate of 0.5 in each global aggregation block and the output block\nbefore the final $1 \\times 1 \\times 1$ convolution. A weight\ndecay~\\cite{krogh1992simple} with a rate of $2e-6$ is also employed.\nTo train the model, we use randomly cropped small patches. In this\nway, we obtain sufficient training data and the requirement on\nmemory is reduced. No extra data augmentation is needed. The\nexperimental results below suggest that patches\nwith a size of $32^3$ leads to the best\nperformance. The batch size is set to 5. The Adam\noptimizer~\\cite{kingma2014adam} with a learning rate of 0.001 is\nemployed to perform the gradient descent algorithm.\n\n\nIn the inference process, following~\\cite{nie20183}, we extract\npatches with the same size as that used in training. For example, to\ngenerate $32^3$ patches for inference, we slide a\nwindow of size $32^3$ through the original image\nwith a constant overlapping step size. The overlapping step size\nmust be smaller than or equal to the patch size, in order to\nguarantee that extracted patches cover the whole image.\nConsequently, prediction for all these patches provides segmentation\nprobability results for every voxel in the original image. For\nvoxels that receive multiple results due to overlapping, we average\nthem to produce the final prediction. The overlapping step size is\nan important hyper-parameter affecting the inference speed and the\nsegmentation accuracy. A smaller overlapping step size results in\nbetter accuracy, but increases the inference time as more patches\nare generated. We explore the trade-off in our experiments.\n\n\\subsection{Comparison with the Baseline}\\label{sec:baseline}\n\n\nWe compare our non-local U-Nets with the baseline on our dataset.\nFollowing~\\cite{nie20183}, the patch size is set to $32^3$\nand the overlapping step size for inference is set to $8$. To remove the\nbias of different subjects, the leave-one-subject-out cross-validation is\nused for evaluating segmentation performance. That is, for 10 subjects in our\ndataset, we train and evaluate models 10 times correspondingly. Each time one\nof the 10 subjects is left out for validation and the other 9 subjects are\nused for training. The mean and standard deviation of segmentation\nperformance of the 10 runs are reported.\n\n\nTables~\\ref{table:results_baseline_dr}\nand~\\ref{table:results_baseline_mhd} provide the experimental\nresults. In terms of both evaluation metrics, our non-local U-Nets achieve\nsignificant improvements over the baseline model. Due to the small\nvariances of the results, we focus on one of the 10 runs for\nvisualization and ablation studies, where the models are trained on the\nfirst 9 subjects and evaluated on the $10^{th}$ subject. A\nvisualization of the segmentation results in this run is given by\nFig.~\\ref{fig:results_visual}. By comparing the areas in red\ncircles, we can see that our model is capable of catching more\ndetails than the baseline model. We also visualize the training\nprocesses to illustrate the superiority of our model.\nFig.~\\ref{fig:results_training} shows the training and validation\ncurves in this run of our model and the baseline model,\nrespectively. Clearly, our model converges faster to a lower\ntraining loss. In addition, according to the better validation\nresults, our model does not suffer from over-fitting.\n\n\nTo further show the efficiency of our proposed model, we compare the\nnumber of parameters as reported in Table~\\ref{table:num_params}.\nOur model reduces $28\\%$ parameters compared to CC-3D-FCN and\nachieves better performance. A comparison of inference time is also\nprovided in Table~\\ref{table:infer_time}. The settings of our device\nare - GPU: Nvidia Titan Xp 12GB; CPU: Intel Xeon E5-2620v4 2.10GHz;\nOS: Ubuntu 16.04.3 LTS.\n\n\nSince our data has been used as the training data in the iSeg-2017\nchallenge, we also compare the\nresults evaluated on the 13 testing subjects in\nTable~\\ref{table:iseg_results}. According to the leader board, our model\nachieves one of the top performances. Results in terms of DR are reported since\nit is the only shared evaluation metric.\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of the number of parameters between our proposed model\n\t\tand the baseline model.}\n\t\\label{table:num_params}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Number of Parameters \\\\\n\t\t\\midrule\n\t\tBaseline  & 2,534,276 \\\\\n\t\tNon-local U-Net & \\textbf{1,821,124} \\\\\n\t\t\\bottomrule\n\n\n\n\n\n\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of inference time between our proposed model and the baseline model. The leave-one-subject-out cross-validation is used. The patch size is set to $32^3$ and the overlapping step size for inference is set to $8$.}\n\t\\label{table:infer_time}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Inference Time (min) \\\\\n\t\t\\midrule\n\t\tBaseline  & 3.85$\\pm$0.15 \\\\\n\t\tNon-local U-Net  & \\textbf{3.06$\\pm$0.12} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.95\\columnwidth]{figure4.png}\n\t\\caption{Visualization of the segmentation results on the $10^{th}$ subject\n\t\tby our proposed model and the baseline model. Both models are trained on the\n\t\tfirst 9 subjects. The first column shows the original segmentation maps. The\n\t\tsecond, third and fourth columns show the binary segmentation maps for CSF,\n\t\tGM and WM, respectively.}\n\t\\label{fig:results_visual}\n\\end{figure}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure5.png}\n\t\\caption{Comparison of training processes and validation results between our proposed model and the\n\t\tbaseline model when training on the first 9 subjects and using the $10^{th}$ subject for validation.}\n\t\\label{fig:results_training}\n\\end{figure}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of DR. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Larger values indicate\n\t\tbetter performance. Details of models are provided in the text.}\n\t\\label{table:results_ablation_dr}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.9585} & 0.9099 & 0.8625 & 0.9103 \\\\\n\t\tModel2       & 0.9568 & 0.9172 & 0.8728 & 0.9156 \\\\\n\t\tModel3       & 0.9576 & 0.9198 & 0.8749 & 0.9174 \\\\\n\t\tModel4       & 0.9578 & 0.9210 & 0.8769 & 0.9186 \\\\\n\t\tModel5       & 0.9554 & 0.9225 & 0.8804 & 0.9194 \\\\\n\t\tNon-local U-Net    & 0.9572 & \\textbf{0.9278} & \\textbf{0.8867} & \\textbf{0.9239} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of 3D-MHD. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Smaller values indicate\n\t\tbetter performance. Note that 3D-MHD gives different results from MHD. Details of models are provided in\n\t\tthe text.}\n\t\\label{table:results_ablation_mhd}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.2363} & 0.6277 & 0.4705 & 0.4448 \\\\\n\t\tModel2       & 0.2404 & 0.6052 & 0.4480 & 0.4312 \\\\\n\t\tModel3       & 0.2392 & 0.5993 & 0.4429 & 0.4271 \\\\\n\t\tModel4       & 0.2397 & 0.5926 & 0.4336 & 0.4220 \\\\\n\t\tModel5       & 0.2444 & 0.5901 & 0.4288 & 0.4211 \\\\\n\t\tNon-local U-Net    & 0.2477 & \\textbf{0.5692} & \\textbf{0.4062} & \\textbf{0.4077} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\subsection{Ablation Studies of Different Modules}\\label{sec:ablation}\n\nWe perform ablation studies to show the effectiveness of each part of our\nnon-local U-Nets. Specifically, we compare the following models:\n\n\\textbf{Model1} is a 3D U-Net without short-range residual connections.\nDown-sampling and up-sampling are implemented by convolutions and\ndeconvolutions with a stride of 2, respectively. The bottom block is simply a\nconvolutional layer. Note that the baseline model, CC-3D-FCN, has showed improved performance over 3D U-Net~\\cite{nie20183}. However, the original 3D U-Net was not designed for this task~\\cite{cciccek20163d}. In our experiments, we appropriately set the hyperparameters of 3D U-Net and achieve better performance.\n\n\\textbf{Model2} is Model1 with short-range residual connections, \\emph{i.e.},\nthe blocks in Fig.~\\ref{fig:residual}(a) and (b) are applied. The bottom\nblock and up-sampling blocks are the same as those in Model1.\n\n\\textbf{Model3} replaces the first up-sampling block in Model2 with the block\nin Fig.~\\ref{fig:residual}(d).\n\n\\textbf{Model4} replaces both up-sampling blocks in Model2 with the block in\nFig.~\\ref{fig:residual}(d).\n\n\\textbf{Model5} replaces the bottom block in Model2 with the block in\nFig.~\\ref{fig:residual}(c).\n\nAll models are trained on the first 9 subjects. We report the segmentation\nperformance on the $10^{th}$ subject in Table~\\ref{table:results_ablation_dr}\nand Table~\\ref{table:results_ablation_mhd}. The results demonstrate how different\nglobal aggregation blocks in our non-local U-Nets improve the performance.\n\n\\subsection{Impact of the Overlapping Step Size}\\label{sec:overlap}\n\nAs discussed above, a small overlapping step size\nusually results in better segmentation, due to the ensemble effect.\nHowever, with a small overlapping step size, the model has to perform\ninference for more validation patches and thus decreases the inference speed. We\nexplore the trade-off in our non-local U-Nets by setting the overlapping step sizes to 4,\n8, 16, 32, respectively. Again, we train our model on the first 9 subjects and\nperform evaluation on the $10^{th}$ subject. The patch size is set to $32^3$.\nAccording to the overlapping step sizes, 11880, 1920,\n387, 80 patches need to be processed during inference, as shown in\nFig.~\\ref{fig:results_overlap_time}. In addition, Fig.~\\ref{fig:results_overlap_dr}\nplots the changes of segmentation performance in terms of DR. Obviously, 8 and\n16 are good choices that achieve accurate and fast segmentation results.\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.7\\columnwidth]{figure6.png}\n\t\\caption{Changes of segmentation performance in terms of DR, with respect to\n\t\tdifferent overlapping step sizes during inference. The model is trained on the\n\t\tfirst 9 subjects and evaluated on the $10^{th}$ subject.}\n\t\\label{fig:results_overlap_dr}\n\\end{figure}\n\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure7.png}\n\t\\caption{Changes of the number of validation patches for the $10^{th}$\n\t\tsubject, with respect to different overlapping step sizes during inference.}\n\t\\label{fig:results_overlap_time}\n\\end{figure}\n\n\\subsection{Impact of the Patch Size}\\label{sec:patch}\n\nThe patch size affects the total number of distinct training samples.\nMeanwhile, it controls the range of available global information when\nperforming segmentation for a patch. To choose the appropriate patch\nsize for the non-local U-Nets, we perform a grid search by training on the first 9\nsubjects and evaluating on the $10^{th}$ subject with the overlapping step\nsize of 8. Experiments are conducted with five different patch sizes:\n$16^3$, $24^3$, $32^3$, $40^3$, $48^3$. The results are provided in\nFig.~\\ref{fig:results_patch_dr}, where $32^3$ obtains the best\nperformance and is selected as the default setting of our model.\n\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure8.png}\n\t\\caption{Changes of segmentation performance in terms of DR, with respect to\n\t\tdifferent patch sizes. The model is trained on the first 9 subjects and\n\t\tevaluated on the $10^{th}$ subject.}\n\t\\label{fig:results_patch_dr}\n\\end{figure}\n\n3.3 Comparison with the Baseline\n\\subsection{Comparison with the Baseline}\\label{sec:baseline}\n\n\nWe compare our non-local U-Nets with the baseline on our dataset.\nFollowing~\\cite{nie20183}, the patch size is set to $32^3$\nand the overlapping step size for inference is set to $8$. To remove the\nbias of different subjects, the leave-one-subject-out cross-validation is\nused for evaluating segmentation performance. That is, for 10 subjects in our\ndataset, we train and evaluate models 10 times correspondingly. Each time one\nof the 10 subjects is left out for validation and the other 9 subjects are\nused for training. The mean and standard deviation of segmentation\nperformance of the 10 runs are reported.\n\n\nTables~\\ref{table:results_baseline_dr}\nand~\\ref{table:results_baseline_mhd} provide the experimental\nresults. In terms of both evaluation metrics, our non-local U-Nets achieve\nsignificant improvements over the baseline model. Due to the small\nvariances of the results, we focus on one of the 10 runs for\nvisualization and ablation studies, where the models are trained on the\nfirst 9 subjects and evaluated on the $10^{th}$ subject. A\nvisualization of the segmentation results in this run is given by\nFig.~\\ref{fig:results_visual}. By comparing the areas in red\ncircles, we can see that our model is capable of catching more\ndetails than the baseline model. We also visualize the training\nprocesses to illustrate the superiority of our model.\nFig.~\\ref{fig:results_training} shows the training and validation\ncurves in this run of our model and the baseline model,\nrespectively. Clearly, our model converges faster to a lower\ntraining loss. In addition, according to the better validation\nresults, our model does not suffer from over-fitting.\n\n\nTo further show the efficiency of our proposed model, we compare the\nnumber of parameters as reported in Table~\\ref{table:num_params}.\nOur model reduces $28\\%$ parameters compared to CC-3D-FCN and\nachieves better performance. A comparison of inference time is also\nprovided in Table~\\ref{table:infer_time}. The settings of our device\nare - GPU: Nvidia Titan Xp 12GB; CPU: Intel Xeon E5-2620v4 2.10GHz;\nOS: Ubuntu 16.04.3 LTS.\n\n\nSince our data has been used as the training data in the iSeg-2017\nchallenge, we also compare the\nresults evaluated on the 13 testing subjects in\nTable~\\ref{table:iseg_results}. According to the leader board, our model\nachieves one of the top performances. Results in terms of DR are reported since\nit is the only shared evaluation metric.\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of the number of parameters between our proposed model\n\t\tand the baseline model.}\n\t\\label{table:num_params}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Number of Parameters \\\\\n\t\t\\midrule\n\t\tBaseline  & 2,534,276 \\\\\n\t\tNon-local U-Net & \\textbf{1,821,124} \\\\\n\t\t\\bottomrule\n\n\n\n\n\n\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of inference time between our proposed model and the baseline model. The leave-one-subject-out cross-validation is used. The patch size is set to $32^3$ and the overlapping step size for inference is set to $8$.}\n\t\\label{table:infer_time}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Inference Time (min) \\\\\n\t\t\\midrule\n\t\tBaseline  & 3.85$\\pm$0.15 \\\\\n\t\tNon-local U-Net  & \\textbf{3.06$\\pm$0.12} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.95\\columnwidth]{figure4.png}\n\t\\caption{Visualization of the segmentation results on the $10^{th}$ subject\n\t\tby our proposed model and the baseline model. Both models are trained on the\n\t\tfirst 9 subjects. The first column shows the original segmentation maps. The\n\t\tsecond, third and fourth columns show the binary segmentation maps for CSF,\n\t\tGM and WM, respectively.}\n\t\\label{fig:results_visual}\n\\end{figure}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure5.png}\n\t\\caption{Comparison of training processes and validation results between our proposed model and the\n\t\tbaseline model when training on the first 9 subjects and using the $10^{th}$ subject for validation.}\n\t\\label{fig:results_training}\n\\end{figure}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of DR. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Larger values indicate\n\t\tbetter performance. Details of models are provided in the text.}\n\t\\label{table:results_ablation_dr}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.9585} & 0.9099 & 0.8625 & 0.9103 \\\\\n\t\tModel2       & 0.9568 & 0.9172 & 0.8728 & 0.9156 \\\\\n\t\tModel3       & 0.9576 & 0.9198 & 0.8749 & 0.9174 \\\\\n\t\tModel4       & 0.9578 & 0.9210 & 0.8769 & 0.9186 \\\\\n\t\tModel5       & 0.9554 & 0.9225 & 0.8804 & 0.9194 \\\\\n\t\tNon-local U-Net    & 0.9572 & \\textbf{0.9278} & \\textbf{0.8867} & \\textbf{0.9239} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of 3D-MHD. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Smaller values indicate\n\t\tbetter performance. Note that 3D-MHD gives different results from MHD. Details of models are provided in\n\t\tthe text.}\n\t\\label{table:results_ablation_mhd}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.2363} & 0.6277 & 0.4705 & 0.4448 \\\\\n\t\tModel2       & 0.2404 & 0.6052 & 0.4480 & 0.4312 \\\\\n\t\tModel3       & 0.2392 & 0.5993 & 0.4429 & 0.4271 \\\\\n\t\tModel4       & 0.2397 & 0.5926 & 0.4336 & 0.4220 \\\\\n\t\tModel5       & 0.2444 & 0.5901 & 0.4288 & 0.4211 \\\\\n\t\tNon-local U-Net    & 0.2477 & \\textbf{0.5692} & \\textbf{0.4062} & \\textbf{0.4077} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n",
        "Descriptive_question1": "What is the number of parameters for the Baseline model in table_4?",
        "Descriptive_question2": "Which model has fewer parameters in table_4?",
        "Reasoning_question1": "What can be inferred about the efficiency of the Non-local U-Net compared to the Baseline model based on the data in table_4?",
        "Reasoning_question2": "How does the reduction in the number of parameters in the Non-local U-Net potentially impact its performance as shown in table_4?",
        "Descriptive_answer1": "2,534,276",
        "Descriptive_answer2": "Non-local U-Net",
        "Reasoning_answer1": "Based on the data in table_4, it can be inferred that the Non-local U-Net is more efficient than the Baseline model in terms of parameter usage. The Non-local U-Net has 1,821,124 parameters compared to the Baseline model's 2,534,276 parameters. This represents a reduction of approximately 28% in the number of parameters, suggesting that the Non-local U-Net achieves its performance with a lighter architecture. A lower number of parameters often implies reduced computational complexity and potentially faster training or inference times, assuming other factors remain constant.",
        "Reasoning_answer2": "The reduction in the number of parameters in the Non-local U-Net to 1,821,124 from the Baseline model's 2,534,276, as shown in table_4, could potentially impact its performance in several ways. First, fewer parameters generally mean a less complex model, which might reduce the risk of overfitting, especially if the dataset is not extremely large or diverse. Second, this reduction could lead to faster computation times during training and inference, enhancing efficiency. However, a potential downside might be a reduced capacity to capture intricate patterns or details in the data if the parameter reduction removes critical features of the model. Despite this, the context suggests that the Non-local U-Net still achieves better performance, implying that the reduction was strategically implemented to maintain or even enhance effectiveness, perhaps by optimizing the architecture or focusing on more impactful parameters."
    },
    {
        "paper_id": "1812.04103.json",
        "table_id": "table_5",
        "table_content": "\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of inference time between our proposed model and the baseline model. The leave-one-subject-out cross-validation is used. The patch size is set to $32^3$ and the overlapping step size for inference is set to $8$.}\n\t\\label{table:infer_time}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Inference Time (min) \\\\\n\t\t\\midrule\n\t\tBaseline  & 3.85$\\pm$0.15 \\\\\n\t\tNon-local U-Net  & \\textbf{3.06$\\pm$0.12} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table}",
        "caption": "Comparison of inference time between our proposed model and the baseline model. The leave-one-subject-out cross-validation is used. The patch size is set to $32^3$ and the overlapping step size for inference is set to $8$.",
        "label": "table:infer_time",
        "section_info": "3 Results and Discussion\n\\section{Results and Discussion}\n\n\nWe perform experiments on the 3D multimodality isointense infant brain MR image segmentation task to evaluate our non-local U-Nets. The task is to perform automatic segmentation of MR images into cerebrospinal fluid~(CSF), gray matter~(GM) and white matter~(WM) regions. We first introduce the baseline model and the evaluation methods used in our experiments. Then the training and inference processes are described. We provide comparison results in terms of both effectiveness and efficiency, and conduct ablation studies to demonstrate that how each global aggregation block in our non-local U-Nets improves the performance. In addition, we explore the trade-off between the inference speed and accuracy based on different overlapping step sizes, and analyze the impact of patch size. The experimental code and dataset information have been made publicly available~\\footnote{\\url{https://github.com/divelab/Non-local-U-Nets}}.\n\n\\subsection{Experimental Setup}\n\n\nWe use CC-3D-FCN~\\cite{nie20183} as our baseline.\nCC-3D-FCN is a 3D fully convolutional network~(3D-FCN) with\nconvolution and concatenate~(CC) skip connections, which is designed\nfor 3D multimodality isointense infant brain image segmentation. It\nhas been shown to outperform traditional machine learning methods,\nsuch as FMRIB's automated segmentation\ntool~(FAST)~\\cite{zhang2001segmentation}, majority\nvoting~(MV), random forest~(RF)~\\cite{criminisi2013decision} and random forest with\nauto-context model~(LINKS)~\\cite{wang2015links}. Moreover, studies\nin~\\cite{nie20183} has showed the superiority of CC-3D-FCN to\nprevious deep learning models, like 2D, 3D\nCNNs~\\cite{zhang2015deep}, DeepMedic~\\cite{kamnitsas2017efficient},\nand the original 3D U-Net~\\cite{cciccek20163d}. Therefore, it is\nappropriate to use CC-3D-FCN as the baseline of our experiments.\nNote that our dataset is different from that in~\\cite{nie20183}.\n\n\nIn our experiments, we employ the Dice ratio~(DR) and propose the 3D modified\nHausdorff distance~(3D-MHD) as the evaluation metrics. These two\nmethods evaluate the accuracy only for binary segmentation tasks, so\nit is required to transform the 4-class segmentation map predicted by our model into\n4 binary segmentation maps for evaluation. That is, a 3D binary\nsegmentation map should be constructed for each class, where 1 denotes the voxel\nin the position belongs to the class and 0 means the opposite. In\nour experiments, we derive binary segmentation maps directly from\n4-class segmentation maps. The evaluation is performed on binary\nsegmentation maps for CSF, GM and WM.\n\n\nSpecifically, let $P$ and $L$ represent the predicted binary segmentation map for one class\nand the corresponding ground truth label, respectively. The DR is given by\n$DR=2|P \\cap L|/(|P|+|L|)$,\n\n\n\nwhere $|\\cdot|$ denotes the number of 1's in a segmentation map and $|P \\cap\nL|$ means the number of 1's shared by $P$ and $L$. Apparently, DR is a value\nin $[0,1]$ and a larger DR indicates a more accurate segmentation.\n\n\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of DR. The leave-one-subject-out\n\t\tcross-validation is used. Larger values indicate better performance.}\n\t\\label{table:results_baseline_dr}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9250$\\pm$0.0118\n\t\t& 0.9084$\\pm$0.0056\n\t\t& 0.8926$\\pm$0.0119\n\t\t& 0.9087$\\pm$0.0066 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9530$\\pm$0.0074}\n\t\t& \\textbf{0.9245$\\pm$0.0049}\n\t\t& \\textbf{0.9102$\\pm$0.0101}\n\t\t& \\textbf{0.9292$\\pm$0.0050} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of 3D-MHD. The leave-one-subject-out\n\t\tcross-validation is used. Smaller values indicate better performance. Note that 3D-MHD gives different results from MHD.}\n\t\\label{table:results_baseline_mhd}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.3417$\\pm$0.0245\n\t\t& 0.6537$\\pm$0.0483\n\t\t& 0.4817$\\pm$0.0454\n\t\t& 0.4924$\\pm$0.0345 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.2554$\\pm$0.0207}\n\t\t& \\textbf{0.5950$\\pm$0.0428}\n\t\t& \\textbf{0.4454$\\pm$0.0040}\n\t\t& \\textbf{0.4319$\\pm$0.0313} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\\begin{table*}[!t]\n\t\\centering\n\t\\caption{Comparison of segmentation performance on the 13 testing subjects of iSeg-2017 between our proposed model\n\t\tand the baseline model in terms of DR. Larger values indicate better performance.}\n\t\\label{table:iseg_results}\n\t\\begin{tabular}{ l | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9324$\\pm$0.0067\n\t\t& 0.9146$\\pm$0.0074\n\t\t& 0.8974$\\pm$0.0123 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9557$\\pm$0.0060}\n\t\t& \\textbf{0.9219$\\pm$0.0089}\n\t\t& \\textbf{0.9044$\\pm$0.0153} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\nThe modified Hausdorff distance~(MHD)~\\cite{dubuisson1994modified} is\ndesigned to compute the similarity between two objects. Here, an object is a\nset of points where a point is represented by a vector. Specifically, given\ntwo sets of vectors $A$ and $B$, MHD is computed by\n$MHD=\\max(d(A,B),d(B,A))$,\n\n\n\nwhere the distance between two sets is defined as\n$d(A,B)=1/|A|\\sum_{a \\in A}{d(a,B)}$,\n\n\n\nand the distance between a vector and a set is defined as\n$d(a,B)=\\min_{b \\in B}||a-b||$.\n\n\n\nPrevious studies~\\cite{wang2015links,zhang2015deep,nie20183}\napplied MHD for evaluation by treating a 3D $D \\times H \\times W$\nmap as $H \\times W$ $D$-dimensional vectors. However, there are two\nmore different ways to vectorize the 3D map, depending on the\ndirection of forming vectors, \\emph{i.e.,} $D \\times H$\n$W$-dimensional vectors and $D \\times W$ $H$-dimensional vectors.\nEach vectorization leads to different evaluation results by MHD. To\nmake it a direction-independent evaluation metric as DR, we define\n3D-MHD, which computes the averaged MHD based on the three different\nvectorizations. A smaller 3D-MHD indicates a\nhigher segmentation accuracy.\n\n\\subsection{Training and Inference Strategies}\n\n\nOur proposed non-local U-Nets apply Dropout~\\cite{srivastava2014dropout} with\na rate of 0.5 in each global aggregation block and the output block\nbefore the final $1 \\times 1 \\times 1$ convolution. A weight\ndecay~\\cite{krogh1992simple} with a rate of $2e-6$ is also employed.\nTo train the model, we use randomly cropped small patches. In this\nway, we obtain sufficient training data and the requirement on\nmemory is reduced. No extra data augmentation is needed. The\nexperimental results below suggest that patches\nwith a size of $32^3$ leads to the best\nperformance. The batch size is set to 5. The Adam\noptimizer~\\cite{kingma2014adam} with a learning rate of 0.001 is\nemployed to perform the gradient descent algorithm.\n\n\nIn the inference process, following~\\cite{nie20183}, we extract\npatches with the same size as that used in training. For example, to\ngenerate $32^3$ patches for inference, we slide a\nwindow of size $32^3$ through the original image\nwith a constant overlapping step size. The overlapping step size\nmust be smaller than or equal to the patch size, in order to\nguarantee that extracted patches cover the whole image.\nConsequently, prediction for all these patches provides segmentation\nprobability results for every voxel in the original image. For\nvoxels that receive multiple results due to overlapping, we average\nthem to produce the final prediction. The overlapping step size is\nan important hyper-parameter affecting the inference speed and the\nsegmentation accuracy. A smaller overlapping step size results in\nbetter accuracy, but increases the inference time as more patches\nare generated. We explore the trade-off in our experiments.\n\n\\subsection{Comparison with the Baseline}\\label{sec:baseline}\n\n\nWe compare our non-local U-Nets with the baseline on our dataset.\nFollowing~\\cite{nie20183}, the patch size is set to $32^3$\nand the overlapping step size for inference is set to $8$. To remove the\nbias of different subjects, the leave-one-subject-out cross-validation is\nused for evaluating segmentation performance. That is, for 10 subjects in our\ndataset, we train and evaluate models 10 times correspondingly. Each time one\nof the 10 subjects is left out for validation and the other 9 subjects are\nused for training. The mean and standard deviation of segmentation\nperformance of the 10 runs are reported.\n\n\nTables~\\ref{table:results_baseline_dr}\nand~\\ref{table:results_baseline_mhd} provide the experimental\nresults. In terms of both evaluation metrics, our non-local U-Nets achieve\nsignificant improvements over the baseline model. Due to the small\nvariances of the results, we focus on one of the 10 runs for\nvisualization and ablation studies, where the models are trained on the\nfirst 9 subjects and evaluated on the $10^{th}$ subject. A\nvisualization of the segmentation results in this run is given by\nFig.~\\ref{fig:results_visual}. By comparing the areas in red\ncircles, we can see that our model is capable of catching more\ndetails than the baseline model. We also visualize the training\nprocesses to illustrate the superiority of our model.\nFig.~\\ref{fig:results_training} shows the training and validation\ncurves in this run of our model and the baseline model,\nrespectively. Clearly, our model converges faster to a lower\ntraining loss. In addition, according to the better validation\nresults, our model does not suffer from over-fitting.\n\n\nTo further show the efficiency of our proposed model, we compare the\nnumber of parameters as reported in Table~\\ref{table:num_params}.\nOur model reduces $28\\%$ parameters compared to CC-3D-FCN and\nachieves better performance. A comparison of inference time is also\nprovided in Table~\\ref{table:infer_time}. The settings of our device\nare - GPU: Nvidia Titan Xp 12GB; CPU: Intel Xeon E5-2620v4 2.10GHz;\nOS: Ubuntu 16.04.3 LTS.\n\n\nSince our data has been used as the training data in the iSeg-2017\nchallenge, we also compare the\nresults evaluated on the 13 testing subjects in\nTable~\\ref{table:iseg_results}. According to the leader board, our model\nachieves one of the top performances. Results in terms of DR are reported since\nit is the only shared evaluation metric.\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of the number of parameters between our proposed model\n\t\tand the baseline model.}\n\t\\label{table:num_params}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Number of Parameters \\\\\n\t\t\\midrule\n\t\tBaseline  & 2,534,276 \\\\\n\t\tNon-local U-Net & \\textbf{1,821,124} \\\\\n\t\t\\bottomrule\n\n\n\n\n\n\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of inference time between our proposed model and the baseline model. The leave-one-subject-out cross-validation is used. The patch size is set to $32^3$ and the overlapping step size for inference is set to $8$.}\n\t\\label{table:infer_time}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Inference Time (min) \\\\\n\t\t\\midrule\n\t\tBaseline  & 3.85$\\pm$0.15 \\\\\n\t\tNon-local U-Net  & \\textbf{3.06$\\pm$0.12} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.95\\columnwidth]{figure4.png}\n\t\\caption{Visualization of the segmentation results on the $10^{th}$ subject\n\t\tby our proposed model and the baseline model. Both models are trained on the\n\t\tfirst 9 subjects. The first column shows the original segmentation maps. The\n\t\tsecond, third and fourth columns show the binary segmentation maps for CSF,\n\t\tGM and WM, respectively.}\n\t\\label{fig:results_visual}\n\\end{figure}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure5.png}\n\t\\caption{Comparison of training processes and validation results between our proposed model and the\n\t\tbaseline model when training on the first 9 subjects and using the $10^{th}$ subject for validation.}\n\t\\label{fig:results_training}\n\\end{figure}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of DR. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Larger values indicate\n\t\tbetter performance. Details of models are provided in the text.}\n\t\\label{table:results_ablation_dr}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.9585} & 0.9099 & 0.8625 & 0.9103 \\\\\n\t\tModel2       & 0.9568 & 0.9172 & 0.8728 & 0.9156 \\\\\n\t\tModel3       & 0.9576 & 0.9198 & 0.8749 & 0.9174 \\\\\n\t\tModel4       & 0.9578 & 0.9210 & 0.8769 & 0.9186 \\\\\n\t\tModel5       & 0.9554 & 0.9225 & 0.8804 & 0.9194 \\\\\n\t\tNon-local U-Net    & 0.9572 & \\textbf{0.9278} & \\textbf{0.8867} & \\textbf{0.9239} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of 3D-MHD. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Smaller values indicate\n\t\tbetter performance. Note that 3D-MHD gives different results from MHD. Details of models are provided in\n\t\tthe text.}\n\t\\label{table:results_ablation_mhd}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.2363} & 0.6277 & 0.4705 & 0.4448 \\\\\n\t\tModel2       & 0.2404 & 0.6052 & 0.4480 & 0.4312 \\\\\n\t\tModel3       & 0.2392 & 0.5993 & 0.4429 & 0.4271 \\\\\n\t\tModel4       & 0.2397 & 0.5926 & 0.4336 & 0.4220 \\\\\n\t\tModel5       & 0.2444 & 0.5901 & 0.4288 & 0.4211 \\\\\n\t\tNon-local U-Net    & 0.2477 & \\textbf{0.5692} & \\textbf{0.4062} & \\textbf{0.4077} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\subsection{Ablation Studies of Different Modules}\\label{sec:ablation}\n\nWe perform ablation studies to show the effectiveness of each part of our\nnon-local U-Nets. Specifically, we compare the following models:\n\n\\textbf{Model1} is a 3D U-Net without short-range residual connections.\nDown-sampling and up-sampling are implemented by convolutions and\ndeconvolutions with a stride of 2, respectively. The bottom block is simply a\nconvolutional layer. Note that the baseline model, CC-3D-FCN, has showed improved performance over 3D U-Net~\\cite{nie20183}. However, the original 3D U-Net was not designed for this task~\\cite{cciccek20163d}. In our experiments, we appropriately set the hyperparameters of 3D U-Net and achieve better performance.\n\n\\textbf{Model2} is Model1 with short-range residual connections, \\emph{i.e.},\nthe blocks in Fig.~\\ref{fig:residual}(a) and (b) are applied. The bottom\nblock and up-sampling blocks are the same as those in Model1.\n\n\\textbf{Model3} replaces the first up-sampling block in Model2 with the block\nin Fig.~\\ref{fig:residual}(d).\n\n\\textbf{Model4} replaces both up-sampling blocks in Model2 with the block in\nFig.~\\ref{fig:residual}(d).\n\n\\textbf{Model5} replaces the bottom block in Model2 with the block in\nFig.~\\ref{fig:residual}(c).\n\nAll models are trained on the first 9 subjects. We report the segmentation\nperformance on the $10^{th}$ subject in Table~\\ref{table:results_ablation_dr}\nand Table~\\ref{table:results_ablation_mhd}. The results demonstrate how different\nglobal aggregation blocks in our non-local U-Nets improve the performance.\n\n\\subsection{Impact of the Overlapping Step Size}\\label{sec:overlap}\n\nAs discussed above, a small overlapping step size\nusually results in better segmentation, due to the ensemble effect.\nHowever, with a small overlapping step size, the model has to perform\ninference for more validation patches and thus decreases the inference speed. We\nexplore the trade-off in our non-local U-Nets by setting the overlapping step sizes to 4,\n8, 16, 32, respectively. Again, we train our model on the first 9 subjects and\nperform evaluation on the $10^{th}$ subject. The patch size is set to $32^3$.\nAccording to the overlapping step sizes, 11880, 1920,\n387, 80 patches need to be processed during inference, as shown in\nFig.~\\ref{fig:results_overlap_time}. In addition, Fig.~\\ref{fig:results_overlap_dr}\nplots the changes of segmentation performance in terms of DR. Obviously, 8 and\n16 are good choices that achieve accurate and fast segmentation results.\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.7\\columnwidth]{figure6.png}\n\t\\caption{Changes of segmentation performance in terms of DR, with respect to\n\t\tdifferent overlapping step sizes during inference. The model is trained on the\n\t\tfirst 9 subjects and evaluated on the $10^{th}$ subject.}\n\t\\label{fig:results_overlap_dr}\n\\end{figure}\n\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure7.png}\n\t\\caption{Changes of the number of validation patches for the $10^{th}$\n\t\tsubject, with respect to different overlapping step sizes during inference.}\n\t\\label{fig:results_overlap_time}\n\\end{figure}\n\n\\subsection{Impact of the Patch Size}\\label{sec:patch}\n\nThe patch size affects the total number of distinct training samples.\nMeanwhile, it controls the range of available global information when\nperforming segmentation for a patch. To choose the appropriate patch\nsize for the non-local U-Nets, we perform a grid search by training on the first 9\nsubjects and evaluating on the $10^{th}$ subject with the overlapping step\nsize of 8. Experiments are conducted with five different patch sizes:\n$16^3$, $24^3$, $32^3$, $40^3$, $48^3$. The results are provided in\nFig.~\\ref{fig:results_patch_dr}, where $32^3$ obtains the best\nperformance and is selected as the default setting of our model.\n\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure8.png}\n\t\\caption{Changes of segmentation performance in terms of DR, with respect to\n\t\tdifferent patch sizes. The model is trained on the first 9 subjects and\n\t\tevaluated on the $10^{th}$ subject.}\n\t\\label{fig:results_patch_dr}\n\\end{figure}\n\n3.3 Comparison with the Baseline\n\\subsection{Comparison with the Baseline}\\label{sec:baseline}\n\n\nWe compare our non-local U-Nets with the baseline on our dataset.\nFollowing~\\cite{nie20183}, the patch size is set to $32^3$\nand the overlapping step size for inference is set to $8$. To remove the\nbias of different subjects, the leave-one-subject-out cross-validation is\nused for evaluating segmentation performance. That is, for 10 subjects in our\ndataset, we train and evaluate models 10 times correspondingly. Each time one\nof the 10 subjects is left out for validation and the other 9 subjects are\nused for training. The mean and standard deviation of segmentation\nperformance of the 10 runs are reported.\n\n\nTables~\\ref{table:results_baseline_dr}\nand~\\ref{table:results_baseline_mhd} provide the experimental\nresults. In terms of both evaluation metrics, our non-local U-Nets achieve\nsignificant improvements over the baseline model. Due to the small\nvariances of the results, we focus on one of the 10 runs for\nvisualization and ablation studies, where the models are trained on the\nfirst 9 subjects and evaluated on the $10^{th}$ subject. A\nvisualization of the segmentation results in this run is given by\nFig.~\\ref{fig:results_visual}. By comparing the areas in red\ncircles, we can see that our model is capable of catching more\ndetails than the baseline model. We also visualize the training\nprocesses to illustrate the superiority of our model.\nFig.~\\ref{fig:results_training} shows the training and validation\ncurves in this run of our model and the baseline model,\nrespectively. Clearly, our model converges faster to a lower\ntraining loss. In addition, according to the better validation\nresults, our model does not suffer from over-fitting.\n\n\nTo further show the efficiency of our proposed model, we compare the\nnumber of parameters as reported in Table~\\ref{table:num_params}.\nOur model reduces $28\\%$ parameters compared to CC-3D-FCN and\nachieves better performance. A comparison of inference time is also\nprovided in Table~\\ref{table:infer_time}. The settings of our device\nare - GPU: Nvidia Titan Xp 12GB; CPU: Intel Xeon E5-2620v4 2.10GHz;\nOS: Ubuntu 16.04.3 LTS.\n\n\nSince our data has been used as the training data in the iSeg-2017\nchallenge, we also compare the\nresults evaluated on the 13 testing subjects in\nTable~\\ref{table:iseg_results}. According to the leader board, our model\nachieves one of the top performances. Results in terms of DR are reported since\nit is the only shared evaluation metric.\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of the number of parameters between our proposed model\n\t\tand the baseline model.}\n\t\\label{table:num_params}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Number of Parameters \\\\\n\t\t\\midrule\n\t\tBaseline  & 2,534,276 \\\\\n\t\tNon-local U-Net & \\textbf{1,821,124} \\\\\n\t\t\\bottomrule\n\n\n\n\n\n\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of inference time between our proposed model and the baseline model. The leave-one-subject-out cross-validation is used. The patch size is set to $32^3$ and the overlapping step size for inference is set to $8$.}\n\t\\label{table:infer_time}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Inference Time (min) \\\\\n\t\t\\midrule\n\t\tBaseline  & 3.85$\\pm$0.15 \\\\\n\t\tNon-local U-Net  & \\textbf{3.06$\\pm$0.12} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.95\\columnwidth]{figure4.png}\n\t\\caption{Visualization of the segmentation results on the $10^{th}$ subject\n\t\tby our proposed model and the baseline model. Both models are trained on the\n\t\tfirst 9 subjects. The first column shows the original segmentation maps. The\n\t\tsecond, third and fourth columns show the binary segmentation maps for CSF,\n\t\tGM and WM, respectively.}\n\t\\label{fig:results_visual}\n\\end{figure}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure5.png}\n\t\\caption{Comparison of training processes and validation results between our proposed model and the\n\t\tbaseline model when training on the first 9 subjects and using the $10^{th}$ subject for validation.}\n\t\\label{fig:results_training}\n\\end{figure}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of DR. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Larger values indicate\n\t\tbetter performance. Details of models are provided in the text.}\n\t\\label{table:results_ablation_dr}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.9585} & 0.9099 & 0.8625 & 0.9103 \\\\\n\t\tModel2       & 0.9568 & 0.9172 & 0.8728 & 0.9156 \\\\\n\t\tModel3       & 0.9576 & 0.9198 & 0.8749 & 0.9174 \\\\\n\t\tModel4       & 0.9578 & 0.9210 & 0.8769 & 0.9186 \\\\\n\t\tModel5       & 0.9554 & 0.9225 & 0.8804 & 0.9194 \\\\\n\t\tNon-local U-Net    & 0.9572 & \\textbf{0.9278} & \\textbf{0.8867} & \\textbf{0.9239} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of 3D-MHD. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Smaller values indicate\n\t\tbetter performance. Note that 3D-MHD gives different results from MHD. Details of models are provided in\n\t\tthe text.}\n\t\\label{table:results_ablation_mhd}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.2363} & 0.6277 & 0.4705 & 0.4448 \\\\\n\t\tModel2       & 0.2404 & 0.6052 & 0.4480 & 0.4312 \\\\\n\t\tModel3       & 0.2392 & 0.5993 & 0.4429 & 0.4271 \\\\\n\t\tModel4       & 0.2397 & 0.5926 & 0.4336 & 0.4220 \\\\\n\t\tModel5       & 0.2444 & 0.5901 & 0.4288 & 0.4211 \\\\\n\t\tNon-local U-Net    & 0.2477 & \\textbf{0.5692} & \\textbf{0.4062} & \\textbf{0.4077} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n",
        "Descriptive_question1": "What is the inference time for the Baseline model in table_5?",
        "Descriptive_question2": "Which model has the lower inference time in table_5?",
        "Reasoning_question1": "Why might the Non-local U-Net have a shorter inference time compared to the Baseline model in table_5?",
        "Reasoning_question2": "How does the difference in inference time between the two models in table_5 reflect on their efficiency for practical applications?",
        "Descriptive_answer1": "3.85",
        "Descriptive_answer2": "Non-local U-Net",
        "Reasoning_answer1": "The Non-local U-Net likely has a shorter inference time of 3.06 minutes compared to the Baseline model's 3.85 minutes due to several architectural and optimization improvements. First, as noted in the context, the Non-local U-Net reduces the number of parameters by 28% compared to the Baseline (CC-3D-FCN), with 1,821,124 parameters versus 2,534,276. Fewer parameters generally lead to faster computation as there are fewer operations to process. Second, the non-local mechanisms and global aggregation blocks in the U-Net architecture may enable more efficient feature processing by capturing long-range dependencies more effectively, potentially reducing the computational overhead during inference. Lastly, while both models use the same patch size (32^3) and overlapping step size (8), the streamlined design of Non-local U-Net could optimize the handling of these patches, further contributing to reduced inference time.",
        "Reasoning_answer2": "The difference in inference time between the Non-local U-Net (3.06 minutes) and the Baseline model (3.85 minutes) in table_5, which amounts to a reduction of approximately 0.79 minutes or about 20.5%, highlights a significant efficiency advantage for the Non-local U-Net in practical applications. This reduction can be critical in real-world scenarios such as medical imaging, where faster inference times can expedite diagnosis and treatment planning, especially when processing multiple scans or large datasets. Additionally, the shorter inference time, combined with fewer parameters (as mentioned in the context), suggests that the Non-local U-Net may require less computational power or can run on less powerful hardware, reducing costs and energy consumption. This efficiency makes the Non-local U-Net more suitable for deployment in time-sensitive or resource-constrained environments, enhancing its applicability in clinical or operational settings."
    },
    {
        "paper_id": "1812.04103.json",
        "table_id": "table_6",
        "table_content": "\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of DR. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Larger values indicate\n\t\tbetter performance. Details of models are provided in the text.}\n\t\\label{table:results_ablation_dr}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.9585} & 0.9099 & 0.8625 & 0.9103 \\\\\n\t\tModel2       & 0.9568 & 0.9172 & 0.8728 & 0.9156 \\\\\n\t\tModel3       & 0.9576 & 0.9198 & 0.8749 & 0.9174 \\\\\n\t\tModel4       & 0.9578 & 0.9210 & 0.8769 & 0.9186 \\\\\n\t\tModel5       & 0.9554 & 0.9225 & 0.8804 & 0.9194 \\\\\n\t\tNon-local U-Net    & 0.9572 & \\textbf{0.9278} & \\textbf{0.8867} & \\textbf{0.9239} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}",
        "caption": "Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of DR. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Larger values indicate\n\t\tbetter performance. Details of models are provided in the text.",
        "label": "table:results_ablation_dr",
        "section_info": "3 Results and Discussion\n\\section{Results and Discussion}\n\n\nWe perform experiments on the 3D multimodality isointense infant brain MR image segmentation task to evaluate our non-local U-Nets. The task is to perform automatic segmentation of MR images into cerebrospinal fluid~(CSF), gray matter~(GM) and white matter~(WM) regions. We first introduce the baseline model and the evaluation methods used in our experiments. Then the training and inference processes are described. We provide comparison results in terms of both effectiveness and efficiency, and conduct ablation studies to demonstrate that how each global aggregation block in our non-local U-Nets improves the performance. In addition, we explore the trade-off between the inference speed and accuracy based on different overlapping step sizes, and analyze the impact of patch size. The experimental code and dataset information have been made publicly available~\\footnote{\\url{https://github.com/divelab/Non-local-U-Nets}}.\n\n\\subsection{Experimental Setup}\n\n\nWe use CC-3D-FCN~\\cite{nie20183} as our baseline.\nCC-3D-FCN is a 3D fully convolutional network~(3D-FCN) with\nconvolution and concatenate~(CC) skip connections, which is designed\nfor 3D multimodality isointense infant brain image segmentation. It\nhas been shown to outperform traditional machine learning methods,\nsuch as FMRIB's automated segmentation\ntool~(FAST)~\\cite{zhang2001segmentation}, majority\nvoting~(MV), random forest~(RF)~\\cite{criminisi2013decision} and random forest with\nauto-context model~(LINKS)~\\cite{wang2015links}. Moreover, studies\nin~\\cite{nie20183} has showed the superiority of CC-3D-FCN to\nprevious deep learning models, like 2D, 3D\nCNNs~\\cite{zhang2015deep}, DeepMedic~\\cite{kamnitsas2017efficient},\nand the original 3D U-Net~\\cite{cciccek20163d}. Therefore, it is\nappropriate to use CC-3D-FCN as the baseline of our experiments.\nNote that our dataset is different from that in~\\cite{nie20183}.\n\n\nIn our experiments, we employ the Dice ratio~(DR) and propose the 3D modified\nHausdorff distance~(3D-MHD) as the evaluation metrics. These two\nmethods evaluate the accuracy only for binary segmentation tasks, so\nit is required to transform the 4-class segmentation map predicted by our model into\n4 binary segmentation maps for evaluation. That is, a 3D binary\nsegmentation map should be constructed for each class, where 1 denotes the voxel\nin the position belongs to the class and 0 means the opposite. In\nour experiments, we derive binary segmentation maps directly from\n4-class segmentation maps. The evaluation is performed on binary\nsegmentation maps for CSF, GM and WM.\n\n\nSpecifically, let $P$ and $L$ represent the predicted binary segmentation map for one class\nand the corresponding ground truth label, respectively. The DR is given by\n$DR=2|P \\cap L|/(|P|+|L|)$,\n\n\n\nwhere $|\\cdot|$ denotes the number of 1's in a segmentation map and $|P \\cap\nL|$ means the number of 1's shared by $P$ and $L$. Apparently, DR is a value\nin $[0,1]$ and a larger DR indicates a more accurate segmentation.\n\n\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of DR. The leave-one-subject-out\n\t\tcross-validation is used. Larger values indicate better performance.}\n\t\\label{table:results_baseline_dr}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9250$\\pm$0.0118\n\t\t& 0.9084$\\pm$0.0056\n\t\t& 0.8926$\\pm$0.0119\n\t\t& 0.9087$\\pm$0.0066 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9530$\\pm$0.0074}\n\t\t& \\textbf{0.9245$\\pm$0.0049}\n\t\t& \\textbf{0.9102$\\pm$0.0101}\n\t\t& \\textbf{0.9292$\\pm$0.0050} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of 3D-MHD. The leave-one-subject-out\n\t\tcross-validation is used. Smaller values indicate better performance. Note that 3D-MHD gives different results from MHD.}\n\t\\label{table:results_baseline_mhd}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.3417$\\pm$0.0245\n\t\t& 0.6537$\\pm$0.0483\n\t\t& 0.4817$\\pm$0.0454\n\t\t& 0.4924$\\pm$0.0345 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.2554$\\pm$0.0207}\n\t\t& \\textbf{0.5950$\\pm$0.0428}\n\t\t& \\textbf{0.4454$\\pm$0.0040}\n\t\t& \\textbf{0.4319$\\pm$0.0313} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\\begin{table*}[!t]\n\t\\centering\n\t\\caption{Comparison of segmentation performance on the 13 testing subjects of iSeg-2017 between our proposed model\n\t\tand the baseline model in terms of DR. Larger values indicate better performance.}\n\t\\label{table:iseg_results}\n\t\\begin{tabular}{ l | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9324$\\pm$0.0067\n\t\t& 0.9146$\\pm$0.0074\n\t\t& 0.8974$\\pm$0.0123 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9557$\\pm$0.0060}\n\t\t& \\textbf{0.9219$\\pm$0.0089}\n\t\t& \\textbf{0.9044$\\pm$0.0153} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\nThe modified Hausdorff distance~(MHD)~\\cite{dubuisson1994modified} is\ndesigned to compute the similarity between two objects. Here, an object is a\nset of points where a point is represented by a vector. Specifically, given\ntwo sets of vectors $A$ and $B$, MHD is computed by\n$MHD=\\max(d(A,B),d(B,A))$,\n\n\n\nwhere the distance between two sets is defined as\n$d(A,B)=1/|A|\\sum_{a \\in A}{d(a,B)}$,\n\n\n\nand the distance between a vector and a set is defined as\n$d(a,B)=\\min_{b \\in B}||a-b||$.\n\n\n\nPrevious studies~\\cite{wang2015links,zhang2015deep,nie20183}\napplied MHD for evaluation by treating a 3D $D \\times H \\times W$\nmap as $H \\times W$ $D$-dimensional vectors. However, there are two\nmore different ways to vectorize the 3D map, depending on the\ndirection of forming vectors, \\emph{i.e.,} $D \\times H$\n$W$-dimensional vectors and $D \\times W$ $H$-dimensional vectors.\nEach vectorization leads to different evaluation results by MHD. To\nmake it a direction-independent evaluation metric as DR, we define\n3D-MHD, which computes the averaged MHD based on the three different\nvectorizations. A smaller 3D-MHD indicates a\nhigher segmentation accuracy.\n\n\\subsection{Training and Inference Strategies}\n\n\nOur proposed non-local U-Nets apply Dropout~\\cite{srivastava2014dropout} with\na rate of 0.5 in each global aggregation block and the output block\nbefore the final $1 \\times 1 \\times 1$ convolution. A weight\ndecay~\\cite{krogh1992simple} with a rate of $2e-6$ is also employed.\nTo train the model, we use randomly cropped small patches. In this\nway, we obtain sufficient training data and the requirement on\nmemory is reduced. No extra data augmentation is needed. The\nexperimental results below suggest that patches\nwith a size of $32^3$ leads to the best\nperformance. The batch size is set to 5. The Adam\noptimizer~\\cite{kingma2014adam} with a learning rate of 0.001 is\nemployed to perform the gradient descent algorithm.\n\n\nIn the inference process, following~\\cite{nie20183}, we extract\npatches with the same size as that used in training. For example, to\ngenerate $32^3$ patches for inference, we slide a\nwindow of size $32^3$ through the original image\nwith a constant overlapping step size. The overlapping step size\nmust be smaller than or equal to the patch size, in order to\nguarantee that extracted patches cover the whole image.\nConsequently, prediction for all these patches provides segmentation\nprobability results for every voxel in the original image. For\nvoxels that receive multiple results due to overlapping, we average\nthem to produce the final prediction. The overlapping step size is\nan important hyper-parameter affecting the inference speed and the\nsegmentation accuracy. A smaller overlapping step size results in\nbetter accuracy, but increases the inference time as more patches\nare generated. We explore the trade-off in our experiments.\n\n\\subsection{Comparison with the Baseline}\\label{sec:baseline}\n\n\nWe compare our non-local U-Nets with the baseline on our dataset.\nFollowing~\\cite{nie20183}, the patch size is set to $32^3$\nand the overlapping step size for inference is set to $8$. To remove the\nbias of different subjects, the leave-one-subject-out cross-validation is\nused for evaluating segmentation performance. That is, for 10 subjects in our\ndataset, we train and evaluate models 10 times correspondingly. Each time one\nof the 10 subjects is left out for validation and the other 9 subjects are\nused for training. The mean and standard deviation of segmentation\nperformance of the 10 runs are reported.\n\n\nTables~\\ref{table:results_baseline_dr}\nand~\\ref{table:results_baseline_mhd} provide the experimental\nresults. In terms of both evaluation metrics, our non-local U-Nets achieve\nsignificant improvements over the baseline model. Due to the small\nvariances of the results, we focus on one of the 10 runs for\nvisualization and ablation studies, where the models are trained on the\nfirst 9 subjects and evaluated on the $10^{th}$ subject. A\nvisualization of the segmentation results in this run is given by\nFig.~\\ref{fig:results_visual}. By comparing the areas in red\ncircles, we can see that our model is capable of catching more\ndetails than the baseline model. We also visualize the training\nprocesses to illustrate the superiority of our model.\nFig.~\\ref{fig:results_training} shows the training and validation\ncurves in this run of our model and the baseline model,\nrespectively. Clearly, our model converges faster to a lower\ntraining loss. In addition, according to the better validation\nresults, our model does not suffer from over-fitting.\n\n\nTo further show the efficiency of our proposed model, we compare the\nnumber of parameters as reported in Table~\\ref{table:num_params}.\nOur model reduces $28\\%$ parameters compared to CC-3D-FCN and\nachieves better performance. A comparison of inference time is also\nprovided in Table~\\ref{table:infer_time}. The settings of our device\nare - GPU: Nvidia Titan Xp 12GB; CPU: Intel Xeon E5-2620v4 2.10GHz;\nOS: Ubuntu 16.04.3 LTS.\n\n\nSince our data has been used as the training data in the iSeg-2017\nchallenge, we also compare the\nresults evaluated on the 13 testing subjects in\nTable~\\ref{table:iseg_results}. According to the leader board, our model\nachieves one of the top performances. Results in terms of DR are reported since\nit is the only shared evaluation metric.\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of the number of parameters between our proposed model\n\t\tand the baseline model.}\n\t\\label{table:num_params}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Number of Parameters \\\\\n\t\t\\midrule\n\t\tBaseline  & 2,534,276 \\\\\n\t\tNon-local U-Net & \\textbf{1,821,124} \\\\\n\t\t\\bottomrule\n\n\n\n\n\n\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of inference time between our proposed model and the baseline model. The leave-one-subject-out cross-validation is used. The patch size is set to $32^3$ and the overlapping step size for inference is set to $8$.}\n\t\\label{table:infer_time}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Inference Time (min) \\\\\n\t\t\\midrule\n\t\tBaseline  & 3.85$\\pm$0.15 \\\\\n\t\tNon-local U-Net  & \\textbf{3.06$\\pm$0.12} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.95\\columnwidth]{figure4.png}\n\t\\caption{Visualization of the segmentation results on the $10^{th}$ subject\n\t\tby our proposed model and the baseline model. Both models are trained on the\n\t\tfirst 9 subjects. The first column shows the original segmentation maps. The\n\t\tsecond, third and fourth columns show the binary segmentation maps for CSF,\n\t\tGM and WM, respectively.}\n\t\\label{fig:results_visual}\n\\end{figure}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure5.png}\n\t\\caption{Comparison of training processes and validation results between our proposed model and the\n\t\tbaseline model when training on the first 9 subjects and using the $10^{th}$ subject for validation.}\n\t\\label{fig:results_training}\n\\end{figure}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of DR. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Larger values indicate\n\t\tbetter performance. Details of models are provided in the text.}\n\t\\label{table:results_ablation_dr}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.9585} & 0.9099 & 0.8625 & 0.9103 \\\\\n\t\tModel2       & 0.9568 & 0.9172 & 0.8728 & 0.9156 \\\\\n\t\tModel3       & 0.9576 & 0.9198 & 0.8749 & 0.9174 \\\\\n\t\tModel4       & 0.9578 & 0.9210 & 0.8769 & 0.9186 \\\\\n\t\tModel5       & 0.9554 & 0.9225 & 0.8804 & 0.9194 \\\\\n\t\tNon-local U-Net    & 0.9572 & \\textbf{0.9278} & \\textbf{0.8867} & \\textbf{0.9239} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of 3D-MHD. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Smaller values indicate\n\t\tbetter performance. Note that 3D-MHD gives different results from MHD. Details of models are provided in\n\t\tthe text.}\n\t\\label{table:results_ablation_mhd}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.2363} & 0.6277 & 0.4705 & 0.4448 \\\\\n\t\tModel2       & 0.2404 & 0.6052 & 0.4480 & 0.4312 \\\\\n\t\tModel3       & 0.2392 & 0.5993 & 0.4429 & 0.4271 \\\\\n\t\tModel4       & 0.2397 & 0.5926 & 0.4336 & 0.4220 \\\\\n\t\tModel5       & 0.2444 & 0.5901 & 0.4288 & 0.4211 \\\\\n\t\tNon-local U-Net    & 0.2477 & \\textbf{0.5692} & \\textbf{0.4062} & \\textbf{0.4077} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\subsection{Ablation Studies of Different Modules}\\label{sec:ablation}\n\nWe perform ablation studies to show the effectiveness of each part of our\nnon-local U-Nets. Specifically, we compare the following models:\n\n\\textbf{Model1} is a 3D U-Net without short-range residual connections.\nDown-sampling and up-sampling are implemented by convolutions and\ndeconvolutions with a stride of 2, respectively. The bottom block is simply a\nconvolutional layer. Note that the baseline model, CC-3D-FCN, has showed improved performance over 3D U-Net~\\cite{nie20183}. However, the original 3D U-Net was not designed for this task~\\cite{cciccek20163d}. In our experiments, we appropriately set the hyperparameters of 3D U-Net and achieve better performance.\n\n\\textbf{Model2} is Model1 with short-range residual connections, \\emph{i.e.},\nthe blocks in Fig.~\\ref{fig:residual}(a) and (b) are applied. The bottom\nblock and up-sampling blocks are the same as those in Model1.\n\n\\textbf{Model3} replaces the first up-sampling block in Model2 with the block\nin Fig.~\\ref{fig:residual}(d).\n\n\\textbf{Model4} replaces both up-sampling blocks in Model2 with the block in\nFig.~\\ref{fig:residual}(d).\n\n\\textbf{Model5} replaces the bottom block in Model2 with the block in\nFig.~\\ref{fig:residual}(c).\n\nAll models are trained on the first 9 subjects. We report the segmentation\nperformance on the $10^{th}$ subject in Table~\\ref{table:results_ablation_dr}\nand Table~\\ref{table:results_ablation_mhd}. The results demonstrate how different\nglobal aggregation blocks in our non-local U-Nets improve the performance.\n\n\\subsection{Impact of the Overlapping Step Size}\\label{sec:overlap}\n\nAs discussed above, a small overlapping step size\nusually results in better segmentation, due to the ensemble effect.\nHowever, with a small overlapping step size, the model has to perform\ninference for more validation patches and thus decreases the inference speed. We\nexplore the trade-off in our non-local U-Nets by setting the overlapping step sizes to 4,\n8, 16, 32, respectively. Again, we train our model on the first 9 subjects and\nperform evaluation on the $10^{th}$ subject. The patch size is set to $32^3$.\nAccording to the overlapping step sizes, 11880, 1920,\n387, 80 patches need to be processed during inference, as shown in\nFig.~\\ref{fig:results_overlap_time}. In addition, Fig.~\\ref{fig:results_overlap_dr}\nplots the changes of segmentation performance in terms of DR. Obviously, 8 and\n16 are good choices that achieve accurate and fast segmentation results.\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.7\\columnwidth]{figure6.png}\n\t\\caption{Changes of segmentation performance in terms of DR, with respect to\n\t\tdifferent overlapping step sizes during inference. The model is trained on the\n\t\tfirst 9 subjects and evaluated on the $10^{th}$ subject.}\n\t\\label{fig:results_overlap_dr}\n\\end{figure}\n\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure7.png}\n\t\\caption{Changes of the number of validation patches for the $10^{th}$\n\t\tsubject, with respect to different overlapping step sizes during inference.}\n\t\\label{fig:results_overlap_time}\n\\end{figure}\n\n\\subsection{Impact of the Patch Size}\\label{sec:patch}\n\nThe patch size affects the total number of distinct training samples.\nMeanwhile, it controls the range of available global information when\nperforming segmentation for a patch. To choose the appropriate patch\nsize for the non-local U-Nets, we perform a grid search by training on the first 9\nsubjects and evaluating on the $10^{th}$ subject with the overlapping step\nsize of 8. Experiments are conducted with five different patch sizes:\n$16^3$, $24^3$, $32^3$, $40^3$, $48^3$. The results are provided in\nFig.~\\ref{fig:results_patch_dr}, where $32^3$ obtains the best\nperformance and is selected as the default setting of our model.\n\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure8.png}\n\t\\caption{Changes of segmentation performance in terms of DR, with respect to\n\t\tdifferent patch sizes. The model is trained on the first 9 subjects and\n\t\tevaluated on the $10^{th}$ subject.}\n\t\\label{fig:results_patch_dr}\n\\end{figure}\n\n3.4 Ablation Studies of Different Modules\n\\subsection{Ablation Studies of Different Modules}\\label{sec:ablation}\n\nWe perform ablation studies to show the effectiveness of each part of our\nnon-local U-Nets. Specifically, we compare the following models:\n\n\\textbf{Model1} is a 3D U-Net without short-range residual connections.\nDown-sampling and up-sampling are implemented by convolutions and\ndeconvolutions with a stride of 2, respectively. The bottom block is simply a\nconvolutional layer. Note that the baseline model, CC-3D-FCN, has showed improved performance over 3D U-Net~\\cite{nie20183}. However, the original 3D U-Net was not designed for this task~\\cite{cciccek20163d}. In our experiments, we appropriately set the hyperparameters of 3D U-Net and achieve better performance.\n\n\\textbf{Model2} is Model1 with short-range residual connections, \\emph{i.e.},\nthe blocks in Fig.~\\ref{fig:residual}(a) and (b) are applied. The bottom\nblock and up-sampling blocks are the same as those in Model1.\n\n\\textbf{Model3} replaces the first up-sampling block in Model2 with the block\nin Fig.~\\ref{fig:residual}(d).\n\n\\textbf{Model4} replaces both up-sampling blocks in Model2 with the block in\nFig.~\\ref{fig:residual}(d).\n\n\\textbf{Model5} replaces the bottom block in Model2 with the block in\nFig.~\\ref{fig:residual}(c).\n\nAll models are trained on the first 9 subjects. We report the segmentation\nperformance on the $10^{th}$ subject in Table~\\ref{table:results_ablation_dr}\nand Table~\\ref{table:results_ablation_mhd}. The results demonstrate how different\nglobal aggregation blocks in our non-local U-Nets improve the performance.\n\n",
        "Descriptive_question1": "What is the DR value for CSF in Model1 in table_6?",
        "Descriptive_question2": "Which model has the highest average DR value in table_6?",
        "Reasoning_question1": "How does the performance of Non-local U-Net compare to other models in terms of average DR in table_6, and what might this suggest about the effectiveness of its global aggregation blocks?",
        "Reasoning_question2": "What trend can be observed in the DR values for WM across the different models in table_6, and what could be the possible reason for this progression?",
        "Descriptive_answer1": "0.9585",
        "Descriptive_answer2": "Non-local U-Net",
        "Reasoning_answer1": "In table_6, the Non-local U-Net achieves the highest average DR value of 0.9239, outperforming all other models, including Model1 (0.9103), Model2 (0.9156), Model3 (0.9174), Model4 (0.9186), and Model5 (0.9194). This superior performance suggests that the global aggregation blocks in the Non-local U-Net are highly effective. By comparing the incremental improvements across the models, it appears that each modification or addition of global aggregation blocks enhances the model's ability to capture long-range dependencies in the data, leading to better segmentation accuracy overall. This indicates that the non-local operations are critical in improving the model's effectiveness beyond traditional architectures.",
        "Reasoning_answer2": "In table_6, there is a clear upward trend in the DR values for WM across the models, starting from Model1 at 0.8625, increasing to 0.8728 in Model2, 0.8749 in Model3, 0.8769 in Model4, 0.8804 in Model5, and peaking at 0.8867 in Non-local U-Net. This progression suggests that each successive model incorporates architectural improvements, likely through the addition or modification of modules such as short-range residual connections and global aggregation blocks, as described in the ablation study context. These enhancements probably allow for better feature extraction and integration of contextual information, which is particularly beneficial for segmenting WM regions, leading to consistent performance gains across the model iterations."
    },
    {
        "paper_id": "1812.04103.json",
        "table_id": "table_7",
        "table_content": "\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of 3D-MHD. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Smaller values indicate\n\t\tbetter performance. Note that 3D-MHD gives different results from MHD. Details of models are provided in\n\t\tthe text.}\n\t\\label{table:results_ablation_mhd}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.2363} & 0.6277 & 0.4705 & 0.4448 \\\\\n\t\tModel2       & 0.2404 & 0.6052 & 0.4480 & 0.4312 \\\\\n\t\tModel3       & 0.2392 & 0.5993 & 0.4429 & 0.4271 \\\\\n\t\tModel4       & 0.2397 & 0.5926 & 0.4336 & 0.4220 \\\\\n\t\tModel5       & 0.2444 & 0.5901 & 0.4288 & 0.4211 \\\\\n\t\tNon-local U-Net    & 0.2477 & \\textbf{0.5692} & \\textbf{0.4062} & \\textbf{0.4077} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}",
        "caption": "Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of 3D-MHD. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Smaller values indicate\n\t\tbetter performance. Note that 3D-MHD gives different results from MHD. Details of models are provided in\n\t\tthe text.",
        "label": "table:results_ablation_mhd",
        "section_info": "3 Results and Discussion\n\\section{Results and Discussion}\n\n\nWe perform experiments on the 3D multimodality isointense infant brain MR image segmentation task to evaluate our non-local U-Nets. The task is to perform automatic segmentation of MR images into cerebrospinal fluid~(CSF), gray matter~(GM) and white matter~(WM) regions. We first introduce the baseline model and the evaluation methods used in our experiments. Then the training and inference processes are described. We provide comparison results in terms of both effectiveness and efficiency, and conduct ablation studies to demonstrate that how each global aggregation block in our non-local U-Nets improves the performance. In addition, we explore the trade-off between the inference speed and accuracy based on different overlapping step sizes, and analyze the impact of patch size. The experimental code and dataset information have been made publicly available~\\footnote{\\url{https://github.com/divelab/Non-local-U-Nets}}.\n\n\\subsection{Experimental Setup}\n\n\nWe use CC-3D-FCN~\\cite{nie20183} as our baseline.\nCC-3D-FCN is a 3D fully convolutional network~(3D-FCN) with\nconvolution and concatenate~(CC) skip connections, which is designed\nfor 3D multimodality isointense infant brain image segmentation. It\nhas been shown to outperform traditional machine learning methods,\nsuch as FMRIB's automated segmentation\ntool~(FAST)~\\cite{zhang2001segmentation}, majority\nvoting~(MV), random forest~(RF)~\\cite{criminisi2013decision} and random forest with\nauto-context model~(LINKS)~\\cite{wang2015links}. Moreover, studies\nin~\\cite{nie20183} has showed the superiority of CC-3D-FCN to\nprevious deep learning models, like 2D, 3D\nCNNs~\\cite{zhang2015deep}, DeepMedic~\\cite{kamnitsas2017efficient},\nand the original 3D U-Net~\\cite{cciccek20163d}. Therefore, it is\nappropriate to use CC-3D-FCN as the baseline of our experiments.\nNote that our dataset is different from that in~\\cite{nie20183}.\n\n\nIn our experiments, we employ the Dice ratio~(DR) and propose the 3D modified\nHausdorff distance~(3D-MHD) as the evaluation metrics. These two\nmethods evaluate the accuracy only for binary segmentation tasks, so\nit is required to transform the 4-class segmentation map predicted by our model into\n4 binary segmentation maps for evaluation. That is, a 3D binary\nsegmentation map should be constructed for each class, where 1 denotes the voxel\nin the position belongs to the class and 0 means the opposite. In\nour experiments, we derive binary segmentation maps directly from\n4-class segmentation maps. The evaluation is performed on binary\nsegmentation maps for CSF, GM and WM.\n\n\nSpecifically, let $P$ and $L$ represent the predicted binary segmentation map for one class\nand the corresponding ground truth label, respectively. The DR is given by\n$DR=2|P \\cap L|/(|P|+|L|)$,\n\n\n\nwhere $|\\cdot|$ denotes the number of 1's in a segmentation map and $|P \\cap\nL|$ means the number of 1's shared by $P$ and $L$. Apparently, DR is a value\nin $[0,1]$ and a larger DR indicates a more accurate segmentation.\n\n\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of DR. The leave-one-subject-out\n\t\tcross-validation is used. Larger values indicate better performance.}\n\t\\label{table:results_baseline_dr}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9250$\\pm$0.0118\n\t\t& 0.9084$\\pm$0.0056\n\t\t& 0.8926$\\pm$0.0119\n\t\t& 0.9087$\\pm$0.0066 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9530$\\pm$0.0074}\n\t\t& \\textbf{0.9245$\\pm$0.0049}\n\t\t& \\textbf{0.9102$\\pm$0.0101}\n\t\t& \\textbf{0.9292$\\pm$0.0050} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\\begin{table*}[!ht]\n\t\\centering\n\t\\caption{Comparison of segmentation performance between our proposed model\n\t\tand the baseline model in terms of 3D-MHD. The leave-one-subject-out\n\t\tcross-validation is used. Smaller values indicate better performance. Note that 3D-MHD gives different results from MHD.}\n\t\\label{table:results_baseline_mhd}\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.3417$\\pm$0.0245\n\t\t& 0.6537$\\pm$0.0483\n\t\t& 0.4817$\\pm$0.0454\n\t\t& 0.4924$\\pm$0.0345 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.2554$\\pm$0.0207}\n\t\t& \\textbf{0.5950$\\pm$0.0428}\n\t\t& \\textbf{0.4454$\\pm$0.0040}\n\t\t& \\textbf{0.4319$\\pm$0.0313} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\\begin{table*}[!t]\n\t\\centering\n\t\\caption{Comparison of segmentation performance on the 13 testing subjects of iSeg-2017 between our proposed model\n\t\tand the baseline model in terms of DR. Larger values indicate better performance.}\n\t\\label{table:iseg_results}\n\t\\begin{tabular}{ l | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} \\\\\n\t\t\\midrule\n\t\tBaseline\n\t\t& 0.9324$\\pm$0.0067\n\t\t& 0.9146$\\pm$0.0074\n\t\t& 0.8974$\\pm$0.0123 \\\\\n\t\tNon-local U-Net\n\t\t& \\textbf{0.9557$\\pm$0.0060}\n\t\t& \\textbf{0.9219$\\pm$0.0089}\n\t\t& \\textbf{0.9044$\\pm$0.0153} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table*}\n\n\nThe modified Hausdorff distance~(MHD)~\\cite{dubuisson1994modified} is\ndesigned to compute the similarity between two objects. Here, an object is a\nset of points where a point is represented by a vector. Specifically, given\ntwo sets of vectors $A$ and $B$, MHD is computed by\n$MHD=\\max(d(A,B),d(B,A))$,\n\n\n\nwhere the distance between two sets is defined as\n$d(A,B)=1/|A|\\sum_{a \\in A}{d(a,B)}$,\n\n\n\nand the distance between a vector and a set is defined as\n$d(a,B)=\\min_{b \\in B}||a-b||$.\n\n\n\nPrevious studies~\\cite{wang2015links,zhang2015deep,nie20183}\napplied MHD for evaluation by treating a 3D $D \\times H \\times W$\nmap as $H \\times W$ $D$-dimensional vectors. However, there are two\nmore different ways to vectorize the 3D map, depending on the\ndirection of forming vectors, \\emph{i.e.,} $D \\times H$\n$W$-dimensional vectors and $D \\times W$ $H$-dimensional vectors.\nEach vectorization leads to different evaluation results by MHD. To\nmake it a direction-independent evaluation metric as DR, we define\n3D-MHD, which computes the averaged MHD based on the three different\nvectorizations. A smaller 3D-MHD indicates a\nhigher segmentation accuracy.\n\n\\subsection{Training and Inference Strategies}\n\n\nOur proposed non-local U-Nets apply Dropout~\\cite{srivastava2014dropout} with\na rate of 0.5 in each global aggregation block and the output block\nbefore the final $1 \\times 1 \\times 1$ convolution. A weight\ndecay~\\cite{krogh1992simple} with a rate of $2e-6$ is also employed.\nTo train the model, we use randomly cropped small patches. In this\nway, we obtain sufficient training data and the requirement on\nmemory is reduced. No extra data augmentation is needed. The\nexperimental results below suggest that patches\nwith a size of $32^3$ leads to the best\nperformance. The batch size is set to 5. The Adam\noptimizer~\\cite{kingma2014adam} with a learning rate of 0.001 is\nemployed to perform the gradient descent algorithm.\n\n\nIn the inference process, following~\\cite{nie20183}, we extract\npatches with the same size as that used in training. For example, to\ngenerate $32^3$ patches for inference, we slide a\nwindow of size $32^3$ through the original image\nwith a constant overlapping step size. The overlapping step size\nmust be smaller than or equal to the patch size, in order to\nguarantee that extracted patches cover the whole image.\nConsequently, prediction for all these patches provides segmentation\nprobability results for every voxel in the original image. For\nvoxels that receive multiple results due to overlapping, we average\nthem to produce the final prediction. The overlapping step size is\nan important hyper-parameter affecting the inference speed and the\nsegmentation accuracy. A smaller overlapping step size results in\nbetter accuracy, but increases the inference time as more patches\nare generated. We explore the trade-off in our experiments.\n\n\\subsection{Comparison with the Baseline}\\label{sec:baseline}\n\n\nWe compare our non-local U-Nets with the baseline on our dataset.\nFollowing~\\cite{nie20183}, the patch size is set to $32^3$\nand the overlapping step size for inference is set to $8$. To remove the\nbias of different subjects, the leave-one-subject-out cross-validation is\nused for evaluating segmentation performance. That is, for 10 subjects in our\ndataset, we train and evaluate models 10 times correspondingly. Each time one\nof the 10 subjects is left out for validation and the other 9 subjects are\nused for training. The mean and standard deviation of segmentation\nperformance of the 10 runs are reported.\n\n\nTables~\\ref{table:results_baseline_dr}\nand~\\ref{table:results_baseline_mhd} provide the experimental\nresults. In terms of both evaluation metrics, our non-local U-Nets achieve\nsignificant improvements over the baseline model. Due to the small\nvariances of the results, we focus on one of the 10 runs for\nvisualization and ablation studies, where the models are trained on the\nfirst 9 subjects and evaluated on the $10^{th}$ subject. A\nvisualization of the segmentation results in this run is given by\nFig.~\\ref{fig:results_visual}. By comparing the areas in red\ncircles, we can see that our model is capable of catching more\ndetails than the baseline model. We also visualize the training\nprocesses to illustrate the superiority of our model.\nFig.~\\ref{fig:results_training} shows the training and validation\ncurves in this run of our model and the baseline model,\nrespectively. Clearly, our model converges faster to a lower\ntraining loss. In addition, according to the better validation\nresults, our model does not suffer from over-fitting.\n\n\nTo further show the efficiency of our proposed model, we compare the\nnumber of parameters as reported in Table~\\ref{table:num_params}.\nOur model reduces $28\\%$ parameters compared to CC-3D-FCN and\nachieves better performance. A comparison of inference time is also\nprovided in Table~\\ref{table:infer_time}. The settings of our device\nare - GPU: Nvidia Titan Xp 12GB; CPU: Intel Xeon E5-2620v4 2.10GHz;\nOS: Ubuntu 16.04.3 LTS.\n\n\nSince our data has been used as the training data in the iSeg-2017\nchallenge, we also compare the\nresults evaluated on the 13 testing subjects in\nTable~\\ref{table:iseg_results}. According to the leader board, our model\nachieves one of the top performances. Results in terms of DR are reported since\nit is the only shared evaluation metric.\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of the number of parameters between our proposed model\n\t\tand the baseline model.}\n\t\\label{table:num_params}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Number of Parameters \\\\\n\t\t\\midrule\n\t\tBaseline  & 2,534,276 \\\\\n\t\tNon-local U-Net & \\textbf{1,821,124} \\\\\n\t\t\\bottomrule\n\n\n\n\n\n\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Comparison of inference time between our proposed model and the baseline model. The leave-one-subject-out cross-validation is used. The patch size is set to $32^3$ and the overlapping step size for inference is set to $8$.}\n\t\\label{table:infer_time}\n\t\\begin{tabular}{ l | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & Inference Time (min) \\\\\n\t\t\\midrule\n\t\tBaseline  & 3.85$\\pm$0.15 \\\\\n\t\tNon-local U-Net  & \\textbf{3.06$\\pm$0.12} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}\n\\end{table}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.95\\columnwidth]{figure4.png}\n\t\\caption{Visualization of the segmentation results on the $10^{th}$ subject\n\t\tby our proposed model and the baseline model. Both models are trained on the\n\t\tfirst 9 subjects. The first column shows the original segmentation maps. The\n\t\tsecond, third and fourth columns show the binary segmentation maps for CSF,\n\t\tGM and WM, respectively.}\n\t\\label{fig:results_visual}\n\\end{figure}\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure5.png}\n\t\\caption{Comparison of training processes and validation results between our proposed model and the\n\t\tbaseline model when training on the first 9 subjects and using the $10^{th}$ subject for validation.}\n\t\\label{fig:results_training}\n\\end{figure}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of DR. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Larger values indicate\n\t\tbetter performance. Details of models are provided in the text.}\n\t\\label{table:results_ablation_dr}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.9585} & 0.9099 & 0.8625 & 0.9103 \\\\\n\t\tModel2       & 0.9568 & 0.9172 & 0.8728 & 0.9156 \\\\\n\t\tModel3       & 0.9576 & 0.9198 & 0.8749 & 0.9174 \\\\\n\t\tModel4       & 0.9578 & 0.9210 & 0.8769 & 0.9186 \\\\\n\t\tModel5       & 0.9554 & 0.9225 & 0.8804 & 0.9194 \\\\\n\t\tNon-local U-Net    & 0.9572 & \\textbf{0.9278} & \\textbf{0.8867} & \\textbf{0.9239} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\caption{Ablation study by comparing segmentation performance between\n\t\tdifferent models in terms of 3D-MHD. All models are trained on the first 9\n\t\tsubjects and evaluated on the $10^{th}$ subject. Smaller values indicate\n\t\tbetter performance. Note that 3D-MHD gives different results from MHD. Details of models are provided in\n\t\tthe text.}\n\t\\label{table:results_ablation_mhd}\n\t\\resizebox{.95\\columnwidth}{!}{\n\t\\begin{tabular}{ l | c | c | c | c }\n\t\t\\toprule\n\t\t\\textbf{Model} & \\textbf{CSF} & \\textbf{GM} & \\textbf{WM} & \\textbf{Average} \\\\\n\t\t\\midrule\n\n\t\tModel1       & \\textbf{0.2363} & 0.6277 & 0.4705 & 0.4448 \\\\\n\t\tModel2       & 0.2404 & 0.6052 & 0.4480 & 0.4312 \\\\\n\t\tModel3       & 0.2392 & 0.5993 & 0.4429 & 0.4271 \\\\\n\t\tModel4       & 0.2397 & 0.5926 & 0.4336 & 0.4220 \\\\\n\t\tModel5       & 0.2444 & 0.5901 & 0.4288 & 0.4211 \\\\\n\t\tNon-local U-Net    & 0.2477 & \\textbf{0.5692} & \\textbf{0.4062} & \\textbf{0.4077} \\\\\n\t\t\\bottomrule\n\t\\end{tabular}}\n\\end{table}\n\n\\subsection{Ablation Studies of Different Modules}\\label{sec:ablation}\n\nWe perform ablation studies to show the effectiveness of each part of our\nnon-local U-Nets. Specifically, we compare the following models:\n\n\\textbf{Model1} is a 3D U-Net without short-range residual connections.\nDown-sampling and up-sampling are implemented by convolutions and\ndeconvolutions with a stride of 2, respectively. The bottom block is simply a\nconvolutional layer. Note that the baseline model, CC-3D-FCN, has showed improved performance over 3D U-Net~\\cite{nie20183}. However, the original 3D U-Net was not designed for this task~\\cite{cciccek20163d}. In our experiments, we appropriately set the hyperparameters of 3D U-Net and achieve better performance.\n\n\\textbf{Model2} is Model1 with short-range residual connections, \\emph{i.e.},\nthe blocks in Fig.~\\ref{fig:residual}(a) and (b) are applied. The bottom\nblock and up-sampling blocks are the same as those in Model1.\n\n\\textbf{Model3} replaces the first up-sampling block in Model2 with the block\nin Fig.~\\ref{fig:residual}(d).\n\n\\textbf{Model4} replaces both up-sampling blocks in Model2 with the block in\nFig.~\\ref{fig:residual}(d).\n\n\\textbf{Model5} replaces the bottom block in Model2 with the block in\nFig.~\\ref{fig:residual}(c).\n\nAll models are trained on the first 9 subjects. We report the segmentation\nperformance on the $10^{th}$ subject in Table~\\ref{table:results_ablation_dr}\nand Table~\\ref{table:results_ablation_mhd}. The results demonstrate how different\nglobal aggregation blocks in our non-local U-Nets improve the performance.\n\n\\subsection{Impact of the Overlapping Step Size}\\label{sec:overlap}\n\nAs discussed above, a small overlapping step size\nusually results in better segmentation, due to the ensemble effect.\nHowever, with a small overlapping step size, the model has to perform\ninference for more validation patches and thus decreases the inference speed. We\nexplore the trade-off in our non-local U-Nets by setting the overlapping step sizes to 4,\n8, 16, 32, respectively. Again, we train our model on the first 9 subjects and\nperform evaluation on the $10^{th}$ subject. The patch size is set to $32^3$.\nAccording to the overlapping step sizes, 11880, 1920,\n387, 80 patches need to be processed during inference, as shown in\nFig.~\\ref{fig:results_overlap_time}. In addition, Fig.~\\ref{fig:results_overlap_dr}\nplots the changes of segmentation performance in terms of DR. Obviously, 8 and\n16 are good choices that achieve accurate and fast segmentation results.\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.7\\columnwidth]{figure6.png}\n\t\\caption{Changes of segmentation performance in terms of DR, with respect to\n\t\tdifferent overlapping step sizes during inference. The model is trained on the\n\t\tfirst 9 subjects and evaluated on the $10^{th}$ subject.}\n\t\\label{fig:results_overlap_dr}\n\\end{figure}\n\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure7.png}\n\t\\caption{Changes of the number of validation patches for the $10^{th}$\n\t\tsubject, with respect to different overlapping step sizes during inference.}\n\t\\label{fig:results_overlap_time}\n\\end{figure}\n\n\\subsection{Impact of the Patch Size}\\label{sec:patch}\n\nThe patch size affects the total number of distinct training samples.\nMeanwhile, it controls the range of available global information when\nperforming segmentation for a patch. To choose the appropriate patch\nsize for the non-local U-Nets, we perform a grid search by training on the first 9\nsubjects and evaluating on the $10^{th}$ subject with the overlapping step\nsize of 8. Experiments are conducted with five different patch sizes:\n$16^3$, $24^3$, $32^3$, $40^3$, $48^3$. The results are provided in\nFig.~\\ref{fig:results_patch_dr}, where $32^3$ obtains the best\nperformance and is selected as the default setting of our model.\n\n\n\\begin{figure}[!t]\n\t\\centering\n\t\\includegraphics[width=0.75\\columnwidth]{figure8.png}\n\t\\caption{Changes of segmentation performance in terms of DR, with respect to\n\t\tdifferent patch sizes. The model is trained on the first 9 subjects and\n\t\tevaluated on the $10^{th}$ subject.}\n\t\\label{fig:results_patch_dr}\n\\end{figure}\n\n3.4 Ablation Studies of Different Modules\n\\subsection{Ablation Studies of Different Modules}\\label{sec:ablation}\n\nWe perform ablation studies to show the effectiveness of each part of our\nnon-local U-Nets. Specifically, we compare the following models:\n\n\\textbf{Model1} is a 3D U-Net without short-range residual connections.\nDown-sampling and up-sampling are implemented by convolutions and\ndeconvolutions with a stride of 2, respectively. The bottom block is simply a\nconvolutional layer. Note that the baseline model, CC-3D-FCN, has showed improved performance over 3D U-Net~\\cite{nie20183}. However, the original 3D U-Net was not designed for this task~\\cite{cciccek20163d}. In our experiments, we appropriately set the hyperparameters of 3D U-Net and achieve better performance.\n\n\\textbf{Model2} is Model1 with short-range residual connections, \\emph{i.e.},\nthe blocks in Fig.~\\ref{fig:residual}(a) and (b) are applied. The bottom\nblock and up-sampling blocks are the same as those in Model1.\n\n\\textbf{Model3} replaces the first up-sampling block in Model2 with the block\nin Fig.~\\ref{fig:residual}(d).\n\n\\textbf{Model4} replaces both up-sampling blocks in Model2 with the block in\nFig.~\\ref{fig:residual}(d).\n\n\\textbf{Model5} replaces the bottom block in Model2 with the block in\nFig.~\\ref{fig:residual}(c).\n\nAll models are trained on the first 9 subjects. We report the segmentation\nperformance on the $10^{th}$ subject in Table~\\ref{table:results_ablation_dr}\nand Table~\\ref{table:results_ablation_mhd}. The results demonstrate how different\nglobal aggregation blocks in our non-local U-Nets improve the performance.\n\n",
        "Descriptive_question1": "What is the 3D-MHD value for CSF in Model1 in table_7?",
        "Descriptive_question2": "Which model has the lowest average 3D-MHD value in table_7?",
        "Reasoning_question1": "How does the performance of Non-local U-Net compare to other models for WM segmentation in terms of 3D-MHD in table_7, and what might this indicate about its architecture?",
        "Reasoning_question2": "What trend can be observed in the 3D-MHD values from Model1 to Non-local U-Net for GM segmentation in table_7, and what could be the possible reason for this trend?",
        "Descriptive_answer1": "0.2363",
        "Descriptive_answer2": "Non-local U-Net",
        "Reasoning_answer1": "To compare the performance of Non-local U-Net for WM segmentation in terms of 3D-MHD, I first note the values from table_7: Non-local U-Net has a 3D-MHD of 0.4062 for WM, while the other models range from 0.4288 (Model5) to 0.4705 (Model1). This shows that Non-local U-Net outperforms all other models by having the lowest 3D-MHD value, indicating better segmentation accuracy for WM since smaller values represent better performance. Analyzing this, the superior performance might be attributed to the inclusion of global aggregation blocks in the Non-local U-Net architecture, which likely capture long-range dependencies and contextual information more effectively than the simpler architectures of Models 1 through 5. This suggests that architectural enhancements, such as non-local operations, play a critical role in improving segmentation precision for complex structures like WM.",
        "Reasoning_answer2": "Examining the 3D-MHD values for GM segmentation in table_7, I observe a clear trend of decreasing values from Model1 to Non-local U-Net: Model1 starts at 0.6277, Model2 at 0.6052, Model3 at 0.5993, Model4 at 0.5926, Model5 at 0.5901, and finally Non-local U-Net at 0.5692. This indicates a consistent improvement in segmentation performance for GM, with smaller values reflecting higher accuracy. The possible reason for this trend could be the progressive incorporation of architectural improvements in each subsequent model, such as short-range residual connections in Model2 and global aggregation blocks in later models, culminating in the Non-local U-Net. These enhancements likely allow for better capture of spatial relationships and contextual details specific to GM, leading to improved segmentation accuracy over the sequence of models."
    },
    {
        "paper_id": "2302.01019.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}[h]\n\\begin{center}\n\\caption{\\small Parameters of performed experiments, 'N of sensors' -- different sensors used in one attempt. \\label{tab:parameters}}\n\\fontsize{9}{10} \\selectfont\n\\begin{tabular}{\np{0.7cm}@{\\extracolsep{3mm}}\np{1.1cm}@{\\extracolsep{3mm}}\np{1.1cm}@{\\extracolsep{3mm}}\np{1.1cm}@{\\extracolsep{3mm}}\np{3.5cm}@{\\extracolsep{3mm}}\n}\\hline \\hline\nphase &N attempts & N sensors & N \\mbox{positive} & description \\\\\\hline\n1a     & 12    & 1     & 12     &  body temperature \\\\\n1b     & 10    & 4     & 10     &  hand temperature \\\\\n1c     & 7    & 5     & --      &  surrounding objects \\\\\n2\t\t   & 63   & 18(3x6)         & 57    &  external calorimeters \\\\\n2\t\t   & 67   & 18(3x6)         & 11    &  control (most positive responses in non-targeted sensors)  \\\\\\hline\n       &&& 0.00012 & Chi-square test, p value \\\\  \n\t\t   &&& 0.0008  & U test, p value\\\\\\hline\ntotal  & 159  & \\multicolumn{3}{l}{independent experiments}\\\\\n       & 2427 & \\multicolumn{3}{l}{operator-sensor sessions}\\\\\n\\hline\\hline\n\\end{tabular}\n\\end{center}\n\\end{table}",
        "caption": "\\small Parameters of performed experiments, 'N of sensors' -- different sensors used in one attempt. \\label{tab:parameters}",
        "label": "tab:parameters",
        "section_info": "3 Experiments\n\\section{Experiments}\n\n\nAn overview of all attempts and their parameters is given in Table \\ref{tab:parameters}. Experiments in the phase 1a and 1b are fully reproducible in the selected group of meditators, we observe only a variation in intensity of temperature responses. We also observed a non-temperature response of surrounding electrochemical sensors on presence of meditators in all cases, however, detecting the exact begin and end of exercises, as shown in Fig. \\ref{fig:monksData}, was possible only in several attempts. Since this topic represents the focus of other works \\cite{KernbachOperator19nq,KernbachQigong22en,KernbachMahirishi20en}, it was not pursued further here. Control measurements were carried out a) in parallel to experimental sessions on separate calorimeters (as non-targeted sensors); b) during so-called 'empty sessions' (meditators were present but no sessions were conducted), several examples of control attempts are shown in Figs. \\ref{fig:phase2ExpControl1}-\\ref{fig:phase2ExpControl3}. \n\n\\begin{table}[h]\n\\begin{center}\n\\caption{\\small Parameters of performed experiments, 'N of sensors' -- different sensors used in one attempt. \\label{tab:parameters}}\n\\fontsize{9}{10} \\selectfont\n\\begin{tabular}{\np{0.7cm}@{\\extracolsep{3mm}}\np{1.1cm}@{\\extracolsep{3mm}}\np{1.1cm}@{\\extracolsep{3mm}}\np{1.1cm}@{\\extracolsep{3mm}}\np{3.5cm}@{\\extracolsep{3mm}}\n}\\hline \\hline\nphase &N attempts & N sensors & N \\mbox{positive} & description \\\\\\hline\n1a     & 12    & 1     & 12     &  body temperature \\\\\n1b     & 10    & 4     & 10     &  hand temperature \\\\\n1c     & 7    & 5     & --      &  surrounding objects \\\\\n2\t\t   & 63   & 18(3x6)         & 57    &  external calorimeters \\\\\n2\t\t   & 67   & 18(3x6)         & 11    &  control (most positive responses in non-targeted sensors)  \\\\\\hline\n       &&& 0.00012 & Chi-square test, p value \\\\  \n\t\t   &&& 0.0008  & U test, p value\\\\\\hline\ntotal  & 159  & \\multicolumn{3}{l}{independent experiments}\\\\\n       & 2427 & \\multicolumn{3}{l}{operator-sensor sessions}\\\\\n\\hline\\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\\subsection{The phase 1 experiments}\n\n\n\\textbf{1a. Body temperature effects} have been measured with three methods (IR, skin surface, core temperature). \n\\begin{figure}[ht]\n\\centering\n\\subfigure[]{\\includegraphics[width=0.49\\textwidth]{images/1408_Tummo_temperature.pdf}}\n\\subfigure[]{\\includegraphics[width=0.49\\textwidth]{images/0209_Tummo_temperature.pdf}}\n\\subfigure[]{\\includegraphics[width=0.49\\textwidth]{images/0309_Tummo_temperature_core.pdf}}\n\\caption{\\small Examples of typical temperature effects for meditative visualization in ASC: \\textbf{(a)} Tummo session with increasing the body temperature (measured by contact skin surface method in abdominal region near the navel); \\textbf{(b)} Using meditative visualization for lowering body temperature (measured by contact skin surface method in abdominal region near the navel); \\textbf{(c)} Variation of core temperature during meditative visualization without breading exercises. Regions of persistent changes after the sessions is well visible in all cases. \\label{fig:TummoTemp}}\n\\end{figure}\n\\begin{figure}[ht]\n\\centering\n\\subfigure{\\includegraphics[width=0.49\\textwidth]{images/hands.jpg}}\n\\subfigure[]{\\includegraphics[width=0.49\\textwidth]{images/hands2.jpg}}\n\\subfigure[]{\\includegraphics[width=0.49\\textwidth]{images/two_diff_sensors.pdf}}\n\\caption{\\small Calorimetric measurement (heating up 15 ml water in small containers) with focusing attention on one hand, each container has one \\textit{t} sensor immersed in the fluid and one \\textit{t} sensor measuring air temperature outside the container: \\textbf{(a)} Setup; \\textbf{(b)} Persistent temperature dynamics after removing the water containers from the hands lasts about 12 minutes, the gray area indicates the time the containers were held in hands. \\label{fig:HandsTemp}}\n\\end{figure}\nThe temperature data depends essentially on behavior of a meditator, even small changes in position (during a 2-3 hour experiment) disturb a thermodynamic equilibrium of the body and are reflected in an increasing or decreasing trend of \\textit{t}. Temperature dynamics depends on a fat content of tissues, clothing, room temperature and other factors that affect thermodynamic conditions. Therefore, experiments have been performed in a fixed position of a meditator, providing enough time for stabilizing body temperature before and after attempts. Fig. \\ref{fig:TummoTemp} demonstrates several experiments with raising or lowering body temperature through meditative visualization in ASC. We observe an almost immediate response of body temperature to the visualization, the area of persistent changes after the session is clearly visible; its duration is longer than the session itself. Inclination of temperature dynamics during the sessions is higher for regions of 'natural' warm-up and cool-down phases, this indicates their more intensive thermodynamic mechanisms. Experiments on lowering body temperature does not include breading exercises.\n\n\\textbf{1b. Additional attempts with heating up 15 ml water in small containers} have been undertaken to measure the heat produced by hands if focusing attention on one of the hands. Each container has one \\textit{t} sensor immersed in the fluid and one \\textit{t} sensor measuring air temperature outside the container. Without attention, hands produce almost equal amount of heat that results in a stable differential temperature of water and air sensors. Focusing attention on one hand increases its skin surface temperature up to 1.5-2 $^\\circ$C and increase the differential temperature of water. We also observe persistent changes after the water containers are removed from hands for about 12 minutes -- increasing temperature of one fluid but equal temperature of air outside, see Fig. \\ref{fig:HandsTemp}. Such a dynamics is difficult to explain by heating only, and motivated a development of external differential calorimeters and methodology of experiments in the phase 2.\n\n\\textbf{1c. Effects on surrounding electrochemical sensors} have been investigated multiple times in different laboratories \\cite{Matos17,Gurtovoi92en,KernbachOperator19nq}, where cases of focused and unfocused attention are separated. Even in the unfocused case (e.g. only with specific expectations), the experimenter demonstrated evident bias in experimental data \\cite{Korenbaum21}; the focused attention produces more significant changes. The results of the phase 1c are similar to such studies, regardless of separating barriers. \n\\begin{figure}[ht]\n\\centering\n\\subfigure{\\includegraphics[width=0.49\\textwidth]{images/Dhammakaya.pdf}}\n\\caption{\\small Effects on surrounding electrochemical sensors in the phase 1c: the point 'a' -- enter in the ASC (begin and end of ASC is clearly visible), the point 'b' -- thermal effects, generated by a body heat of meditator. The point 'a' is about 20 minutes before the point 'b'. \\label{fig:monksData}}\n\\end{figure}\n\nExample of electrochemical and thermal dynamics is shown in Fig. \\ref{fig:monksData}. Enter in ASC is indicated by a rapid increase of ionic dynamics, see the point 'a'. Electrochemical dynamics follows the begin and end of meditation. As the meditator approaches the sensors, we observe a slow increase of temperature in the container. However, it starts to affect the electrochemical dynamics about 30-40 minutes later, see point 'b' in Fig. \\ref{fig:monksData}. Thus, the dynamics between 'a' and 'b' points provides information about 'non-classing' effects of focused attention on the environment before 'classical' factors start affecting the sensor. \n\n\n\\subsection{The phase 2 experiments}\n\\label{sec:ExpPhase2}\n\n\nThe meditative visualization in this phase is similar to 'exteriorization of self', which is a basic technique of Vajrayana and other Buddhist (e.g. from Dzogchen lineage \\cite{rinpoche2006union}) and Taoist \\cite{chao1973taoist} traditions, and also explored in academic publications \\cite{Menon16}. The meditator in the operator room attempts to visualize a predetermined channel of differential calorimeter placed in the measurement laboratory. During experiment, the meditator receives a real-time biofeedback (mostly acoustic signals based on EEG and breading rate) to control ASC and also real-time data from temperature sensors in the remote setup. \n\nTypical dynamics of calorimeter data is shown in Fig. \\ref{fig:typicalCalorimeter}. The targeted channel demonstrates a temperature deviation that results in symmetry breaking dynamics of fluid sensors, air sensors do not demonstrate any similar effects (i.e. this points to internal mechanisms in aqueous solutions that cause such thermal fluctuations). This dynamics is similar to the measurement in Fig. \\ref{fig:HandsTemp}, but in this case a direct heat transfer from the mediator is excluded.\n\\begin{figure}[htp]\n\\centering\n\\subfigure[\\label{fig:typicalCalorimeterA}]{\\includegraphics[width=0.49\\textwidth]{images/1210_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.49\\textwidth]{images/1210_2.pdf}}\n\\caption{\\small Example of experiments in the phase 2: \\textbf{(a)} symmetry breaking of temperature in fluid sensors; \\textbf{(b)} air sensors demonstrate a similar temperature of both channels. Grey bar shows an experimental session. These two graphs can be represented as differential temperatures in one plot, see Fig. \\ref{fig:phase2ExpTwo}. \\label{fig:typicalCalorimeter}}\n\\end{figure}\n\nFig. \\ref{fig:typicalTemperature} demonstrates differential temperature of fluidic and air sensors, temperature dynamics in laboratory before, during and after the experiment (about 0.05 $^\\circ$C for 8 hours, convection-based fluctuations 0.001 $^\\circ$C), and dynamics of power supply for all sensors. It is clearly evident that external and system-internal factors do not provide any reasons for a change of temperature trend inside the calorimeter (especially for only one channel of the calorimeter). Experimental sessions take 20-30 minutes, temperature changes extend far beyond the sessions, typically 60-90 minutes after, until the differential temperature again balances in the calorimeters. To exclude computational artifacts caused by nonlinear regression, Figs. \\ref{fig:phase2ExpLinReg} and \\ref{fig:phase2ExpNonlinReg} demonstrate the same experiment in linear (without a priory knowledge about the session time) and nonlinear regression (with a priory knowledge about the session time): typically nonlinear approximation provides better fitting of background region and more clear separation of different trends, however it should not be used for long post-experimental regions due to accumulation of errors.   \n\\begin{figure}[htp]\n\\centering\n\\subfigure[]{\\includegraphics[width=0.49\\textwidth]{images/1110_a1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.49\\textwidth]{images/1110_a2.pdf}}\n\\subfigure[]{\\includegraphics[width=0.49\\textwidth]{images/1110_a3.pdf}}\n\\subfigure[]{\\includegraphics[width=0.49\\textwidth]{images/1110_a4.pdf}}\n\\caption{\\small Example of experiments in the phase 2: \\textbf{(a)} differential temperature of fluidic and air sensors; \\textbf{(b)} temperature dynamics in measurement laboratory; \\textbf{(c)} convection-based fluctuations of temperature in measurement laboratory (regression analysis) measured by external air sensors; \\textbf{(d)} dynamics of power supply for fluidic and air sensors. Grey bar shows an experimental session. Other examples of fluidic and environmental data are shown in Figs.\\ref{fig:phase2Env1}-\\ref{fig:phase2Env3}. \\label{fig:typicalTemperature}}\n\\end{figure}\n\n\n\\begin{figure*}[htp]\n\\centering\n\\subfigure[\\label{fig:phase2ExpControl1}]{\\includegraphics[width=0.33\\textwidth]{images/0811_1_control.pdf}}\n\\subfigure[\\label{fig:phase2ExpControl2}]{\\includegraphics[width=0.33\\textwidth]{images/0811_2_control.pdf}}\n\\subfigure[\\label{fig:phase2ExpControl3}]{\\includegraphics[width=0.33\\textwidth]{images/1111_1.pdf}}\n\n\\subfigure[\\label{fig:phase2Env1}]{\\includegraphics[width=0.33\\textwidth]{images/2109_4.pdf}}\n\\subfigure[\\label{fig:phase2Env2}]{\\includegraphics[width=0.33\\textwidth]{images/2109_2.pdf}}\n\\subfigure[\\label{fig:phase2Env3}]{\\includegraphics[width=0.33\\textwidth]{images/2109_1.pdf}}\n\n\\subfigure[\\label{fig:phase2ExpTwo}]{\\includegraphics[width=0.33\\textwidth]{images/1210_3.pdf}}\n\\subfigure[\\label{fig:phase2ExpLinReg}]{\\includegraphics[width=0.33\\textwidth]{images/0911_3.pdf}}\n\\subfigure[\\label{fig:phase2ExpNonlinReg}]{\\includegraphics[width=0.33\\textwidth]{images/0911_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/0911_2.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/0411_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/0411_2.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/2610_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/1810_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/1010_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/1910_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/1011_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/1011_2.pdf}}\n\\caption{\\small Examples of experimental data from the phase 2: \\textbf{(a,b)} control measurements as 'empty sessions'; \\textbf{(c)} control measurements as 'non-targeted sensors' running in parallel to experiments; \\textbf{(d,e,f)} another example for environmental data as in Fig. \\ref{fig:typicalTemperature}; \\textbf{(a)} the same as in Fig.\\ref{fig:typicalCalorimeter} but in differential form; \\textbf{(h,i)} the same experiment in linear and nonlinear regression; \\textbf{(j-m)} examples of multiple sensor responses; \\textbf{(n-r)} other examples. \\label{fig:phase2Exp}}\n\\end{figure*}\n\nSeveral examples of experimental data from the phase 2 are shown in Fig. \\ref{fig:phase2Exp}. Since experiments involved several calorimeters working in parallel (served also as control devices), mediators attempted to affect several of them during one session. We noticed two atypical behaviors of calorimeters installed in the same measurement room. First, if at the beginning of experiments only one targeted sensor/channel responded, as shown in Fig. \\ref{fig:typicalCalorimeterA}, towards the end of all experiments several fluidic sensors may have responded (however, still with a difference between targeted and not targeted channels). Second, as the experiments progressed, the water used in calorimeters exhibited increasing thermal fluctuations and therefore required more frequent water changes in the calorimeter. \n\nIn such conditions, from 63 independent sessions conducted during three months in the phase 2, 57 attempts are positive, thus the success rate is about 90.4\\%. From 67 control attempts, performed as  'empty sessions' (meditators were present but no sessions were conducted, see Figs. \\ref{fig:phase2ExpControl1}, \\ref{fig:phase2ExpControl2}) and in parallel to experimental sessions (as non-targeted sensors, see Fig. \\ref{fig:phase2ExpControl3}), we recorded 11 positive responses mostly from non-targeted sensors running in parallel to experiments, which represent 16.4\\%. The non-parametric Chi-square and Mann-Whitney U tests are used for statistical analysis against two different null-hypotheses, see also \\cite{Kernbach12JSE}. For Chi-square test the positive/negative results in experimental attempts are represented by '1'/'0' and we consider the null hypothesis about a random character of obtained data (50\\% of success under the experimental channel). The Mann-Whitney U test analyzes two groups of reactions from experimental and control attempts, where the null hypothesis is an identical distribution function of these groups. Results are shown in Table \\ref{tab:parameters}; based on obtained values we reject the null hypotheses for Chi-square and Mann-Whitney U tests (significance level $\\alpha=0.005$, two tailed).\n\n\n3.2 The phase 2 experiments\n\\subsection{The phase 2 experiments}\n\\label{sec:ExpPhase2}\n\n\nThe meditative visualization in this phase is similar to 'exteriorization of self', which is a basic technique of Vajrayana and other Buddhist (e.g. from Dzogchen lineage \\cite{rinpoche2006union}) and Taoist \\cite{chao1973taoist} traditions, and also explored in academic publications \\cite{Menon16}. The meditator in the operator room attempts to visualize a predetermined channel of differential calorimeter placed in the measurement laboratory. During experiment, the meditator receives a real-time biofeedback (mostly acoustic signals based on EEG and breading rate) to control ASC and also real-time data from temperature sensors in the remote setup. \n\nTypical dynamics of calorimeter data is shown in Fig. \\ref{fig:typicalCalorimeter}. The targeted channel demonstrates a temperature deviation that results in symmetry breaking dynamics of fluid sensors, air sensors do not demonstrate any similar effects (i.e. this points to internal mechanisms in aqueous solutions that cause such thermal fluctuations). This dynamics is similar to the measurement in Fig. \\ref{fig:HandsTemp}, but in this case a direct heat transfer from the mediator is excluded.\n\\begin{figure}[htp]\n\\centering\n\\subfigure[\\label{fig:typicalCalorimeterA}]{\\includegraphics[width=0.49\\textwidth]{images/1210_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.49\\textwidth]{images/1210_2.pdf}}\n\\caption{\\small Example of experiments in the phase 2: \\textbf{(a)} symmetry breaking of temperature in fluid sensors; \\textbf{(b)} air sensors demonstrate a similar temperature of both channels. Grey bar shows an experimental session. These two graphs can be represented as differential temperatures in one plot, see Fig. \\ref{fig:phase2ExpTwo}. \\label{fig:typicalCalorimeter}}\n\\end{figure}\n\nFig. \\ref{fig:typicalTemperature} demonstrates differential temperature of fluidic and air sensors, temperature dynamics in laboratory before, during and after the experiment (about 0.05 $^\\circ$C for 8 hours, convection-based fluctuations 0.001 $^\\circ$C), and dynamics of power supply for all sensors. It is clearly evident that external and system-internal factors do not provide any reasons for a change of temperature trend inside the calorimeter (especially for only one channel of the calorimeter). Experimental sessions take 20-30 minutes, temperature changes extend far beyond the sessions, typically 60-90 minutes after, until the differential temperature again balances in the calorimeters. To exclude computational artifacts caused by nonlinear regression, Figs. \\ref{fig:phase2ExpLinReg} and \\ref{fig:phase2ExpNonlinReg} demonstrate the same experiment in linear (without a priory knowledge about the session time) and nonlinear regression (with a priory knowledge about the session time): typically nonlinear approximation provides better fitting of background region and more clear separation of different trends, however it should not be used for long post-experimental regions due to accumulation of errors.   \n\\begin{figure}[htp]\n\\centering\n\\subfigure[]{\\includegraphics[width=0.49\\textwidth]{images/1110_a1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.49\\textwidth]{images/1110_a2.pdf}}\n\\subfigure[]{\\includegraphics[width=0.49\\textwidth]{images/1110_a3.pdf}}\n\\subfigure[]{\\includegraphics[width=0.49\\textwidth]{images/1110_a4.pdf}}\n\\caption{\\small Example of experiments in the phase 2: \\textbf{(a)} differential temperature of fluidic and air sensors; \\textbf{(b)} temperature dynamics in measurement laboratory; \\textbf{(c)} convection-based fluctuations of temperature in measurement laboratory (regression analysis) measured by external air sensors; \\textbf{(d)} dynamics of power supply for fluidic and air sensors. Grey bar shows an experimental session. Other examples of fluidic and environmental data are shown in Figs.\\ref{fig:phase2Env1}-\\ref{fig:phase2Env3}. \\label{fig:typicalTemperature}}\n\\end{figure}\n\n\n\\begin{figure*}[htp]\n\\centering\n\\subfigure[\\label{fig:phase2ExpControl1}]{\\includegraphics[width=0.33\\textwidth]{images/0811_1_control.pdf}}\n\\subfigure[\\label{fig:phase2ExpControl2}]{\\includegraphics[width=0.33\\textwidth]{images/0811_2_control.pdf}}\n\\subfigure[\\label{fig:phase2ExpControl3}]{\\includegraphics[width=0.33\\textwidth]{images/1111_1.pdf}}\n\n\\subfigure[\\label{fig:phase2Env1}]{\\includegraphics[width=0.33\\textwidth]{images/2109_4.pdf}}\n\\subfigure[\\label{fig:phase2Env2}]{\\includegraphics[width=0.33\\textwidth]{images/2109_2.pdf}}\n\\subfigure[\\label{fig:phase2Env3}]{\\includegraphics[width=0.33\\textwidth]{images/2109_1.pdf}}\n\n\\subfigure[\\label{fig:phase2ExpTwo}]{\\includegraphics[width=0.33\\textwidth]{images/1210_3.pdf}}\n\\subfigure[\\label{fig:phase2ExpLinReg}]{\\includegraphics[width=0.33\\textwidth]{images/0911_3.pdf}}\n\\subfigure[\\label{fig:phase2ExpNonlinReg}]{\\includegraphics[width=0.33\\textwidth]{images/0911_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/0911_2.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/0411_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/0411_2.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/2610_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/1810_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/1010_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/1910_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/1011_1.pdf}}\n\\subfigure[]{\\includegraphics[width=0.33\\textwidth]{images/1011_2.pdf}}\n\\caption{\\small Examples of experimental data from the phase 2: \\textbf{(a,b)} control measurements as 'empty sessions'; \\textbf{(c)} control measurements as 'non-targeted sensors' running in parallel to experiments; \\textbf{(d,e,f)} another example for environmental data as in Fig. \\ref{fig:typicalTemperature}; \\textbf{(a)} the same as in Fig.\\ref{fig:typicalCalorimeter} but in differential form; \\textbf{(h,i)} the same experiment in linear and nonlinear regression; \\textbf{(j-m)} examples of multiple sensor responses; \\textbf{(n-r)} other examples. \\label{fig:phase2Exp}}\n\\end{figure*}\n\nSeveral examples of experimental data from the phase 2 are shown in Fig. \\ref{fig:phase2Exp}. Since experiments involved several calorimeters working in parallel (served also as control devices), mediators attempted to affect several of them during one session. We noticed two atypical behaviors of calorimeters installed in the same measurement room. First, if at the beginning of experiments only one targeted sensor/channel responded, as shown in Fig. \\ref{fig:typicalCalorimeterA}, towards the end of all experiments several fluidic sensors may have responded (however, still with a difference between targeted and not targeted channels). Second, as the experiments progressed, the water used in calorimeters exhibited increasing thermal fluctuations and therefore required more frequent water changes in the calorimeter. \n\nIn such conditions, from 63 independent sessions conducted during three months in the phase 2, 57 attempts are positive, thus the success rate is about 90.4\\%. From 67 control attempts, performed as  'empty sessions' (meditators were present but no sessions were conducted, see Figs. \\ref{fig:phase2ExpControl1}, \\ref{fig:phase2ExpControl2}) and in parallel to experimental sessions (as non-targeted sensors, see Fig. \\ref{fig:phase2ExpControl3}), we recorded 11 positive responses mostly from non-targeted sensors running in parallel to experiments, which represent 16.4\\%. The non-parametric Chi-square and Mann-Whitney U tests are used for statistical analysis against two different null-hypotheses, see also \\cite{Kernbach12JSE}. For Chi-square test the positive/negative results in experimental attempts are represented by '1'/'0' and we consider the null hypothesis about a random character of obtained data (50\\% of success under the experimental channel). The Mann-Whitney U test analyzes two groups of reactions from experimental and control attempts, where the null hypothesis is an identical distribution function of these groups. Results are shown in Table \\ref{tab:parameters}; based on obtained values we reject the null hypotheses for Chi-square and Mann-Whitney U tests (significance level $\\alpha=0.005$, two tailed).\n\n\n",
        "Descriptive_question1": "What is the total number of independent experiments reported in table_1?",
        "Descriptive_question2": "How many sensors were used in phase 1b experiments as per table_1?",
        "Reasoning_question1": "Why might the success rate of positive responses in phase 2 experiments be significantly higher than in control attempts according to table_1?",
        "Reasoning_question2": "What can be inferred about the effectiveness of meditative visualization on external calorimeters compared to control sessions based on the data in table_1?",
        "Descriptive_answer1": "159",
        "Descriptive_answer2": "4",
        "Reasoning_answer1": "The success rate of positive responses in phase 2 experiments is significantly higher (90.4% with 57 positive out of 63 attempts) compared to control attempts (16.4% with 11 positive out of 67 attempts) as seen in table_1. This discrepancy suggests that the meditative visualization technique used in phase 2 actively influences the targeted external calorimeters, leading to a higher likelihood of positive responses. In contrast, control attempts, which include 'empty sessions' or non-targeted sensors, lack this intentional focus, resulting in fewer positive outcomes. The statistical significance of this difference is supported by the Chi-square test (p-value of 0.00012) and U test (p-value of 0.0008), both indicating that the results are not due to random chance but rather to the specific experimental conditions of phase 2.",
        "Reasoning_answer2": "Based on the data in table_1, meditative visualization appears to be highly effective on external calorimeters in phase 2 experiments, with 57 positive responses out of 63 attempts, yielding a success rate of about 90.4%. In comparison, control sessions, which involve non-targeted sensors or 'empty sessions' where no active visualization occurs, show only 11 positive responses out of 67 attempts, or a success rate of 16.4%. This stark contrast suggests that the focused attention and visualization techniques employed by meditators create a specific effect on the targeted calorimeters, far beyond what occurs in the absence of such focus. The low p-values from the Chi-square and U tests (0.00012 and 0.0008, respectively) further reinforce that this difference is statistically significant, implying a strong causal link between meditative visualization and the observed effects on external calorimeters compared to control conditions."
    },
    {
        "paper_id": "1704.00954.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}\n\\caption{Best fit Schechter function parameters for the observed UV luminosity function.}\n\\label{tab:parameters_redshift}\n\\begin{tabular}{cccc}\n\\hline\n$z$ & $M^{*}$ & $\\log_{10}(\\phi^{*}/{\\rm Mpc^{-3}})$ & $\\alpha$ \\\\\n\\hline\n13 & -19.91 & -5.71 & -2.54\\\\\n12 & -19.92 & -5.09 & -2.35\\\\\n11 & -20.17 & -4.79 & -2.27\\\\\n10 & -20.69 & -4.70 & -2.27\\\\\n9 & -20.68 & -4.20 & -2.10\\\\\n8 & -20.93 & -3.92 & -2.04\\\\\n\\hline\n\\end{tabular}\n\\end{table}",
        "caption": "Best fit Schechter function parameters for the observed UV luminosity function.",
        "label": "tab:parameters_redshift",
        "section_info": "4 Photometric Properties\n\\section{Photometric Properties}\\label{sec:photometric}\n\n\\subsection{Modelling Galaxy Photometry}\\label{sec:photometric.modelling}\n\nWe build up the spectral energy distribution (SED) of each galaxy on a star particle by star particle basis. Firstly, we assign a pure stellar SED to each particle on the basis of its mass, age, and chemical composition. We adopt the {\\sc Pegase.2} \\citep{pegase} stellar population synthesis (SPS) model combined with a \\citet{Chabrier2003} initial mass function (IMF) over $0.1-100\\,{\\rm M_{\\odot}}$. The emission from each star particle is then modified to take into account reprocessing by both dust and gas as described below. \n\n\\subsubsection{Nebular Continuum and Line Emission Modelling}\\label{sec:photometric.modelling.nebular}\n\nWe use the {\\sc cloudy} photoionisation code to model the effect of reprocessing by H{\\sc ii} surrounding stars. The hydrogen density is chosen to be $100\\,{\\rm cm^{-3}}$ and the chemical composition of the gas is set to the metallicity of the star particle scaled by solar abundances. We assume a uniform covering fraction of $0.85$ thereby leaving sufficient LyC photons to reionise the Universe. \n\nThe implications of the choice of SPS model, initial mass function, and Lyman continuum (LyC) escape fraction on the spectral energy distributions are discussed in more detail in \\citet{Wilkins2016b} and \\citet{Wilkins2016c}. While these assumptions can result in large systematic effects, the effect on the rest frame far-UV ($150\\,{\\rm nm}$) is relatively small as nebular emission contributes only around $10\\%$ of the total luminosity and variations due to the choice of model typically changing luminosities by $<0.1\\,{\\rm dex}$ \\citep{Wilkins2016c}.\n\n\\subsubsection{Dust Attenuation}\\label{sec:photometric.modelling.dust}\n\nTo estimate the dust attenuation in \\bluetides\\ we employ a scheme which links the metal density integrated along parallel lines of sight to the dust optical depth $\\tau$. \n\nIn this model the rest-frame $V$-band ($0.55\\,{\\rm \\mu m}$) dust optical depth ($\\tau_{V}(x, y, z)$) is,\n\\begin{equation}\n\\tau_{V}(x, y, z) = \\kappa \\Sigma (x, y, z) = \\int_{z'=0}^{z} \\kappa \\rho_\\mathrm{metal}(x, y, z')\\,{\\rm d}z',\n\\end{equation}\nwhere $\\rho_\\mathrm{metal}(x, y, z')$ is the metal density, and we have chosen the $z$ direction to be the line of sight direction. $\\kappa$ is a normalization factor, a free parameter that is tuned to match the model with the observed $z\\approx 8$ luminosity function (see \\S\\ref{sec:photometric.UVLF}).\n\nFirst, the metal mass is painted to a 3-dimensional image with resolution of $0.2 h^{-1}\\,{\\rm ckpc}$. The image is passed through a Gaussian smoothing filter with a width of $r_s = 0.5 h^{-1}\\,{\\rm ckpc}$, the most probable smoothing length of gas particles that have collapsed into galaxies in the simulation. The parameter $r_s$ is also degenerate with $\\kappa$. Secondly, we compute the cumulative sum of the image along the line of sight direction ($z$). After this procedure, the image contains the surface density of metals ($\\Sigma(x, y, z)$) that contributes to the attenuation at any spatial location. Finally, we read off the values from the image at the location of each star particle. \n\nWe employ an individual stellar cluster (ISC) approximation in the implementation. The star clusters are identified with a Friends-of-Friends algorithm with a linking length of $l = 2.0 h^{-1}\\,{\\rm ckpc}$. For each star cluster, we perform the above calculation for metal mass in the bounding box of the star cluster with a buffer region of $b=2.0 h^{-1}\\,{\\rm ckpc}$. We tested that the approximation is stable to reasonable changes in the linking length $l$ or the size of the buffer region. The ISC approximation allows us to focus the computational resource to locations in the simulation where the the dust attenuation is most relevant. At the high-redshifts ($z \\ge 8$) simulated by \\bluetides\\ the ISC approximation provides a significant computational advantage comparing to a full volume ray tracing approach. At such high redshift, the attenuation due to chance-aligned galaxies can be neglected because the abundance of galaxies with very high metallicities is low. \n\nThe optical depth at an arbitrary wavelength $\\lambda$ is related to the $V$-band optical depth through an attenuation curve. We parameterise the attenuation curve as a power-law with index $\\gamma$,\n\\begin{equation}\n\\tau_\\lambda = \\tau_V\\times\\left(\\frac{\\lambda}{0.55\\,{\\rm\\mu m}}\\right)^{\\gamma}.\n\\end{equation}\nFor $\\gamma$ we choose a value of $-1$ yielding an attenuation curve slightly flatter in the UV than the Pei et al. (1992) Small Magellanic Cloud  curve, but not as flat as the \\citet{Calzetti2000} ``Starburst'' curve. \n\nThe predicted surface density of metals is strongly correlated with the stellar mass and intrinsic luminosity. This results in a strong trend of the average UV attenuation with both the stellar mass and intrinsic UV luminosity, albeit with considerable scatter (see Fig. \\ref{fig:L_fesc}). At a fixed stellar mass the attenuation is predicted to decrease slightly to higher redshift. \n\nHowever, the formation of dust and metals, while linked to some degree, are not expected to trace one another exactly \\citep[see modelling by][]{Mancini2015}. Consequently, such a simple model is unlikely to fully capture the redshift and luminosity dependence of dust attenuation, especially at the highest redshifts where the formation of dust in AGB stars or in-situ in the ISM has not had time to occur. This may then suggest that our dust model produces too much attenuation at the highest redshifts. Indeed, this is perhaps hinted at by the recent discovery \\citep{Oesch2016} of an exceptionally bright ($M\\approx -22$) and blue (and therefore likely dust-poor) galaxy at $z\\approx 11$. While the discovery of this object is consistent with predictions based on intrinsic luminosities \\citep{Waters2016a} it would be very unexpected based on dust attenuated luminosities predicted using our model. Given the uncertainties introduced by the dust model, particularly at $z>10$, throughout this work we consider predictions based on both the intrinsic luminosities and the dust attenuated luminosities.\n\nA consequence of the desire to fit observations of the $z\\sim 8$ far-UV luminosity function is the prediction that there exist a number of massive, heavily dust-obscured galaxies. The existence of these galaxies, which would not appear in Lyman-break selected samples, explains the discrepancy between predictions from \\bluetides\\ and current observational constraints on the galaxy stellar mass function and star formation rate distribution function (see \\S\\ref{sec:physical.GSMF} and \\S\\ref{sec:physical.SFRDF}). Unfortunately, the relative faintness and rarity of these objects means they are unlikely to be identified in current IR/sub-mm observations. However, massive heavily obscured intensely star forming galaxies have been identified at lower redshift \\citep[e.g. HFLS3 at $z=6.34$:][]{Riechers2013} suggesting that such objects can and do exist in the relatively early Universe.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/L_fesc_StellarMass.pdf}\n\\caption{The effective escape fraction of far-UV ($150\\,{\\rm nm}$) photons as a function of stellar mass. The median far-UV escape fractions in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:L_fesc}.}\n\\label{fig:L_fesc}\n\\end{figure}\n\n\n\n\n\\subsection{Spectral Energy Distributions}\\label{sec:photometric.SED}\n\n\nThe resulting average intrinsic (including nebular continuum and line emission) and observed specific\\footnote{That is, expressed per unit stellar mass.} spectral energy distributions are shown, for three mass bins at $z=8$, in Fig. \\ref{fig:SED_M}. \n\nThe average intrinsic SEDs are generally very blue, reflecting the ongoing star formation activity, young ages, and low metallicities in the sample. While the shape of the SEDs in each mass bin is very similar, the most massive galaxies have slightly redder SEDs reflecting the higher metallicity of the stellar populations. A more detailed analysis of the pure stellar and intrinsic SEDs is contained in \\citet{Wilkins2016c}. \n\nAs noted in the previous section, the most massive galaxies also suffer much higher attenuation due to dust resulting in redder observed SEDs and higher mass-to-light ratios. The trend of higher mass-to-light ratios at higher stellar mass can be seen more clearly in Fig. \\ref{fig:MTOL}. Fig. \\ref{fig:MTOL} also shows the evolution with redshift demonstrating that stellar mass-to-light ratios increase to lower redshift. This predominantly reflects the increasing age of the stellar populations to lower redshift.\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/SED_M.pdf}\n\\caption{The average observed and unattenuated SEDs (expressed per unit stellar mass) in three mass bins at $z=8$.}\n\\label{fig:SED_M}\n\\end{figure*}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/MTOL.pdf}\n\\caption{The intrinsic and dust attenuated far-UV mass-to-light ratios as function of stellar mass and redshift. The median intrinsic and observed far-UV mass-to-light ratios in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:MTOL}.}\n\\label{fig:MTOL}\n\\end{figure}\n\n\n\\subsection{Luminosity Functions}\\label{sec:photometric.UVLF}\n\nThe luminosity function (LF) is an incredibly useful statistical description of the galaxy population. In Fig. \\ref{fig:UVLF} we present both the intrinsic and dust attenuated far-UV luminosity functions at $z=8\\to 15$. In Fig. \\ref{fig:UVLF_multi} we show the intrinsic and attenuated UV LFs at $z\\in\\{8,9,10\\}$ together with current observational constraints. Both the intrinsic and observed luminosity functions demonstrate the rapid expected build up of the galaxy population at high-redshift. For example, the number of $M=-19$ objects increases by a factor of around 1000 from $z=15\\to 8$. The rapid decline of the LF to high-redshift poses challenges for the observational identification of galaxy populations at $z>12$ even using \\jwst. This is explored in more detail in Wilkins et al. {\\em submitted} where we make predictions for the surface density of sources at $z>8$ including the effects of field-to-field, or cosmic, variance.\n\nThe observed LF is generally similar to the intrinsic LF at faint luminosities ($M>-20$). At brighter luminosities there is stronger steepening of the LF reflecting the increasing strength of dust attenuation. As noted earlier our dust model is tuned to match the $z\\approx 8$ observed UV LF. However, it is important to stress that this only makes a significant difference at relatively bright luminosities ($M<-20$); at fainter luminosities there simply is not the surface density of metals (and therefore inferred dust) to yield significant attenuation. The excellent fit at fainter luminosities is then simply a consequence of the physics employed in the model and not a resulting of tuning using the dust model. However, while the faint end of the LF is unaffected by our choice of dust model it can be systematically affected by the choice of initial mass function (and to a lesser extent choice of SPS model); see \\citet{Wilkins2016c}. Adopting an IMF yielding more low-mass stars than our assumed IMF \\citep[e.g. a pure][IMF extended down to $0.1\\,{\\rm M_{\\odot}}$]{Salpeter1955} would uniformly reduce the luminosities of our galaxies, shifting the LF to fainter luminosities.\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/UVLF.pdf}\n\\caption{Intrinsic (left panel) and dust attenuated (observed, right panel) rest-frame far-UV ($150\\,{\\rm nm}$) luminosity functions. Observations at $z\\approx 8$ and $10.4$ from Bouwens et al.\\ (2015) are shown for comparison. The scale of the right-hand axis shows the number of galaxies in each bin magnitude in the simulation. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:UVLF}.}\n\\label{fig:UVLF}\n\\end{figure*}\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/UVLF_multi.pdf}\n\\caption{Intrinsic (left panel) and dust attenuated (observed, right panel) rest-frame far-UV ($150\\,{\\rm nm}$) luminosity functions. Observations at $z\\approx 8$ and $10.4$ from Bouwens et al.\\ (2015) are shown for comparison. The scale of the right-hand axis shows the number of galaxies in each bin magnitude in the simulation. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:UVLF}.}\n\\label{fig:UVLF_multi}\n\\end{figure*}\n\nWe also fit the dust attenuated far-UV LF by a Schechter function and find that the function provides a good overall fit to shape of the LF, as shown in Fig. \\ref{fig:UVLF_multi} at $z\\in\\{8,9,10\\}$. The evolution of the Schechter function parameters is shown in Fig. \\ref{fig:parameters_redshift} with the parameters listed in Table \\ref{tab:parameters_redshift} alongside various observational constraints at $z=4-10$. All three parameters decrease to higher redshift and overlap with observational constraints (and extrapolations from lower-redshift).\n\n\\begin{table}\n\\caption{Best fit Schechter function parameters for the observed UV luminosity function.}\n\\label{tab:parameters_redshift}\n\\begin{tabular}{cccc}\n\\hline\n$z$ & $M^{*}$ & $\\log_{10}(\\phi^{*}/{\\rm Mpc^{-3}})$ & $\\alpha$ \\\\\n\\hline\n13 & -19.91 & -5.71 & -2.54\\\\\n12 & -19.92 & -5.09 & -2.35\\\\\n11 & -20.17 & -4.79 & -2.27\\\\\n10 & -20.69 & -4.70 & -2.27\\\\\n9 & -20.68 & -4.20 & -2.10\\\\\n8 & -20.93 & -3.92 & -2.04\\\\\n\\hline\n\\end{tabular}\n\\end{table}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/parameters_redshift.pdf}\n\\caption{Redshift evolution of the best fit Schechter function parameters of the simulated observed UV LF.}\n\\label{fig:parameters_redshift}\n\\end{figure}\n\n\n\n\n\n\n\n\n\n4.3 Luminosity Functions\n\\subsection{Luminosity Functions}\\label{sec:photometric.UVLF}\n\nThe luminosity function (LF) is an incredibly useful statistical description of the galaxy population. In Fig. \\ref{fig:UVLF} we present both the intrinsic and dust attenuated far-UV luminosity functions at $z=8\\to 15$. In Fig. \\ref{fig:UVLF_multi} we show the intrinsic and attenuated UV LFs at $z\\in\\{8,9,10\\}$ together with current observational constraints. Both the intrinsic and observed luminosity functions demonstrate the rapid expected build up of the galaxy population at high-redshift. For example, the number of $M=-19$ objects increases by a factor of around 1000 from $z=15\\to 8$. The rapid decline of the LF to high-redshift poses challenges for the observational identification of galaxy populations at $z>12$ even using \\jwst. This is explored in more detail in Wilkins et al. {\\em submitted} where we make predictions for the surface density of sources at $z>8$ including the effects of field-to-field, or cosmic, variance.\n\nThe observed LF is generally similar to the intrinsic LF at faint luminosities ($M>-20$). At brighter luminosities there is stronger steepening of the LF reflecting the increasing strength of dust attenuation. As noted earlier our dust model is tuned to match the $z\\approx 8$ observed UV LF. However, it is important to stress that this only makes a significant difference at relatively bright luminosities ($M<-20$); at fainter luminosities there simply is not the surface density of metals (and therefore inferred dust) to yield significant attenuation. The excellent fit at fainter luminosities is then simply a consequence of the physics employed in the model and not a resulting of tuning using the dust model. However, while the faint end of the LF is unaffected by our choice of dust model it can be systematically affected by the choice of initial mass function (and to a lesser extent choice of SPS model); see \\citet{Wilkins2016c}. Adopting an IMF yielding more low-mass stars than our assumed IMF \\citep[e.g. a pure][IMF extended down to $0.1\\,{\\rm M_{\\odot}}$]{Salpeter1955} would uniformly reduce the luminosities of our galaxies, shifting the LF to fainter luminosities.\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/UVLF.pdf}\n\\caption{Intrinsic (left panel) and dust attenuated (observed, right panel) rest-frame far-UV ($150\\,{\\rm nm}$) luminosity functions. Observations at $z\\approx 8$ and $10.4$ from Bouwens et al.\\ (2015) are shown for comparison. The scale of the right-hand axis shows the number of galaxies in each bin magnitude in the simulation. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:UVLF}.}\n\\label{fig:UVLF}\n\\end{figure*}\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/UVLF_multi.pdf}\n\\caption{Intrinsic (left panel) and dust attenuated (observed, right panel) rest-frame far-UV ($150\\,{\\rm nm}$) luminosity functions. Observations at $z\\approx 8$ and $10.4$ from Bouwens et al.\\ (2015) are shown for comparison. The scale of the right-hand axis shows the number of galaxies in each bin magnitude in the simulation. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:UVLF}.}\n\\label{fig:UVLF_multi}\n\\end{figure*}\n\nWe also fit the dust attenuated far-UV LF by a Schechter function and find that the function provides a good overall fit to shape of the LF, as shown in Fig. \\ref{fig:UVLF_multi} at $z\\in\\{8,9,10\\}$. The evolution of the Schechter function parameters is shown in Fig. \\ref{fig:parameters_redshift} with the parameters listed in Table \\ref{tab:parameters_redshift} alongside various observational constraints at $z=4-10$. All three parameters decrease to higher redshift and overlap with observational constraints (and extrapolations from lower-redshift).\n\n\\begin{table}\n\\caption{Best fit Schechter function parameters for the observed UV luminosity function.}\n\\label{tab:parameters_redshift}\n\\begin{tabular}{cccc}\n\\hline\n$z$ & $M^{*}$ & $\\log_{10}(\\phi^{*}/{\\rm Mpc^{-3}})$ & $\\alpha$ \\\\\n\\hline\n13 & -19.91 & -5.71 & -2.54\\\\\n12 & -19.92 & -5.09 & -2.35\\\\\n11 & -20.17 & -4.79 & -2.27\\\\\n10 & -20.69 & -4.70 & -2.27\\\\\n9 & -20.68 & -4.20 & -2.10\\\\\n8 & -20.93 & -3.92 & -2.04\\\\\n\\hline\n\\end{tabular}\n\\end{table}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/parameters_redshift.pdf}\n\\caption{Redshift evolution of the best fit Schechter function parameters of the simulated observed UV LF.}\n\\label{fig:parameters_redshift}\n\\end{figure}\n\n\n\n\n\n\n\n\n\n",
        "Descriptive_question1": "What is the value of M* at z=8 in table_1?",
        "Descriptive_question2": "What is the alpha value at z=13 in table_1?",
        "Reasoning_question1": "How does the characteristic luminosity M* change with increasing redshift from z=8 to z=13 in table_1, and what might this indicate about galaxy evolution at high redshifts?",
        "Reasoning_question2": "What trend can be observed in the parameter phi* from z=8 to z=13 in table_1, and what could this suggest about the number density of galaxies at these redshifts?",
        "Descriptive_answer1": "-20.93",
        "Descriptive_answer2": "-2.54",
        "Reasoning_answer1": "Looking at the data in table_1, the characteristic luminosity M* changes from -20.93 at z=8 to -19.91 at z=13. This indicates a trend of M* becoming less negative, or fainter, as redshift increases. Breaking this down step by step, at z=8, M* is -20.93, at z=9 it is -20.68, at z=10 it is -20.69, at z=11 it is -20.17, at z=12 it is -19.92, and finally at z=13 it is -19.91. This shows a general increase in the value of M* (becoming less negative), meaning that the characteristic luminosity of galaxies is decreasing, or they are becoming fainter with increasing redshift. This could indicate that at higher redshifts, corresponding to earlier times in the universe, galaxies were generally less luminous, possibly due to lower stellar masses or less intense star formation activity. This aligns with the idea of galaxy evolution where galaxies build up mass and luminosity over time, so at earlier epochs (higher z), the brightest galaxies are not as luminous as those at lower redshifts.",
        "Reasoning_answer2": "Analyzing the data in table_1, the parameter phi* (expressed as log10(phi*/Mpc^-3)) shows a clear trend from z=8 to z=13. At z=8, phi* is -3.92, at z=9 it is -4.20, at z=10 it is -4.70, at z=11 it is -4.79, at z=12 it is -5.09, and at z=13 it is -5.71. Step by step, this indicates a consistent decrease in the value of log10(phi*), which means that phi* itself (the characteristic number density of galaxies) is decreasing significantly with increasing redshift. Since phi* represents the normalization of the luminosity function, a decreasing phi* suggests that the number density of galaxies is much lower at higher redshifts. This could imply that in the early universe (higher z), there were fewer galaxies, which is consistent with the ongoing process of galaxy formation and the build-up of the galaxy population over cosmic time. The rapid decline in phi* indicates that the universe at z=13 had a much sparser population of galaxies compared to z=8."
    },
    {
        "paper_id": "1704.00954.json",
        "table_id": "table_2",
        "table_content": "\\begin{table*}\n\\caption{Tabulated values of the stellar to dark matter mass ratio used in Fig. \\ref{fig:DM_stellar}.}\n\\label{tab:DM_stellar}\n\\begin{tabular}{cccccccc}\n\\hline\n    & \\multicolumn{7}{c}{$\\log_{10}(M_{\\rm DM}/{\\rm M_{\\odot}})=$} \\\\\n$z$ & $10.5$-$10.75$ & $10.75$-$11.0$ & $11.0$-$11.25$ & $11.25$-$11.5$ & $11.5$-$11.75$ & $11.75$-$12.0$ & $12.0$-$12.25$ \\\\\n\\hline\n    & \\multicolumn{7}{c}{{\\bf stellar-to-dark matter mass ratio} - $\\log_{10}(M_*/M_{\\rm DM})$} \\\\\n\\hline\n 13.0 & $-2.75 $  & $-2.56 $  & - & - & - & - & -\\\\\n 12.0 & $-2.67 $  & $-2.54 $  & $-2.46 $  & - & - & - & -\\\\\n 11.0 & $-2.66 $  & $-2.52 $  & $-2.38 $  & $-2.28 $  & - & - & -\\\\\n 10.0 & $-2.6 $  & $-2.45 $  & $-2.3 $  & $-2.21 $  & - & - & -\\\\\n 9.0 & $-2.57 $  & $-2.4 $  & $-2.26 $  & $-2.11 $  & $-1.99 $  & $-2.05 $  & -\\\\\n 8.0 & $-2.52 $  & $-2.34 $  & $-2.18 $  & $-2.05 $  & $-1.89 $  & $-1.75 $  & $-1.87 $ \\\\\n\\hline\n\\end{tabular}\n\\end{table*}",
        "caption": "Tabulated values of the stellar to dark matter mass ratio used in Fig. \\ref{fig:DM_stellar}.",
        "label": "tab:DM_stellar",
        "section_info": "3 Physical Properties\n\\section{Physical Properties}\\label{sec:physical}\n\n\\subsection{Dark Matter - Stellar Mass Connection}\\label{sec:physical.DM}\n\nWe begin by investigating the link between the dark matter and stellar masses of galaxies predicted by \\bluetides. In Fig. \\ref{fig:DM_stellar} we show the ratio of the stellar to dark matter masses of galaxies. This ratio increases to higher stellar mass (increasing by approximately 0.5 dex as the dark matter mass is increases by 1 dex) and to lower-redshift. The shape of this relationship broadly matches the extrapolation of the \\citet{Moster2013} abundance matching model, however there is a significant difference ($\\approx 0.4\\,{\\rm dex}$) in normalisation. In Fig. \\ref{fig:DM_stellar} we also compare our results to the \\citet{Behroozi2013} model this time finding a significant difference in both normalisation and shape (at $M_{h}>10^{11}\\,{\\rm M_{\\odot}}$). The exact reason for this is unclear but may reflect that the \\citet{Moster2013} and \\citet{Behroozi2013} models are calibrated at lower redshift, and thus rely on extrapolation to produce the high-redshift relationship.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/DM_stellar.pdf}\n\\caption{The ratio of the stellar to dark matter mass as a function of dark matter mass predicted by \\bluetides. The top panel shows the full distribution of sources at $z=8$ with large points denoting the median and the error-bars showing the central $68\\%$ range. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. Tabulated values of the median ratios are given in Table \\ref{tab:DM_stellar}.}\n\\label{fig:DM_stellar}\n\\end{figure}\n\n\n\n\\subsection{The Galaxy Stellar Mass Function}\\label{sec:physical.GSMF}\n\n\nThe galaxy stellar mass function (GSMF) predicted by \\bluetides\\ is shown in Fig. \\ref{fig:GSMF}. At $z=8$ \\bluetides\\ simulated a sufficiently large volume to robustly model the GSMF to stellar masses of $>10^{10}\\,{\\rm M_{\\odot}}$. From $z=15\\to 8$ the number of $>10^{8}\\,{\\rm M_{\\odot}}$ galaxies within the simulation increases from a handful at $z=15$ to almost 120,000 by $z=8$ demonstrating the rapid assembly of the galaxy population during this epoch. Over the period the shape of the GSMF also evolved, with the number density of massive galaxies increasing faster. For example, from $z=10\\to 8$ the number density of galaxies with $M_*\\approx 10^{9.5}\\,{\\rm M_{\\odot}}$ increased a factor of $\\approx 4\\times$ faster than those with $M_*\\approx 10^{8}\\,{\\rm M_{\\odot}}$.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/GSMF.pdf}\n\\caption{The galaxy stellar mass function predicted by \\bluetides\\ at $z\\in\\{8,9,10,11,12,13,14,15\\}$. The right-hand axis shows the total number of galaxies in \\bluetides\\ in each $\\Delta\\log_{10}M=0.2$ mass bin. The points and grey line show observational constraints from \\citet{Song2015} at $z\\approx 8$ corrected to assume a \\citet{Chabrier2003} initial mass function. The inset panel shows the number of objects with $\\log_{10}(M_*/{\\rm M_{\\odot}})>8$ in the simulation volume as a function of redshift $z=15\\to 8$. Tabulated quantities from \\bluetides\\ are given in Table \\ref{tab:GSMF}.}\n\\label{fig:GSMF}\n\\end{figure}\n\nIt is now possible, by combining deep {\\em Hubble} observations with {\\em Spitzer}/IRAC photometry, to probe the rest-frame UV-optical spectral energy distributions of galaxies at very-high redshift, and thus measure robust stellar masses and thus the galaxy stellar mass function. \n\nWhile several studies have constrained the GSMF at very-high redshift \\citep{Gonzalez2011, Duncan2014,Grazian2015,Song2015} only \\citet{Song2015} have extended observational measurements of the GSMF to $z\\approx 8$ overlapping with \\bluetides. The \\citet{Song2015} results are shown Fig. \\ref{fig:GSMF} and closely match the \\bluetides\\ predictions over much of the simulated and observed mass range. The possible exception to this otherwise excellent agreement is at high masses $M_*>10^{10}\\,{\\rm M_{\\odot}}$ where \\bluetides\\ appears to predict more galaxies than are currently observed (although the observational uncertainties are very large). While this may reflect modelling issues it is also likely there exist observational biases at these large masses. The most-massive systems are predicted to be heavily obscured, even at $z\\approx 8$, and may fall out of UV selected samples.\n\nIt is also important to note that there are large differences between the observed GSMFs presented by different studies at very-high redshift. For example, despite using a largely overlapping set of observations \\citet{Song2015} find number densities (at $M_{*}>10^{9}\\,{\\rm M_{\\odot}}$) almost an order of magnitude lower than \\citet{Duncan2014} - for a discussion of the many issues regarding observational estimates of the GSMF see \\citet{Grazian2015} and \\citet{Song2015}. Observational estimates of the GSMF are sensitive to the choice of initial mass function (IMF). Assuming a \\citet{Salpeter1955} IMF for example would lead to observational mass estimates systematically increasing by approximately $0.17\\,{\\rm dex}$.\n\n\n\n\n\n\n\n\n\\subsection{The Star Formation Rate Distribution Function}\\label{sec:physical.SFRDF}\n\nAnother fundamental description of galaxy population is the star formation rate (SFR) distribution function (SFR-DF). \\bluetides\\ predictions for the SFR-DF are shown, alongside observational constraints at $z\\in\\{4.9,6.8,7.9\\}$ from \\citet{Mashian2016} in Fig. \\ref{fig:SFRDF}. The general shape of the predicted SFR-DF is similar to the galaxy stellar mass function and similarly lacks a strong break. However, the SFRDF also evolves more slowly than the galaxy stellar mass function. The \\citet{Mashian2016} $z\\approx 7.9$ distribution function has both a higher normalisation at low-SFRs and contains fewer high-SFR galaxies. The lack of high-SFR galaxies may again suggest a modelling issue though may also reflect an observational bias. This is discussed in more depth in \\S\\ref{sec:photometric.modelling.dust} where we discuss predictions for dust attenuation. \n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/SFRDF.pdf}\n\\caption{The star formation rate distribution function predicted by \\bluetides. The right-hand axis shows the total number of galaxies in \\bluetides\\ in each $\\Delta\\log_{10}SFR=0.2$ bin. Solid lines show the dust-corrected (intrinsic) star formation rate distribution functions measured by \\citet{Mashian2016} at $z\\in\\{7.9, 6.8, 4.9\\}$. The \\citet{Mashian2016} curves are corrected to assume a \\citet{Chabrier2003} IMF using the calibrations proposed by \\citet{KE2012}. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:SFRDF}.}\n\\label{fig:SFRDF}\n\\end{figure}\n\n\n\n\n\\subsection{Star Formation Histories}\\label{sec:physical.SFH}\n\n\nAt all the redshifts simulated by \\bluetides\\ the average star formation activity in galaxies is increasing rapidly, though the rate of this increase slows at later times. The average star formation histories of galaxies with stellar masses $>10^{8}\\,{\\rm M_{\\odot}}$ are shown in Fig. \\ref{fig:SFH}. Within the range probed by \\bluetides\\ there is little variation in the shape of the star formation history with stellar mass. This can also be seen in Figs. \\ref{fig:M_sSFR} and \\ref{fig:ages} where we show the average specific star formation and mean stellar ages in different mass bins. Both quantities show no correlation with stellar mass over the range which we are sensitive suggesting that star formation has not yet been quenched in these systems. The lack of quenching in our simulated galaxies is not entirely surprising as the mass range does not yet encompass many galaxies with $M_{h}>10^{12}\\,{\\rm M_{\\odot}}$ where inflows, and thus star formation, is expected to be suppressed \\citep[e.g.][]{Finlator2011}. It is worth noting however there is a tentative indication of some suppression in the most massive halos, however at $z=8$ there are not yet enough to have a clear picture.\n\nWhile there is no correlation with stellar mass both the average specific star formation rate and mean stellar age evolve strongly with redshift. For example, from $z=14\\to 8$ average mass-weighted stellar ages increase from approximately $30\\to 90\\,{\\rm Myr}$ while specific star formation rates drop by around $0.5\\,{\\rm dex}$.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/SFH.pdf}\n\\caption{The average star formation histories of galaxies with $M_{*}>10^{8}\\,{\\rm M_{\\odot}}$ at $z\\in\\{14,12,10,8\\}$. The figure shows the fraction of the total star formation occurring in each $\\Delta t=10\\,{\\rm Myr}$ age-bin.}\n\\label{fig:SFH}\n\\end{figure}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/M_sSFR.pdf}\n\\caption{The relationship between the specific star formation rate (${\\rm SFR}/M_{*}$) and stellar mass predicted for galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$ by \\bluetides. The top panel demonstrates the full distribution of sources at $z=8$ with the points denoting the median and the error-bars showing the central $68\\%$ range. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. The median specific star formation rates in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:physical}.}\n\\label{fig:M_sSFR}\n\\end{figure}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/ages.pdf}\n\\caption{The relationship between the mean stellar age and stellar mass predicted for galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$ by \\bluetides. The top panel demonstrates the full distribution of sources at $z=8$ with the points denoting the median and the error-bars showing the central $68\\%$ range. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. The median ages in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:physical}.}\n\\label{fig:ages}\n\\end{figure}\n\n\n\n\n\\subsection{Metal Enrichment}\\label{sec:physical.Z}\n\n\nAs galaxies assemble stellar mass in the simulation the average metallicity of both the gas and stars increases. This can be seen in Fig. \\ref{fig:metallicities} where we show both the average mass-weighted stellar and star forming gas phase metallicity as a function of stellar mass. The trend of metallicities with stellar masses increases as ${\\rm d}\\log_{10}Z/{\\rm d}\\log_{10}M_*\\approx 0.4$. This trend is similar to observational measurements, using rest-frame optical strong line diagnostics, from \\citet{Maiolino2008} (at $z\\approx 3.5$) and \\citet{Mannucci2009} (at $z\\approx 3.1$). The normalisation of the simulated mass-metallicity relationship at $z\\approx 8$ is also similar to that found at $z\\sim 3$ by \\citet{Maiolino2008} and \\citet{Mannucci2009} using rest-frame optical diagnostics and at $z\\sim 5$ by \\citet{Faisst2016} using rest-UV absorption complexes. \n\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/metallicities.pdf}\n\\caption{The stellar (light points) and star forming gas (dark points) metallicities of galaxies in \\bluetides. The 2D histogram in the top panel shows all objects with $M>10^{8}\\,{\\rm M_{\\odot}}$ at $z=8$. Points denoting the median and central $68\\%$. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. Observational constraints from \\citet{Maiolino2008}, \\citet{Mannucci2009}, and \\citet{Faisst2016} at $z\\sim 3.5$, $z\\sim 3.1$, and $z\\sim 5$ respectively are also shown. Observational measurements of the stellar mass assume a \\citet{Chabrier2003} initial mass function and metallicities were converted to a mass-fraction assuming $12+\\log_{10}(O/H)_{\\odot} = 8.69$ and $Z_{\\odot}=0.02$. The median metallicities in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:physical}.}\n\\label{fig:metallicities}\n\\end{figure}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n3.1 Dark Matter - Stellar Mass Connection\n\\subsection{Dark Matter - Stellar Mass Connection}\\label{sec:physical.DM}\n\nWe begin by investigating the link between the dark matter and stellar masses of galaxies predicted by \\bluetides. In Fig. \\ref{fig:DM_stellar} we show the ratio of the stellar to dark matter masses of galaxies. This ratio increases to higher stellar mass (increasing by approximately 0.5 dex as the dark matter mass is increases by 1 dex) and to lower-redshift. The shape of this relationship broadly matches the extrapolation of the \\citet{Moster2013} abundance matching model, however there is a significant difference ($\\approx 0.4\\,{\\rm dex}$) in normalisation. In Fig. \\ref{fig:DM_stellar} we also compare our results to the \\citet{Behroozi2013} model this time finding a significant difference in both normalisation and shape (at $M_{h}>10^{11}\\,{\\rm M_{\\odot}}$). The exact reason for this is unclear but may reflect that the \\citet{Moster2013} and \\citet{Behroozi2013} models are calibrated at lower redshift, and thus rely on extrapolation to produce the high-redshift relationship.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/DM_stellar.pdf}\n\\caption{The ratio of the stellar to dark matter mass as a function of dark matter mass predicted by \\bluetides. The top panel shows the full distribution of sources at $z=8$ with large points denoting the median and the error-bars showing the central $68\\%$ range. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. Tabulated values of the median ratios are given in Table \\ref{tab:DM_stellar}.}\n\\label{fig:DM_stellar}\n\\end{figure}\n\n\n\n",
        "Descriptive_question1": "What is the redshift value focused on in the top panel of table_2?",
        "Descriptive_question2": "What statistical measure is represented by large points in table_2?",
        "Reasoning_question1": "How does the ratio of stellar to dark matter mass change with increasing dark matter mass according to table_2, and what might this imply about galaxy formation?",
        "Reasoning_question2": "Why might there be a difference in normalization and shape when comparing the results in table_2 to the Behroozi2013 model at high dark matter masses?",
        "Descriptive_answer1": "8",
        "Descriptive_answer2": "median",
        "Reasoning_answer1": "According to the context provided with table_2, the ratio of stellar to dark matter mass increases with increasing dark matter mass, specifically by approximately 0.5 dex for every 1 dex increase in dark matter mass. This suggests a positive correlation between the two masses. To understand this trend, consider that as dark matter mass increases, it provides a deeper gravitational potential well, which can trap more baryonic matter, leading to increased stellar mass formation. This implies that in the early universe, galaxies with higher dark matter content are likely to host more significant stellar populations, reflecting an essential aspect of galaxy formation where dark matter acts as a scaffold for baryonic matter assembly. Additionally, this trend might indicate that the efficiency of star formation could be influenced by the total mass of the dark matter halo, pointing to a key physical mechanism in galaxy evolution models.",
        "Reasoning_answer2": "The difference in normalization Rajasri shape when comparing the results in table_2 to the Behroozi2013 model at high dark matter masses (above 10^11 solar masses) could stem from multiple factors. First, consider that the Behroozi2013 model is calibrated at lower redshifts, whereas table_2 reflects data from high redshifts as simulated by Bluetides. This discrepancy in redshift calibration means the Behroozi2013 model relies on extrapolation to predict high-redshift relationships, which may not accurately capture the physical conditions or evolutionary processes at those early epochs. Second, the normalization difference of about 0.4 dex suggests a systematic offset in the predicted stellar-to-dark matter mass ratios, possibly due to different assumptions about star formation efficiency or feedback processes in the models. Lastly, the shape difference might indicate variations in how stellar mass accumulates relative to dark matter mass at high masses, potentially influenced by unique high-redshift environmental factors or simulation-specific physics not accounted for in the Behroozi2013 model. This highlights the challenges in modeling galaxy formation across cosmic time and the need for models tailored to specific redshift regimes."
    },
    {
        "paper_id": "1704.00954.json",
        "table_id": "table_3",
        "table_content": "\\begin{table*}\n\\caption{Tabulated values of the galaxy stellar mass function (GSMF) used in Fig. \\ref{fig:GSMF}. }\n\\label{tab:GSMF}\n\\begin{tabular}{ccccccc}\n\\hline\n    & \\multicolumn{6}{c}{$\\log_{10}(\\phi/{\\rm dex^{-1}Mpc^{-3}})$} \\\\\n        \\hline\n$\\log_{10}(M_*/{\\rm M_{\\odot}})$ & $z=13$ & $z=12$ & $z=11$ & $z=10$ & $z=9$ & $z=8$  \\\\\n\\hline\n $ 8.0 $ - $8.2 $  & $ -5.61 $  & $ -4.94 $  & $ -4.31 $  & $ -3.72 $  & $ -3.22 $  & $ -2.76 $ \\\\\n $ 8.2 $ - $8.4 $  & $ -6.02 $  & $ -5.3 $  & $ -4.64 $  & $ -4.01 $  & $ -3.45 $  & $ -2.97 $ \\\\\n $ 8.4 $ - $8.6 $  & $ -6.4 $  & $ -5.7 $  & $ -4.97 $  & $ -4.3 $  & $ -3.71 $  & $ -3.19 $ \\\\\n $ 8.6 $ - $8.8 $  & - & $ -6.14 $  & $ -5.35 $  & $ -4.61 $  & $ -3.98 $  & $ -3.42 $ \\\\\n $ 8.8 $ - $9.0 $  & - & $ -6.53 $  & $ -5.7 $  & $ -4.96 $  & $ -4.27 $  & $ -3.66 $ \\\\\n $ 9.0 $ - $9.2 $  & - & - & $ -6.21 $  & $ -5.34 $  & $ -4.58 $  & $ -3.92 $ \\\\\n $ 9.2 $ - $9.4 $  & - & - & - & $ -5.65 $  & $ -4.93 $  & $ -4.2 $ \\\\\n $ 9.4 $ - $9.6 $  & - & - & - & $ -6.11 $  & $ -5.28 $  & $ -4.5 $ \\\\\n $ 9.6 $ - $9.8 $  & - & - & - & - & $ -5.65 $  & $ -4.87 $ \\\\\n $ 9.8 $ - $10.0 $  & - & - & - & - & $ -6.07 $  & $ -5.17 $ \\\\\n $ 10.0 $ - $10.2 $  & - & - & - & - & - & $ -5.59 $ \\\\\n $ 10.2 $ - $10.4 $  & - & - & - & - & - & $ -5.96 $ \\\\\n\\hline\n\\end{tabular}\n\\end{table*}",
        "caption": "Tabulated values of the galaxy stellar mass function (GSMF) used in Fig. \\ref{fig:GSMF}. ",
        "label": "tab:GSMF",
        "section_info": "3 Physical Properties\n\\section{Physical Properties}\\label{sec:physical}\n\n\\subsection{Dark Matter - Stellar Mass Connection}\\label{sec:physical.DM}\n\nWe begin by investigating the link between the dark matter and stellar masses of galaxies predicted by \\bluetides. In Fig. \\ref{fig:DM_stellar} we show the ratio of the stellar to dark matter masses of galaxies. This ratio increases to higher stellar mass (increasing by approximately 0.5 dex as the dark matter mass is increases by 1 dex) and to lower-redshift. The shape of this relationship broadly matches the extrapolation of the \\citet{Moster2013} abundance matching model, however there is a significant difference ($\\approx 0.4\\,{\\rm dex}$) in normalisation. In Fig. \\ref{fig:DM_stellar} we also compare our results to the \\citet{Behroozi2013} model this time finding a significant difference in both normalisation and shape (at $M_{h}>10^{11}\\,{\\rm M_{\\odot}}$). The exact reason for this is unclear but may reflect that the \\citet{Moster2013} and \\citet{Behroozi2013} models are calibrated at lower redshift, and thus rely on extrapolation to produce the high-redshift relationship.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/DM_stellar.pdf}\n\\caption{The ratio of the stellar to dark matter mass as a function of dark matter mass predicted by \\bluetides. The top panel shows the full distribution of sources at $z=8$ with large points denoting the median and the error-bars showing the central $68\\%$ range. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. Tabulated values of the median ratios are given in Table \\ref{tab:DM_stellar}.}\n\\label{fig:DM_stellar}\n\\end{figure}\n\n\n\n\\subsection{The Galaxy Stellar Mass Function}\\label{sec:physical.GSMF}\n\n\nThe galaxy stellar mass function (GSMF) predicted by \\bluetides\\ is shown in Fig. \\ref{fig:GSMF}. At $z=8$ \\bluetides\\ simulated a sufficiently large volume to robustly model the GSMF to stellar masses of $>10^{10}\\,{\\rm M_{\\odot}}$. From $z=15\\to 8$ the number of $>10^{8}\\,{\\rm M_{\\odot}}$ galaxies within the simulation increases from a handful at $z=15$ to almost 120,000 by $z=8$ demonstrating the rapid assembly of the galaxy population during this epoch. Over the period the shape of the GSMF also evolved, with the number density of massive galaxies increasing faster. For example, from $z=10\\to 8$ the number density of galaxies with $M_*\\approx 10^{9.5}\\,{\\rm M_{\\odot}}$ increased a factor of $\\approx 4\\times$ faster than those with $M_*\\approx 10^{8}\\,{\\rm M_{\\odot}}$.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/GSMF.pdf}\n\\caption{The galaxy stellar mass function predicted by \\bluetides\\ at $z\\in\\{8,9,10,11,12,13,14,15\\}$. The right-hand axis shows the total number of galaxies in \\bluetides\\ in each $\\Delta\\log_{10}M=0.2$ mass bin. The points and grey line show observational constraints from \\citet{Song2015} at $z\\approx 8$ corrected to assume a \\citet{Chabrier2003} initial mass function. The inset panel shows the number of objects with $\\log_{10}(M_*/{\\rm M_{\\odot}})>8$ in the simulation volume as a function of redshift $z=15\\to 8$. Tabulated quantities from \\bluetides\\ are given in Table \\ref{tab:GSMF}.}\n\\label{fig:GSMF}\n\\end{figure}\n\nIt is now possible, by combining deep {\\em Hubble} observations with {\\em Spitzer}/IRAC photometry, to probe the rest-frame UV-optical spectral energy distributions of galaxies at very-high redshift, and thus measure robust stellar masses and thus the galaxy stellar mass function. \n\nWhile several studies have constrained the GSMF at very-high redshift \\citep{Gonzalez2011, Duncan2014,Grazian2015,Song2015} only \\citet{Song2015} have extended observational measurements of the GSMF to $z\\approx 8$ overlapping with \\bluetides. The \\citet{Song2015} results are shown Fig. \\ref{fig:GSMF} and closely match the \\bluetides\\ predictions over much of the simulated and observed mass range. The possible exception to this otherwise excellent agreement is at high masses $M_*>10^{10}\\,{\\rm M_{\\odot}}$ where \\bluetides\\ appears to predict more galaxies than are currently observed (although the observational uncertainties are very large). While this may reflect modelling issues it is also likely there exist observational biases at these large masses. The most-massive systems are predicted to be heavily obscured, even at $z\\approx 8$, and may fall out of UV selected samples.\n\nIt is also important to note that there are large differences between the observed GSMFs presented by different studies at very-high redshift. For example, despite using a largely overlapping set of observations \\citet{Song2015} find number densities (at $M_{*}>10^{9}\\,{\\rm M_{\\odot}}$) almost an order of magnitude lower than \\citet{Duncan2014} - for a discussion of the many issues regarding observational estimates of the GSMF see \\citet{Grazian2015} and \\citet{Song2015}. Observational estimates of the GSMF are sensitive to the choice of initial mass function (IMF). Assuming a \\citet{Salpeter1955} IMF for example would lead to observational mass estimates systematically increasing by approximately $0.17\\,{\\rm dex}$.\n\n\n\n\n\n\n\n\n\\subsection{The Star Formation Rate Distribution Function}\\label{sec:physical.SFRDF}\n\nAnother fundamental description of galaxy population is the star formation rate (SFR) distribution function (SFR-DF). \\bluetides\\ predictions for the SFR-DF are shown, alongside observational constraints at $z\\in\\{4.9,6.8,7.9\\}$ from \\citet{Mashian2016} in Fig. \\ref{fig:SFRDF}. The general shape of the predicted SFR-DF is similar to the galaxy stellar mass function and similarly lacks a strong break. However, the SFRDF also evolves more slowly than the galaxy stellar mass function. The \\citet{Mashian2016} $z\\approx 7.9$ distribution function has both a higher normalisation at low-SFRs and contains fewer high-SFR galaxies. The lack of high-SFR galaxies may again suggest a modelling issue though may also reflect an observational bias. This is discussed in more depth in \\S\\ref{sec:photometric.modelling.dust} where we discuss predictions for dust attenuation. \n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/SFRDF.pdf}\n\\caption{The star formation rate distribution function predicted by \\bluetides. The right-hand axis shows the total number of galaxies in \\bluetides\\ in each $\\Delta\\log_{10}SFR=0.2$ bin. Solid lines show the dust-corrected (intrinsic) star formation rate distribution functions measured by \\citet{Mashian2016} at $z\\in\\{7.9, 6.8, 4.9\\}$. The \\citet{Mashian2016} curves are corrected to assume a \\citet{Chabrier2003} IMF using the calibrations proposed by \\citet{KE2012}. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:SFRDF}.}\n\\label{fig:SFRDF}\n\\end{figure}\n\n\n\n\n\\subsection{Star Formation Histories}\\label{sec:physical.SFH}\n\n\nAt all the redshifts simulated by \\bluetides\\ the average star formation activity in galaxies is increasing rapidly, though the rate of this increase slows at later times. The average star formation histories of galaxies with stellar masses $>10^{8}\\,{\\rm M_{\\odot}}$ are shown in Fig. \\ref{fig:SFH}. Within the range probed by \\bluetides\\ there is little variation in the shape of the star formation history with stellar mass. This can also be seen in Figs. \\ref{fig:M_sSFR} and \\ref{fig:ages} where we show the average specific star formation and mean stellar ages in different mass bins. Both quantities show no correlation with stellar mass over the range which we are sensitive suggesting that star formation has not yet been quenched in these systems. The lack of quenching in our simulated galaxies is not entirely surprising as the mass range does not yet encompass many galaxies with $M_{h}>10^{12}\\,{\\rm M_{\\odot}}$ where inflows, and thus star formation, is expected to be suppressed \\citep[e.g.][]{Finlator2011}. It is worth noting however there is a tentative indication of some suppression in the most massive halos, however at $z=8$ there are not yet enough to have a clear picture.\n\nWhile there is no correlation with stellar mass both the average specific star formation rate and mean stellar age evolve strongly with redshift. For example, from $z=14\\to 8$ average mass-weighted stellar ages increase from approximately $30\\to 90\\,{\\rm Myr}$ while specific star formation rates drop by around $0.5\\,{\\rm dex}$.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/SFH.pdf}\n\\caption{The average star formation histories of galaxies with $M_{*}>10^{8}\\,{\\rm M_{\\odot}}$ at $z\\in\\{14,12,10,8\\}$. The figure shows the fraction of the total star formation occurring in each $\\Delta t=10\\,{\\rm Myr}$ age-bin.}\n\\label{fig:SFH}\n\\end{figure}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/M_sSFR.pdf}\n\\caption{The relationship between the specific star formation rate (${\\rm SFR}/M_{*}$) and stellar mass predicted for galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$ by \\bluetides. The top panel demonstrates the full distribution of sources at $z=8$ with the points denoting the median and the error-bars showing the central $68\\%$ range. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. The median specific star formation rates in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:physical}.}\n\\label{fig:M_sSFR}\n\\end{figure}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/ages.pdf}\n\\caption{The relationship between the mean stellar age and stellar mass predicted for galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$ by \\bluetides. The top panel demonstrates the full distribution of sources at $z=8$ with the points denoting the median and the error-bars showing the central $68\\%$ range. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. The median ages in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:physical}.}\n\\label{fig:ages}\n\\end{figure}\n\n\n\n\n\\subsection{Metal Enrichment}\\label{sec:physical.Z}\n\n\nAs galaxies assemble stellar mass in the simulation the average metallicity of both the gas and stars increases. This can be seen in Fig. \\ref{fig:metallicities} where we show both the average mass-weighted stellar and star forming gas phase metallicity as a function of stellar mass. The trend of metallicities with stellar masses increases as ${\\rm d}\\log_{10}Z/{\\rm d}\\log_{10}M_*\\approx 0.4$. This trend is similar to observational measurements, using rest-frame optical strong line diagnostics, from \\citet{Maiolino2008} (at $z\\approx 3.5$) and \\citet{Mannucci2009} (at $z\\approx 3.1$). The normalisation of the simulated mass-metallicity relationship at $z\\approx 8$ is also similar to that found at $z\\sim 3$ by \\citet{Maiolino2008} and \\citet{Mannucci2009} using rest-frame optical diagnostics and at $z\\sim 5$ by \\citet{Faisst2016} using rest-UV absorption complexes. \n\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/metallicities.pdf}\n\\caption{The stellar (light points) and star forming gas (dark points) metallicities of galaxies in \\bluetides. The 2D histogram in the top panel shows all objects with $M>10^{8}\\,{\\rm M_{\\odot}}$ at $z=8$. Points denoting the median and central $68\\%$. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. Observational constraints from \\citet{Maiolino2008}, \\citet{Mannucci2009}, and \\citet{Faisst2016} at $z\\sim 3.5$, $z\\sim 3.1$, and $z\\sim 5$ respectively are also shown. Observational measurements of the stellar mass assume a \\citet{Chabrier2003} initial mass function and metallicities were converted to a mass-fraction assuming $12+\\log_{10}(O/H)_{\\odot} = 8.69$ and $Z_{\\odot}=0.02$. The median metallicities in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:physical}.}\n\\label{fig:metallicities}\n\\end{figure}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n3.2 The Galaxy Stellar Mass Function\n\\subsection{The Galaxy Stellar Mass Function}\\label{sec:physical.GSMF}\n\n\nThe galaxy stellar mass function (GSMF) predicted by \\bluetides\\ is shown in Fig. \\ref{fig:GSMF}. At $z=8$ \\bluetides\\ simulated a sufficiently large volume to robustly model the GSMF to stellar masses of $>10^{10}\\,{\\rm M_{\\odot}}$. From $z=15\\to 8$ the number of $>10^{8}\\,{\\rm M_{\\odot}}$ galaxies within the simulation increases from a handful at $z=15$ to almost 120,000 by $z=8$ demonstrating the rapid assembly of the galaxy population during this epoch. Over the period the shape of the GSMF also evolved, with the number density of massive galaxies increasing faster. For example, from $z=10\\to 8$ the number density of galaxies with $M_*\\approx 10^{9.5}\\,{\\rm M_{\\odot}}$ increased a factor of $\\approx 4\\times$ faster than those with $M_*\\approx 10^{8}\\,{\\rm M_{\\odot}}$.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/GSMF.pdf}\n\\caption{The galaxy stellar mass function predicted by \\bluetides\\ at $z\\in\\{8,9,10,11,12,13,14,15\\}$. The right-hand axis shows the total number of galaxies in \\bluetides\\ in each $\\Delta\\log_{10}M=0.2$ mass bin. The points and grey line show observational constraints from \\citet{Song2015} at $z\\approx 8$ corrected to assume a \\citet{Chabrier2003} initial mass function. The inset panel shows the number of objects with $\\log_{10}(M_*/{\\rm M_{\\odot}})>8$ in the simulation volume as a function of redshift $z=15\\to 8$. Tabulated quantities from \\bluetides\\ are given in Table \\ref{tab:GSMF}.}\n\\label{fig:GSMF}\n\\end{figure}\n\nIt is now possible, by combining deep {\\em Hubble} observations with {\\em Spitzer}/IRAC photometry, to probe the rest-frame UV-optical spectral energy distributions of galaxies at very-high redshift, and thus measure robust stellar masses and thus the galaxy stellar mass function. \n\nWhile several studies have constrained the GSMF at very-high redshift \\citep{Gonzalez2011, Duncan2014,Grazian2015,Song2015} only \\citet{Song2015} have extended observational measurements of the GSMF to $z\\approx 8$ overlapping with \\bluetides. The \\citet{Song2015} results are shown Fig. \\ref{fig:GSMF} and closely match the \\bluetides\\ predictions over much of the simulated and observed mass range. The possible exception to this otherwise excellent agreement is at high masses $M_*>10^{10}\\,{\\rm M_{\\odot}}$ where \\bluetides\\ appears to predict more galaxies than are currently observed (although the observational uncertainties are very large). While this may reflect modelling issues it is also likely there exist observational biases at these large masses. The most-massive systems are predicted to be heavily obscured, even at $z\\approx 8$, and may fall out of UV selected samples.\n\nIt is also important to note that there are large differences between the observed GSMFs presented by different studies at very-high redshift. For example, despite using a largely overlapping set of observations \\citet{Song2015} find number densities (at $M_{*}>10^{9}\\,{\\rm M_{\\odot}}$) almost an order of magnitude lower than \\citet{Duncan2014} - for a discussion of the many issues regarding observational estimates of the GSMF see \\citet{Grazian2015} and \\citet{Song2015}. Observational estimates of the GSMF are sensitive to the choice of initial mass function (IMF). Assuming a \\citet{Salpeter1955} IMF for example would lead to observational mass estimates systematically increasing by approximately $0.17\\,{\\rm dex}$.\n\n\n\n\n\n\n\n\n",
        "Descriptive_question1": "What is the highest redshift value shown in table_3 for the galaxy stellar mass function?",
        "Descriptive_question2": "At which stellar mass does table_3 report data for z=8?",
        "Reasoning_question1": "How does the number density of massive galaxies change compared to less massive ones from z=10 to z=8 in table_3, and what does this suggest about galaxy assembly during this period?",
        "Reasoning_question2": "What might explain the discrepancy between the Bluetides predictions and observational data at high stellar masses (>10^10 M_{\\odot}) in table_3, and how could observational biases affect this comparison?",
        "Descriptive_answer1": "z=13",
        "Descriptive_answer2": "10.0-10.2",
        "Reasoning_answer1": "To analyze the change in number density of galaxies from z=10 to z=8, I first looked at the values of log10(φ/dex^-1 Mpc^-3) in table_3 for different stellar mass ranges. For massive galaxies, around M_* ≈ 10^9.5 M_{\\odot} (stellar mass bin 9.4-9.6), the value changes from -6.11 at z=10 to -4.5 at z=8, an increase in number density by a factor of approximately 10^(6.11-4.5) = 10^1.61 ≈ 40. For less massive galaxies, around M_* ≈ 10^8 M_{\\odot} (stellar mass bin 8.0-8.2), the value changes from -3.72 at z=10 to -2.76 at z=8, an increase by a factor of 10^(3.72-2.76) = 10^0.96 ≈ 9.1. This shows that the number density of massive galaxies increased roughly 4 times faster than that of less massive galaxies (40/9.1 ≈ 4.4). This disparity suggests that during the period from z=10 to z=8, the assembly of massive galaxies was significantly more rapid, indicating a preferential build-up of larger structures, possibly due to more efficient merging or accretion processes at higher masses during this epoch of rapid galaxy formation.",
        "Reasoning_answer2": "Examining the discrepancy between Bluetides predictions and observational data at high stellar masses (>10^10 M_{\\odot}), I note from the context that Bluetides predicts more galaxies than observed in this range, as seen in table_3 where data for z=8 extends to 10.2-10.4 with a value of -5.96, indicating a significant predicted number density. One potential explanation for this discrepancy could be modeling issues within Bluetides, such as overestimation of star formation or merging rates at high masses, which might not fully account for physical processes like feedback mechanisms that could suppress galaxy formation. Additionally, observational biases likely play a significant role; the context suggests that the most massive systems at z≈8 are heavily obscured by dust, causing them to drop out of UV-selected samples used in observational studies like Song2015. This means that current observational data might undercount these massive galaxies, leading to an apparent discrepancy. Future observations with instruments sensitive to obscured systems, or improved dust modeling in simulations, could help resolve this difference."
    },
    {
        "paper_id": "1704.00954.json",
        "table_id": "table_4",
        "table_content": "\\begin{table*}\n\\caption{Tabulated values of the star formation rate distribution function (SFRDF) used in Fig. \\ref{fig:SFRDF}.}\n\\label{tab:SFRDF}\n\\begin{tabular}{ccccccc}\n\\hline\n    & \\multicolumn{6}{c}{$\\log_{10}(\\phi/{\\rm dex^{-1}Mpc^{-3}})$} \\\\\n    \\hline\n$\\log_{10}({\\rm SFR}/{\\rm M_{\\odot}\\,yr^{-1}})$ & $z=13$ & $z=12$ & $z=11$ & $z=10$ & $z=9$ & $z=8$  \\\\\n\\hline\n $ -0.6 $ - $ -0.4 $  & $-3.93 $ & $-3.5 $ & $-3.11 $ & $-2.8 $ & $-2.51 $ & $-2.22 $\\\\\n $ -0.4 $ - $ -0.2 $  & $-4.21 $ & $-3.77 $ & $-3.36 $ & $-3.02 $ & $-2.71 $ & $-2.41 $\\\\\n $ -0.2 $ - $ 0.0 $  & $-4.54 $ & $-4.07 $ & $-3.62 $ & $-3.26 $ & $-2.93 $ & $-2.61 $\\\\\n $ 0.0 $ - $ 0.2 $  & $-4.92 $ & $-4.42 $ & $-3.93 $ & $-3.52 $ & $-3.16 $ & $-2.8 $\\\\\n $ 0.2 $ - $ 0.4 $  & $-5.28 $ & $-4.72 $ & $-4.22 $ & $-3.78 $ & $-3.39 $ & $-3.01 $\\\\\n $ 0.4 $ - $ 0.6 $  & $-5.67 $ & $-5.13 $ & $-4.54 $ & $-4.06 $ & $-3.64 $ & $-3.22 $\\\\\n $ 0.6 $ - $ 0.8 $  & $-6.04 $ & $-5.49 $ & $-4.89 $ & $-4.39 $ & $-3.9 $ & $-3.46 $\\\\\n $ 0.8 $ - $ 1.0 $  & $-6.4 $ & $-5.82 $ & $-5.21 $ & $-4.69 $ & $-4.14 $ & $-3.7 $\\\\\n $ 1.0 $ - $ 1.2 $  & - & $-6.23 $ & $-5.63 $ & $-5.05 $ & $-4.44 $ & $-3.97 $\\\\\n $ 1.2 $ - $ 1.4 $  & - & - & $-5.98 $ & $-5.41 $ & $-4.79 $ & $-4.25 $\\\\\n $ 1.4 $ - $ 1.6 $  & - & - & $-6.32 $ & $-5.77 $ & $-5.12 $ & $-4.57 $\\\\\n $ 1.6 $ - $ 1.8 $  & - & - & - & $-6.21 $ & $-5.44 $ & $-4.91 $\\\\\n $ 1.8 $ - $ 2.0 $  & - & - & - & - & $-5.91 $ & $-5.28 $\\\\\n $ 2.0 $ - $ 2.2 $  & - & - & - & - & $-6.34 $ & $-5.65 $\\\\\n $ 2.2 $ - $ 2.4 $  & - & - & - & - & - & $-6.12 $\\\\\n $ 2.4 $ - $ 2.6 $  & - & - & - & - & - & $-6.57 $\\\\\n\\hline\n\\end{tabular}\n\\end{table*}",
        "caption": "Tabulated values of the star formation rate distribution function (SFRDF) used in Fig. \\ref{fig:SFRDF}.",
        "label": "tab:SFRDF",
        "section_info": "3 Physical Properties\n\\section{Physical Properties}\\label{sec:physical}\n\n\\subsection{Dark Matter - Stellar Mass Connection}\\label{sec:physical.DM}\n\nWe begin by investigating the link between the dark matter and stellar masses of galaxies predicted by \\bluetides. In Fig. \\ref{fig:DM_stellar} we show the ratio of the stellar to dark matter masses of galaxies. This ratio increases to higher stellar mass (increasing by approximately 0.5 dex as the dark matter mass is increases by 1 dex) and to lower-redshift. The shape of this relationship broadly matches the extrapolation of the \\citet{Moster2013} abundance matching model, however there is a significant difference ($\\approx 0.4\\,{\\rm dex}$) in normalisation. In Fig. \\ref{fig:DM_stellar} we also compare our results to the \\citet{Behroozi2013} model this time finding a significant difference in both normalisation and shape (at $M_{h}>10^{11}\\,{\\rm M_{\\odot}}$). The exact reason for this is unclear but may reflect that the \\citet{Moster2013} and \\citet{Behroozi2013} models are calibrated at lower redshift, and thus rely on extrapolation to produce the high-redshift relationship.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/DM_stellar.pdf}\n\\caption{The ratio of the stellar to dark matter mass as a function of dark matter mass predicted by \\bluetides. The top panel shows the full distribution of sources at $z=8$ with large points denoting the median and the error-bars showing the central $68\\%$ range. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. Tabulated values of the median ratios are given in Table \\ref{tab:DM_stellar}.}\n\\label{fig:DM_stellar}\n\\end{figure}\n\n\n\n\\subsection{The Galaxy Stellar Mass Function}\\label{sec:physical.GSMF}\n\n\nThe galaxy stellar mass function (GSMF) predicted by \\bluetides\\ is shown in Fig. \\ref{fig:GSMF}. At $z=8$ \\bluetides\\ simulated a sufficiently large volume to robustly model the GSMF to stellar masses of $>10^{10}\\,{\\rm M_{\\odot}}$. From $z=15\\to 8$ the number of $>10^{8}\\,{\\rm M_{\\odot}}$ galaxies within the simulation increases from a handful at $z=15$ to almost 120,000 by $z=8$ demonstrating the rapid assembly of the galaxy population during this epoch. Over the period the shape of the GSMF also evolved, with the number density of massive galaxies increasing faster. For example, from $z=10\\to 8$ the number density of galaxies with $M_*\\approx 10^{9.5}\\,{\\rm M_{\\odot}}$ increased a factor of $\\approx 4\\times$ faster than those with $M_*\\approx 10^{8}\\,{\\rm M_{\\odot}}$.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/GSMF.pdf}\n\\caption{The galaxy stellar mass function predicted by \\bluetides\\ at $z\\in\\{8,9,10,11,12,13,14,15\\}$. The right-hand axis shows the total number of galaxies in \\bluetides\\ in each $\\Delta\\log_{10}M=0.2$ mass bin. The points and grey line show observational constraints from \\citet{Song2015} at $z\\approx 8$ corrected to assume a \\citet{Chabrier2003} initial mass function. The inset panel shows the number of objects with $\\log_{10}(M_*/{\\rm M_{\\odot}})>8$ in the simulation volume as a function of redshift $z=15\\to 8$. Tabulated quantities from \\bluetides\\ are given in Table \\ref{tab:GSMF}.}\n\\label{fig:GSMF}\n\\end{figure}\n\nIt is now possible, by combining deep {\\em Hubble} observations with {\\em Spitzer}/IRAC photometry, to probe the rest-frame UV-optical spectral energy distributions of galaxies at very-high redshift, and thus measure robust stellar masses and thus the galaxy stellar mass function. \n\nWhile several studies have constrained the GSMF at very-high redshift \\citep{Gonzalez2011, Duncan2014,Grazian2015,Song2015} only \\citet{Song2015} have extended observational measurements of the GSMF to $z\\approx 8$ overlapping with \\bluetides. The \\citet{Song2015} results are shown Fig. \\ref{fig:GSMF} and closely match the \\bluetides\\ predictions over much of the simulated and observed mass range. The possible exception to this otherwise excellent agreement is at high masses $M_*>10^{10}\\,{\\rm M_{\\odot}}$ where \\bluetides\\ appears to predict more galaxies than are currently observed (although the observational uncertainties are very large). While this may reflect modelling issues it is also likely there exist observational biases at these large masses. The most-massive systems are predicted to be heavily obscured, even at $z\\approx 8$, and may fall out of UV selected samples.\n\nIt is also important to note that there are large differences between the observed GSMFs presented by different studies at very-high redshift. For example, despite using a largely overlapping set of observations \\citet{Song2015} find number densities (at $M_{*}>10^{9}\\,{\\rm M_{\\odot}}$) almost an order of magnitude lower than \\citet{Duncan2014} - for a discussion of the many issues regarding observational estimates of the GSMF see \\citet{Grazian2015} and \\citet{Song2015}. Observational estimates of the GSMF are sensitive to the choice of initial mass function (IMF). Assuming a \\citet{Salpeter1955} IMF for example would lead to observational mass estimates systematically increasing by approximately $0.17\\,{\\rm dex}$.\n\n\n\n\n\n\n\n\n\\subsection{The Star Formation Rate Distribution Function}\\label{sec:physical.SFRDF}\n\nAnother fundamental description of galaxy population is the star formation rate (SFR) distribution function (SFR-DF). \\bluetides\\ predictions for the SFR-DF are shown, alongside observational constraints at $z\\in\\{4.9,6.8,7.9\\}$ from \\citet{Mashian2016} in Fig. \\ref{fig:SFRDF}. The general shape of the predicted SFR-DF is similar to the galaxy stellar mass function and similarly lacks a strong break. However, the SFRDF also evolves more slowly than the galaxy stellar mass function. The \\citet{Mashian2016} $z\\approx 7.9$ distribution function has both a higher normalisation at low-SFRs and contains fewer high-SFR galaxies. The lack of high-SFR galaxies may again suggest a modelling issue though may also reflect an observational bias. This is discussed in more depth in \\S\\ref{sec:photometric.modelling.dust} where we discuss predictions for dust attenuation. \n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/SFRDF.pdf}\n\\caption{The star formation rate distribution function predicted by \\bluetides. The right-hand axis shows the total number of galaxies in \\bluetides\\ in each $\\Delta\\log_{10}SFR=0.2$ bin. Solid lines show the dust-corrected (intrinsic) star formation rate distribution functions measured by \\citet{Mashian2016} at $z\\in\\{7.9, 6.8, 4.9\\}$. The \\citet{Mashian2016} curves are corrected to assume a \\citet{Chabrier2003} IMF using the calibrations proposed by \\citet{KE2012}. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:SFRDF}.}\n\\label{fig:SFRDF}\n\\end{figure}\n\n\n\n\n\\subsection{Star Formation Histories}\\label{sec:physical.SFH}\n\n\nAt all the redshifts simulated by \\bluetides\\ the average star formation activity in galaxies is increasing rapidly, though the rate of this increase slows at later times. The average star formation histories of galaxies with stellar masses $>10^{8}\\,{\\rm M_{\\odot}}$ are shown in Fig. \\ref{fig:SFH}. Within the range probed by \\bluetides\\ there is little variation in the shape of the star formation history with stellar mass. This can also be seen in Figs. \\ref{fig:M_sSFR} and \\ref{fig:ages} where we show the average specific star formation and mean stellar ages in different mass bins. Both quantities show no correlation with stellar mass over the range which we are sensitive suggesting that star formation has not yet been quenched in these systems. The lack of quenching in our simulated galaxies is not entirely surprising as the mass range does not yet encompass many galaxies with $M_{h}>10^{12}\\,{\\rm M_{\\odot}}$ where inflows, and thus star formation, is expected to be suppressed \\citep[e.g.][]{Finlator2011}. It is worth noting however there is a tentative indication of some suppression in the most massive halos, however at $z=8$ there are not yet enough to have a clear picture.\n\nWhile there is no correlation with stellar mass both the average specific star formation rate and mean stellar age evolve strongly with redshift. For example, from $z=14\\to 8$ average mass-weighted stellar ages increase from approximately $30\\to 90\\,{\\rm Myr}$ while specific star formation rates drop by around $0.5\\,{\\rm dex}$.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/SFH.pdf}\n\\caption{The average star formation histories of galaxies with $M_{*}>10^{8}\\,{\\rm M_{\\odot}}$ at $z\\in\\{14,12,10,8\\}$. The figure shows the fraction of the total star formation occurring in each $\\Delta t=10\\,{\\rm Myr}$ age-bin.}\n\\label{fig:SFH}\n\\end{figure}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/M_sSFR.pdf}\n\\caption{The relationship between the specific star formation rate (${\\rm SFR}/M_{*}$) and stellar mass predicted for galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$ by \\bluetides. The top panel demonstrates the full distribution of sources at $z=8$ with the points denoting the median and the error-bars showing the central $68\\%$ range. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. The median specific star formation rates in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:physical}.}\n\\label{fig:M_sSFR}\n\\end{figure}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/ages.pdf}\n\\caption{The relationship between the mean stellar age and stellar mass predicted for galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$ by \\bluetides. The top panel demonstrates the full distribution of sources at $z=8$ with the points denoting the median and the error-bars showing the central $68\\%$ range. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. The median ages in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:physical}.}\n\\label{fig:ages}\n\\end{figure}\n\n\n\n\n\\subsection{Metal Enrichment}\\label{sec:physical.Z}\n\n\nAs galaxies assemble stellar mass in the simulation the average metallicity of both the gas and stars increases. This can be seen in Fig. \\ref{fig:metallicities} where we show both the average mass-weighted stellar and star forming gas phase metallicity as a function of stellar mass. The trend of metallicities with stellar masses increases as ${\\rm d}\\log_{10}Z/{\\rm d}\\log_{10}M_*\\approx 0.4$. This trend is similar to observational measurements, using rest-frame optical strong line diagnostics, from \\citet{Maiolino2008} (at $z\\approx 3.5$) and \\citet{Mannucci2009} (at $z\\approx 3.1$). The normalisation of the simulated mass-metallicity relationship at $z\\approx 8$ is also similar to that found at $z\\sim 3$ by \\citet{Maiolino2008} and \\citet{Mannucci2009} using rest-frame optical diagnostics and at $z\\sim 5$ by \\citet{Faisst2016} using rest-UV absorption complexes. \n\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/metallicities.pdf}\n\\caption{The stellar (light points) and star forming gas (dark points) metallicities of galaxies in \\bluetides. The 2D histogram in the top panel shows all objects with $M>10^{8}\\,{\\rm M_{\\odot}}$ at $z=8$. Points denoting the median and central $68\\%$. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. Observational constraints from \\citet{Maiolino2008}, \\citet{Mannucci2009}, and \\citet{Faisst2016} at $z\\sim 3.5$, $z\\sim 3.1$, and $z\\sim 5$ respectively are also shown. Observational measurements of the stellar mass assume a \\citet{Chabrier2003} initial mass function and metallicities were converted to a mass-fraction assuming $12+\\log_{10}(O/H)_{\\odot} = 8.69$ and $Z_{\\odot}=0.02$. The median metallicities in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:physical}.}\n\\label{fig:metallicities}\n\\end{figure}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n3.3 The Star Formation Rate Distribution Function\n\\subsection{The Star Formation Rate Distribution Function}\\label{sec:physical.SFRDF}\n\nAnother fundamental description of galaxy population is the star formation rate (SFR) distribution function (SFR-DF). \\bluetides\\ predictions for the SFR-DF are shown, alongside observational constraints at $z\\in\\{4.9,6.8,7.9\\}$ from \\citet{Mashian2016} in Fig. \\ref{fig:SFRDF}. The general shape of the predicted SFR-DF is similar to the galaxy stellar mass function and similarly lacks a strong break. However, the SFRDF also evolves more slowly than the galaxy stellar mass function. The \\citet{Mashian2016} $z\\approx 7.9$ distribution function has both a higher normalisation at low-SFRs and contains fewer high-SFR galaxies. The lack of high-SFR galaxies may again suggest a modelling issue though may also reflect an observational bias. This is discussed in more depth in \\S\\ref{sec:photometric.modelling.dust} where we discuss predictions for dust attenuation. \n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/SFRDF.pdf}\n\\caption{The star formation rate distribution function predicted by \\bluetides. The right-hand axis shows the total number of galaxies in \\bluetides\\ in each $\\Delta\\log_{10}SFR=0.2$ bin. Solid lines show the dust-corrected (intrinsic) star formation rate distribution functions measured by \\citet{Mashian2016} at $z\\in\\{7.9, 6.8, 4.9\\}$. The \\citet{Mashian2016} curves are corrected to assume a \\citet{Chabrier2003} IMF using the calibrations proposed by \\citet{KE2012}. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:SFRDF}.}\n\\label{fig:SFRDF}\n\\end{figure}\n\n\n\n\n",
        "Descriptive_question1": "What is the highest redshift value mentioned in table_4 for the star formation rate distribution function?",
        "Descriptive_question2": "Which IMF is assumed for the corrections in table_4?",
        "Reasoning_question1": "How does the normalization of the star formation rate distribution function at low-SFRs in table_4 compare between the Bluetides predictions and the Mashian2016 observations at z≈7.9, and what might this indicate?",
        "Reasoning_question2": "What could be the possible reasons for the discrepancy in the number of high-SFR galaxies between Bluetides predictions and Mashian2016 observations in table_4, and how might observational biases play a role?",
        "Descriptive_answer1": "z=13",
        "Descriptive_answer2": "Chabrier2003",
        "Reasoning_answer1": "To compare the normalization of the star formation rate distribution function (SFRDF) at low-SFRs between Bluetides predictions and Mashian2016 observations at z≈7.9, I first consider the context provided in the text. The text states that the Mashian2016 distribution function at z≈7.9 has a higher normalization at low-SFRs compared to Bluetides predictions. This means that for galaxies with lower star formation rates, Mashian2016 reports a higher number density or frequency of such galaxies than what Bluetides simulates. Looking at the table, for z=8 (closest to z=7.9), the SFRDF values at low SFRs (e.g., log(SFR) between -0.6 and -0.4) are around -2.22, and these values decrease (become more negative) at higher redshifts, indicating lower number densities in Bluetides predictions. Since Mashian2016 shows higher normalization, it suggests that their observational data captures more low-SFR galaxies. This discrepancy could indicate that Bluetides might be underpredicting the number of low-SFR galaxies at high redshifts, possibly due to limitations in the simulation's resolution or assumptions about star formation processes at these early epochs. Alternatively, Mashian2016's higher normalization might reflect observational sensitivities or selection effects that favor detecting more faint galaxies.",
        "Reasoning_answer2": "Analyzing the discrepancy in the number of high-SFR galaxies between Bluetides predictions and Mashian2016 observations at z≈7.9, I refer to the text which notes that Mashian2016 contains fewer high-SFR galaxies compared to Bluetides. In the table, for z=8, Bluetides predicts SFRDF values even at high SFR ranges (e.g., log(SFR) between 2.2 and 2.6 with values like -6.12 and -6.57), suggesting a presence of high-SFR galaxies, while Mashian2016 data indicates fewer such galaxies. This difference could stem from modeling issues in Bluetides, where the simulation might overestimate star formation efficiency or feedback mechanisms in massive galaxies, leading to more predicted high-SFR systems. On the other hand, observational biases could significantly contribute to this discrepancy. As discussed in the text, high-SFR galaxies at high redshifts might be heavily obscured by dust, making them less likely to be detected in UV-selected observational samples like those of Mashian2016. This dust attenuation could cause an undercount of such galaxies in observations. Additionally, the limited volume or sensitivity of observational surveys might miss rare, high-SFR objects, whereas simulations like Bluetides cover a larger representative volume and include all theoretical formations, thus predicting more high-SFR galaxies."
    },
    {
        "paper_id": "1704.00954.json",
        "table_id": "table_5",
        "table_content": "\\begin{table*}\n\\caption{Tabulated values of the median specific star formation rate, mass-weighted stellar age, star forming gas metallicity, and stellar metallicity used in Figures \\ref{fig:M_sSFR}, {fig:ages} and \\ref{fig:metallicities}.}\n\\label{tab:physical}\n\\begin{tabular}{cccccccccc}\n\\hline\n    & \\multicolumn{9}{c}{$\\log_{10}(M_*/{\\rm M_{\\odot}})=$} \\\\\n$z$ & $8.0$-$8.25$ & $8.25$-$8.5$ & $8.5$-$8.75$ & $8.75$-$9.0$ & $9.0$-$9.25$ & $9.25$-$9.50$ & $9.50$-$9.75$ & $9.75$-$10.0$ & $10.0$-$10.25$ \\\\\n\\hline\n    & \\multicolumn{9}{c}{{\\bf median specific star formation rate} - $\\log_{10}[({\\rm SFR}/M_{*})/{\\rm yr^{-1}}]$} \\\\\n\\hline\n 13.0 & $-7.7 $  & $-7.65 $  & - & - & - & - & - & - & -\\\\\n 12.0 & $-7.76 $  & $-7.77 $  & $-7.74 $  & $-7.74 $  & - & - & - & - & -\\\\\n 11.0 & $-7.81 $  & $-7.81 $  & $-7.79 $  & $-7.76 $  & $-7.74 $  & - & - & - & -\\\\\n 10.0 & $-7.91 $  & $-7.91 $  & $-7.91 $  & $-7.9 $  & $-7.89 $  & $-7.89 $  & $-7.96 $  & - & -\\\\\n 9.0 & $-8.01 $  & $-8.0 $  & $-7.99 $  & $-7.97 $  & $-7.98 $  & $-7.96 $  & $-7.97 $  & $-7.96 $  & -\\\\\n 8.0 & $-8.1 $  & $-8.09 $  & $-8.09 $  & $-8.1 $  & $-8.1 $  & $-8.1 $  & $-8.11 $  & $-8.11 $  & $-8.1 $ \\\\\n\\hline\n    & \\multicolumn{9}{c}{{\\bf median mass-weighted age} - $age/{\\rm Myr}$} \\\\\n\\hline\n 13.0 & 33 & 33 & - & - & - & - & - & - & -\\\\\n 12.0 & 39 & 41 & 40 & 40 & - & - & - & - & -\\\\\n 11.0 & 47 & 48 & 48 & 46 & 49 & - & - & - & -\\\\\n 10.0 & 57 & 56 & 56 & 56 & 56 & 56 & 56 & - & -\\\\\n 9.0 & 71 & 71 & 70 & 70 & 70 & 70 & 70 & 71 & -\\\\\n 8.0 & 89 & 88 & 88 & 88 & 87 & 88 & 88 & 89 & 89\\\\\n\\hline\n    & \\multicolumn{9}{c}{{\\bf median star forming gas metallicity} - $\\log_{10}Z_{\\rm SFG}$} \\\\\n\\hline\n 13.0 & $-2.95 $  & $-2.85 $  & - & - & - & - & - & - & -\\\\\n 12.0 & $-2.99 $  & $-2.86 $  & $-2.75 $  & $-2.62 $  & - & - & - & - & -\\\\\n 11.0 & $-3.0 $  & $-2.89 $  & $-2.79 $  & $-2.65 $  & $-2.56 $  & - & - & - & -\\\\\n 10.0 & $-3.02 $  & $-2.91 $  & $-2.81 $  & $-2.71 $  & $-2.58 $  & $-2.5 $  & $-2.38 $  & - & -\\\\\n 9.0 & $-3.02 $  & $-2.92 $  & $-2.82 $  & $-2.71 $  & $-2.61 $  & $-2.5 $  & $-2.39 $  & $-2.26 $  & -\\\\\n 8.0 & $-3.03 $  & $-2.92 $  & $-2.82 $  & $-2.72 $  & $-2.61 $  & $-2.5 $  & $-2.4 $  & $-2.27 $  & $-2.17 $ \\\\\n\\hline\n    & \\multicolumn{9}{c}{{\\bf median stellar metallicity} - $\\log_{10}Z_{*}$} \\\\\n\\hline\n 13.0 & $-3.11 $  & $-3.01 $  & - & - & - & - & - & - & -\\\\\n 12.0 & $-3.15 $  & $-3.04 $  & $-2.92 $  & $-2.82 $  & - & - & - & - & -\\\\\n 11.0 & $-3.17 $  & $-3.06 $  & $-2.95 $  & $-2.83 $  & $-2.75 $  & - & - & - & -\\\\\n 10.0 & $-3.17 $  & $-3.07 $  & $-2.98 $  & $-2.87 $  & $-2.75 $  & $-2.66 $  & $-2.56 $  & - & -\\\\\n 9.0 & $-3.18 $  & $-3.08 $  & $-2.98 $  & $-2.88 $  & $-2.78 $  & $-2.67 $  & $-2.57 $  & $-2.45 $  & -\\\\\n 8.0 & $-3.19 $  & $-3.09 $  & $-2.99 $  & $-2.89 $  & $-2.79 $  & $-2.68 $  & $-2.58 $  & $-2.47 $  & $-2.38 $ \\\\\n\\hline\n\\end{tabular}\n\\end{table*}",
        "caption": "Tabulated values of the median specific star formation rate, mass-weighted stellar age, star forming gas metallicity, and stellar metallicity used in Figures \\ref{fig:M_sSFR}, {fig:ages} and \\ref{fig:metallicities}.",
        "label": "tab:physical",
        "section_info": "3 Physical Properties\n\\section{Physical Properties}\\label{sec:physical}\n\n\\subsection{Dark Matter - Stellar Mass Connection}\\label{sec:physical.DM}\n\nWe begin by investigating the link between the dark matter and stellar masses of galaxies predicted by \\bluetides. In Fig. \\ref{fig:DM_stellar} we show the ratio of the stellar to dark matter masses of galaxies. This ratio increases to higher stellar mass (increasing by approximately 0.5 dex as the dark matter mass is increases by 1 dex) and to lower-redshift. The shape of this relationship broadly matches the extrapolation of the \\citet{Moster2013} abundance matching model, however there is a significant difference ($\\approx 0.4\\,{\\rm dex}$) in normalisation. In Fig. \\ref{fig:DM_stellar} we also compare our results to the \\citet{Behroozi2013} model this time finding a significant difference in both normalisation and shape (at $M_{h}>10^{11}\\,{\\rm M_{\\odot}}$). The exact reason for this is unclear but may reflect that the \\citet{Moster2013} and \\citet{Behroozi2013} models are calibrated at lower redshift, and thus rely on extrapolation to produce the high-redshift relationship.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/DM_stellar.pdf}\n\\caption{The ratio of the stellar to dark matter mass as a function of dark matter mass predicted by \\bluetides. The top panel shows the full distribution of sources at $z=8$ with large points denoting the median and the error-bars showing the central $68\\%$ range. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. Tabulated values of the median ratios are given in Table \\ref{tab:DM_stellar}.}\n\\label{fig:DM_stellar}\n\\end{figure}\n\n\n\n\\subsection{The Galaxy Stellar Mass Function}\\label{sec:physical.GSMF}\n\n\nThe galaxy stellar mass function (GSMF) predicted by \\bluetides\\ is shown in Fig. \\ref{fig:GSMF}. At $z=8$ \\bluetides\\ simulated a sufficiently large volume to robustly model the GSMF to stellar masses of $>10^{10}\\,{\\rm M_{\\odot}}$. From $z=15\\to 8$ the number of $>10^{8}\\,{\\rm M_{\\odot}}$ galaxies within the simulation increases from a handful at $z=15$ to almost 120,000 by $z=8$ demonstrating the rapid assembly of the galaxy population during this epoch. Over the period the shape of the GSMF also evolved, with the number density of massive galaxies increasing faster. For example, from $z=10\\to 8$ the number density of galaxies with $M_*\\approx 10^{9.5}\\,{\\rm M_{\\odot}}$ increased a factor of $\\approx 4\\times$ faster than those with $M_*\\approx 10^{8}\\,{\\rm M_{\\odot}}$.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/GSMF.pdf}\n\\caption{The galaxy stellar mass function predicted by \\bluetides\\ at $z\\in\\{8,9,10,11,12,13,14,15\\}$. The right-hand axis shows the total number of galaxies in \\bluetides\\ in each $\\Delta\\log_{10}M=0.2$ mass bin. The points and grey line show observational constraints from \\citet{Song2015} at $z\\approx 8$ corrected to assume a \\citet{Chabrier2003} initial mass function. The inset panel shows the number of objects with $\\log_{10}(M_*/{\\rm M_{\\odot}})>8$ in the simulation volume as a function of redshift $z=15\\to 8$. Tabulated quantities from \\bluetides\\ are given in Table \\ref{tab:GSMF}.}\n\\label{fig:GSMF}\n\\end{figure}\n\nIt is now possible, by combining deep {\\em Hubble} observations with {\\em Spitzer}/IRAC photometry, to probe the rest-frame UV-optical spectral energy distributions of galaxies at very-high redshift, and thus measure robust stellar masses and thus the galaxy stellar mass function. \n\nWhile several studies have constrained the GSMF at very-high redshift \\citep{Gonzalez2011, Duncan2014,Grazian2015,Song2015} only \\citet{Song2015} have extended observational measurements of the GSMF to $z\\approx 8$ overlapping with \\bluetides. The \\citet{Song2015} results are shown Fig. \\ref{fig:GSMF} and closely match the \\bluetides\\ predictions over much of the simulated and observed mass range. The possible exception to this otherwise excellent agreement is at high masses $M_*>10^{10}\\,{\\rm M_{\\odot}}$ where \\bluetides\\ appears to predict more galaxies than are currently observed (although the observational uncertainties are very large). While this may reflect modelling issues it is also likely there exist observational biases at these large masses. The most-massive systems are predicted to be heavily obscured, even at $z\\approx 8$, and may fall out of UV selected samples.\n\nIt is also important to note that there are large differences between the observed GSMFs presented by different studies at very-high redshift. For example, despite using a largely overlapping set of observations \\citet{Song2015} find number densities (at $M_{*}>10^{9}\\,{\\rm M_{\\odot}}$) almost an order of magnitude lower than \\citet{Duncan2014} - for a discussion of the many issues regarding observational estimates of the GSMF see \\citet{Grazian2015} and \\citet{Song2015}. Observational estimates of the GSMF are sensitive to the choice of initial mass function (IMF). Assuming a \\citet{Salpeter1955} IMF for example would lead to observational mass estimates systematically increasing by approximately $0.17\\,{\\rm dex}$.\n\n\n\n\n\n\n\n\n\\subsection{The Star Formation Rate Distribution Function}\\label{sec:physical.SFRDF}\n\nAnother fundamental description of galaxy population is the star formation rate (SFR) distribution function (SFR-DF). \\bluetides\\ predictions for the SFR-DF are shown, alongside observational constraints at $z\\in\\{4.9,6.8,7.9\\}$ from \\citet{Mashian2016} in Fig. \\ref{fig:SFRDF}. The general shape of the predicted SFR-DF is similar to the galaxy stellar mass function and similarly lacks a strong break. However, the SFRDF also evolves more slowly than the galaxy stellar mass function. The \\citet{Mashian2016} $z\\approx 7.9$ distribution function has both a higher normalisation at low-SFRs and contains fewer high-SFR galaxies. The lack of high-SFR galaxies may again suggest a modelling issue though may also reflect an observational bias. This is discussed in more depth in \\S\\ref{sec:photometric.modelling.dust} where we discuss predictions for dust attenuation. \n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/SFRDF.pdf}\n\\caption{The star formation rate distribution function predicted by \\bluetides. The right-hand axis shows the total number of galaxies in \\bluetides\\ in each $\\Delta\\log_{10}SFR=0.2$ bin. Solid lines show the dust-corrected (intrinsic) star formation rate distribution functions measured by \\citet{Mashian2016} at $z\\in\\{7.9, 6.8, 4.9\\}$. The \\citet{Mashian2016} curves are corrected to assume a \\citet{Chabrier2003} IMF using the calibrations proposed by \\citet{KE2012}. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:SFRDF}.}\n\\label{fig:SFRDF}\n\\end{figure}\n\n\n\n\n\\subsection{Star Formation Histories}\\label{sec:physical.SFH}\n\n\nAt all the redshifts simulated by \\bluetides\\ the average star formation activity in galaxies is increasing rapidly, though the rate of this increase slows at later times. The average star formation histories of galaxies with stellar masses $>10^{8}\\,{\\rm M_{\\odot}}$ are shown in Fig. \\ref{fig:SFH}. Within the range probed by \\bluetides\\ there is little variation in the shape of the star formation history with stellar mass. This can also be seen in Figs. \\ref{fig:M_sSFR} and \\ref{fig:ages} where we show the average specific star formation and mean stellar ages in different mass bins. Both quantities show no correlation with stellar mass over the range which we are sensitive suggesting that star formation has not yet been quenched in these systems. The lack of quenching in our simulated galaxies is not entirely surprising as the mass range does not yet encompass many galaxies with $M_{h}>10^{12}\\,{\\rm M_{\\odot}}$ where inflows, and thus star formation, is expected to be suppressed \\citep[e.g.][]{Finlator2011}. It is worth noting however there is a tentative indication of some suppression in the most massive halos, however at $z=8$ there are not yet enough to have a clear picture.\n\nWhile there is no correlation with stellar mass both the average specific star formation rate and mean stellar age evolve strongly with redshift. For example, from $z=14\\to 8$ average mass-weighted stellar ages increase from approximately $30\\to 90\\,{\\rm Myr}$ while specific star formation rates drop by around $0.5\\,{\\rm dex}$.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/SFH.pdf}\n\\caption{The average star formation histories of galaxies with $M_{*}>10^{8}\\,{\\rm M_{\\odot}}$ at $z\\in\\{14,12,10,8\\}$. The figure shows the fraction of the total star formation occurring in each $\\Delta t=10\\,{\\rm Myr}$ age-bin.}\n\\label{fig:SFH}\n\\end{figure}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/M_sSFR.pdf}\n\\caption{The relationship between the specific star formation rate (${\\rm SFR}/M_{*}$) and stellar mass predicted for galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$ by \\bluetides. The top panel demonstrates the full distribution of sources at $z=8$ with the points denoting the median and the error-bars showing the central $68\\%$ range. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. The median specific star formation rates in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:physical}.}\n\\label{fig:M_sSFR}\n\\end{figure}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/ages.pdf}\n\\caption{The relationship between the mean stellar age and stellar mass predicted for galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$ by \\bluetides. The top panel demonstrates the full distribution of sources at $z=8$ with the points denoting the median and the error-bars showing the central $68\\%$ range. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. The median ages in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:physical}.}\n\\label{fig:ages}\n\\end{figure}\n\n\n\n\n\\subsection{Metal Enrichment}\\label{sec:physical.Z}\n\n\nAs galaxies assemble stellar mass in the simulation the average metallicity of both the gas and stars increases. This can be seen in Fig. \\ref{fig:metallicities} where we show both the average mass-weighted stellar and star forming gas phase metallicity as a function of stellar mass. The trend of metallicities with stellar masses increases as ${\\rm d}\\log_{10}Z/{\\rm d}\\log_{10}M_*\\approx 0.4$. This trend is similar to observational measurements, using rest-frame optical strong line diagnostics, from \\citet{Maiolino2008} (at $z\\approx 3.5$) and \\citet{Mannucci2009} (at $z\\approx 3.1$). The normalisation of the simulated mass-metallicity relationship at $z\\approx 8$ is also similar to that found at $z\\sim 3$ by \\citet{Maiolino2008} and \\citet{Mannucci2009} using rest-frame optical diagnostics and at $z\\sim 5$ by \\citet{Faisst2016} using rest-UV absorption complexes. \n\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/metallicities.pdf}\n\\caption{The stellar (light points) and star forming gas (dark points) metallicities of galaxies in \\bluetides. The 2D histogram in the top panel shows all objects with $M>10^{8}\\,{\\rm M_{\\odot}}$ at $z=8$. Points denoting the median and central $68\\%$. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. Observational constraints from \\citet{Maiolino2008}, \\citet{Mannucci2009}, and \\citet{Faisst2016} at $z\\sim 3.5$, $z\\sim 3.1$, and $z\\sim 5$ respectively are also shown. Observational measurements of the stellar mass assume a \\citet{Chabrier2003} initial mass function and metallicities were converted to a mass-fraction assuming $12+\\log_{10}(O/H)_{\\odot} = 8.69$ and $Z_{\\odot}=0.02$. The median metallicities in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:physical}.}\n\\label{fig:metallicities}\n\\end{figure}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n3.4 Star Formation Histories\n\\subsection{Star Formation Histories}\\label{sec:physical.SFH}\n\n\nAt all the redshifts simulated by \\bluetides\\ the average star formation activity in galaxies is increasing rapidly, though the rate of this increase slows at later times. The average star formation histories of galaxies with stellar masses $>10^{8}\\,{\\rm M_{\\odot}}$ are shown in Fig. \\ref{fig:SFH}. Within the range probed by \\bluetides\\ there is little variation in the shape of the star formation history with stellar mass. This can also be seen in Figs. \\ref{fig:M_sSFR} and \\ref{fig:ages} where we show the average specific star formation and mean stellar ages in different mass bins. Both quantities show no correlation with stellar mass over the range which we are sensitive suggesting that star formation has not yet been quenched in these systems. The lack of quenching in our simulated galaxies is not entirely surprising as the mass range does not yet encompass many galaxies with $M_{h}>10^{12}\\,{\\rm M_{\\odot}}$ where inflows, and thus star formation, is expected to be suppressed \\citep[e.g.][]{Finlator2011}. It is worth noting however there is a tentative indication of some suppression in the most massive halos, however at $z=8$ there are not yet enough to have a clear picture.\n\nWhile there is no correlation with stellar mass both the average specific star formation rate and mean stellar age evolve strongly with redshift. For example, from $z=14\\to 8$ average mass-weighted stellar ages increase from approximately $30\\to 90\\,{\\rm Myr}$ while specific star formation rates drop by around $0.5\\,{\\rm dex}$.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/SFH.pdf}\n\\caption{The average star formation histories of galaxies with $M_{*}>10^{8}\\,{\\rm M_{\\odot}}$ at $z\\in\\{14,12,10,8\\}$. The figure shows the fraction of the total star formation occurring in each $\\Delta t=10\\,{\\rm Myr}$ age-bin.}\n\\label{fig:SFH}\n\\end{figure}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/M_sSFR.pdf}\n\\caption{The relationship between the specific star formation rate (${\\rm SFR}/M_{*}$) and stellar mass predicted for galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$ by \\bluetides. The top panel demonstrates the full distribution of sources at $z=8$ with the points denoting the median and the error-bars showing the central $68\\%$ range. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. The median specific star formation rates in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:physical}.}\n\\label{fig:M_sSFR}\n\\end{figure}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/ages.pdf}\n\\caption{The relationship between the mean stellar age and stellar mass predicted for galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$ by \\bluetides. The top panel demonstrates the full distribution of sources at $z=8$ with the points denoting the median and the error-bars showing the central $68\\%$ range. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. The median ages in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:physical}.}\n\\label{fig:ages}\n\\end{figure}\n\n\n\n\n3.5 Metal Enrichment\n\\subsection{Metal Enrichment}\\label{sec:physical.Z}\n\n\nAs galaxies assemble stellar mass in the simulation the average metallicity of both the gas and stars increases. This can be seen in Fig. \\ref{fig:metallicities} where we show both the average mass-weighted stellar and star forming gas phase metallicity as a function of stellar mass. The trend of metallicities with stellar masses increases as ${\\rm d}\\log_{10}Z/{\\rm d}\\log_{10}M_*\\approx 0.4$. This trend is similar to observational measurements, using rest-frame optical strong line diagnostics, from \\citet{Maiolino2008} (at $z\\approx 3.5$) and \\citet{Mannucci2009} (at $z\\approx 3.1$). The normalisation of the simulated mass-metallicity relationship at $z\\approx 8$ is also similar to that found at $z\\sim 3$ by \\citet{Maiolino2008} and \\citet{Mannucci2009} using rest-frame optical diagnostics and at $z\\sim 5$ by \\citet{Faisst2016} using rest-UV absorption complexes. \n\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/metallicities.pdf}\n\\caption{The stellar (light points) and star forming gas (dark points) metallicities of galaxies in \\bluetides. The 2D histogram in the top panel shows all objects with $M>10^{8}\\,{\\rm M_{\\odot}}$ at $z=8$. Points denoting the median and central $68\\%$. The lower panel shows only the median values for bins containing $>10$ galaxies at $z\\in\\{14,13,12,11,10,9,8\\}$. Observational constraints from \\citet{Maiolino2008}, \\citet{Mannucci2009}, and \\citet{Faisst2016} at $z\\sim 3.5$, $z\\sim 3.1$, and $z\\sim 5$ respectively are also shown. Observational measurements of the stellar mass assume a \\citet{Chabrier2003} initial mass function and metallicities were converted to a mass-fraction assuming $12+\\log_{10}(O/H)_{\\odot} = 8.69$ and $Z_{\\odot}=0.02$. The median metallicities in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:physical}.}\n\\label{fig:metallicities}\n\\end{figure}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n",
        "Descriptive_question1": "What is the median specific star formation rate for galaxies at z=8 in table_5?",
        "Descriptive_question2": "What is the median stellar age for galaxies at z=10 in table_5?",
        "Reasoning_question1": "How does the trend in median specific star formation rate with redshift in table_5 reflect the evolution of galaxy formation processes from z=14 to z=8?",
        "Reasoning_question2": "What can be inferred about the relationship between stellar mass and metallicity from the data in table_5, and how does this compare to observational trends at lower redshifts?",
        "Descriptive_answer1": "-8.1",
        "Descriptive_answer2": "56",
        "Reasoning_answer1": "Examining the median specific star formation rate (sSFR) in table_5 across different redshifts reveals a clear trend of decreasing sSFR from z=13 to z=8. For instance, at z=13, the sSFR is around -7.7 for the lowest stellar mass bin, while at z=8, it drops to around -8.1 across multiple stellar mass bins. This decline of approximately 0.4 dex indicates a reduction in the rate of star formation relative to stellar mass over time. The trend reflects the evolution of galaxy formation processes, where early galaxies at higher redshifts (z=13 to z=10) experience more intense star formation activity, likely due to abundant gas inflows and lower stellar masses, leading to higher sSFR. As redshift decreases (z=10 to z=8), galaxies grow in stellar mass, and the availability of gas for new star formation may decrease due to consumption or feedback processes, resulting in a lower sSFR. This evolution suggests a transition from rapid, bursty star formation in the early universe to a more regulated star formation phase as galaxies mature.",
        "Reasoning_answer2": "From table_5, a positive relationship between stellar mass and both star-forming gas metallicity (log10 Z_SFG) and stellar metallicity (log10 Z_*) can be observed. For example, at z=8, as stellar mass increases from log10(M_*/M⊙)=8.0-8.25 to 10.0-10.25, gas metallicity rises from -3.03 to -2.17, and stellar metallicity increases from -3.19 to -2.38, indicating an increase of about 0.8-0.9 dex in metallicity over a 2 dex increase in stellar mass. This suggests a slope of approximately dlog10Z/dlog10M_* ≈ 0.4-0.45, where metallicity increases with stellar mass. This trend likely arises because more massive galaxies have undergone more star formation, producing and retaining more metals through supernova feedback and stellar evolution. Comparing this to observational trends at lower redshifts (z≈3-5) mentioned in the context, the simulated mass-metallicity relationship at z=8 shows a similar slope (dlog10Z/dlog10M_* ≈ 0.4) and normalization to observations by Maiolino (2008) and Mannucci (2009) at z≈3.1-3.5, as well as Faisst (2016) at z≈5. This similarity suggests that the physical processes driving metal enrichment in galaxies, such as star formation and feedback, are consistent across these cosmic epochs within the simulation's framework, despite the higher redshift of the data in table_5."
    },
    {
        "paper_id": "1704.00954.json",
        "table_id": "table_6",
        "table_content": "\\begin{table*}\n\\caption{Tabulated values of the median far-UV photon escape fraction in stellar mass bins used in Fig. \\ref{fig:L_fesc}.}\n\\label{tab:L_fesc}\n\\begin{tabular}{cccccccccc}\n\\hline\n    & \\multicolumn{9}{c}{$\\log_{10}(M_*/{\\rm M_{\\odot}})=$} \\\\\n$z$ & $8.0$-$8.25$ & $8.25$-$8.5$ & $8.5$-$8.75$ & $8.75$-$9.0$ & $9.0$-$9.25$ & $9.25$-$9.50$ & $9.50$-$9.75$ & $9.75$-$10.0$ & $10.0$-$10.25$ \\\\\n\\hline\n    & \\multicolumn{9}{c}{{\\bf median far-UV photon escape fraction}} \\\\\n\\hline\n 13.0 & 0.87 & 0.77 & - & - & - & - & - & - & -\\\\\n 12.0 & 0.91 & 0.78 & 0.6 & 0.43 & - & - & - & - & -\\\\\n 11.0 & 0.9 & 0.79 & 0.62 & 0.39 & 0.32 & - & - & - & -\\\\\n 10.0 & 0.91 & 0.8 & 0.67 & 0.5 & 0.33 & 0.26 & 0.18 & - & -\\\\\n 9.0 & 0.9 & 0.8 & 0.65 & 0.49 & 0.35 & 0.24 & 0.18 & 0.14 & -\\\\\n 8.0 & 0.89 & 0.8 & 0.67 & 0.52 & 0.38 & 0.27 & 0.2 & 0.15 & 0.12\\\\\n\\hline\n\\end{tabular}\n\\end{table*}",
        "caption": "Tabulated values of the median far-UV photon escape fraction in stellar mass bins used in Fig. \\ref{fig:L_fesc}.",
        "label": "tab:L_fesc",
        "section_info": "4 Photometric Properties\n\\section{Photometric Properties}\\label{sec:photometric}\n\n\\subsection{Modelling Galaxy Photometry}\\label{sec:photometric.modelling}\n\nWe build up the spectral energy distribution (SED) of each galaxy on a star particle by star particle basis. Firstly, we assign a pure stellar SED to each particle on the basis of its mass, age, and chemical composition. We adopt the {\\sc Pegase.2} \\citep{pegase} stellar population synthesis (SPS) model combined with a \\citet{Chabrier2003} initial mass function (IMF) over $0.1-100\\,{\\rm M_{\\odot}}$. The emission from each star particle is then modified to take into account reprocessing by both dust and gas as described below. \n\n\\subsubsection{Nebular Continuum and Line Emission Modelling}\\label{sec:photometric.modelling.nebular}\n\nWe use the {\\sc cloudy} photoionisation code to model the effect of reprocessing by H{\\sc ii} surrounding stars. The hydrogen density is chosen to be $100\\,{\\rm cm^{-3}}$ and the chemical composition of the gas is set to the metallicity of the star particle scaled by solar abundances. We assume a uniform covering fraction of $0.85$ thereby leaving sufficient LyC photons to reionise the Universe. \n\nThe implications of the choice of SPS model, initial mass function, and Lyman continuum (LyC) escape fraction on the spectral energy distributions are discussed in more detail in \\citet{Wilkins2016b} and \\citet{Wilkins2016c}. While these assumptions can result in large systematic effects, the effect on the rest frame far-UV ($150\\,{\\rm nm}$) is relatively small as nebular emission contributes only around $10\\%$ of the total luminosity and variations due to the choice of model typically changing luminosities by $<0.1\\,{\\rm dex}$ \\citep{Wilkins2016c}.\n\n\\subsubsection{Dust Attenuation}\\label{sec:photometric.modelling.dust}\n\nTo estimate the dust attenuation in \\bluetides\\ we employ a scheme which links the metal density integrated along parallel lines of sight to the dust optical depth $\\tau$. \n\nIn this model the rest-frame $V$-band ($0.55\\,{\\rm \\mu m}$) dust optical depth ($\\tau_{V}(x, y, z)$) is,\n\\begin{equation}\n\\tau_{V}(x, y, z) = \\kappa \\Sigma (x, y, z) = \\int_{z'=0}^{z} \\kappa \\rho_\\mathrm{metal}(x, y, z')\\,{\\rm d}z',\n\\end{equation}\nwhere $\\rho_\\mathrm{metal}(x, y, z')$ is the metal density, and we have chosen the $z$ direction to be the line of sight direction. $\\kappa$ is a normalization factor, a free parameter that is tuned to match the model with the observed $z\\approx 8$ luminosity function (see \\S\\ref{sec:photometric.UVLF}).\n\nFirst, the metal mass is painted to a 3-dimensional image with resolution of $0.2 h^{-1}\\,{\\rm ckpc}$. The image is passed through a Gaussian smoothing filter with a width of $r_s = 0.5 h^{-1}\\,{\\rm ckpc}$, the most probable smoothing length of gas particles that have collapsed into galaxies in the simulation. The parameter $r_s$ is also degenerate with $\\kappa$. Secondly, we compute the cumulative sum of the image along the line of sight direction ($z$). After this procedure, the image contains the surface density of metals ($\\Sigma(x, y, z)$) that contributes to the attenuation at any spatial location. Finally, we read off the values from the image at the location of each star particle. \n\nWe employ an individual stellar cluster (ISC) approximation in the implementation. The star clusters are identified with a Friends-of-Friends algorithm with a linking length of $l = 2.0 h^{-1}\\,{\\rm ckpc}$. For each star cluster, we perform the above calculation for metal mass in the bounding box of the star cluster with a buffer region of $b=2.0 h^{-1}\\,{\\rm ckpc}$. We tested that the approximation is stable to reasonable changes in the linking length $l$ or the size of the buffer region. The ISC approximation allows us to focus the computational resource to locations in the simulation where the the dust attenuation is most relevant. At the high-redshifts ($z \\ge 8$) simulated by \\bluetides\\ the ISC approximation provides a significant computational advantage comparing to a full volume ray tracing approach. At such high redshift, the attenuation due to chance-aligned galaxies can be neglected because the abundance of galaxies with very high metallicities is low. \n\nThe optical depth at an arbitrary wavelength $\\lambda$ is related to the $V$-band optical depth through an attenuation curve. We parameterise the attenuation curve as a power-law with index $\\gamma$,\n\\begin{equation}\n\\tau_\\lambda = \\tau_V\\times\\left(\\frac{\\lambda}{0.55\\,{\\rm\\mu m}}\\right)^{\\gamma}.\n\\end{equation}\nFor $\\gamma$ we choose a value of $-1$ yielding an attenuation curve slightly flatter in the UV than the Pei et al. (1992) Small Magellanic Cloud  curve, but not as flat as the \\citet{Calzetti2000} ``Starburst'' curve. \n\nThe predicted surface density of metals is strongly correlated with the stellar mass and intrinsic luminosity. This results in a strong trend of the average UV attenuation with both the stellar mass and intrinsic UV luminosity, albeit with considerable scatter (see Fig. \\ref{fig:L_fesc}). At a fixed stellar mass the attenuation is predicted to decrease slightly to higher redshift. \n\nHowever, the formation of dust and metals, while linked to some degree, are not expected to trace one another exactly \\citep[see modelling by][]{Mancini2015}. Consequently, such a simple model is unlikely to fully capture the redshift and luminosity dependence of dust attenuation, especially at the highest redshifts where the formation of dust in AGB stars or in-situ in the ISM has not had time to occur. This may then suggest that our dust model produces too much attenuation at the highest redshifts. Indeed, this is perhaps hinted at by the recent discovery \\citep{Oesch2016} of an exceptionally bright ($M\\approx -22$) and blue (and therefore likely dust-poor) galaxy at $z\\approx 11$. While the discovery of this object is consistent with predictions based on intrinsic luminosities \\citep{Waters2016a} it would be very unexpected based on dust attenuated luminosities predicted using our model. Given the uncertainties introduced by the dust model, particularly at $z>10$, throughout this work we consider predictions based on both the intrinsic luminosities and the dust attenuated luminosities.\n\nA consequence of the desire to fit observations of the $z\\sim 8$ far-UV luminosity function is the prediction that there exist a number of massive, heavily dust-obscured galaxies. The existence of these galaxies, which would not appear in Lyman-break selected samples, explains the discrepancy between predictions from \\bluetides\\ and current observational constraints on the galaxy stellar mass function and star formation rate distribution function (see \\S\\ref{sec:physical.GSMF} and \\S\\ref{sec:physical.SFRDF}). Unfortunately, the relative faintness and rarity of these objects means they are unlikely to be identified in current IR/sub-mm observations. However, massive heavily obscured intensely star forming galaxies have been identified at lower redshift \\citep[e.g. HFLS3 at $z=6.34$:][]{Riechers2013} suggesting that such objects can and do exist in the relatively early Universe.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/L_fesc_StellarMass.pdf}\n\\caption{The effective escape fraction of far-UV ($150\\,{\\rm nm}$) photons as a function of stellar mass. The median far-UV escape fractions in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:L_fesc}.}\n\\label{fig:L_fesc}\n\\end{figure}\n\n\n\n\n\\subsection{Spectral Energy Distributions}\\label{sec:photometric.SED}\n\n\nThe resulting average intrinsic (including nebular continuum and line emission) and observed specific\\footnote{That is, expressed per unit stellar mass.} spectral energy distributions are shown, for three mass bins at $z=8$, in Fig. \\ref{fig:SED_M}. \n\nThe average intrinsic SEDs are generally very blue, reflecting the ongoing star formation activity, young ages, and low metallicities in the sample. While the shape of the SEDs in each mass bin is very similar, the most massive galaxies have slightly redder SEDs reflecting the higher metallicity of the stellar populations. A more detailed analysis of the pure stellar and intrinsic SEDs is contained in \\citet{Wilkins2016c}. \n\nAs noted in the previous section, the most massive galaxies also suffer much higher attenuation due to dust resulting in redder observed SEDs and higher mass-to-light ratios. The trend of higher mass-to-light ratios at higher stellar mass can be seen more clearly in Fig. \\ref{fig:MTOL}. Fig. \\ref{fig:MTOL} also shows the evolution with redshift demonstrating that stellar mass-to-light ratios increase to lower redshift. This predominantly reflects the increasing age of the stellar populations to lower redshift.\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/SED_M.pdf}\n\\caption{The average observed and unattenuated SEDs (expressed per unit stellar mass) in three mass bins at $z=8$.}\n\\label{fig:SED_M}\n\\end{figure*}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/MTOL.pdf}\n\\caption{The intrinsic and dust attenuated far-UV mass-to-light ratios as function of stellar mass and redshift. The median intrinsic and observed far-UV mass-to-light ratios in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:MTOL}.}\n\\label{fig:MTOL}\n\\end{figure}\n\n\n\\subsection{Luminosity Functions}\\label{sec:photometric.UVLF}\n\nThe luminosity function (LF) is an incredibly useful statistical description of the galaxy population. In Fig. \\ref{fig:UVLF} we present both the intrinsic and dust attenuated far-UV luminosity functions at $z=8\\to 15$. In Fig. \\ref{fig:UVLF_multi} we show the intrinsic and attenuated UV LFs at $z\\in\\{8,9,10\\}$ together with current observational constraints. Both the intrinsic and observed luminosity functions demonstrate the rapid expected build up of the galaxy population at high-redshift. For example, the number of $M=-19$ objects increases by a factor of around 1000 from $z=15\\to 8$. The rapid decline of the LF to high-redshift poses challenges for the observational identification of galaxy populations at $z>12$ even using \\jwst. This is explored in more detail in Wilkins et al. {\\em submitted} where we make predictions for the surface density of sources at $z>8$ including the effects of field-to-field, or cosmic, variance.\n\nThe observed LF is generally similar to the intrinsic LF at faint luminosities ($M>-20$). At brighter luminosities there is stronger steepening of the LF reflecting the increasing strength of dust attenuation. As noted earlier our dust model is tuned to match the $z\\approx 8$ observed UV LF. However, it is important to stress that this only makes a significant difference at relatively bright luminosities ($M<-20$); at fainter luminosities there simply is not the surface density of metals (and therefore inferred dust) to yield significant attenuation. The excellent fit at fainter luminosities is then simply a consequence of the physics employed in the model and not a resulting of tuning using the dust model. However, while the faint end of the LF is unaffected by our choice of dust model it can be systematically affected by the choice of initial mass function (and to a lesser extent choice of SPS model); see \\citet{Wilkins2016c}. Adopting an IMF yielding more low-mass stars than our assumed IMF \\citep[e.g. a pure][IMF extended down to $0.1\\,{\\rm M_{\\odot}}$]{Salpeter1955} would uniformly reduce the luminosities of our galaxies, shifting the LF to fainter luminosities.\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/UVLF.pdf}\n\\caption{Intrinsic (left panel) and dust attenuated (observed, right panel) rest-frame far-UV ($150\\,{\\rm nm}$) luminosity functions. Observations at $z\\approx 8$ and $10.4$ from Bouwens et al.\\ (2015) are shown for comparison. The scale of the right-hand axis shows the number of galaxies in each bin magnitude in the simulation. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:UVLF}.}\n\\label{fig:UVLF}\n\\end{figure*}\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/UVLF_multi.pdf}\n\\caption{Intrinsic (left panel) and dust attenuated (observed, right panel) rest-frame far-UV ($150\\,{\\rm nm}$) luminosity functions. Observations at $z\\approx 8$ and $10.4$ from Bouwens et al.\\ (2015) are shown for comparison. The scale of the right-hand axis shows the number of galaxies in each bin magnitude in the simulation. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:UVLF}.}\n\\label{fig:UVLF_multi}\n\\end{figure*}\n\nWe also fit the dust attenuated far-UV LF by a Schechter function and find that the function provides a good overall fit to shape of the LF, as shown in Fig. \\ref{fig:UVLF_multi} at $z\\in\\{8,9,10\\}$. The evolution of the Schechter function parameters is shown in Fig. \\ref{fig:parameters_redshift} with the parameters listed in Table \\ref{tab:parameters_redshift} alongside various observational constraints at $z=4-10$. All three parameters decrease to higher redshift and overlap with observational constraints (and extrapolations from lower-redshift).\n\n\\begin{table}\n\\caption{Best fit Schechter function parameters for the observed UV luminosity function.}\n\\label{tab:parameters_redshift}\n\\begin{tabular}{cccc}\n\\hline\n$z$ & $M^{*}$ & $\\log_{10}(\\phi^{*}/{\\rm Mpc^{-3}})$ & $\\alpha$ \\\\\n\\hline\n13 & -19.91 & -5.71 & -2.54\\\\\n12 & -19.92 & -5.09 & -2.35\\\\\n11 & -20.17 & -4.79 & -2.27\\\\\n10 & -20.69 & -4.70 & -2.27\\\\\n9 & -20.68 & -4.20 & -2.10\\\\\n8 & -20.93 & -3.92 & -2.04\\\\\n\\hline\n\\end{tabular}\n\\end{table}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/parameters_redshift.pdf}\n\\caption{Redshift evolution of the best fit Schechter function parameters of the simulated observed UV LF.}\n\\label{fig:parameters_redshift}\n\\end{figure}\n\n\n\n\n\n\n\n\n\n4.1 Modelling Galaxy Photometry\n\\subsection{Modelling Galaxy Photometry}\\label{sec:photometric.modelling}\n\nWe build up the spectral energy distribution (SED) of each galaxy on a star particle by star particle basis. Firstly, we assign a pure stellar SED to each particle on the basis of its mass, age, and chemical composition. We adopt the {\\sc Pegase.2} \\citep{pegase} stellar population synthesis (SPS) model combined with a \\citet{Chabrier2003} initial mass function (IMF) over $0.1-100\\,{\\rm M_{\\odot}}$. The emission from each star particle is then modified to take into account reprocessing by both dust and gas as described below. \n\n\\subsubsection{Nebular Continuum and Line Emission Modelling}\\label{sec:photometric.modelling.nebular}\n\nWe use the {\\sc cloudy} photoionisation code to model the effect of reprocessing by H{\\sc ii} surrounding stars. The hydrogen density is chosen to be $100\\,{\\rm cm^{-3}}$ and the chemical composition of the gas is set to the metallicity of the star particle scaled by solar abundances. We assume a uniform covering fraction of $0.85$ thereby leaving sufficient LyC photons to reionise the Universe. \n\nThe implications of the choice of SPS model, initial mass function, and Lyman continuum (LyC) escape fraction on the spectral energy distributions are discussed in more detail in \\citet{Wilkins2016b} and \\citet{Wilkins2016c}. While these assumptions can result in large systematic effects, the effect on the rest frame far-UV ($150\\,{\\rm nm}$) is relatively small as nebular emission contributes only around $10\\%$ of the total luminosity and variations due to the choice of model typically changing luminosities by $<0.1\\,{\\rm dex}$ \\citep{Wilkins2016c}.\n\n\\subsubsection{Dust Attenuation}\\label{sec:photometric.modelling.dust}\n\nTo estimate the dust attenuation in \\bluetides\\ we employ a scheme which links the metal density integrated along parallel lines of sight to the dust optical depth $\\tau$. \n\nIn this model the rest-frame $V$-band ($0.55\\,{\\rm \\mu m}$) dust optical depth ($\\tau_{V}(x, y, z)$) is,\n\\begin{equation}\n\\tau_{V}(x, y, z) = \\kappa \\Sigma (x, y, z) = \\int_{z'=0}^{z} \\kappa \\rho_\\mathrm{metal}(x, y, z')\\,{\\rm d}z',\n\\end{equation}\nwhere $\\rho_\\mathrm{metal}(x, y, z')$ is the metal density, and we have chosen the $z$ direction to be the line of sight direction. $\\kappa$ is a normalization factor, a free parameter that is tuned to match the model with the observed $z\\approx 8$ luminosity function (see \\S\\ref{sec:photometric.UVLF}).\n\nFirst, the metal mass is painted to a 3-dimensional image with resolution of $0.2 h^{-1}\\,{\\rm ckpc}$. The image is passed through a Gaussian smoothing filter with a width of $r_s = 0.5 h^{-1}\\,{\\rm ckpc}$, the most probable smoothing length of gas particles that have collapsed into galaxies in the simulation. The parameter $r_s$ is also degenerate with $\\kappa$. Secondly, we compute the cumulative sum of the image along the line of sight direction ($z$). After this procedure, the image contains the surface density of metals ($\\Sigma(x, y, z)$) that contributes to the attenuation at any spatial location. Finally, we read off the values from the image at the location of each star particle. \n\nWe employ an individual stellar cluster (ISC) approximation in the implementation. The star clusters are identified with a Friends-of-Friends algorithm with a linking length of $l = 2.0 h^{-1}\\,{\\rm ckpc}$. For each star cluster, we perform the above calculation for metal mass in the bounding box of the star cluster with a buffer region of $b=2.0 h^{-1}\\,{\\rm ckpc}$. We tested that the approximation is stable to reasonable changes in the linking length $l$ or the size of the buffer region. The ISC approximation allows us to focus the computational resource to locations in the simulation where the the dust attenuation is most relevant. At the high-redshifts ($z \\ge 8$) simulated by \\bluetides\\ the ISC approximation provides a significant computational advantage comparing to a full volume ray tracing approach. At such high redshift, the attenuation due to chance-aligned galaxies can be neglected because the abundance of galaxies with very high metallicities is low. \n\nThe optical depth at an arbitrary wavelength $\\lambda$ is related to the $V$-band optical depth through an attenuation curve. We parameterise the attenuation curve as a power-law with index $\\gamma$,\n\\begin{equation}\n\\tau_\\lambda = \\tau_V\\times\\left(\\frac{\\lambda}{0.55\\,{\\rm\\mu m}}\\right)^{\\gamma}.\n\\end{equation}\nFor $\\gamma$ we choose a value of $-1$ yielding an attenuation curve slightly flatter in the UV than the Pei et al. (1992) Small Magellanic Cloud  curve, but not as flat as the \\citet{Calzetti2000} ``Starburst'' curve. \n\nThe predicted surface density of metals is strongly correlated with the stellar mass and intrinsic luminosity. This results in a strong trend of the average UV attenuation with both the stellar mass and intrinsic UV luminosity, albeit with considerable scatter (see Fig. \\ref{fig:L_fesc}). At a fixed stellar mass the attenuation is predicted to decrease slightly to higher redshift. \n\nHowever, the formation of dust and metals, while linked to some degree, are not expected to trace one another exactly \\citep[see modelling by][]{Mancini2015}. Consequently, such a simple model is unlikely to fully capture the redshift and luminosity dependence of dust attenuation, especially at the highest redshifts where the formation of dust in AGB stars or in-situ in the ISM has not had time to occur. This may then suggest that our dust model produces too much attenuation at the highest redshifts. Indeed, this is perhaps hinted at by the recent discovery \\citep{Oesch2016} of an exceptionally bright ($M\\approx -22$) and blue (and therefore likely dust-poor) galaxy at $z\\approx 11$. While the discovery of this object is consistent with predictions based on intrinsic luminosities \\citep{Waters2016a} it would be very unexpected based on dust attenuated luminosities predicted using our model. Given the uncertainties introduced by the dust model, particularly at $z>10$, throughout this work we consider predictions based on both the intrinsic luminosities and the dust attenuated luminosities.\n\nA consequence of the desire to fit observations of the $z\\sim 8$ far-UV luminosity function is the prediction that there exist a number of massive, heavily dust-obscured galaxies. The existence of these galaxies, which would not appear in Lyman-break selected samples, explains the discrepancy between predictions from \\bluetides\\ and current observational constraints on the galaxy stellar mass function and star formation rate distribution function (see \\S\\ref{sec:physical.GSMF} and \\S\\ref{sec:physical.SFRDF}). Unfortunately, the relative faintness and rarity of these objects means they are unlikely to be identified in current IR/sub-mm observations. However, massive heavily obscured intensely star forming galaxies have been identified at lower redshift \\citep[e.g. HFLS3 at $z=6.34$:][]{Riechers2013} suggesting that such objects can and do exist in the relatively early Universe.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/L_fesc_StellarMass.pdf}\n\\caption{The effective escape fraction of far-UV ($150\\,{\\rm nm}$) photons as a function of stellar mass. The median far-UV escape fractions in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:L_fesc}.}\n\\label{fig:L_fesc}\n\\end{figure}\n\n\n\n\n4.1.2 Dust Attenuation\n\\subsubsection{Dust Attenuation}\\label{sec:photometric.modelling.dust}\n\nTo estimate the dust attenuation in \\bluetides\\ we employ a scheme which links the metal density integrated along parallel lines of sight to the dust optical depth $\\tau$. \n\nIn this model the rest-frame $V$-band ($0.55\\,{\\rm \\mu m}$) dust optical depth ($\\tau_{V}(x, y, z)$) is,\n\\begin{equation}\n\\tau_{V}(x, y, z) = \\kappa \\Sigma (x, y, z) = \\int_{z'=0}^{z} \\kappa \\rho_\\mathrm{metal}(x, y, z')\\,{\\rm d}z',\n\\end{equation}\nwhere $\\rho_\\mathrm{metal}(x, y, z')$ is the metal density, and we have chosen the $z$ direction to be the line of sight direction. $\\kappa$ is a normalization factor, a free parameter that is tuned to match the model with the observed $z\\approx 8$ luminosity function (see \\S\\ref{sec:photometric.UVLF}).\n\nFirst, the metal mass is painted to a 3-dimensional image with resolution of $0.2 h^{-1}\\,{\\rm ckpc}$. The image is passed through a Gaussian smoothing filter with a width of $r_s = 0.5 h^{-1}\\,{\\rm ckpc}$, the most probable smoothing length of gas particles that have collapsed into galaxies in the simulation. The parameter $r_s$ is also degenerate with $\\kappa$. Secondly, we compute the cumulative sum of the image along the line of sight direction ($z$). After this procedure, the image contains the surface density of metals ($\\Sigma(x, y, z)$) that contributes to the attenuation at any spatial location. Finally, we read off the values from the image at the location of each star particle. \n\nWe employ an individual stellar cluster (ISC) approximation in the implementation. The star clusters are identified with a Friends-of-Friends algorithm with a linking length of $l = 2.0 h^{-1}\\,{\\rm ckpc}$. For each star cluster, we perform the above calculation for metal mass in the bounding box of the star cluster with a buffer region of $b=2.0 h^{-1}\\,{\\rm ckpc}$. We tested that the approximation is stable to reasonable changes in the linking length $l$ or the size of the buffer region. The ISC approximation allows us to focus the computational resource to locations in the simulation where the the dust attenuation is most relevant. At the high-redshifts ($z \\ge 8$) simulated by \\bluetides\\ the ISC approximation provides a significant computational advantage comparing to a full volume ray tracing approach. At such high redshift, the attenuation due to chance-aligned galaxies can be neglected because the abundance of galaxies with very high metallicities is low. \n\nThe optical depth at an arbitrary wavelength $\\lambda$ is related to the $V$-band optical depth through an attenuation curve. We parameterise the attenuation curve as a power-law with index $\\gamma$,\n\\begin{equation}\n\\tau_\\lambda = \\tau_V\\times\\left(\\frac{\\lambda}{0.55\\,{\\rm\\mu m}}\\right)^{\\gamma}.\n\\end{equation}\nFor $\\gamma$ we choose a value of $-1$ yielding an attenuation curve slightly flatter in the UV than the Pei et al. (1992) Small Magellanic Cloud  curve, but not as flat as the \\citet{Calzetti2000} ``Starburst'' curve. \n\nThe predicted surface density of metals is strongly correlated with the stellar mass and intrinsic luminosity. This results in a strong trend of the average UV attenuation with both the stellar mass and intrinsic UV luminosity, albeit with considerable scatter (see Fig. \\ref{fig:L_fesc}). At a fixed stellar mass the attenuation is predicted to decrease slightly to higher redshift. \n\nHowever, the formation of dust and metals, while linked to some degree, are not expected to trace one another exactly \\citep[see modelling by][]{Mancini2015}. Consequently, such a simple model is unlikely to fully capture the redshift and luminosity dependence of dust attenuation, especially at the highest redshifts where the formation of dust in AGB stars or in-situ in the ISM has not had time to occur. This may then suggest that our dust model produces too much attenuation at the highest redshifts. Indeed, this is perhaps hinted at by the recent discovery \\citep{Oesch2016} of an exceptionally bright ($M\\approx -22$) and blue (and therefore likely dust-poor) galaxy at $z\\approx 11$. While the discovery of this object is consistent with predictions based on intrinsic luminosities \\citep{Waters2016a} it would be very unexpected based on dust attenuated luminosities predicted using our model. Given the uncertainties introduced by the dust model, particularly at $z>10$, throughout this work we consider predictions based on both the intrinsic luminosities and the dust attenuated luminosities.\n\nA consequence of the desire to fit observations of the $z\\sim 8$ far-UV luminosity function is the prediction that there exist a number of massive, heavily dust-obscured galaxies. The existence of these galaxies, which would not appear in Lyman-break selected samples, explains the discrepancy between predictions from \\bluetides\\ and current observational constraints on the galaxy stellar mass function and star formation rate distribution function (see \\S\\ref{sec:physical.GSMF} and \\S\\ref{sec:physical.SFRDF}). Unfortunately, the relative faintness and rarity of these objects means they are unlikely to be identified in current IR/sub-mm observations. However, massive heavily obscured intensely star forming galaxies have been identified at lower redshift \\citep[e.g. HFLS3 at $z=6.34$:][]{Riechers2013} suggesting that such objects can and do exist in the relatively early Universe.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/L_fesc_StellarMass.pdf}\n\\caption{The effective escape fraction of far-UV ($150\\,{\\rm nm}$) photons as a function of stellar mass. The median far-UV escape fractions in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:L_fesc}.}\n\\label{fig:L_fesc}\n\\end{figure}\n\n\n\n\n",
        "Descriptive_question1": "What is the focus of the data presented in table_6 regarding photon escape?",
        "Descriptive_question2": "Which type of photons are analyzed in table_6?",
        "Reasoning_question1": "How does the median far-UV photon escape fraction vary across different stellar mass bins as shown in table_6, and what might this indicate about the relationship between stellar mass and photon escape?",
        "Reasoning_question2": "Based on the data in table_6, why might higher stellar mass bins show different escape fractions compared to lower stellar mass bins, considering the effects of dust attenuation?",
        "Descriptive_answer1": "photon escape",
        "Descriptive_answer2": "far-UV",
        "Reasoning_answer1": "Looking at table_6, the median far-UV photon escape fraction decreases as stellar mass increases across the bins. For example, at redshift z=8.0, the escape fraction is 0.89 for the stellar mass bin 8.0-8.25 (log10(M*/M⊙)) and drops to 0.12 for the bin 10.0-10.25. This trend is consistent across other redshifts, such as z=10.0, where the fraction decreases from 0.91 to 0.18 as stellar mass increases. This indicates a negative correlation between stellar mass and photon escape fraction, suggesting that galaxies with higher stellar mass are less effective at allowing far-UV photons to escape, possibly due to increased dust or other attenuating factors in more massive galaxies.",
        "Reasoning_answer2": "Analyzing table_6, higher stellar mass bins consistently show lower far-UV photon escape fractions compared to lower stellar mass bins. For instance, at z=9.0, the escape fraction is 0.9 for the lowest mass bin (8.0-8.25) and only 0.14 for the highest bin (9.75-10.0). This pattern suggests that dust attenuation, which is noted in the context to be strongly correlated with stellar mass, plays a significant role. In more massive galaxies, the higher metal content likely leads to greater dust formation, which absorbs or scatters far-UV photons, reducing the escape fraction. In contrast, lower mass galaxies may have less dust due to lower metallicity, allowing a higher fraction of photons to escape. This relationship highlights how dust attenuation scales with stellar mass, impacting the observed escape fractions."
    },
    {
        "paper_id": "1704.00954.json",
        "table_id": "table_7",
        "table_content": "\\begin{table*}\n\\caption{Tabulated values of the median intrinsic and observed far-UV mass-to-light ratio used in Fig. \\ref{fig:MTOL}.}\n\\label{tab:MTOL}\n\\begin{tabular}{cccccccccc}\n\\hline\n    & \\multicolumn{9}{c}{$\\log_{10}(M_*/{\\rm M_{\\odot}})=$} \\\\\n$z$ & $8.0$-$8.25$ & $8.25$-$8.5$ & $8.5$-$8.75$ & $8.75$-$9.0$ & $9.0$-$9.25$ & $9.25$-$9.50$ & $9.50$-$9.75$ & $9.75$-$10.0$ & $10.0$-$10.25$ \\\\\n\\hline\n    & \\multicolumn{9}{c}{{\\bf median intrinsic mass-to-light ratio} -  $\\log_{10}[(M_{*}/{\\rm M_{\\odot}})/(L_{\\nu, {\\rm fuv}}/{\\rm erg s^{-1} Hz^{-1}})]$} \\\\\n\\hline\n 13.0 & $ -20.42 $  & $ -20.44 $  & - & - & - & - & - & - & -\\\\\n 12.0 & $ -20.38 $  & $ -20.34 $  & $ -20.37 $  & $ -20.37 $  & - & - & - & - & -\\\\\n 11.0 & $ -20.3 $  & $ -20.3 $  & $ -20.31 $  & $ -20.33 $  & $ -20.37 $  & - & - & - & -\\\\\n 10.0 & $ -20.25 $  & $ -20.25 $  & $ -20.25 $  & $ -20.24 $  & $ -20.23 $  & $ -20.23 $  & $ -20.21 $  & - & -\\\\\n 9.0 & $ -20.15 $  & $ -20.16 $  & $ -20.17 $  & $ -20.17 $  & $ -20.17 $  & $ -20.17 $  & $ -20.18 $  & $ -20.17 $  & -\\\\\n 8.0 & $ -20.08 $  & $ -20.09 $  & $ -20.09 $  & $ -20.09 $  & $ -20.08 $  & $ -20.08 $  & $ -20.07 $  & $ -20.06 $  & $ -20.08 $ \\\\\n\\hline\n    & \\multicolumn{9}{c}{{\\bf median observed mass-to-light ratio} -  $\\log_{10}[(M_{*}/{\\rm M_{\\odot}})/(L_{\\nu, {\\rm fuv}}/{\\rm erg s^{-1} Hz^{-1}})]$} \\\\\n\\hline\n 13.0 & $-20.37 $  & $-20.33 $  & - & - & - & - & - & - & -\\\\\n 12.0 & $-20.32 $  & $-20.23 $  & $-20.15 $  & $-20.0 $  & - & - & - & - & -\\\\\n 11.0 & $-20.25 $  & $-20.19 $  & $-20.1 $  & $-19.94 $  & $-19.81 $  & - & - & - & -\\\\\n 10.0 & $-20.2 $  & $-20.15 $  & $-20.07 $  & $-19.93 $  & $-19.77 $  & $-19.62 $  & $-19.5 $  & - & -\\\\\n 9.0 & $-20.1 $  & $-20.05 $  & $-19.97 $  & $-19.86 $  & $-19.72 $  & $-19.56 $  & $-19.44 $  & $-19.32 $  & -\\\\\n 8.0 & $-20.03 $  & $-19.98 $  & $-19.91 $  & $-19.8 $  & $-19.67 $  & $-19.51 $  & $-19.38 $  & $-19.25 $  & $-19.21 $ \\\\\n\\hline\n\\end{tabular}\n\\end{table*}",
        "caption": "Tabulated values of the median intrinsic and observed far-UV mass-to-light ratio used in Fig. \\ref{fig:MTOL}.",
        "label": "tab:MTOL",
        "section_info": "4 Photometric Properties\n\\section{Photometric Properties}\\label{sec:photometric}\n\n\\subsection{Modelling Galaxy Photometry}\\label{sec:photometric.modelling}\n\nWe build up the spectral energy distribution (SED) of each galaxy on a star particle by star particle basis. Firstly, we assign a pure stellar SED to each particle on the basis of its mass, age, and chemical composition. We adopt the {\\sc Pegase.2} \\citep{pegase} stellar population synthesis (SPS) model combined with a \\citet{Chabrier2003} initial mass function (IMF) over $0.1-100\\,{\\rm M_{\\odot}}$. The emission from each star particle is then modified to take into account reprocessing by both dust and gas as described below. \n\n\\subsubsection{Nebular Continuum and Line Emission Modelling}\\label{sec:photometric.modelling.nebular}\n\nWe use the {\\sc cloudy} photoionisation code to model the effect of reprocessing by H{\\sc ii} surrounding stars. The hydrogen density is chosen to be $100\\,{\\rm cm^{-3}}$ and the chemical composition of the gas is set to the metallicity of the star particle scaled by solar abundances. We assume a uniform covering fraction of $0.85$ thereby leaving sufficient LyC photons to reionise the Universe. \n\nThe implications of the choice of SPS model, initial mass function, and Lyman continuum (LyC) escape fraction on the spectral energy distributions are discussed in more detail in \\citet{Wilkins2016b} and \\citet{Wilkins2016c}. While these assumptions can result in large systematic effects, the effect on the rest frame far-UV ($150\\,{\\rm nm}$) is relatively small as nebular emission contributes only around $10\\%$ of the total luminosity and variations due to the choice of model typically changing luminosities by $<0.1\\,{\\rm dex}$ \\citep{Wilkins2016c}.\n\n\\subsubsection{Dust Attenuation}\\label{sec:photometric.modelling.dust}\n\nTo estimate the dust attenuation in \\bluetides\\ we employ a scheme which links the metal density integrated along parallel lines of sight to the dust optical depth $\\tau$. \n\nIn this model the rest-frame $V$-band ($0.55\\,{\\rm \\mu m}$) dust optical depth ($\\tau_{V}(x, y, z)$) is,\n\\begin{equation}\n\\tau_{V}(x, y, z) = \\kappa \\Sigma (x, y, z) = \\int_{z'=0}^{z} \\kappa \\rho_\\mathrm{metal}(x, y, z')\\,{\\rm d}z',\n\\end{equation}\nwhere $\\rho_\\mathrm{metal}(x, y, z')$ is the metal density, and we have chosen the $z$ direction to be the line of sight direction. $\\kappa$ is a normalization factor, a free parameter that is tuned to match the model with the observed $z\\approx 8$ luminosity function (see \\S\\ref{sec:photometric.UVLF}).\n\nFirst, the metal mass is painted to a 3-dimensional image with resolution of $0.2 h^{-1}\\,{\\rm ckpc}$. The image is passed through a Gaussian smoothing filter with a width of $r_s = 0.5 h^{-1}\\,{\\rm ckpc}$, the most probable smoothing length of gas particles that have collapsed into galaxies in the simulation. The parameter $r_s$ is also degenerate with $\\kappa$. Secondly, we compute the cumulative sum of the image along the line of sight direction ($z$). After this procedure, the image contains the surface density of metals ($\\Sigma(x, y, z)$) that contributes to the attenuation at any spatial location. Finally, we read off the values from the image at the location of each star particle. \n\nWe employ an individual stellar cluster (ISC) approximation in the implementation. The star clusters are identified with a Friends-of-Friends algorithm with a linking length of $l = 2.0 h^{-1}\\,{\\rm ckpc}$. For each star cluster, we perform the above calculation for metal mass in the bounding box of the star cluster with a buffer region of $b=2.0 h^{-1}\\,{\\rm ckpc}$. We tested that the approximation is stable to reasonable changes in the linking length $l$ or the size of the buffer region. The ISC approximation allows us to focus the computational resource to locations in the simulation where the the dust attenuation is most relevant. At the high-redshifts ($z \\ge 8$) simulated by \\bluetides\\ the ISC approximation provides a significant computational advantage comparing to a full volume ray tracing approach. At such high redshift, the attenuation due to chance-aligned galaxies can be neglected because the abundance of galaxies with very high metallicities is low. \n\nThe optical depth at an arbitrary wavelength $\\lambda$ is related to the $V$-band optical depth through an attenuation curve. We parameterise the attenuation curve as a power-law with index $\\gamma$,\n\\begin{equation}\n\\tau_\\lambda = \\tau_V\\times\\left(\\frac{\\lambda}{0.55\\,{\\rm\\mu m}}\\right)^{\\gamma}.\n\\end{equation}\nFor $\\gamma$ we choose a value of $-1$ yielding an attenuation curve slightly flatter in the UV than the Pei et al. (1992) Small Magellanic Cloud  curve, but not as flat as the \\citet{Calzetti2000} ``Starburst'' curve. \n\nThe predicted surface density of metals is strongly correlated with the stellar mass and intrinsic luminosity. This results in a strong trend of the average UV attenuation with both the stellar mass and intrinsic UV luminosity, albeit with considerable scatter (see Fig. \\ref{fig:L_fesc}). At a fixed stellar mass the attenuation is predicted to decrease slightly to higher redshift. \n\nHowever, the formation of dust and metals, while linked to some degree, are not expected to trace one another exactly \\citep[see modelling by][]{Mancini2015}. Consequently, such a simple model is unlikely to fully capture the redshift and luminosity dependence of dust attenuation, especially at the highest redshifts where the formation of dust in AGB stars or in-situ in the ISM has not had time to occur. This may then suggest that our dust model produces too much attenuation at the highest redshifts. Indeed, this is perhaps hinted at by the recent discovery \\citep{Oesch2016} of an exceptionally bright ($M\\approx -22$) and blue (and therefore likely dust-poor) galaxy at $z\\approx 11$. While the discovery of this object is consistent with predictions based on intrinsic luminosities \\citep{Waters2016a} it would be very unexpected based on dust attenuated luminosities predicted using our model. Given the uncertainties introduced by the dust model, particularly at $z>10$, throughout this work we consider predictions based on both the intrinsic luminosities and the dust attenuated luminosities.\n\nA consequence of the desire to fit observations of the $z\\sim 8$ far-UV luminosity function is the prediction that there exist a number of massive, heavily dust-obscured galaxies. The existence of these galaxies, which would not appear in Lyman-break selected samples, explains the discrepancy between predictions from \\bluetides\\ and current observational constraints on the galaxy stellar mass function and star formation rate distribution function (see \\S\\ref{sec:physical.GSMF} and \\S\\ref{sec:physical.SFRDF}). Unfortunately, the relative faintness and rarity of these objects means they are unlikely to be identified in current IR/sub-mm observations. However, massive heavily obscured intensely star forming galaxies have been identified at lower redshift \\citep[e.g. HFLS3 at $z=6.34$:][]{Riechers2013} suggesting that such objects can and do exist in the relatively early Universe.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/L_fesc_StellarMass.pdf}\n\\caption{The effective escape fraction of far-UV ($150\\,{\\rm nm}$) photons as a function of stellar mass. The median far-UV escape fractions in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:L_fesc}.}\n\\label{fig:L_fesc}\n\\end{figure}\n\n\n\n\n\\subsection{Spectral Energy Distributions}\\label{sec:photometric.SED}\n\n\nThe resulting average intrinsic (including nebular continuum and line emission) and observed specific\\footnote{That is, expressed per unit stellar mass.} spectral energy distributions are shown, for three mass bins at $z=8$, in Fig. \\ref{fig:SED_M}. \n\nThe average intrinsic SEDs are generally very blue, reflecting the ongoing star formation activity, young ages, and low metallicities in the sample. While the shape of the SEDs in each mass bin is very similar, the most massive galaxies have slightly redder SEDs reflecting the higher metallicity of the stellar populations. A more detailed analysis of the pure stellar and intrinsic SEDs is contained in \\citet{Wilkins2016c}. \n\nAs noted in the previous section, the most massive galaxies also suffer much higher attenuation due to dust resulting in redder observed SEDs and higher mass-to-light ratios. The trend of higher mass-to-light ratios at higher stellar mass can be seen more clearly in Fig. \\ref{fig:MTOL}. Fig. \\ref{fig:MTOL} also shows the evolution with redshift demonstrating that stellar mass-to-light ratios increase to lower redshift. This predominantly reflects the increasing age of the stellar populations to lower redshift.\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/SED_M.pdf}\n\\caption{The average observed and unattenuated SEDs (expressed per unit stellar mass) in three mass bins at $z=8$.}\n\\label{fig:SED_M}\n\\end{figure*}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/MTOL.pdf}\n\\caption{The intrinsic and dust attenuated far-UV mass-to-light ratios as function of stellar mass and redshift. The median intrinsic and observed far-UV mass-to-light ratios in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:MTOL}.}\n\\label{fig:MTOL}\n\\end{figure}\n\n\n\\subsection{Luminosity Functions}\\label{sec:photometric.UVLF}\n\nThe luminosity function (LF) is an incredibly useful statistical description of the galaxy population. In Fig. \\ref{fig:UVLF} we present both the intrinsic and dust attenuated far-UV luminosity functions at $z=8\\to 15$. In Fig. \\ref{fig:UVLF_multi} we show the intrinsic and attenuated UV LFs at $z\\in\\{8,9,10\\}$ together with current observational constraints. Both the intrinsic and observed luminosity functions demonstrate the rapid expected build up of the galaxy population at high-redshift. For example, the number of $M=-19$ objects increases by a factor of around 1000 from $z=15\\to 8$. The rapid decline of the LF to high-redshift poses challenges for the observational identification of galaxy populations at $z>12$ even using \\jwst. This is explored in more detail in Wilkins et al. {\\em submitted} where we make predictions for the surface density of sources at $z>8$ including the effects of field-to-field, or cosmic, variance.\n\nThe observed LF is generally similar to the intrinsic LF at faint luminosities ($M>-20$). At brighter luminosities there is stronger steepening of the LF reflecting the increasing strength of dust attenuation. As noted earlier our dust model is tuned to match the $z\\approx 8$ observed UV LF. However, it is important to stress that this only makes a significant difference at relatively bright luminosities ($M<-20$); at fainter luminosities there simply is not the surface density of metals (and therefore inferred dust) to yield significant attenuation. The excellent fit at fainter luminosities is then simply a consequence of the physics employed in the model and not a resulting of tuning using the dust model. However, while the faint end of the LF is unaffected by our choice of dust model it can be systematically affected by the choice of initial mass function (and to a lesser extent choice of SPS model); see \\citet{Wilkins2016c}. Adopting an IMF yielding more low-mass stars than our assumed IMF \\citep[e.g. a pure][IMF extended down to $0.1\\,{\\rm M_{\\odot}}$]{Salpeter1955} would uniformly reduce the luminosities of our galaxies, shifting the LF to fainter luminosities.\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/UVLF.pdf}\n\\caption{Intrinsic (left panel) and dust attenuated (observed, right panel) rest-frame far-UV ($150\\,{\\rm nm}$) luminosity functions. Observations at $z\\approx 8$ and $10.4$ from Bouwens et al.\\ (2015) are shown for comparison. The scale of the right-hand axis shows the number of galaxies in each bin magnitude in the simulation. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:UVLF}.}\n\\label{fig:UVLF}\n\\end{figure*}\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/UVLF_multi.pdf}\n\\caption{Intrinsic (left panel) and dust attenuated (observed, right panel) rest-frame far-UV ($150\\,{\\rm nm}$) luminosity functions. Observations at $z\\approx 8$ and $10.4$ from Bouwens et al.\\ (2015) are shown for comparison. The scale of the right-hand axis shows the number of galaxies in each bin magnitude in the simulation. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:UVLF}.}\n\\label{fig:UVLF_multi}\n\\end{figure*}\n\nWe also fit the dust attenuated far-UV LF by a Schechter function and find that the function provides a good overall fit to shape of the LF, as shown in Fig. \\ref{fig:UVLF_multi} at $z\\in\\{8,9,10\\}$. The evolution of the Schechter function parameters is shown in Fig. \\ref{fig:parameters_redshift} with the parameters listed in Table \\ref{tab:parameters_redshift} alongside various observational constraints at $z=4-10$. All three parameters decrease to higher redshift and overlap with observational constraints (and extrapolations from lower-redshift).\n\n\\begin{table}\n\\caption{Best fit Schechter function parameters for the observed UV luminosity function.}\n\\label{tab:parameters_redshift}\n\\begin{tabular}{cccc}\n\\hline\n$z$ & $M^{*}$ & $\\log_{10}(\\phi^{*}/{\\rm Mpc^{-3}})$ & $\\alpha$ \\\\\n\\hline\n13 & -19.91 & -5.71 & -2.54\\\\\n12 & -19.92 & -5.09 & -2.35\\\\\n11 & -20.17 & -4.79 & -2.27\\\\\n10 & -20.69 & -4.70 & -2.27\\\\\n9 & -20.68 & -4.20 & -2.10\\\\\n8 & -20.93 & -3.92 & -2.04\\\\\n\\hline\n\\end{tabular}\n\\end{table}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/parameters_redshift.pdf}\n\\caption{Redshift evolution of the best fit Schechter function parameters of the simulated observed UV LF.}\n\\label{fig:parameters_redshift}\n\\end{figure}\n\n\n\n\n\n\n\n\n\n4.2 Spectral Energy Distributions\n\\subsection{Spectral Energy Distributions}\\label{sec:photometric.SED}\n\n\nThe resulting average intrinsic (including nebular continuum and line emission) and observed specific\\footnote{That is, expressed per unit stellar mass.} spectral energy distributions are shown, for three mass bins at $z=8$, in Fig. \\ref{fig:SED_M}. \n\nThe average intrinsic SEDs are generally very blue, reflecting the ongoing star formation activity, young ages, and low metallicities in the sample. While the shape of the SEDs in each mass bin is very similar, the most massive galaxies have slightly redder SEDs reflecting the higher metallicity of the stellar populations. A more detailed analysis of the pure stellar and intrinsic SEDs is contained in \\citet{Wilkins2016c}. \n\nAs noted in the previous section, the most massive galaxies also suffer much higher attenuation due to dust resulting in redder observed SEDs and higher mass-to-light ratios. The trend of higher mass-to-light ratios at higher stellar mass can be seen more clearly in Fig. \\ref{fig:MTOL}. Fig. \\ref{fig:MTOL} also shows the evolution with redshift demonstrating that stellar mass-to-light ratios increase to lower redshift. This predominantly reflects the increasing age of the stellar populations to lower redshift.\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/SED_M.pdf}\n\\caption{The average observed and unattenuated SEDs (expressed per unit stellar mass) in three mass bins at $z=8$.}\n\\label{fig:SED_M}\n\\end{figure*}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/MTOL.pdf}\n\\caption{The intrinsic and dust attenuated far-UV mass-to-light ratios as function of stellar mass and redshift. The median intrinsic and observed far-UV mass-to-light ratios in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:MTOL}.}\n\\label{fig:MTOL}\n\\end{figure}\n\n\n",
        "Descriptive_question1": "What type of mass-to-light ratios are presented in table_7?",
        "Descriptive_question2": "Which simulation predicts the values in table_7?",
        "Reasoning_question1": "Why might the observed far-UV mass-to-light ratios in table_7 be higher than the intrinsic ratios for the most massive galaxies?",
        "Reasoning_question2": "How does the trend of median far-UV mass-to-light ratios in table_7 reflect the evolution of stellar populations across different redshifts?",
        "Descriptive_answer1": "far-UV",
        "Descriptive_answer2": "bluetides",
        "Reasoning_answer1": "The observed far-UV mass-to-light ratios are likely higher than the intrinsic ratios for the most massive galaxies due to the effect of dust attenuation. As noted in the context, massive galaxies suffer higher dust attenuation, which reduces the observed far-UV luminosity compared to the intrinsic luminosity. This reduction in luminosity, while the stellar mass remains the same, results in a higher mass-to-light ratio for the observed values. The table data supports this, showing consistently higher observed ratios compared to intrinsic ones, especially at higher stellar masses across various redshifts.",
        "Reasoning_answer2": "The trend of median far-UV mass-to-light ratios in table_7, where ratios increase to lower redshifts, reflects the evolution of stellar populations primarily through aging. At higher redshifts (e.g., z=13), galaxies are younger with more active star formation, leading to brighter far-UV luminosity and lower mass-to-light ratios. As redshift decreases (e.g., to z=8), stellar populations age, and the proportion of older, less luminous stars increases, reducing overall far-UV luminosity per unit mass and thus increasing the mass-to-light ratio. Additionally, as mentioned in the context, dust attenuation may contribute more at lower redshifts due to increased metal and dust content over time, further elevating observed ratios. This evolutionary trend is evident in the table data, where both intrinsic and observed ratios generally increase from higher to lower redshifts for a given stellar mass bin."
    },
    {
        "paper_id": "1704.00954.json",
        "table_id": "table_8",
        "table_content": "\\begin{table*}\n\\caption{Tabulated values of the intrinsic and observed (dust-attenuated) far-UV luminosity functions used in Figure \\ref{fig:UVLF}.}\n\\label{tab:UVLF}\n\\begin{tabular}{ccccccc}\n\\hline\n    & \\multicolumn{6}{c}{$\\log_{10}(\\phi/{\\rm mag^{-1}Mpc^{-3}})$} \\\\\n        \\hline\n$M_{\\rm fuv}$ & $z=13$ & $z=12$ & $z=11$ & $z=10$ & $z=9$ & $z=8$  \\\\\n\\hline\n \\multicolumn{7}{c}{{\\bf intrinsic far-UV luminosity function}} \\\\\n\\hline\n $ -25.0$ - $-24.5 $  & - & - & - & - & - & $-6.71$\\\\\n $ -24.5$ - $-24.0 $  & - & - & - & - & - & $-6.26$\\\\\n $ -24.0$ - $-23.5 $  & - & - & - & - & $-6.54$ & $-5.89$\\\\\n $ -23.5$ - $-23.0 $  & - & - & - & - & $-6.19$ & $-5.46$\\\\\n $ -23.0$ - $-22.5 $  & - & - & - & $-6.37$ & $-5.71$ & $-5.13$\\\\\n $ -22.5$ - $-22.0 $  & - & - & $-6.63$ & $-6.04$ & $-5.36$ & $-4.77$\\\\\n $ -22.0$ - $-21.5 $  & - & - & $-6.35$ & $-5.66$ & $-5.05$ & $-4.48$\\\\\n $ -21.5$ - $-21.0 $  & - & $-6.63$ & $-5.92$ & $-5.29$ & $-4.71$ & $-4.21$\\\\\n $ -21.0$ - $-20.5 $  & $-6.82$ & $-6.15$ & $-5.54$ & $-4.94$ & $-4.44$ & $-3.96$\\\\\n $ -20.5$ - $-20.0 $  & $-6.43$ & $-5.77$ & $-5.22$ & $-4.64$ & $-4.17$ & $-3.71$\\\\\n $ -20.0$ - $-19.5 $  & $-5.96$ & $-5.37$ & $-4.87$ & $-4.33$ & $-3.91$ & $-3.49$\\\\\n $ -19.5$ - $-19.0 $  & $-5.6$ & $-5.04$ & $-4.56$ & $-4.04$ & $-3.66$ & $-3.26$\\\\\n $ -19.0$ - $-18.5 $  & $-5.18$ & $-4.66$ & $-4.23$ & $-3.77$ & $-3.42$ & $-3.05$\\\\\n $ -18.5$ - $-18.0 $  & $-4.84$ & $-4.33$ & $-3.92$ & $-3.5$ & $-3.19$ & $-2.84$\\\\\n $ -18.0$ - $-17.5 $  & $-4.46$ & $-3.99$ & $-3.63$ & $-3.25$ & $-2.98$ & $-2.64$\\\\\n $ -17.5$ - $-17.0 $  & $-4.12$ & $-3.69$ & $-3.37$ & $-3.02$ & $-2.77$ & $-2.44$\\\\\n\\hline\n \\multicolumn{7}{c}{{\\bf observed (dust-corrected) far-UV luminosity function}} \\\\\n\\hline\n $ -23.0$ - $-22.5 $  & - & - & - & - & - & $-6.89$\\\\\n $ -22.5$ - $-22.0 $  & - & - & - & - & $-6.61$ & $-6.07$\\\\\n $ -22.0$ - $-21.5 $  & - & - & - & $-6.59$ & $-6.06$ & $-5.35$\\\\\n $ -21.5$ - $-21.0 $  & - & - & $-6.52$ & $-5.91$ & $-5.37$ & $-4.8$\\\\\n $ -21.0$ - $-20.5 $  & - & $-6.49$ & $-5.91$ & $-5.28$ & $-4.76$ & $-4.27$\\\\\n $ -20.5$ - $-20.0 $  & $-6.59$ & $-5.9$ & $-5.33$ & $-4.73$ & $-4.29$ & $-3.85$\\\\\n $ -20.0$ - $-19.5 $  & $-5.98$ & $-5.39$ & $-4.88$ & $-4.35$ & $-3.93$ & $-3.5$\\\\\n $ -19.5$ - $-19.0 $  & $-5.6$ & $-5.01$ & $-4.55$ & $-4.04$ & $-3.65$ & $-3.25$\\\\\n $ -19.0$ - $-18.5 $  & $-5.17$ & $-4.66$ & $-4.23$ & $-3.76$ & $-3.41$ & $-3.03$\\\\\n $ -18.5$ - $-18.0 $  & $-4.84$ & $-4.32$ & $-3.92$ & $-3.49$ & $-3.18$ & $-2.82$\\\\\n $ -18.0$ - $-17.5 $  & $-4.45$ & $-3.99$ & $-3.63$ & $-3.25$ & $-2.97$ & $-2.63$\\\\\n $ -17.5$ - $-17.0 $  & $-4.12$ & $-3.69$ & $-3.37$ & $-3.02$ & $-2.77$ & $-2.44$\\\\\n\\hline\n\\end{tabular}\n\\end{table*}",
        "caption": "Tabulated values of the intrinsic and observed (dust-attenuated) far-UV luminosity functions used in Figure \\ref{fig:UVLF}.",
        "label": "tab:UVLF",
        "section_info": "4 Photometric Properties\n\\section{Photometric Properties}\\label{sec:photometric}\n\n\\subsection{Modelling Galaxy Photometry}\\label{sec:photometric.modelling}\n\nWe build up the spectral energy distribution (SED) of each galaxy on a star particle by star particle basis. Firstly, we assign a pure stellar SED to each particle on the basis of its mass, age, and chemical composition. We adopt the {\\sc Pegase.2} \\citep{pegase} stellar population synthesis (SPS) model combined with a \\citet{Chabrier2003} initial mass function (IMF) over $0.1-100\\,{\\rm M_{\\odot}}$. The emission from each star particle is then modified to take into account reprocessing by both dust and gas as described below. \n\n\\subsubsection{Nebular Continuum and Line Emission Modelling}\\label{sec:photometric.modelling.nebular}\n\nWe use the {\\sc cloudy} photoionisation code to model the effect of reprocessing by H{\\sc ii} surrounding stars. The hydrogen density is chosen to be $100\\,{\\rm cm^{-3}}$ and the chemical composition of the gas is set to the metallicity of the star particle scaled by solar abundances. We assume a uniform covering fraction of $0.85$ thereby leaving sufficient LyC photons to reionise the Universe. \n\nThe implications of the choice of SPS model, initial mass function, and Lyman continuum (LyC) escape fraction on the spectral energy distributions are discussed in more detail in \\citet{Wilkins2016b} and \\citet{Wilkins2016c}. While these assumptions can result in large systematic effects, the effect on the rest frame far-UV ($150\\,{\\rm nm}$) is relatively small as nebular emission contributes only around $10\\%$ of the total luminosity and variations due to the choice of model typically changing luminosities by $<0.1\\,{\\rm dex}$ \\citep{Wilkins2016c}.\n\n\\subsubsection{Dust Attenuation}\\label{sec:photometric.modelling.dust}\n\nTo estimate the dust attenuation in \\bluetides\\ we employ a scheme which links the metal density integrated along parallel lines of sight to the dust optical depth $\\tau$. \n\nIn this model the rest-frame $V$-band ($0.55\\,{\\rm \\mu m}$) dust optical depth ($\\tau_{V}(x, y, z)$) is,\n\\begin{equation}\n\\tau_{V}(x, y, z) = \\kappa \\Sigma (x, y, z) = \\int_{z'=0}^{z} \\kappa \\rho_\\mathrm{metal}(x, y, z')\\,{\\rm d}z',\n\\end{equation}\nwhere $\\rho_\\mathrm{metal}(x, y, z')$ is the metal density, and we have chosen the $z$ direction to be the line of sight direction. $\\kappa$ is a normalization factor, a free parameter that is tuned to match the model with the observed $z\\approx 8$ luminosity function (see \\S\\ref{sec:photometric.UVLF}).\n\nFirst, the metal mass is painted to a 3-dimensional image with resolution of $0.2 h^{-1}\\,{\\rm ckpc}$. The image is passed through a Gaussian smoothing filter with a width of $r_s = 0.5 h^{-1}\\,{\\rm ckpc}$, the most probable smoothing length of gas particles that have collapsed into galaxies in the simulation. The parameter $r_s$ is also degenerate with $\\kappa$. Secondly, we compute the cumulative sum of the image along the line of sight direction ($z$). After this procedure, the image contains the surface density of metals ($\\Sigma(x, y, z)$) that contributes to the attenuation at any spatial location. Finally, we read off the values from the image at the location of each star particle. \n\nWe employ an individual stellar cluster (ISC) approximation in the implementation. The star clusters are identified with a Friends-of-Friends algorithm with a linking length of $l = 2.0 h^{-1}\\,{\\rm ckpc}$. For each star cluster, we perform the above calculation for metal mass in the bounding box of the star cluster with a buffer region of $b=2.0 h^{-1}\\,{\\rm ckpc}$. We tested that the approximation is stable to reasonable changes in the linking length $l$ or the size of the buffer region. The ISC approximation allows us to focus the computational resource to locations in the simulation where the the dust attenuation is most relevant. At the high-redshifts ($z \\ge 8$) simulated by \\bluetides\\ the ISC approximation provides a significant computational advantage comparing to a full volume ray tracing approach. At such high redshift, the attenuation due to chance-aligned galaxies can be neglected because the abundance of galaxies with very high metallicities is low. \n\nThe optical depth at an arbitrary wavelength $\\lambda$ is related to the $V$-band optical depth through an attenuation curve. We parameterise the attenuation curve as a power-law with index $\\gamma$,\n\\begin{equation}\n\\tau_\\lambda = \\tau_V\\times\\left(\\frac{\\lambda}{0.55\\,{\\rm\\mu m}}\\right)^{\\gamma}.\n\\end{equation}\nFor $\\gamma$ we choose a value of $-1$ yielding an attenuation curve slightly flatter in the UV than the Pei et al. (1992) Small Magellanic Cloud  curve, but not as flat as the \\citet{Calzetti2000} ``Starburst'' curve. \n\nThe predicted surface density of metals is strongly correlated with the stellar mass and intrinsic luminosity. This results in a strong trend of the average UV attenuation with both the stellar mass and intrinsic UV luminosity, albeit with considerable scatter (see Fig. \\ref{fig:L_fesc}). At a fixed stellar mass the attenuation is predicted to decrease slightly to higher redshift. \n\nHowever, the formation of dust and metals, while linked to some degree, are not expected to trace one another exactly \\citep[see modelling by][]{Mancini2015}. Consequently, such a simple model is unlikely to fully capture the redshift and luminosity dependence of dust attenuation, especially at the highest redshifts where the formation of dust in AGB stars or in-situ in the ISM has not had time to occur. This may then suggest that our dust model produces too much attenuation at the highest redshifts. Indeed, this is perhaps hinted at by the recent discovery \\citep{Oesch2016} of an exceptionally bright ($M\\approx -22$) and blue (and therefore likely dust-poor) galaxy at $z\\approx 11$. While the discovery of this object is consistent with predictions based on intrinsic luminosities \\citep{Waters2016a} it would be very unexpected based on dust attenuated luminosities predicted using our model. Given the uncertainties introduced by the dust model, particularly at $z>10$, throughout this work we consider predictions based on both the intrinsic luminosities and the dust attenuated luminosities.\n\nA consequence of the desire to fit observations of the $z\\sim 8$ far-UV luminosity function is the prediction that there exist a number of massive, heavily dust-obscured galaxies. The existence of these galaxies, which would not appear in Lyman-break selected samples, explains the discrepancy between predictions from \\bluetides\\ and current observational constraints on the galaxy stellar mass function and star formation rate distribution function (see \\S\\ref{sec:physical.GSMF} and \\S\\ref{sec:physical.SFRDF}). Unfortunately, the relative faintness and rarity of these objects means they are unlikely to be identified in current IR/sub-mm observations. However, massive heavily obscured intensely star forming galaxies have been identified at lower redshift \\citep[e.g. HFLS3 at $z=6.34$:][]{Riechers2013} suggesting that such objects can and do exist in the relatively early Universe.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/L_fesc_StellarMass.pdf}\n\\caption{The effective escape fraction of far-UV ($150\\,{\\rm nm}$) photons as a function of stellar mass. The median far-UV escape fractions in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:L_fesc}.}\n\\label{fig:L_fesc}\n\\end{figure}\n\n\n\n\n\\subsection{Spectral Energy Distributions}\\label{sec:photometric.SED}\n\n\nThe resulting average intrinsic (including nebular continuum and line emission) and observed specific\\footnote{That is, expressed per unit stellar mass.} spectral energy distributions are shown, for three mass bins at $z=8$, in Fig. \\ref{fig:SED_M}. \n\nThe average intrinsic SEDs are generally very blue, reflecting the ongoing star formation activity, young ages, and low metallicities in the sample. While the shape of the SEDs in each mass bin is very similar, the most massive galaxies have slightly redder SEDs reflecting the higher metallicity of the stellar populations. A more detailed analysis of the pure stellar and intrinsic SEDs is contained in \\citet{Wilkins2016c}. \n\nAs noted in the previous section, the most massive galaxies also suffer much higher attenuation due to dust resulting in redder observed SEDs and higher mass-to-light ratios. The trend of higher mass-to-light ratios at higher stellar mass can be seen more clearly in Fig. \\ref{fig:MTOL}. Fig. \\ref{fig:MTOL} also shows the evolution with redshift demonstrating that stellar mass-to-light ratios increase to lower redshift. This predominantly reflects the increasing age of the stellar populations to lower redshift.\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/SED_M.pdf}\n\\caption{The average observed and unattenuated SEDs (expressed per unit stellar mass) in three mass bins at $z=8$.}\n\\label{fig:SED_M}\n\\end{figure*}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/MTOL.pdf}\n\\caption{The intrinsic and dust attenuated far-UV mass-to-light ratios as function of stellar mass and redshift. The median intrinsic and observed far-UV mass-to-light ratios in stellar mass bins predicted by \\bluetides\\ are tabulated in Table \\ref{tab:MTOL}.}\n\\label{fig:MTOL}\n\\end{figure}\n\n\n\\subsection{Luminosity Functions}\\label{sec:photometric.UVLF}\n\nThe luminosity function (LF) is an incredibly useful statistical description of the galaxy population. In Fig. \\ref{fig:UVLF} we present both the intrinsic and dust attenuated far-UV luminosity functions at $z=8\\to 15$. In Fig. \\ref{fig:UVLF_multi} we show the intrinsic and attenuated UV LFs at $z\\in\\{8,9,10\\}$ together with current observational constraints. Both the intrinsic and observed luminosity functions demonstrate the rapid expected build up of the galaxy population at high-redshift. For example, the number of $M=-19$ objects increases by a factor of around 1000 from $z=15\\to 8$. The rapid decline of the LF to high-redshift poses challenges for the observational identification of galaxy populations at $z>12$ even using \\jwst. This is explored in more detail in Wilkins et al. {\\em submitted} where we make predictions for the surface density of sources at $z>8$ including the effects of field-to-field, or cosmic, variance.\n\nThe observed LF is generally similar to the intrinsic LF at faint luminosities ($M>-20$). At brighter luminosities there is stronger steepening of the LF reflecting the increasing strength of dust attenuation. As noted earlier our dust model is tuned to match the $z\\approx 8$ observed UV LF. However, it is important to stress that this only makes a significant difference at relatively bright luminosities ($M<-20$); at fainter luminosities there simply is not the surface density of metals (and therefore inferred dust) to yield significant attenuation. The excellent fit at fainter luminosities is then simply a consequence of the physics employed in the model and not a resulting of tuning using the dust model. However, while the faint end of the LF is unaffected by our choice of dust model it can be systematically affected by the choice of initial mass function (and to a lesser extent choice of SPS model); see \\citet{Wilkins2016c}. Adopting an IMF yielding more low-mass stars than our assumed IMF \\citep[e.g. a pure][IMF extended down to $0.1\\,{\\rm M_{\\odot}}$]{Salpeter1955} would uniformly reduce the luminosities of our galaxies, shifting the LF to fainter luminosities.\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/UVLF.pdf}\n\\caption{Intrinsic (left panel) and dust attenuated (observed, right panel) rest-frame far-UV ($150\\,{\\rm nm}$) luminosity functions. Observations at $z\\approx 8$ and $10.4$ from Bouwens et al.\\ (2015) are shown for comparison. The scale of the right-hand axis shows the number of galaxies in each bin magnitude in the simulation. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:UVLF}.}\n\\label{fig:UVLF}\n\\end{figure*}\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/UVLF_multi.pdf}\n\\caption{Intrinsic (left panel) and dust attenuated (observed, right panel) rest-frame far-UV ($150\\,{\\rm nm}$) luminosity functions. Observations at $z\\approx 8$ and $10.4$ from Bouwens et al.\\ (2015) are shown for comparison. The scale of the right-hand axis shows the number of galaxies in each bin magnitude in the simulation. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:UVLF}.}\n\\label{fig:UVLF_multi}\n\\end{figure*}\n\nWe also fit the dust attenuated far-UV LF by a Schechter function and find that the function provides a good overall fit to shape of the LF, as shown in Fig. \\ref{fig:UVLF_multi} at $z\\in\\{8,9,10\\}$. The evolution of the Schechter function parameters is shown in Fig. \\ref{fig:parameters_redshift} with the parameters listed in Table \\ref{tab:parameters_redshift} alongside various observational constraints at $z=4-10$. All three parameters decrease to higher redshift and overlap with observational constraints (and extrapolations from lower-redshift).\n\n\\begin{table}\n\\caption{Best fit Schechter function parameters for the observed UV luminosity function.}\n\\label{tab:parameters_redshift}\n\\begin{tabular}{cccc}\n\\hline\n$z$ & $M^{*}$ & $\\log_{10}(\\phi^{*}/{\\rm Mpc^{-3}})$ & $\\alpha$ \\\\\n\\hline\n13 & -19.91 & -5.71 & -2.54\\\\\n12 & -19.92 & -5.09 & -2.35\\\\\n11 & -20.17 & -4.79 & -2.27\\\\\n10 & -20.69 & -4.70 & -2.27\\\\\n9 & -20.68 & -4.20 & -2.10\\\\\n8 & -20.93 & -3.92 & -2.04\\\\\n\\hline\n\\end{tabular}\n\\end{table}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/parameters_redshift.pdf}\n\\caption{Redshift evolution of the best fit Schechter function parameters of the simulated observed UV LF.}\n\\label{fig:parameters_redshift}\n\\end{figure}\n\n\n\n\n\n\n\n\n\n4.3 Luminosity Functions\n\\subsection{Luminosity Functions}\\label{sec:photometric.UVLF}\n\nThe luminosity function (LF) is an incredibly useful statistical description of the galaxy population. In Fig. \\ref{fig:UVLF} we present both the intrinsic and dust attenuated far-UV luminosity functions at $z=8\\to 15$. In Fig. \\ref{fig:UVLF_multi} we show the intrinsic and attenuated UV LFs at $z\\in\\{8,9,10\\}$ together with current observational constraints. Both the intrinsic and observed luminosity functions demonstrate the rapid expected build up of the galaxy population at high-redshift. For example, the number of $M=-19$ objects increases by a factor of around 1000 from $z=15\\to 8$. The rapid decline of the LF to high-redshift poses challenges for the observational identification of galaxy populations at $z>12$ even using \\jwst. This is explored in more detail in Wilkins et al. {\\em submitted} where we make predictions for the surface density of sources at $z>8$ including the effects of field-to-field, or cosmic, variance.\n\nThe observed LF is generally similar to the intrinsic LF at faint luminosities ($M>-20$). At brighter luminosities there is stronger steepening of the LF reflecting the increasing strength of dust attenuation. As noted earlier our dust model is tuned to match the $z\\approx 8$ observed UV LF. However, it is important to stress that this only makes a significant difference at relatively bright luminosities ($M<-20$); at fainter luminosities there simply is not the surface density of metals (and therefore inferred dust) to yield significant attenuation. The excellent fit at fainter luminosities is then simply a consequence of the physics employed in the model and not a resulting of tuning using the dust model. However, while the faint end of the LF is unaffected by our choice of dust model it can be systematically affected by the choice of initial mass function (and to a lesser extent choice of SPS model); see \\citet{Wilkins2016c}. Adopting an IMF yielding more low-mass stars than our assumed IMF \\citep[e.g. a pure][IMF extended down to $0.1\\,{\\rm M_{\\odot}}$]{Salpeter1955} would uniformly reduce the luminosities of our galaxies, shifting the LF to fainter luminosities.\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/UVLF.pdf}\n\\caption{Intrinsic (left panel) and dust attenuated (observed, right panel) rest-frame far-UV ($150\\,{\\rm nm}$) luminosity functions. Observations at $z\\approx 8$ and $10.4$ from Bouwens et al.\\ (2015) are shown for comparison. The scale of the right-hand axis shows the number of galaxies in each bin magnitude in the simulation. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:UVLF}.}\n\\label{fig:UVLF}\n\\end{figure*}\n\n\\begin{figure*}\n\\centering\n\\includegraphics[width=40pc]{figures/UVLF_multi.pdf}\n\\caption{Intrinsic (left panel) and dust attenuated (observed, right panel) rest-frame far-UV ($150\\,{\\rm nm}$) luminosity functions. Observations at $z\\approx 8$ and $10.4$ from Bouwens et al.\\ (2015) are shown for comparison. The scale of the right-hand axis shows the number of galaxies in each bin magnitude in the simulation. Tabulated quantities of the \\bluetides\\ predicted are given in Table \\ref{tab:UVLF}.}\n\\label{fig:UVLF_multi}\n\\end{figure*}\n\nWe also fit the dust attenuated far-UV LF by a Schechter function and find that the function provides a good overall fit to shape of the LF, as shown in Fig. \\ref{fig:UVLF_multi} at $z\\in\\{8,9,10\\}$. The evolution of the Schechter function parameters is shown in Fig. \\ref{fig:parameters_redshift} with the parameters listed in Table \\ref{tab:parameters_redshift} alongside various observational constraints at $z=4-10$. All three parameters decrease to higher redshift and overlap with observational constraints (and extrapolations from lower-redshift).\n\n\\begin{table}\n\\caption{Best fit Schechter function parameters for the observed UV luminosity function.}\n\\label{tab:parameters_redshift}\n\\begin{tabular}{cccc}\n\\hline\n$z$ & $M^{*}$ & $\\log_{10}(\\phi^{*}/{\\rm Mpc^{-3}})$ & $\\alpha$ \\\\\n\\hline\n13 & -19.91 & -5.71 & -2.54\\\\\n12 & -19.92 & -5.09 & -2.35\\\\\n11 & -20.17 & -4.79 & -2.27\\\\\n10 & -20.69 & -4.70 & -2.27\\\\\n9 & -20.68 & -4.20 & -2.10\\\\\n8 & -20.93 & -3.92 & -2.04\\\\\n\\hline\n\\end{tabular}\n\\end{table}\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/parameters_redshift.pdf}\n\\caption{Redshift evolution of the best fit Schechter function parameters of the simulated observed UV LF.}\n\\label{fig:parameters_redshift}\n\\end{figure}\n\n\n\n\n\n\n\n\n\n",
        "Descriptive_question1": "What is the redshift value at which the intrinsic and dust attenuated far-UV luminosity functions are presented in table_8?",
        "Descriptive_question2": "What type of luminosity functions are tabulated in table_8?",
        "Reasoning_question1": "How does the dust attenuation affect the shape of the far-UV luminosity function at brighter luminosities as shown in table_8?",
        "Reasoning_question2": "What can be inferred about the build-up of the galaxy population at high redshift from the data in table_8?",
        "Descriptive_answer1": "z=8",
        "Descriptive_answer2": "far-UV",
        "Reasoning_answer1": "Dust attenuation significantly impacts the shape of the far-UV luminosity function at brighter luminosities. Looking at the data in the table, for luminosities brighter than M = -20, the observed (dust-attenuated) values of log(φ) are consistently lower than the intrinsic values. For example, at z=8 and M between -22.5 and -22.0, the intrinsic value is -4.77, while the observed value is -6.07, indicating a substantial reduction due to dust. This trend holds across redshifts, showing that dust causes a stronger steepening of the luminosity function at the brighter end, as more luminous galaxies tend to have higher dust content, leading to greater attenuation of their far-UV light.",
        "Reasoning_answer2": "From the data in the table, we can infer a rapid build-up of the galaxy population at high redshift. Examining the values of log(φ) across redshifts from z=13 to z=8, there is a consistent increase in the luminosity function values at each magnitude bin. For instance, at M between -20.0 and -19.5, the intrinsic log(φ) value increases from -5.96 at z=13 to -3.49 at z=8, indicating a significant growth in the number density of galaxies. This trend suggests that as redshift decreases from z=13 to z=8, corresponding to earlier cosmic times, the galaxy population builds up rapidly, with more galaxies forming and becoming luminous in the far-UV range, reflecting active star formation and galaxy evolution during this period."
    },
    {
        "paper_id": "1704.00954.json",
        "table_id": "table_9",
        "table_content": "\\begin{table*}\n\\caption{Median SMBH masses in stellar mass bins used in Fig. \\ref{fig:M_MSMBH}.}\n\\label{tab:SMBH}\n\\begin{tabular}{cccccccccc}\n\\hline\n    & \\multicolumn{9}{c}{$\\log_{10}(M_*/{\\rm M_{\\odot}})=$} \\\\\n$z$ & $8.0$-$8.25$ & $8.25$-$8.5$ & $8.5$-$8.75$ & $8.75$-$9.0$ & $9.0$-$9.25$ & $9.25$-$9.50$ & $9.50$-$9.75$ & $9.75$-$10.0$ & $10.0$-$10.25$ \\\\\n\\hline\n    & \\multicolumn{9}{c}{{\\bf median SMBH mass} - $\\log_{10}(M_{\\rm SMBH}/M_{\\odot})$} \\\\\n\\hline\n 13.0 & - & - & - & - & - & - & - & - & -\\\\\n 12.0 & - & - & - & 5.93 & - & - & - & - & -\\\\\n 11.0 & - & - & - & 5.95 & 6.17 & - & - & - & -\\\\\n 10.0 & - & - & - & 5.96 & 6.19 & 6.49 & 6.69 & - & -\\\\\n 9.0 & - & - & - & 5.97 & 6.18 & 6.44 & 6.68 & 6.99 & -\\\\\n 8.0 & - & - & 5.86 & 5.98 & 6.17 & 6.4 & 6.65 & 6.94 & 7.12\\\\\n \\hline\n\\end{tabular}\n\\end{table*}",
        "caption": "Median SMBH masses in stellar mass bins used in Fig. \\ref{fig:M_MSMBH}.",
        "label": "tab:SMBH",
        "section_info": "5 Super-massive black-holes\n\\section{Super-massive black-holes}\\label{sec:SMBHs}\n\n\nSuper-massive black-holes (SMBHs) are incorporated into \\bluetides\\ by first seeding halos with $5\\times 10^5\\,h^{-1}\\,{\\rm M_{\\odot}}$ mass black holes once they reach a dark matter mass greater than $>5\\times 10^{10}\\,h^{-1}\\,{\\rm M_{\\odot}}$. Fig. \\ref{fig:M_MSMBH} shows both the fraction of galaxies hosting a SMBH and the SMBH mass as a function of stellar mass. The majority of galaxies with stellar masses below $\\sim 10^{8.5}\\,{\\rm M_{\\odot}}$ occupy halos that have yet to be seeded with a SMBH while virtually all above $\\sim 10^{9}\\,{\\rm M_{\\odot}}$ are in halos containing a SMBH. At higher stellar masses the SMBH and stellar mass are strongly correlated, albeit with significant scatter. The formation and evolution of super-massive black-holes in \\bluetides\\ is discussed in more detail in Di Matteo et al. {\\em in-prep}.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/M_MSMBH.pdf}\n\\caption{{\\em Top-panel} The fraction of galaxies in halos hosting a SMBH. {\\em Middle/bottom-panels} The relationship between the mass of the central super-massive black hole and the stellar mass of galaxies in \\bluetides. The top panel demonstrates the full distribution of sources at $z=8$ with the points denoting the median and the error-bars showing the central $68\\%$ range within uniform stellar mass bins. The lower panel shows only the median of stellar mass bins for $z\\in\\{14,13,12,11,10,9,8\\}$. Median SMBH masses in stellar mass bins are tabulated in Table \\ref{tab:SMBH}.}\n\\label{fig:M_MSMBH}\n\\end{figure}\n\n\n\\subsection{Contribution to far-UV luminosities}\\label{sec:SMBHs.UV}\n\nThe rate at which mass is accreted from the halo onto the SMBH ${\\rm d}M_{\\bullet}/{\\rm d}t$ can be used to estimate the bolometric luminosity $L_{\\rm bol}$, \n\\begin{equation}\nL_{\\rm bol} = \\eta \\frac{{\\rm d}M_{\\bullet}c^2}{{\\rm d}t}\n\\end{equation}\nwhere $\\eta$ is the efficiency and is assumed to be $0.1$. \n\nAssuming a bolometric correction of $2.25$\\footnote{A larger bolometric correction as suggested by \\citet{Runnoe2012} would reduce the predicted luminosities of the SMBHs and thus the fractional contribution to the galaxy luminosity.} \\citep{Fontanot2012} we estimate the far-UV luminosities of the SMBHs. In Figure \\ref{fig:SMBH_LCont} we show the fractional contribution of the SMBH to the total far-UV luminosity. In galaxies hosting a SMBH the SMBH on average contributes only approximately $5\\%$ of the total (intrinsic stellar + SMBH) far-UV luminosity. The fraction of galaxies hosting a SMBH that contributes $>10\\%$ of the total far-UV luminosity increases with stellar mass. In galaxies at $z=8$ with stellar masses $>10^{10}\\,{\\rm M_{\\odot}}$ approximately $25\\%$ of galaxies host a SMBH which contributes $>10\\%$ of the far-UV luminosity. In six of the most massive ($M_*=10^{10-10.6}\\,{\\rm M_{\\odot}}$) galaxies the total far-UV luminosity is dominated by accretion on to the central SMBH.\n\n\\begin{figure}\n\\centering\n\\includegraphics[width=20pc]{figures/SMBH_LCont.pdf}\n\\caption{{\\em Top-panel} The fraction of galaxies in which the SMBH contributes $>10\\%$ of the far-UV ($150\\,{\\rm nm}$) luminosity. {\\em Middle/bottom-panels} The fractional contribution of SMBHs to the total far-UV luminosity as a function of stellar mass. Note: the statistics are calculated for all galaxies (i.e. including those galaxies which have not yet been seeded with a SMBH). The top panel shows the full distribution of sources at $z=8$ with the large points denoting the median and the error-bars showing the central $68\\%$ range within uniform stellar mass bins. The small dark individual grey points denote objects where the SMBH accounts for $>50\\%$ of the total far-UV luminosity. The lower panel shows only the median of stellar mass bins for $z\\in\\{11,10,9,8\\}$.}\n\\label{fig:SMBH_LCont}\n\\end{figure}\n\n\n\n\n\n\n\n\n\n",
        "Descriptive_question1": "What is the median SMBH mass for the lowest stellar mass bin in table_9?",
        "Descriptive_question2": "At which redshift is the median SMBH mass data provided for the highest stellar mass bin in table_9?",
        "Reasoning_question1": "How does the median SMBH mass change with increasing stellar mass bins in table_9, and what might this imply about the relationship between stellar mass and SMBH growth?",
        "Reasoning_question2": "Based on the data in table_9, why might there be a significant scatter in SMBH masses at higher stellar masses, and what factors could contribute to this variation?",
        "Descriptive_answer1": "5.86",
        "Descriptive_answer2": "8.0",
        "Reasoning_answer1": "Looking at the data in table_9, I observe that as stellar mass bins increase from 8.5-8.75 to 10.0-10.25 at redshift z=8.0, the median SMBH mass increases consistently from 5.86 to 7.12 (in log10(M_SMBH/M_⊙)). This trend is also visible at other redshifts, for example at z=9.0, where the median SMBH mass rises from 5.97 to 6.99 across the available stellar mass bins. This consistent increase suggests a strong positive correlation between stellar mass and SMBH mass. The implication is that as galaxies grow in stellar mass, the central SMBH also grows, possibly due to more available material for accretion or a more massive host halo that supports larger black hole formation. This relationship may reflect a co-evolution between galaxies and their central black holes, where feedback processes or accretion rates scale with the galaxy's stellar content.",
        "Reasoning_answer2": "Analyzing table_9, I notice that data for higher stellar mass bins (e.g., 9.75-10.0 and 10.0-10.25) are only available at lower redshifts like z=8.0 and z=9.0, with median SMBH masses reaching up to 7.12. This limited data at higher masses, combined with the context from the provided text that mentions significant scatter in the SMBH-stellar mass relationship at higher stellar masses, leads me to infer several contributing factors. First, at higher stellar masses, galaxies might have diverse formation histories, leading to variations in accretion rates onto the SMBH. Some galaxies could experience episodic accretion due to mergers or gas inflows, while others might have more stable growth. Second, feedback mechanisms, such as AGN activity, could vary in strength, suppressing or enhancing SMBH growth differently across similar stellar mass bins. Lastly, environmental factors, like the density of the surrounding medium or interactions with neighboring galaxies, could introduce additional variability in SMBH mass growth, resulting in the observed scatter."
    },
    {
        "paper_id": "2107.07598.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}[h!]\n\t\\caption{Initial mineralogical composition for the 2 cement systems (calculated with CEMDATA18 \\citep{Lothenbach2018}). Units are mol/dm$^3$.}\n\t\\begin{center}\n\t\t\\begin{tabular}{cccc}\n\t\t\t\\hline\n\t\t\t& & & \\\\\n\t\t\tMineral & End members & System 1 & System 2\\\\\n\t\t\tPortlandite & & 5.9 $\\times$ 10$^{-2}$ & 3.8 $\\times$ 10$^{-2}$\\\\\n\t\t\tMonocarbonate & & & 3.1 $\\times$ 10$^{-3}$ \\\\\n\t\t\tEttringite & $NA$* & & 1.2 $\\times$ 10$^{-3}$ \\\\\n\t\t\tStraetlingite & & & 0 \\\\\n\t\t\tCalcite & & & 1.7 $\\times$ 10$^{-3}$\\\\\n\t\t\tC-S-H (ideal solid solution)  & & & \\\\\n\t\t\t& CSHQ-JenD & 1.7 $\\times$ 10$^{-2}$ & 1.7 $\\times$ 10$^{-2}$ \\\\\n\t\t\t& CSHQ-JenH & 1.1 $\\times$ 10$^{-2}$ & 1.1 $\\times$ 10$^{-2}$ \\\\\n\t\t\t& CSHQ-TobD & 1.3 $\\times$ 10$^{-2}$ & 1.3 $\\times$ 10$^{-2}$ \\\\\n\t\t\t& CSHQ-TobH & 5.6 $\\times$ 10$^{-4}$ & 5.6 $\\times$ 10$^{-4}$ \\\\\n\t\t\t\\hline\n\t\t\t\\multicolumn{4}{l}{*Not Applicable.}  \n\t\t\\end{tabular}\n\t\\end{center}\n\t\\label{table1}\n\\end{table}",
        "caption": "Initial mineralogical composition for the 2 cement systems (calculated with CEMDATA18 \\citep{Lothenbach2018}). Units are mol/dm$^3$.",
        "label": "table1",
        "section_info": "2 Methods\n\\section{Methods}\n\\label{methods}\n\\subsection{Reactive transport problems}\n\\label{rt_prob}\n\nOur reactive transport setups calculate leaching of hardened cement paste under diffusive or advective-dispersive transport conditions. Two cement systems are considered: (i) a relatively simple Ca-Si-O-H system, and (ii) a more representative Al-C-Ca-S-Si-H-O system. Both systems consider portlandite and a solid solution representation of the calcium silicate hydrates (C-S-H) using the CSHQ model by \\citet{Kulik2011}. The second system also contains calcite, straetlingite, monocarboaluminate and ettringite. Thermodynamic properties for the aqueous and solid phases are taken from CEMDATA18 \\citep[][]{Lothenbach2018}. For both cement systems, the problem is defined with a small amount of hardened cement paste to obtain leaching fronts in affordable calculation times. The initial condition is obtained by hydrating 10 g/dm$^{3}$ of cement clinkers with the composition $\\rm{\\left[CaO, SiO_2, CO_2, Al_2O_3, SO_3\\right] = \\left[1.11, 0.314, 0.0477, 0.432, 0.0375\\right]}$ mol/100 g with a water-cement ratio of 0.5 (for cement system 1, only CaO and SiO$_2$ are used). A porosity of 0.5 is considered. The initial aqueous phase is in equilibrium with the hardened cement paste (Table \\ref{table1}). The two-dimensional flow and transport field measures 3 $\\times$ 3 cm$^2$. All boundaries are closed, except a 1 cm wide open boundary at the top right and bottom left sides. In case of advective-dispersive transport and for solving the Richards equation for water flow with HPx, the boundary conditions at the open parts are set by defining a constant pressure head of 30 cm at the top and 0 cm at the bottom. For solute transport, a constant concentration (first type) and a constant concentration flux (third type) boundary condition  are assumed for the diffusive and advective-dispersive transport conditions, respectively. Dilute water (with 1 $\\mu$mol of each of the chemical components) is entering the system at the top boundary of the system. By considering equilibrium, dissolution/leaching of the cement hydrates is simulated considering the minerals listed in Table \\ref{table1}. \n\nWe test with 61 $\\times$ 61 and 121 $\\times$ 121 grids for both transport conditions. This choice is dictated by computational expense and our available hardware: 4 CPUs Intel i7 2.70 Ghz. To keep the computations required to obtain the benchmark simulations tractable, the considered simulation time period varies between 2 and 6 years. Computational time is detailed for each experiment later on.\n\n\\begin{table}[h!]\n\t\\caption{Initial mineralogical composition for the 2 cement systems (calculated with CEMDATA18 \\citep{Lothenbach2018}). Units are mol/dm$^3$.}\n\t\\begin{center}\n\t\t\\begin{tabular}{cccc}\n\t\t\t\\hline\n\t\t\t& & & \\\\\n\t\t\tMineral & End members & System 1 & System 2\\\\\n\t\t\tPortlandite & & 5.9 $\\times$ 10$^{-2}$ & 3.8 $\\times$ 10$^{-2}$\\\\\n\t\t\tMonocarbonate & & & 3.1 $\\times$ 10$^{-3}$ \\\\\n\t\t\tEttringite & $NA$* & & 1.2 $\\times$ 10$^{-3}$ \\\\\n\t\t\tStraetlingite & & & 0 \\\\\n\t\t\tCalcite & & & 1.7 $\\times$ 10$^{-3}$\\\\\n\t\t\tC-S-H (ideal solid solution)  & & & \\\\\n\t\t\t& CSHQ-JenD & 1.7 $\\times$ 10$^{-2}$ & 1.7 $\\times$ 10$^{-2}$ \\\\\n\t\t\t& CSHQ-JenH & 1.1 $\\times$ 10$^{-2}$ & 1.1 $\\times$ 10$^{-2}$ \\\\\n\t\t\t& CSHQ-TobD & 1.3 $\\times$ 10$^{-2}$ & 1.3 $\\times$ 10$^{-2}$ \\\\\n\t\t\t& CSHQ-TobH & 5.6 $\\times$ 10$^{-4}$ & 5.6 $\\times$ 10$^{-4}$ \\\\\n\t\t\t\\hline\n\t\t\t\\multicolumn{4}{l}{*Not Applicable.}  \n\t\t\\end{tabular}\n\t\\end{center}\n\t\\label{table1}\n\\end{table}\n\n\\FloatBarrier\n\n\\subsection{Emulation strategy and implementation}\n\\label{rt_emul}\n\nThe coupled reactive transport model for leaching of hardened cement paste is implemented in the HPx code that couples Hydrus \\citep{Simunek2013}  with PHREEQC \\citep{Parkhurst-Appelo2013} using a sequential non-iterative approach \\citep{Jacques2018}. Transport is calculated for each chemical component, i.e. in terms of total aqueous concentration of the given element. After the independent transport calculations in each time step, geochemical calculations with PHREEQC are done for each grid node to calculate the equilibrium solid phase and aqueous composition. As stated earlier, we test replacing the PHREEQC geochemical solver of HPx by a trained nonlinear regressor which we refer to as an emulator (also called commonly metamodel, surrogate model or proxy model). For each time step and grid node of a given reactive transport simulation, we emulate the components’ aqueous concentrations from the total components’ amounts and then re-calculate the components’ solid amounts by subtracting for each component the new aqueous concentration from the total amount. The total mass before and after a reactive transport step in a single cell is therefore fully conserved, although because of the emulation error it can be wrongly distributed between solid and aqueous phases. \n\nOur emulators are Python-based and a call to the Python language is introduced within the C/C++ written HPx code. We refer to the resulting HPx variant as HPx$_{\\rm py}$. When benchmarking against HPx, we consider both the open-mp version where the PHREEQC calculations are parallelized over the physical cores of the computer (in our case 4 cores), that we refer to as four-core HPx or HPx$_{\\rm{4C}}$, and single-threaded or single-core HPx that we call  HPx$_{\\rm{1C}}$. Regarding terminology, we refer to the HPx-simulated data as ``original\" data and the simulated data by HPx$_{\\rm py}$-DNN and HPx$_{\\rm py}$-kNN as emulated data. Importantly, for the problems considered herein transport calculations roughly represent 10 \\% to 20 \\% of the total reactive transport simulation time with HPx$_{\\rm{4C}}$. Defining speedup as ``number of times faster\", this means that the corresponding maximum possible speedup, which would be obtained if the PHREEQC-based geochemical computations would incur no computational cost at all, ranges between 5 and 10. If HPx$_{\\rm{1C}}$ is used, that is, if all of reactive transport computations are achieved on a single thread, then the runtime fraction associated with transport decreases to between approximately 3 \\% and 5.5 \\%, while the associated maximum possible speedup increases to between 18 and 33.\n\nThe emulation techniques investigated in this study are k-nearest neighbors \\citep[kNN, e.g.,][]{Hastie2009} and deep neural networks \\citep[DNN, e.g.,][]{Goodfellow2016}. The main reason for this choice is that these techniques are very fast while a large prediction speed is needed for the emulator to compete against geochemical solvers such as PHREEQC, which for the considered two cement systems and hardware performs either about 670 (system 1) or 210 (system 2) calculations per second on a single thread  (Intel\\textsuperscript{\\textregistered} i7 2.70GHz CPU). Furthermore we deal herein with multi-output regression and both kNN and DNN attempt to honor, in their own distinctive ways, the relationships between the different output targets. That makes kNN and DNN attractive compared to emulation approaches that require training a separate regressor for each output target, which (i) does not leverage any possible relation between targets and (ii) is likely to be slower than multi-output emulators. As further detailed later on, for our cement system 1 (2 inputs  - 4 outputs) and a training base of 400,000 examples, single-threaded kNN is found to be about 300 times faster than single-threaded PHREEQC for performing 10,000 calculations. This speedup becomes as large as 4000 for our trained DNN when ran on a NVIDIA Quadro M2000M GPU. For our cement system 2 (5 inputs - 7 outputs), our trained GPU-based DNN remains approximately 1000 (3000) times faster than single-threaded PHREEQC when ran on a NVIDIA Quadro M2000M (NVIDIA Quadro P6000) GPU. For this second problem, standard (and single-threaded) kNN becomes prohibitively slow compared to HPx$_{\\rm{4C}}$ and we thus rely on an approximate GPU-compatible kNN algorithm that, as detailed later, we run on NVIDIA Quadro P6000 GPU. For a training base of 1,000,000 samples, this GPU-powered approximate kNN algorithm is also about 3000 times faster than single-threaded PHREEQC for producing 10,000 predictions at once. Importantly, for all of these comparisons PHREEQC is run in batch mode and is thus initialized only once.\n\nThe kNN technique basically finds a number of similar instances to a presented example within a training base using a given distance measure, and then interpolate between them. Our used kNN regressor for the considered first cement system is the kNN regressor implemented in the Python scikit-learn toolbox \\citep[][]{sklearn}, using the default automatic selection between the ``ball tree\" and ``k-d tree\" methods for exact nearest neighbor search \\citep[see,][for details]{sklearn}. We search for the 5 closest neighbors with respect to the euclidean distance and perform an inverse-distance weighted interpolation. Furthermore, we run our scikit-learn kNN on a single-thread since we observed a drop in computation speed using the multi-threading option, when called from within our Windows-based HPx$_{\\rm py}$ framework. As mentioned above, for the considered second cement system the scikit-learn kNN regression approach becomes too slow compared to  HP$_{\\rm{4C}}$. Therefore, we implemented our own kNN regression based on the GPU-powered FAISS package for kNN search, using an approximate search method \\citep[see][for algorithmic details]{faiss2017}.\n\nNeural networks basically define the (possibly complex) relationships existing between input, $\\textbf{x}$, and output, $\\textbf{y}$, data vectors by using combinations of computational units that are called neurons. A neuron is an operator of the form:\n\n\\begin{equation}\nh\\left(\\textbf{x}\\right) =f\\left(\\langle \\textbf{x}, \\textbf{w} \\rangle+ b \\right),\n\\label{dnn1}\n\\end{equation}\n\nwhere $h\\left(\\cdot \\right)$ is the scalar output of the neuron, $f\\left(\\cdot \\right)$ is a nonlinear function that is called the ``activation function\", $\\langle\\cdot,\\cdot\\rangle$ signifies the scalar product, $\\textbf{w} = \\left[w_1, \\cdots, w_N\\right]$ is a set of weights of same dimension, $N$, as $\\textbf{x}$ and $b$ represents the bias associated with the neuron. For a given task, the values for $\\textbf{w}$ and $b$ associated with each neuron must be optimized or ``learned\" such that the resulting neural network performs as well as possible. When $f\\left(\\cdot \\right)$ is differentiable, $\\textbf{w}$ and $b$ can be learned by gradient descent. Common forms of $f\\left(\\cdot \\right)$ include the rectified linear unit (ReLU), sigmoid function and hyperbolic tangent function.\n\nWhen there is no directed loops or cycles across neurons or combinations thereof, the network is said to be feedforward (FFN). In the FFN architecture, the neurons are organized in layers. A standard layer is given by\n\n\\begin{equation}\n\\textbf{h}\\left(\\textbf{x}\\right)=f\\left(\\textbf{W}\\textbf{x} + \\textbf{b} \\right),\n\\label{dnn2}\n\\end{equation}\n\nwhere $\\textbf{W}$ and $\\textbf{b}$ are now a matrix of weights and a vector of biases, respectively. The name multilayer perceptron (MLP) designates a FFN with more than one layer that is fully connected (FC), that is, where every neuron of a given layer is connected to all neurons of the next layer. A most typical network is the 2-layer MLP, which consists of two layers with the outputs of the first-layer neurons becoming inputs to the second-layer neurons\n\n\\begin{equation}\n\\textbf{y}=\\textbf{g}\\left[\\textbf{h}\\left(\\textbf{x}\\right)\\right]  \\equiv f_2\\left[\\textbf{W}_2\\cdot f_1\\left(\\textbf{W}_1\\textbf{x} + \\textbf{b}_1 \\right) + \\textbf{b}_2 \\right],\n\\label{dnn3}\n\\end{equation}\n\nwhere $\\textbf{g}\\left(\\cdot\\right)$ and $\\textbf{h}\\left(\\cdot\\right)$ are referred to as output layer and hidden layer, respectively.\n\nIn theory, the two-layer MLP described in equation (\\ref{dnn3}) is a universal approximator as it can approximate any underlying process between $\\textbf{y}$ and $\\textbf{x}$ \\citep{Cybenko1989,Hornik1991}. However, this only works if the dimension of $\\textbf{h}\\left(\\cdot\\right)$ is (potentially many orders of magnitudes) larger than that of the input $\\textbf{x}$, thereby making learning practically infeasible and the two-layer MLP approximator useless for large $N$ (typically $N \\geq$ 5 - 10). Practitioners have found that it is much more efficient to use many hidden layers rather than increasing the size of a single hidden layer \\citep[e.g.,][]{Goodfellow2016}. When a FFN/MLP has more than one hidden layer it is considered to be deep. Nevertheless, current deep networks are not necessarily purely FFN but may mix different aspects of FFN, such as convolutional neural networks (CNN) and recurrent neural networks \\citep[RNN, see, e.g.,][]{Goodfellow2016}.\n\nOur selected DNN networks are detailed in Appendix. They basically consist of a 6-layers FC neural network with scaled exponential linear units (SELUs) as activation functions. SELUs have been showed to outperform the widely used rectified linear units (RELUs) for training fully connected networks \\citep[][]{Klambauer2017}. The size of our FC layers progressively increases from input dimensionality to a maximum size of either 128 (cement sytem 1) or 512 neurons (cement system 2), and then decreases stepwise towards the output dimensionality. Our DNNs are implemented within the pytorch framework \\citep{pytorch2017} and training is performed by stochastic gradient descent with the Adam algorithm \\citep[][]{Kingma-Ba2015}. \n\nAll GPU calculations were performed on a NVIDIA Quadro M2000M GPU for cement system 1, and on a more recent NVIDIA Quadro P6000 GPU for cement system 2 (mainly because the Windows-based FAISS kNN GPU implementation used for cement system 2 is not compatible with a GPU as old as the  NVIDIA Quadro M2000M). Running DNNs and kNN on a GPU (or more if available) is significantly faster than running on CPUs.\n\n\\subsection{Metrics for training performance assessment}\n\\label{metrics}\nWe resort to two metrics to assess the training quality of each emulator using an independent test set of 10,000 samples, $\\left[\\textbf{X}^*,\\textbf{Y}^*\\right]$, that is therefore not used for training. The $Q_2$ coefficient (also often referred to as coefficient of determination)is given by\n\n\\begin{equation}\nQ_2 = 1-\\frac{\\sum_{i = 1}^{n^*}\\sum_{j = 1}^{d_y}\\left(y^*_{i,j} - y^*_{s,i,j}\\right)^2}{\\sum_{i = 1}^{n^*}\\sum_{j = 1}^{d_y}\\left(y^*_{i,j} - \\overline{\\textbf{Y}^*}\\right)^2},\n\\label{q2}\n\\end{equation}\n\nwhere $\\textbf{Y}^*_s$ is a $n^* \\times d_y$ array of simulated outputs and $\\overline{\\textbf{Y}^*}$ denotes the mean of $\\textbf{Y}^*$. Furthermore, the root-mean-square-error (RMSE) is defined as\n\n\\begin{equation}{\\rm RMSE} =\\sqrt{\\frac{\\sum_{i = 1}^{n^*}\\sum_{j = 1}^{d_y}\\left(y^*_{i,j} - y^*_{s,i,j}\\right)^2}{n^*d_y}}.\n\\label{rmse}\n\\end{equation}\n\n\n2.1 Reactive transport problems\n\\subsection{Reactive transport problems}\n\\label{rt_prob}\n\nOur reactive transport setups calculate leaching of hardened cement paste under diffusive or advective-dispersive transport conditions. Two cement systems are considered: (i) a relatively simple Ca-Si-O-H system, and (ii) a more representative Al-C-Ca-S-Si-H-O system. Both systems consider portlandite and a solid solution representation of the calcium silicate hydrates (C-S-H) using the CSHQ model by \\citet{Kulik2011}. The second system also contains calcite, straetlingite, monocarboaluminate and ettringite. Thermodynamic properties for the aqueous and solid phases are taken from CEMDATA18 \\citep[][]{Lothenbach2018}. For both cement systems, the problem is defined with a small amount of hardened cement paste to obtain leaching fronts in affordable calculation times. The initial condition is obtained by hydrating 10 g/dm$^{3}$ of cement clinkers with the composition $\\rm{\\left[CaO, SiO_2, CO_2, Al_2O_3, SO_3\\right] = \\left[1.11, 0.314, 0.0477, 0.432, 0.0375\\right]}$ mol/100 g with a water-cement ratio of 0.5 (for cement system 1, only CaO and SiO$_2$ are used). A porosity of 0.5 is considered. The initial aqueous phase is in equilibrium with the hardened cement paste (Table \\ref{table1}). The two-dimensional flow and transport field measures 3 $\\times$ 3 cm$^2$. All boundaries are closed, except a 1 cm wide open boundary at the top right and bottom left sides. In case of advective-dispersive transport and for solving the Richards equation for water flow with HPx, the boundary conditions at the open parts are set by defining a constant pressure head of 30 cm at the top and 0 cm at the bottom. For solute transport, a constant concentration (first type) and a constant concentration flux (third type) boundary condition  are assumed for the diffusive and advective-dispersive transport conditions, respectively. Dilute water (with 1 $\\mu$mol of each of the chemical components) is entering the system at the top boundary of the system. By considering equilibrium, dissolution/leaching of the cement hydrates is simulated considering the minerals listed in Table \\ref{table1}. \n\nWe test with 61 $\\times$ 61 and 121 $\\times$ 121 grids for both transport conditions. This choice is dictated by computational expense and our available hardware: 4 CPUs Intel i7 2.70 Ghz. To keep the computations required to obtain the benchmark simulations tractable, the considered simulation time period varies between 2 and 6 years. Computational time is detailed for each experiment later on.\n\n\\begin{table}[h!]\n\t\\caption{Initial mineralogical composition for the 2 cement systems (calculated with CEMDATA18 \\citep{Lothenbach2018}). Units are mol/dm$^3$.}\n\t\\begin{center}\n\t\t\\begin{tabular}{cccc}\n\t\t\t\\hline\n\t\t\t& & & \\\\\n\t\t\tMineral & End members & System 1 & System 2\\\\\n\t\t\tPortlandite & & 5.9 $\\times$ 10$^{-2}$ & 3.8 $\\times$ 10$^{-2}$\\\\\n\t\t\tMonocarbonate & & & 3.1 $\\times$ 10$^{-3}$ \\\\\n\t\t\tEttringite & $NA$* & & 1.2 $\\times$ 10$^{-3}$ \\\\\n\t\t\tStraetlingite & & & 0 \\\\\n\t\t\tCalcite & & & 1.7 $\\times$ 10$^{-3}$\\\\\n\t\t\tC-S-H (ideal solid solution)  & & & \\\\\n\t\t\t& CSHQ-JenD & 1.7 $\\times$ 10$^{-2}$ & 1.7 $\\times$ 10$^{-2}$ \\\\\n\t\t\t& CSHQ-JenH & 1.1 $\\times$ 10$^{-2}$ & 1.1 $\\times$ 10$^{-2}$ \\\\\n\t\t\t& CSHQ-TobD & 1.3 $\\times$ 10$^{-2}$ & 1.3 $\\times$ 10$^{-2}$ \\\\\n\t\t\t& CSHQ-TobH & 5.6 $\\times$ 10$^{-4}$ & 5.6 $\\times$ 10$^{-4}$ \\\\\n\t\t\t\\hline\n\t\t\t\\multicolumn{4}{l}{*Not Applicable.}  \n\t\t\\end{tabular}\n\t\\end{center}\n\t\\label{table1}\n\\end{table}\n\n\\FloatBarrier\n\n3 Results\n\\section{Results}\n\\label{results}\n\n\\subsection{Ca-Si Problem}\nFor this first cement system, the emulation problem consists of predicting at each time step of the RT simulation the (output) Ca, Si, H and O aqueous concentrations (mol/kg of water or mol/kgw) from the (input) total amounts of Ca and Si (mol). \n\n\\subsubsection{Training the emulators}\n\\label{train_res1}\nHere the kNN and DNN emulators are firstly trained using a set of 400,000 test examples for both. This training set is obtained by randomly sampling the two-dimensional input space by latin hypercube sampling (LHS) between $\\left[0,0\\right]$ and $\\rm{\\left[Ca^{tot}_{max},Si^{tot}_{max}\\right]}$, and running PHREEQC for each input sample, $\\rm{\\textbf{x}_i = \\left[Ca^{tot}_i, Si^{tot}_i\\right]}$ to get the corresponding output vectors, $\\rm{\\textbf{y}_i = \\left[Ca^{conc}_i,Si^{conc}_i,H^{conc}_i,O^{conc}_i\\right]}$. The upper bounds, $\\rm{Ca^{tot}_{max}}$ and $\\rm{Si^{tot}_{max}}$ are defined based on a cheap full RT simulation with advective-dispersive transport using a small 1D domain of 51 nodes. It is worth noting that the total amounts of $\\textbf{x}_i$, corresponding to the PHREEQC-simulated concentrations, $\\textbf{y}_i$, have to be corrected for the different amount of water between the training set and the transport simulations. Doing so, it turns out that about 20 \\% of the post-corrected $x_i$ values exceed their pre-defined upper bounds and these excessively large values need to be filtered out. Creating the 400,000 training examples thus required about 500,000 PHREEQC runs. As stated earlier, for this problem single-threaded PHREEQC performs about 670 geochemical calculations per second on our used Intel\\textsuperscript{\\textregistered} i7 CPU.\n\nWith respect to kNN, the tuning parameters are the number of neighbors, $k$, the type of distance measure, and the interpolation technique. We simply used the default settings: $k = 5$, euclidean distance and inverse-distance interpolation. Regarding training of the DNN, the 400,000 sample were split between the training set itself (90 \\% of the data) and a validation set (10 \\% of the data). The latter serves to monitor the evolution of the selected mean squared error loss function on samples that are not used for training, thereby detecting potential overfitting. If the validation loss stops decreasing before the fixed number of epochs has been completed, then  training is stopped. Importantly, the emulation is achieved in log-space for both the input, $X$ and output $Y$, domains. This because total amounts and concentrations of the involved components typically cover many orders of magnitudes (up to 10 orders or more). Using a DNN also requires some form of data normalization or standardization. Here both the $\\rm{log\\left(\\textbf{x}_i\\right)}$ and $\\rm{log\\left(\\textbf{y}_i\\right)}$ vectors are standardized around 0 with standard deviation of 1.\n\nFigure \\ref{fig1} illustrates the trained emulators' performance for geochemical predictions using an independent test set that comprises 10,000 test examples. Both kNN (Figures \\ref{fig1}a - d) and DNN (Figures \\ref{fig1}e - h) appear to be rather accurate . DNN also shows a slight degradation for the larger concentration values  (Figures \\ref{fig1}e - h). The latter is likely due to the combination of a small proportion of large concentration values in the training set with the log-transformation that implicitly pushes the DNN to try harder to fit the smaller concentrations during training. Regarding speedup and as written earlier, for this setup the single-threaded kNN method is 300 times faster than single-threaded PHREEQC for predicting the 10,000 concentration vectors all at once. The computational savings allowed by the DNN emulator when ran on our GPU is higher with a speedup as large as 4000 for predicting the same 10,000 concentration vectors all at once. \n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure1.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 400,000 samples and the DNN is trained using the same 400,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig1}\n\\end{figure}\n\nTo test the sensitivity of the emulators' performance to the training set size, training was also performed using reduced training sets comprising 100,000, 10,000, 1000 and 100 samples, respectively. It seen that the DNN performance achieved  when using 10,000 training samples is virtually the same as that obtained when using 400,000 training samples (Figure \\ref{fig2} and Table \\ref{table2}). It is only for training sets smaller than 1000 samples that the DNN performance starts to degrade significantly (Table \\ref{table2}). In contrast, the behavior of kNN appears to markedly decrease as the training set gets smaller (Figure \\ref{fig2} and Table \\ref{table1}).\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure2.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 10,000 samples and the DNN is trained using the same 10,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig2}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Performance of the DNN and kNN emulators for cement system 1 and different training set sizes. For brevity, only the results for Ca$^{conc}$ and Si$^{conc}$ are shown. The units are mol per kg of water (mol/kgw). ML refers to the type of emulator, TR signifies the size of the training set, and the RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & \\\\\n\t\t\tML & TR & RMSE - Ca$^{conc}$ & $Q_2$ - Ca$^{conc}$ & RMSE - Si$^{conc}$ & $Q_2$ - Si$^{conc}$\\\\\n\t\t\tDNN & 4 $\\times$ 10$^{5}$ & 4.50 $\\times$ 10$^{-5}$ & 0.9999 & 1.21 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{5}$ & 5.06 $\\times$ 10$^{-5}$ & 0.9999 & 1.30 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{4}$ & 4.57 $\\times$ 10$^{-5}$ & 0.9999 & 1.48 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{3}$ & 8.02 $\\times$ 10$^{-5}$ & 0.9998 & 3.31 $\\times$ 10$^{-6}$ & 0.9994\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{2}$ & 47.1 $\\times$ 10$^{-5}$ & 0.9935 & 9.02 $\\times$ 10$^{-6}$ & 0.9959\\\\\n\t\t\tkNN & 4 $\\times$ 10$^{5}$ & 2.52 $\\times$ 10$^{-5}$ & 1.0000 & 1.59 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{5}$ & 6.27 $\\times$ 10$^{-5}$ & 0.9999 & 4.20 $\\times$ 10$^{-6}$ & 0.9991\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{4}$ & 14.1 $\\times$ 10$^{-5}$ & 0.9994 & 13.5 $\\times$ 10$^{-6}$ & 0.9908\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{3}$ & 43.9 $\\times$ 10$^{-5}$ & 0.9944 & 74.6 $\\times$ 10$^{-6}$ & 0.7180\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{2}$  & 96.4 $\\times$ 10$^{-5}$ & 0.9729 & 55.0 $\\times$ 10$^{-6}$ & 0.8468\\\\\n\t\t\\end{tabular}\n\t\\end{center}\n\t\\label{table2}\n\\end{table}\n\n\\FloatBarrier\n\n\\subsubsection{Reactive transport simulation}\n\\label{prob12Dres}\n\nThis section focuses on reactive transport simulations with HPx$_{\\rm py}$ within cement system 1, under both advective-dispersive and diffusive transport conditions. As written above, the domain sizes are both 61 $\\times$ 61 and 121 $\\times$ 121 for the advection-dispersion case and, because of computational constraints, solely 61 $\\times$ 61  for the diffusion case. In addition, the simulation time period is 2 years for the advection-dispersion case and 1 year for the diffusion case. Figures \\ref{fig3} and \\ref{fig4} present times series of original and emulated Ca, Si, H and O concentrations at 5 locations within the 2D domain for advective-dispersive transport conditions, for both our kNN-based (HPx$_{\\rm py}$-kNN) and DNN-based (HPx$_{\\rm py}$-DNN) reactive transport codes. It is seen that HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN both induce a quite good simulation accuracy. Also, the results for the diffusive transport case are of similarly good quality (not shown). Figures \\ref{fig5} - \\ref{fig6} provide more insights into the HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN performances by displaying 2D Ca, Si, H and O concentration profiles at a given time. For each experiment and chemical component, this time is selected as to be well representative of the simulated dynamics. It is observed that the original and emulated images are visually almost indistinguishable for the advection-dispersion case (Figure \\ref{fig5}). For the diffusion case, the emulators also perform quite well for Ca, H and O (Figures\\ref{fig6}a - c, g - i and j - l), while some slight to moderate discrepancies appear at the concentration front for Si (Figures \\ref{fig6}d - f). Notwithstanding, the Si concentration remains globally well predicted. Furthermore, Figures \\ref{fig7} - \\ref{fig8} present the original and emulated 2D solid amount profiles corresponding to Figures \\ref{fig5} - \\ref{fig6}. The original solid amount profiles are overall well approximated by HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN for the advection-dispersion case (Figure \\ref{fig7}), even though some mismatch appears at the border of the fully depleted zone for the H component. As of the diffusion case (Figure \\ref{fig8}), the same kind of mismatch is observed for the emulated solid amounts of H by HPx$_{\\rm py}$-kNN while the emulated profiles by HPx$_{\\rm py}$-DNN show some more discrepancies. Though we decided to present raw emulation results, we would like to stress that some if not all of the observed artifacts could likely be smoothed out by using some post-filtering such as median filtering.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure3.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-kNN emulated (TM+kNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig3}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure4.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-DNN emulated (TM+DNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig4}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure5.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig5}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure6.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig6}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure7.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig7}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure8.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig8}\n\\end{figure}\n\nThe speedups associated with the considered problem are detailed in Table \\ref{table3}. It is noted that the GPU-based DNN emulator allows for a speedup that is close to optimal. Indeed, The DNN speedups overall represent 85 \\% to 95 \\% of the maximum possible speedups (that is, speedups that would be obtained if the geochemcical calculations would come at no cost at all). The speedups associated with single-threaded kNN remain substantial but only amount to 57 \\% - 65 \\% of the corresponding maximum speedups. As detailed in section \\ref{train_res1}, the used kNN and DNN implementations are found to be respectively 300 and 4000 faster than single-threaded PHREEQC when predicting 10,000 points all one once for this geochemical system. Based on these numbers one could have expected the achieved speedups to represent say 90 \\% (kNN) or 99 \\% (DNN) of the maximum possible ones. A large part of the gaps between achieved and maximum possible speedups is thus likely caused by the time required for communicating and exchanging data between the main C/C++ code and the Python-based emulators.\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 1. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. The kNN predictions are performed on a single CPU using the scikit-learn implementation while the DNN predictions make use of our GPU. ML signifies the used machine learning method for emulation, TC denotes transport conditions (ADV: advection-dispersion, DIF: diffusion) and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HPx$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & TC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 3189 & 6.8 & 7.7 & 24.5 & 28.5 \\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 3189 & 5.0 & 7.7 & 18.0 & 28.5 \\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 23,337 & 5.0 & 5.2 & 17.0 & 18.0 \\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 23,337 & 3.4 & 5.2 & 11.6 & 18.0 \\\\\n\t\t\tDNN & DIF & 61 $\\times$ 61 & 25,448 & 4.2 & 4.9 & 16.2 & 19.1 \\\\\n\t\t\tkNN & DIF & 61 $\\times$ 61 & 25,448 & 2.8 & 4.9 & 10.8 & 19.1 \\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table3}\n\\end{table}\n\n\\FloatBarrier\n\n\\subsection{Al-C-Ca-S-Si Problem}\n\nFor this second cement system, the emulation problem consists of predicting at each time step of the RT simulation the (output) Al, C, Ca, S, Si, H and O aqueous concentrations (mol/kgw) from the (input) total amounts of Al, C, Ca, S and Si (mol). Here we focus on advective-dispersive transport only while similarly as for cement system 1, the considered domain sizes are 61 $\\times$ 61 and 121 $\\times$ 121. Moreover, the simulation time period is set to 6 years. \n\nAs mentioned earlier, for this higher-dimensional problem it is observed that the used scikit-learn kNN implementation becomes prohibitively slow compared to HPx$_{\\rm{4C}}$. We found that to get a good emulation accuracy the kNN training base needs to contain 1,000,000 samples (or more). This training base's size together with a 5-dimensional search space leads to an HPx$_{\\rm py}$-kNN reactive transport simulation time that is comparable to that of HPx$_{\\rm{4C}}$. Therefore, for this second cement system we built a custom kNN regressor around another kNN implementation contained in the FAISS package \\citep{faiss2017}. Our used FAISS variant allows for GPU computing and is much faster than scikit-learn for this cement system, but is slightly less accurate due to the use of an approximate rather than exact nearest neighbor search \\citep[see][for details]{faiss2017}.\n\n\\subsubsection{Training the DNN}\n\\label{train_res2}\n\nBuilding a good training set to perform a kNN search and learn the weights and biases of our DNN turned out to be a complicated task in this case. This because to make useful kNN predictions and/or learn an useful DNN, the training set must be sufficiently representative of the geochemical conditions encountered during the reactive transport simulation one wish to perform with HPx$_{\\rm py}$-kNN and the trained HPx$_{\\rm py}$-DNN. In contrast to cement system 1, creating the training set by sampling the $X$-space with a controlled randomness between predefined lower and upper bounds did not prove successful. We tried that strategy by drawing as much as 4,000,000 5-dimensional $\\textbf{x}$ vectors from the $X$-space using a Sobol low-discrepancy sequence \\citep[][]{Sobol1967, Joe-Kuo2003}. Such low-discrepancy sampling scheme covers the 5-dimensional hypercube more uniformly than LHS. Despite a good performance on the test set (not shown), the resulting DNN accuracy in reactive transport mode was never deemed satisfying. In other words, no satisfying ``global\" or ``universal\" DNN emulator could be devised for this cement system. This is probably caused by the fact for this problem, the input (5 total amounts) and output (7 aqueous concentrations) spaces are quite nonlinearly related and both cover 6 to 10 orders of magnitudes depending on the considered element. Therefore, we resorted to the alternative training strategy detailed below. The latter basically tries to grasp the complex correlations and higher-order dependencies that exist between the elements of $\\textbf{x}$ (total amounts, input space) for a given reactive transport simulation setup, in order to produce a training set that honors these between-input relationships.\n\n\\begin{itemize}\n\t\\item Perform a ``cheap\" full reactive transport simulation under the transport conditions and geochemistry of interest and collect the resulting $\\textbf{x}$ -$\\textbf{y}$  pairs of examples (for the considered grid nodes and time steps). Computational demand controls what domain size and simulation time period can be used for this cheap calculation. We used a modest 16 $\\times$ 16 domain and a simulation time period of 10 years. The associated HPx$_{\\rm{4C}}$ runtimes is 180 s.\n\t\n\t\\item Fit a kernel density estimator (KDE) with a Gaussian kernel to the collected $\\textbf{x}$ vectors (encapsulated in the $\\textbf{X}$ array) and generate a fixed number of new input vectors, $\\textbf{x}_{KDE}$. Then run PHREEQC for the $\\textbf{X}_{KDE}$ set to get the corresponding output set, $\\textbf{Y}_{KDE}$. Now apply the correction for porosity described in section \\ref{train_res1} to the $\\textbf{X}_{KDE}$ set and form the training set by merging the ensemble of $\\textbf{x}$-$\\textbf{y}$ pairs with that of the $\\textbf{x}_{KDE}$-$\\textbf{y}_{KDE}$ pairs. The number of produced unique examples by the considered cheap HPx$_{\\rm{4C}}$ simulations varied between 10,000 and 50,000. The KDE-based enrichment of this dataset was deemed necessary to provide more input variability, thereby avoiding overfitting of the trained DNN and improving the kNN accuracy, while still honoring the complex between-input relationships. The number of KDE-generated samples was set as to obtain a total training set size of 1,000,000 examples. A key component of the approach is the bandwith parameter of the Gaussian KDE kernel which controls how much the KDE-generated samples depart from the original ensemble. After limited trial and error, we fixed the kernel bandwith to 0.0025 for the considered case studies.\n\t\n\\end{itemize}\n\nThe scatter plots in Figure \\ref{fig9} illustrate our training set creation procedure. The orange dots depict the pairwise relationships observed between the 5 elements of $\\textbf{x}$ in the cheap simulation. The red and cyan dots in Figure \\ref{fig9} represent the KDE-generated samples with the selected bandwidth, before and after applying the correction for porosity, respectively. Training of the DNN is achieved using the ensemble of original and KDE-corrected input points. We refer to this kind of dataset as RT-based, since it is based on an albeit cheap, full RT simulation. Furthermore, we refer to the obtained DNN and kNN emulators as ``local\" emulators, since as opposed to the emulators constructed for cement system 1, the current emulators are only valid for the input conditions encapsulated in the RT-based training set.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1.5cm}\\includegraphics[width=50pc]{Figure9.png}\n\t\\caption{Scatter plots of the complex relationships between the five considered inputs for (1) the computationally cheap RT simulation of cement system 2 (orange dots) performed with the original HPx code, (2) the corresponding sampled points by kernel density estimation (KDE) using the selected bandwith (turquoise dots) and, (3) the same sampled KDE points after porosity correction (red dots).}\n\t\\label{fig9}\n\\end{figure}\n\nTraining performance of the DNN emulator is presented in Figure \\ref{fig10}. Similarly as for cement system 1, about 90\\% of the available data was used for the actual training of the DNN while the remaining 10\\% were used a validation set to contol overfitting. Lastly, performance is evaluated for both the trained DNN and kNN emulators using an independent test set of 10,000 examples. Overall, the accuracy of our ``local \" DNN emulator for this RT-inspired dataset is rather large with $Q_2$ values always greater or equal than 0.998. Training performance of the corresponding local kNN emulator is equally good (not shown). Regarding speedup, for this problem single-threaded PHREEQC achieves about 210 geochemical calculations per second on our used Intel\\textsuperscript{\\textregistered} i7 CPU while the GPU-based DNN and kNN emulators are both about 3000 times faster when predicting the 10,000 test points at once .\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure10.png}\n\t\\caption{1-1 plots of local DNN emulation performance obtained for system 2 when the local DNN is trained using 1,000,000 samples. The $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig10}\n\\end{figure}\n\n\\subsubsection{Reactive transport simulation}\n\\label{prob2res}\nOur ``local\" DNN performs rather well when applied to the 61 $\\times$ 61 grid size and a time period of 6 years (Figures \\ref{fig11} - \\ref{fig12}). This for all components but C, for which some localized deviations appear between original and emulated 2D concentration profiles towards the end of the simulation period (Figure \\ref{fig12}). When applied to the 121 $\\times$ 121 grid, HPx$_{\\rm py}$-DNN produces additional discrepancies for O, Si and Al towards the end of the simulation period (Figures \\ref{fig13} - \\ref{fig14}). Yet most of the observed artifacts could probably be smoothed out by using post-filtering. The associated speedups are listed in Table \\ref{table4}. These speedups are larger to those obtained for cement system 1, with values between 8 and 9 when evaluated against HPx$_{\\rm{4C}}$. These speedups represent about 85 \\% to 90 \\% of the maximum possible speedup (Table \\ref{table3}). Overall, these findings indicate that for the considered problem, our RT-based training of a local DNN only works is the training set is sufficiently representative of the particular geochemical conditions encountered in the computationally demanding simulations, which is arguably not easy to achieve. This limitation is further discussed in section \\ref{discussion}.\n\nWe note a more uniform behavior for HPx$_{\\rm py}$-kNN across grid sizes than for HPx$_{\\rm py}$-DNN. Here the results for the 121  $\\times$ 121 grid are only slightly less accurate than those associated with the 61 $ \\times$ 61 grid (see Figures \\ref{fig15} - \\ref{fig16} where for brevity we only show concentration profiles for the C Al and S components). Furthermore, whenever observed the discrepancies between original and emulated profiles are more regularly scattered than for HPx$_{\\rm py}$-DNN. Note also that herein too, post-filtering could likely smooth out a large part of these deviations. In addition, owing to the use of a GPU to achieve the kNN calculations, the speedup provided by HPx$_{\\rm py}$-kNN are as large as those provided by HPx$_{\\rm py}$-DNN (Table \\ref{table4}).\n\nWith respect to the emulated solid amounts, the HPx$_{\\rm py}$-DNN results look visually good for the the 61 $\\times$ 61 grid. This is shown in Figure \\ref{fig17} for the C, Al and S chemical components, while emulation of the H, O Ca and Si chemical components is globally of similar quality (not shown). Nevertheless, significant deviations appear towards the end of the simulation period for every chemical component (see Figure \\ref{fig18} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components shows the same level of mismatch). The HPx$_{\\rm py}$-kNN predictions are also fairly accurate for the 61 $\\times$ 61 (see Figure \\ref{fig19} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components exhibits a globally similar quality) while some discrepancies show up at the end of the simulation (Figure \\ref{fig20}). However, the mismatch is less pronounced than for HPx$_{\\rm py}$-DNN. Overall, HPx$_{\\rm py}$-kNN appears to be somewhat more robust than HPx$_{\\rm py}$-DNN for this cement system, while providing the same (large) speedup.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure11.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig11}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure12.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig12}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure13.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig13}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure14.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig14}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure15.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN).  The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig15}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure16.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig16}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure17.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig17}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure18.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig18}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure19.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig19}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure20.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig20}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 2. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. Both the kNN and DNN predictions make use of a GPU. ML signifies the used machine learning method for emulation, BC denotes the type of flow boundary conditions and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HP$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & BC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 21,415 & 8.2 & 9.0 & 30.3 & 33.1\\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 21,415 & 7.9 & 9.0 & 28.9 & 33.1\\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 199,841 & 8.2 & 9.5 & 29.9 & 32.8\\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 199,841 & 8.5 & 9.5 & 31.3 & 32.8\\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table4}\n\\end{table}\n\n\\FloatBarrier\n\n3.1 Ca-Si Problem\n\\subsection{Ca-Si Problem}\nFor this first cement system, the emulation problem consists of predicting at each time step of the RT simulation the (output) Ca, Si, H and O aqueous concentrations (mol/kg of water or mol/kgw) from the (input) total amounts of Ca and Si (mol). \n\n\\subsubsection{Training the emulators}\n\\label{train_res1}\nHere the kNN and DNN emulators are firstly trained using a set of 400,000 test examples for both. This training set is obtained by randomly sampling the two-dimensional input space by latin hypercube sampling (LHS) between $\\left[0,0\\right]$ and $\\rm{\\left[Ca^{tot}_{max},Si^{tot}_{max}\\right]}$, and running PHREEQC for each input sample, $\\rm{\\textbf{x}_i = \\left[Ca^{tot}_i, Si^{tot}_i\\right]}$ to get the corresponding output vectors, $\\rm{\\textbf{y}_i = \\left[Ca^{conc}_i,Si^{conc}_i,H^{conc}_i,O^{conc}_i\\right]}$. The upper bounds, $\\rm{Ca^{tot}_{max}}$ and $\\rm{Si^{tot}_{max}}$ are defined based on a cheap full RT simulation with advective-dispersive transport using a small 1D domain of 51 nodes. It is worth noting that the total amounts of $\\textbf{x}_i$, corresponding to the PHREEQC-simulated concentrations, $\\textbf{y}_i$, have to be corrected for the different amount of water between the training set and the transport simulations. Doing so, it turns out that about 20 \\% of the post-corrected $x_i$ values exceed their pre-defined upper bounds and these excessively large values need to be filtered out. Creating the 400,000 training examples thus required about 500,000 PHREEQC runs. As stated earlier, for this problem single-threaded PHREEQC performs about 670 geochemical calculations per second on our used Intel\\textsuperscript{\\textregistered} i7 CPU.\n\nWith respect to kNN, the tuning parameters are the number of neighbors, $k$, the type of distance measure, and the interpolation technique. We simply used the default settings: $k = 5$, euclidean distance and inverse-distance interpolation. Regarding training of the DNN, the 400,000 sample were split between the training set itself (90 \\% of the data) and a validation set (10 \\% of the data). The latter serves to monitor the evolution of the selected mean squared error loss function on samples that are not used for training, thereby detecting potential overfitting. If the validation loss stops decreasing before the fixed number of epochs has been completed, then  training is stopped. Importantly, the emulation is achieved in log-space for both the input, $X$ and output $Y$, domains. This because total amounts and concentrations of the involved components typically cover many orders of magnitudes (up to 10 orders or more). Using a DNN also requires some form of data normalization or standardization. Here both the $\\rm{log\\left(\\textbf{x}_i\\right)}$ and $\\rm{log\\left(\\textbf{y}_i\\right)}$ vectors are standardized around 0 with standard deviation of 1.\n\nFigure \\ref{fig1} illustrates the trained emulators' performance for geochemical predictions using an independent test set that comprises 10,000 test examples. Both kNN (Figures \\ref{fig1}a - d) and DNN (Figures \\ref{fig1}e - h) appear to be rather accurate . DNN also shows a slight degradation for the larger concentration values  (Figures \\ref{fig1}e - h). The latter is likely due to the combination of a small proportion of large concentration values in the training set with the log-transformation that implicitly pushes the DNN to try harder to fit the smaller concentrations during training. Regarding speedup and as written earlier, for this setup the single-threaded kNN method is 300 times faster than single-threaded PHREEQC for predicting the 10,000 concentration vectors all at once. The computational savings allowed by the DNN emulator when ran on our GPU is higher with a speedup as large as 4000 for predicting the same 10,000 concentration vectors all at once. \n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure1.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 400,000 samples and the DNN is trained using the same 400,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig1}\n\\end{figure}\n\nTo test the sensitivity of the emulators' performance to the training set size, training was also performed using reduced training sets comprising 100,000, 10,000, 1000 and 100 samples, respectively. It seen that the DNN performance achieved  when using 10,000 training samples is virtually the same as that obtained when using 400,000 training samples (Figure \\ref{fig2} and Table \\ref{table2}). It is only for training sets smaller than 1000 samples that the DNN performance starts to degrade significantly (Table \\ref{table2}). In contrast, the behavior of kNN appears to markedly decrease as the training set gets smaller (Figure \\ref{fig2} and Table \\ref{table1}).\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure2.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 10,000 samples and the DNN is trained using the same 10,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig2}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Performance of the DNN and kNN emulators for cement system 1 and different training set sizes. For brevity, only the results for Ca$^{conc}$ and Si$^{conc}$ are shown. The units are mol per kg of water (mol/kgw). ML refers to the type of emulator, TR signifies the size of the training set, and the RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & \\\\\n\t\t\tML & TR & RMSE - Ca$^{conc}$ & $Q_2$ - Ca$^{conc}$ & RMSE - Si$^{conc}$ & $Q_2$ - Si$^{conc}$\\\\\n\t\t\tDNN & 4 $\\times$ 10$^{5}$ & 4.50 $\\times$ 10$^{-5}$ & 0.9999 & 1.21 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{5}$ & 5.06 $\\times$ 10$^{-5}$ & 0.9999 & 1.30 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{4}$ & 4.57 $\\times$ 10$^{-5}$ & 0.9999 & 1.48 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{3}$ & 8.02 $\\times$ 10$^{-5}$ & 0.9998 & 3.31 $\\times$ 10$^{-6}$ & 0.9994\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{2}$ & 47.1 $\\times$ 10$^{-5}$ & 0.9935 & 9.02 $\\times$ 10$^{-6}$ & 0.9959\\\\\n\t\t\tkNN & 4 $\\times$ 10$^{5}$ & 2.52 $\\times$ 10$^{-5}$ & 1.0000 & 1.59 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{5}$ & 6.27 $\\times$ 10$^{-5}$ & 0.9999 & 4.20 $\\times$ 10$^{-6}$ & 0.9991\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{4}$ & 14.1 $\\times$ 10$^{-5}$ & 0.9994 & 13.5 $\\times$ 10$^{-6}$ & 0.9908\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{3}$ & 43.9 $\\times$ 10$^{-5}$ & 0.9944 & 74.6 $\\times$ 10$^{-6}$ & 0.7180\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{2}$  & 96.4 $\\times$ 10$^{-5}$ & 0.9729 & 55.0 $\\times$ 10$^{-6}$ & 0.8468\\\\\n\t\t\\end{tabular}\n\t\\end{center}\n\t\\label{table2}\n\\end{table}\n\n\\FloatBarrier\n\n\\subsubsection{Reactive transport simulation}\n\\label{prob12Dres}\n\nThis section focuses on reactive transport simulations with HPx$_{\\rm py}$ within cement system 1, under both advective-dispersive and diffusive transport conditions. As written above, the domain sizes are both 61 $\\times$ 61 and 121 $\\times$ 121 for the advection-dispersion case and, because of computational constraints, solely 61 $\\times$ 61  for the diffusion case. In addition, the simulation time period is 2 years for the advection-dispersion case and 1 year for the diffusion case. Figures \\ref{fig3} and \\ref{fig4} present times series of original and emulated Ca, Si, H and O concentrations at 5 locations within the 2D domain for advective-dispersive transport conditions, for both our kNN-based (HPx$_{\\rm py}$-kNN) and DNN-based (HPx$_{\\rm py}$-DNN) reactive transport codes. It is seen that HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN both induce a quite good simulation accuracy. Also, the results for the diffusive transport case are of similarly good quality (not shown). Figures \\ref{fig5} - \\ref{fig6} provide more insights into the HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN performances by displaying 2D Ca, Si, H and O concentration profiles at a given time. For each experiment and chemical component, this time is selected as to be well representative of the simulated dynamics. It is observed that the original and emulated images are visually almost indistinguishable for the advection-dispersion case (Figure \\ref{fig5}). For the diffusion case, the emulators also perform quite well for Ca, H and O (Figures\\ref{fig6}a - c, g - i and j - l), while some slight to moderate discrepancies appear at the concentration front for Si (Figures \\ref{fig6}d - f). Notwithstanding, the Si concentration remains globally well predicted. Furthermore, Figures \\ref{fig7} - \\ref{fig8} present the original and emulated 2D solid amount profiles corresponding to Figures \\ref{fig5} - \\ref{fig6}. The original solid amount profiles are overall well approximated by HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN for the advection-dispersion case (Figure \\ref{fig7}), even though some mismatch appears at the border of the fully depleted zone for the H component. As of the diffusion case (Figure \\ref{fig8}), the same kind of mismatch is observed for the emulated solid amounts of H by HPx$_{\\rm py}$-kNN while the emulated profiles by HPx$_{\\rm py}$-DNN show some more discrepancies. Though we decided to present raw emulation results, we would like to stress that some if not all of the observed artifacts could likely be smoothed out by using some post-filtering such as median filtering.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure3.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-kNN emulated (TM+kNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig3}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure4.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-DNN emulated (TM+DNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig4}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure5.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig5}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure6.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig6}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure7.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig7}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure8.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig8}\n\\end{figure}\n\nThe speedups associated with the considered problem are detailed in Table \\ref{table3}. It is noted that the GPU-based DNN emulator allows for a speedup that is close to optimal. Indeed, The DNN speedups overall represent 85 \\% to 95 \\% of the maximum possible speedups (that is, speedups that would be obtained if the geochemcical calculations would come at no cost at all). The speedups associated with single-threaded kNN remain substantial but only amount to 57 \\% - 65 \\% of the corresponding maximum speedups. As detailed in section \\ref{train_res1}, the used kNN and DNN implementations are found to be respectively 300 and 4000 faster than single-threaded PHREEQC when predicting 10,000 points all one once for this geochemical system. Based on these numbers one could have expected the achieved speedups to represent say 90 \\% (kNN) or 99 \\% (DNN) of the maximum possible ones. A large part of the gaps between achieved and maximum possible speedups is thus likely caused by the time required for communicating and exchanging data between the main C/C++ code and the Python-based emulators.\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 1. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. The kNN predictions are performed on a single CPU using the scikit-learn implementation while the DNN predictions make use of our GPU. ML signifies the used machine learning method for emulation, TC denotes transport conditions (ADV: advection-dispersion, DIF: diffusion) and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HPx$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & TC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 3189 & 6.8 & 7.7 & 24.5 & 28.5 \\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 3189 & 5.0 & 7.7 & 18.0 & 28.5 \\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 23,337 & 5.0 & 5.2 & 17.0 & 18.0 \\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 23,337 & 3.4 & 5.2 & 11.6 & 18.0 \\\\\n\t\t\tDNN & DIF & 61 $\\times$ 61 & 25,448 & 4.2 & 4.9 & 16.2 & 19.1 \\\\\n\t\t\tkNN & DIF & 61 $\\times$ 61 & 25,448 & 2.8 & 4.9 & 10.8 & 19.1 \\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table3}\n\\end{table}\n\n\\FloatBarrier\n\n3.1.1 Training the emulators\n\\subsubsection{Training the emulators}\n\\label{train_res1}\nHere the kNN and DNN emulators are firstly trained using a set of 400,000 test examples for both. This training set is obtained by randomly sampling the two-dimensional input space by latin hypercube sampling (LHS) between $\\left[0,0\\right]$ and $\\rm{\\left[Ca^{tot}_{max},Si^{tot}_{max}\\right]}$, and running PHREEQC for each input sample, $\\rm{\\textbf{x}_i = \\left[Ca^{tot}_i, Si^{tot}_i\\right]}$ to get the corresponding output vectors, $\\rm{\\textbf{y}_i = \\left[Ca^{conc}_i,Si^{conc}_i,H^{conc}_i,O^{conc}_i\\right]}$. The upper bounds, $\\rm{Ca^{tot}_{max}}$ and $\\rm{Si^{tot}_{max}}$ are defined based on a cheap full RT simulation with advective-dispersive transport using a small 1D domain of 51 nodes. It is worth noting that the total amounts of $\\textbf{x}_i$, corresponding to the PHREEQC-simulated concentrations, $\\textbf{y}_i$, have to be corrected for the different amount of water between the training set and the transport simulations. Doing so, it turns out that about 20 \\% of the post-corrected $x_i$ values exceed their pre-defined upper bounds and these excessively large values need to be filtered out. Creating the 400,000 training examples thus required about 500,000 PHREEQC runs. As stated earlier, for this problem single-threaded PHREEQC performs about 670 geochemical calculations per second on our used Intel\\textsuperscript{\\textregistered} i7 CPU.\n\nWith respect to kNN, the tuning parameters are the number of neighbors, $k$, the type of distance measure, and the interpolation technique. We simply used the default settings: $k = 5$, euclidean distance and inverse-distance interpolation. Regarding training of the DNN, the 400,000 sample were split between the training set itself (90 \\% of the data) and a validation set (10 \\% of the data). The latter serves to monitor the evolution of the selected mean squared error loss function on samples that are not used for training, thereby detecting potential overfitting. If the validation loss stops decreasing before the fixed number of epochs has been completed, then  training is stopped. Importantly, the emulation is achieved in log-space for both the input, $X$ and output $Y$, domains. This because total amounts and concentrations of the involved components typically cover many orders of magnitudes (up to 10 orders or more). Using a DNN also requires some form of data normalization or standardization. Here both the $\\rm{log\\left(\\textbf{x}_i\\right)}$ and $\\rm{log\\left(\\textbf{y}_i\\right)}$ vectors are standardized around 0 with standard deviation of 1.\n\nFigure \\ref{fig1} illustrates the trained emulators' performance for geochemical predictions using an independent test set that comprises 10,000 test examples. Both kNN (Figures \\ref{fig1}a - d) and DNN (Figures \\ref{fig1}e - h) appear to be rather accurate . DNN also shows a slight degradation for the larger concentration values  (Figures \\ref{fig1}e - h). The latter is likely due to the combination of a small proportion of large concentration values in the training set with the log-transformation that implicitly pushes the DNN to try harder to fit the smaller concentrations during training. Regarding speedup and as written earlier, for this setup the single-threaded kNN method is 300 times faster than single-threaded PHREEQC for predicting the 10,000 concentration vectors all at once. The computational savings allowed by the DNN emulator when ran on our GPU is higher with a speedup as large as 4000 for predicting the same 10,000 concentration vectors all at once. \n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure1.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 400,000 samples and the DNN is trained using the same 400,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig1}\n\\end{figure}\n\nTo test the sensitivity of the emulators' performance to the training set size, training was also performed using reduced training sets comprising 100,000, 10,000, 1000 and 100 samples, respectively. It seen that the DNN performance achieved  when using 10,000 training samples is virtually the same as that obtained when using 400,000 training samples (Figure \\ref{fig2} and Table \\ref{table2}). It is only for training sets smaller than 1000 samples that the DNN performance starts to degrade significantly (Table \\ref{table2}). In contrast, the behavior of kNN appears to markedly decrease as the training set gets smaller (Figure \\ref{fig2} and Table \\ref{table1}).\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure2.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 10,000 samples and the DNN is trained using the same 10,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig2}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Performance of the DNN and kNN emulators for cement system 1 and different training set sizes. For brevity, only the results for Ca$^{conc}$ and Si$^{conc}$ are shown. The units are mol per kg of water (mol/kgw). ML refers to the type of emulator, TR signifies the size of the training set, and the RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & \\\\\n\t\t\tML & TR & RMSE - Ca$^{conc}$ & $Q_2$ - Ca$^{conc}$ & RMSE - Si$^{conc}$ & $Q_2$ - Si$^{conc}$\\\\\n\t\t\tDNN & 4 $\\times$ 10$^{5}$ & 4.50 $\\times$ 10$^{-5}$ & 0.9999 & 1.21 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{5}$ & 5.06 $\\times$ 10$^{-5}$ & 0.9999 & 1.30 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{4}$ & 4.57 $\\times$ 10$^{-5}$ & 0.9999 & 1.48 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{3}$ & 8.02 $\\times$ 10$^{-5}$ & 0.9998 & 3.31 $\\times$ 10$^{-6}$ & 0.9994\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{2}$ & 47.1 $\\times$ 10$^{-5}$ & 0.9935 & 9.02 $\\times$ 10$^{-6}$ & 0.9959\\\\\n\t\t\tkNN & 4 $\\times$ 10$^{5}$ & 2.52 $\\times$ 10$^{-5}$ & 1.0000 & 1.59 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{5}$ & 6.27 $\\times$ 10$^{-5}$ & 0.9999 & 4.20 $\\times$ 10$^{-6}$ & 0.9991\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{4}$ & 14.1 $\\times$ 10$^{-5}$ & 0.9994 & 13.5 $\\times$ 10$^{-6}$ & 0.9908\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{3}$ & 43.9 $\\times$ 10$^{-5}$ & 0.9944 & 74.6 $\\times$ 10$^{-6}$ & 0.7180\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{2}$  & 96.4 $\\times$ 10$^{-5}$ & 0.9729 & 55.0 $\\times$ 10$^{-6}$ & 0.8468\\\\\n\t\t\\end{tabular}\n\t\\end{center}\n\t\\label{table2}\n\\end{table}\n\n\\FloatBarrier\n\n",
        "Descriptive_question1": "What is the concentration of Portlandite in System 1 as shown in table_1?",
        "Descriptive_question2": "Which mineral has a concentration of 0 in System 2 according to table_1?",
        "Reasoning_question1": "Why might the concentration of Portlandite be higher in System 1 compared to System 2 in table_1?",
        "Reasoning_question2": "What could be the implications of having a variety of minerals like Monocarbonate and Ettringite in System 2 but not in System 1 as presented in table_1?",
        "Descriptive_answer1": "5.9 × 10^{-2}",
        "Descriptive_answer2": "Straetlingite",
        "Reasoning_answer1": "The higher concentration of Portlandite in System 1 compared to System 2 could be attributed to differences in the chemical composition and complexity of the two systems. System 1 is described as a relatively simple Ca-Si-O-H system, which likely focuses on fewer minerals and interactions, allowing for a higher concentration of Portlandite (5.9 × 10^{-2} mol/dm³) as a primary hydrate. In contrast, System 2 is a more complex Al-C-Ca-S-Si-H-O system with additional minerals like Monocarbonate, Ettringite, and Calcite. This increased complexity might lead to a distribution of calcium across multiple mineral phases, reducing the concentration of Portlandite to 3.8 × 10^{-2} mol/dm³. Additionally, the presence of other elements and competing reactions in System 2 may limit the formation or stability of Portlandite compared to the simpler environment in System 1.",
        "Reasoning_answer2": "The presence of a variety of minerals such as Monocarbonate (3.1 × 10^{-3} mol/dm³), Ettringite (1.2 × 10^{-3} mol/dm³), and Calcite (1.7 × 10^{-3} mol/dm³) in System 2, but not in System 1, suggests that System 2 represents a more chemically diverse and realistic cement paste environment, as it incorporates additional elements like aluminum, carbon, and sulfur. This diversity could imply that System 2 is more representative of actual cement behavior under real-world conditions, where multiple mineral phases interact during hydration and leaching processes. The implication for System 1, lacking these minerals, is that it may oversimplify the chemical interactions, potentially leading to less accurate predictions of long-term durability or leaching behavior. For System 2, the presence of these minerals could affect the overall stability, porosity, and resistance to degradation, as each mineral contributes differently to the cement matrix—Ettringite, for instance, may influence volume stability, while Monocarbonate could impact carbonation resistance. Therefore, System 2 likely provides a better model for studying complex reactive transport phenomena, though at the cost of increased computational demand due to the additional geochemical interactions."
    },
    {
        "paper_id": "2107.07598.json",
        "table_id": "table_2",
        "table_content": "\\begin{table}[h!]\n\t\\caption{Performance of the DNN and kNN emulators for cement system 1 and different training set sizes. For brevity, only the results for Ca$^{conc}$ and Si$^{conc}$ are shown. The units are mol per kg of water (mol/kgw). ML refers to the type of emulator, TR signifies the size of the training set, and the RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & \\\\\n\t\t\tML & TR & RMSE - Ca$^{conc}$ & $Q_2$ - Ca$^{conc}$ & RMSE - Si$^{conc}$ & $Q_2$ - Si$^{conc}$\\\\\n\t\t\tDNN & 4 $\\times$ 10$^{5}$ & 4.50 $\\times$ 10$^{-5}$ & 0.9999 & 1.21 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{5}$ & 5.06 $\\times$ 10$^{-5}$ & 0.9999 & 1.30 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{4}$ & 4.57 $\\times$ 10$^{-5}$ & 0.9999 & 1.48 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{3}$ & 8.02 $\\times$ 10$^{-5}$ & 0.9998 & 3.31 $\\times$ 10$^{-6}$ & 0.9994\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{2}$ & 47.1 $\\times$ 10$^{-5}$ & 0.9935 & 9.02 $\\times$ 10$^{-6}$ & 0.9959\\\\\n\t\t\tkNN & 4 $\\times$ 10$^{5}$ & 2.52 $\\times$ 10$^{-5}$ & 1.0000 & 1.59 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{5}$ & 6.27 $\\times$ 10$^{-5}$ & 0.9999 & 4.20 $\\times$ 10$^{-6}$ & 0.9991\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{4}$ & 14.1 $\\times$ 10$^{-5}$ & 0.9994 & 13.5 $\\times$ 10$^{-6}$ & 0.9908\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{3}$ & 43.9 $\\times$ 10$^{-5}$ & 0.9944 & 74.6 $\\times$ 10$^{-6}$ & 0.7180\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{2}$  & 96.4 $\\times$ 10$^{-5}$ & 0.9729 & 55.0 $\\times$ 10$^{-6}$ & 0.8468\\\\\n\t\t\\end{tabular}\n\t\\end{center}\n\t\\label{table2}\n\\end{table}",
        "caption": "Performance of the DNN and kNN emulators for cement system 1 and different training set sizes. For brevity, only the results for Ca$^{conc}$ and Si$^{conc}$ are shown. The units are mol per kg of water (mol/kgw). ML refers to the type of emulator, TR signifies the size of the training set, and the RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.",
        "label": "table2",
        "section_info": "3 Results\n\\section{Results}\n\\label{results}\n\n\\subsection{Ca-Si Problem}\nFor this first cement system, the emulation problem consists of predicting at each time step of the RT simulation the (output) Ca, Si, H and O aqueous concentrations (mol/kg of water or mol/kgw) from the (input) total amounts of Ca and Si (mol). \n\n\\subsubsection{Training the emulators}\n\\label{train_res1}\nHere the kNN and DNN emulators are firstly trained using a set of 400,000 test examples for both. This training set is obtained by randomly sampling the two-dimensional input space by latin hypercube sampling (LHS) between $\\left[0,0\\right]$ and $\\rm{\\left[Ca^{tot}_{max},Si^{tot}_{max}\\right]}$, and running PHREEQC for each input sample, $\\rm{\\textbf{x}_i = \\left[Ca^{tot}_i, Si^{tot}_i\\right]}$ to get the corresponding output vectors, $\\rm{\\textbf{y}_i = \\left[Ca^{conc}_i,Si^{conc}_i,H^{conc}_i,O^{conc}_i\\right]}$. The upper bounds, $\\rm{Ca^{tot}_{max}}$ and $\\rm{Si^{tot}_{max}}$ are defined based on a cheap full RT simulation with advective-dispersive transport using a small 1D domain of 51 nodes. It is worth noting that the total amounts of $\\textbf{x}_i$, corresponding to the PHREEQC-simulated concentrations, $\\textbf{y}_i$, have to be corrected for the different amount of water between the training set and the transport simulations. Doing so, it turns out that about 20 \\% of the post-corrected $x_i$ values exceed their pre-defined upper bounds and these excessively large values need to be filtered out. Creating the 400,000 training examples thus required about 500,000 PHREEQC runs. As stated earlier, for this problem single-threaded PHREEQC performs about 670 geochemical calculations per second on our used Intel\\textsuperscript{\\textregistered} i7 CPU.\n\nWith respect to kNN, the tuning parameters are the number of neighbors, $k$, the type of distance measure, and the interpolation technique. We simply used the default settings: $k = 5$, euclidean distance and inverse-distance interpolation. Regarding training of the DNN, the 400,000 sample were split between the training set itself (90 \\% of the data) and a validation set (10 \\% of the data). The latter serves to monitor the evolution of the selected mean squared error loss function on samples that are not used for training, thereby detecting potential overfitting. If the validation loss stops decreasing before the fixed number of epochs has been completed, then  training is stopped. Importantly, the emulation is achieved in log-space for both the input, $X$ and output $Y$, domains. This because total amounts and concentrations of the involved components typically cover many orders of magnitudes (up to 10 orders or more). Using a DNN also requires some form of data normalization or standardization. Here both the $\\rm{log\\left(\\textbf{x}_i\\right)}$ and $\\rm{log\\left(\\textbf{y}_i\\right)}$ vectors are standardized around 0 with standard deviation of 1.\n\nFigure \\ref{fig1} illustrates the trained emulators' performance for geochemical predictions using an independent test set that comprises 10,000 test examples. Both kNN (Figures \\ref{fig1}a - d) and DNN (Figures \\ref{fig1}e - h) appear to be rather accurate . DNN also shows a slight degradation for the larger concentration values  (Figures \\ref{fig1}e - h). The latter is likely due to the combination of a small proportion of large concentration values in the training set with the log-transformation that implicitly pushes the DNN to try harder to fit the smaller concentrations during training. Regarding speedup and as written earlier, for this setup the single-threaded kNN method is 300 times faster than single-threaded PHREEQC for predicting the 10,000 concentration vectors all at once. The computational savings allowed by the DNN emulator when ran on our GPU is higher with a speedup as large as 4000 for predicting the same 10,000 concentration vectors all at once. \n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure1.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 400,000 samples and the DNN is trained using the same 400,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig1}\n\\end{figure}\n\nTo test the sensitivity of the emulators' performance to the training set size, training was also performed using reduced training sets comprising 100,000, 10,000, 1000 and 100 samples, respectively. It seen that the DNN performance achieved  when using 10,000 training samples is virtually the same as that obtained when using 400,000 training samples (Figure \\ref{fig2} and Table \\ref{table2}). It is only for training sets smaller than 1000 samples that the DNN performance starts to degrade significantly (Table \\ref{table2}). In contrast, the behavior of kNN appears to markedly decrease as the training set gets smaller (Figure \\ref{fig2} and Table \\ref{table1}).\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure2.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 10,000 samples and the DNN is trained using the same 10,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig2}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Performance of the DNN and kNN emulators for cement system 1 and different training set sizes. For brevity, only the results for Ca$^{conc}$ and Si$^{conc}$ are shown. The units are mol per kg of water (mol/kgw). ML refers to the type of emulator, TR signifies the size of the training set, and the RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & \\\\\n\t\t\tML & TR & RMSE - Ca$^{conc}$ & $Q_2$ - Ca$^{conc}$ & RMSE - Si$^{conc}$ & $Q_2$ - Si$^{conc}$\\\\\n\t\t\tDNN & 4 $\\times$ 10$^{5}$ & 4.50 $\\times$ 10$^{-5}$ & 0.9999 & 1.21 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{5}$ & 5.06 $\\times$ 10$^{-5}$ & 0.9999 & 1.30 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{4}$ & 4.57 $\\times$ 10$^{-5}$ & 0.9999 & 1.48 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{3}$ & 8.02 $\\times$ 10$^{-5}$ & 0.9998 & 3.31 $\\times$ 10$^{-6}$ & 0.9994\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{2}$ & 47.1 $\\times$ 10$^{-5}$ & 0.9935 & 9.02 $\\times$ 10$^{-6}$ & 0.9959\\\\\n\t\t\tkNN & 4 $\\times$ 10$^{5}$ & 2.52 $\\times$ 10$^{-5}$ & 1.0000 & 1.59 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{5}$ & 6.27 $\\times$ 10$^{-5}$ & 0.9999 & 4.20 $\\times$ 10$^{-6}$ & 0.9991\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{4}$ & 14.1 $\\times$ 10$^{-5}$ & 0.9994 & 13.5 $\\times$ 10$^{-6}$ & 0.9908\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{3}$ & 43.9 $\\times$ 10$^{-5}$ & 0.9944 & 74.6 $\\times$ 10$^{-6}$ & 0.7180\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{2}$  & 96.4 $\\times$ 10$^{-5}$ & 0.9729 & 55.0 $\\times$ 10$^{-6}$ & 0.8468\\\\\n\t\t\\end{tabular}\n\t\\end{center}\n\t\\label{table2}\n\\end{table}\n\n\\FloatBarrier\n\n\\subsubsection{Reactive transport simulation}\n\\label{prob12Dres}\n\nThis section focuses on reactive transport simulations with HPx$_{\\rm py}$ within cement system 1, under both advective-dispersive and diffusive transport conditions. As written above, the domain sizes are both 61 $\\times$ 61 and 121 $\\times$ 121 for the advection-dispersion case and, because of computational constraints, solely 61 $\\times$ 61  for the diffusion case. In addition, the simulation time period is 2 years for the advection-dispersion case and 1 year for the diffusion case. Figures \\ref{fig3} and \\ref{fig4} present times series of original and emulated Ca, Si, H and O concentrations at 5 locations within the 2D domain for advective-dispersive transport conditions, for both our kNN-based (HPx$_{\\rm py}$-kNN) and DNN-based (HPx$_{\\rm py}$-DNN) reactive transport codes. It is seen that HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN both induce a quite good simulation accuracy. Also, the results for the diffusive transport case are of similarly good quality (not shown). Figures \\ref{fig5} - \\ref{fig6} provide more insights into the HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN performances by displaying 2D Ca, Si, H and O concentration profiles at a given time. For each experiment and chemical component, this time is selected as to be well representative of the simulated dynamics. It is observed that the original and emulated images are visually almost indistinguishable for the advection-dispersion case (Figure \\ref{fig5}). For the diffusion case, the emulators also perform quite well for Ca, H and O (Figures\\ref{fig6}a - c, g - i and j - l), while some slight to moderate discrepancies appear at the concentration front for Si (Figures \\ref{fig6}d - f). Notwithstanding, the Si concentration remains globally well predicted. Furthermore, Figures \\ref{fig7} - \\ref{fig8} present the original and emulated 2D solid amount profiles corresponding to Figures \\ref{fig5} - \\ref{fig6}. The original solid amount profiles are overall well approximated by HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN for the advection-dispersion case (Figure \\ref{fig7}), even though some mismatch appears at the border of the fully depleted zone for the H component. As of the diffusion case (Figure \\ref{fig8}), the same kind of mismatch is observed for the emulated solid amounts of H by HPx$_{\\rm py}$-kNN while the emulated profiles by HPx$_{\\rm py}$-DNN show some more discrepancies. Though we decided to present raw emulation results, we would like to stress that some if not all of the observed artifacts could likely be smoothed out by using some post-filtering such as median filtering.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure3.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-kNN emulated (TM+kNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig3}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure4.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-DNN emulated (TM+DNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig4}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure5.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig5}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure6.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig6}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure7.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig7}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure8.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig8}\n\\end{figure}\n\nThe speedups associated with the considered problem are detailed in Table \\ref{table3}. It is noted that the GPU-based DNN emulator allows for a speedup that is close to optimal. Indeed, The DNN speedups overall represent 85 \\% to 95 \\% of the maximum possible speedups (that is, speedups that would be obtained if the geochemcical calculations would come at no cost at all). The speedups associated with single-threaded kNN remain substantial but only amount to 57 \\% - 65 \\% of the corresponding maximum speedups. As detailed in section \\ref{train_res1}, the used kNN and DNN implementations are found to be respectively 300 and 4000 faster than single-threaded PHREEQC when predicting 10,000 points all one once for this geochemical system. Based on these numbers one could have expected the achieved speedups to represent say 90 \\% (kNN) or 99 \\% (DNN) of the maximum possible ones. A large part of the gaps between achieved and maximum possible speedups is thus likely caused by the time required for communicating and exchanging data between the main C/C++ code and the Python-based emulators.\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 1. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. The kNN predictions are performed on a single CPU using the scikit-learn implementation while the DNN predictions make use of our GPU. ML signifies the used machine learning method for emulation, TC denotes transport conditions (ADV: advection-dispersion, DIF: diffusion) and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HPx$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & TC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 3189 & 6.8 & 7.7 & 24.5 & 28.5 \\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 3189 & 5.0 & 7.7 & 18.0 & 28.5 \\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 23,337 & 5.0 & 5.2 & 17.0 & 18.0 \\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 23,337 & 3.4 & 5.2 & 11.6 & 18.0 \\\\\n\t\t\tDNN & DIF & 61 $\\times$ 61 & 25,448 & 4.2 & 4.9 & 16.2 & 19.1 \\\\\n\t\t\tkNN & DIF & 61 $\\times$ 61 & 25,448 & 2.8 & 4.9 & 10.8 & 19.1 \\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table3}\n\\end{table}\n\n\\FloatBarrier\n\n\\subsection{Al-C-Ca-S-Si Problem}\n\nFor this second cement system, the emulation problem consists of predicting at each time step of the RT simulation the (output) Al, C, Ca, S, Si, H and O aqueous concentrations (mol/kgw) from the (input) total amounts of Al, C, Ca, S and Si (mol). Here we focus on advective-dispersive transport only while similarly as for cement system 1, the considered domain sizes are 61 $\\times$ 61 and 121 $\\times$ 121. Moreover, the simulation time period is set to 6 years. \n\nAs mentioned earlier, for this higher-dimensional problem it is observed that the used scikit-learn kNN implementation becomes prohibitively slow compared to HPx$_{\\rm{4C}}$. We found that to get a good emulation accuracy the kNN training base needs to contain 1,000,000 samples (or more). This training base's size together with a 5-dimensional search space leads to an HPx$_{\\rm py}$-kNN reactive transport simulation time that is comparable to that of HPx$_{\\rm{4C}}$. Therefore, for this second cement system we built a custom kNN regressor around another kNN implementation contained in the FAISS package \\citep{faiss2017}. Our used FAISS variant allows for GPU computing and is much faster than scikit-learn for this cement system, but is slightly less accurate due to the use of an approximate rather than exact nearest neighbor search \\citep[see][for details]{faiss2017}.\n\n\\subsubsection{Training the DNN}\n\\label{train_res2}\n\nBuilding a good training set to perform a kNN search and learn the weights and biases of our DNN turned out to be a complicated task in this case. This because to make useful kNN predictions and/or learn an useful DNN, the training set must be sufficiently representative of the geochemical conditions encountered during the reactive transport simulation one wish to perform with HPx$_{\\rm py}$-kNN and the trained HPx$_{\\rm py}$-DNN. In contrast to cement system 1, creating the training set by sampling the $X$-space with a controlled randomness between predefined lower and upper bounds did not prove successful. We tried that strategy by drawing as much as 4,000,000 5-dimensional $\\textbf{x}$ vectors from the $X$-space using a Sobol low-discrepancy sequence \\citep[][]{Sobol1967, Joe-Kuo2003}. Such low-discrepancy sampling scheme covers the 5-dimensional hypercube more uniformly than LHS. Despite a good performance on the test set (not shown), the resulting DNN accuracy in reactive transport mode was never deemed satisfying. In other words, no satisfying ``global\" or ``universal\" DNN emulator could be devised for this cement system. This is probably caused by the fact for this problem, the input (5 total amounts) and output (7 aqueous concentrations) spaces are quite nonlinearly related and both cover 6 to 10 orders of magnitudes depending on the considered element. Therefore, we resorted to the alternative training strategy detailed below. The latter basically tries to grasp the complex correlations and higher-order dependencies that exist between the elements of $\\textbf{x}$ (total amounts, input space) for a given reactive transport simulation setup, in order to produce a training set that honors these between-input relationships.\n\n\\begin{itemize}\n\t\\item Perform a ``cheap\" full reactive transport simulation under the transport conditions and geochemistry of interest and collect the resulting $\\textbf{x}$ -$\\textbf{y}$  pairs of examples (for the considered grid nodes and time steps). Computational demand controls what domain size and simulation time period can be used for this cheap calculation. We used a modest 16 $\\times$ 16 domain and a simulation time period of 10 years. The associated HPx$_{\\rm{4C}}$ runtimes is 180 s.\n\t\n\t\\item Fit a kernel density estimator (KDE) with a Gaussian kernel to the collected $\\textbf{x}$ vectors (encapsulated in the $\\textbf{X}$ array) and generate a fixed number of new input vectors, $\\textbf{x}_{KDE}$. Then run PHREEQC for the $\\textbf{X}_{KDE}$ set to get the corresponding output set, $\\textbf{Y}_{KDE}$. Now apply the correction for porosity described in section \\ref{train_res1} to the $\\textbf{X}_{KDE}$ set and form the training set by merging the ensemble of $\\textbf{x}$-$\\textbf{y}$ pairs with that of the $\\textbf{x}_{KDE}$-$\\textbf{y}_{KDE}$ pairs. The number of produced unique examples by the considered cheap HPx$_{\\rm{4C}}$ simulations varied between 10,000 and 50,000. The KDE-based enrichment of this dataset was deemed necessary to provide more input variability, thereby avoiding overfitting of the trained DNN and improving the kNN accuracy, while still honoring the complex between-input relationships. The number of KDE-generated samples was set as to obtain a total training set size of 1,000,000 examples. A key component of the approach is the bandwith parameter of the Gaussian KDE kernel which controls how much the KDE-generated samples depart from the original ensemble. After limited trial and error, we fixed the kernel bandwith to 0.0025 for the considered case studies.\n\t\n\\end{itemize}\n\nThe scatter plots in Figure \\ref{fig9} illustrate our training set creation procedure. The orange dots depict the pairwise relationships observed between the 5 elements of $\\textbf{x}$ in the cheap simulation. The red and cyan dots in Figure \\ref{fig9} represent the KDE-generated samples with the selected bandwidth, before and after applying the correction for porosity, respectively. Training of the DNN is achieved using the ensemble of original and KDE-corrected input points. We refer to this kind of dataset as RT-based, since it is based on an albeit cheap, full RT simulation. Furthermore, we refer to the obtained DNN and kNN emulators as ``local\" emulators, since as opposed to the emulators constructed for cement system 1, the current emulators are only valid for the input conditions encapsulated in the RT-based training set.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1.5cm}\\includegraphics[width=50pc]{Figure9.png}\n\t\\caption{Scatter plots of the complex relationships between the five considered inputs for (1) the computationally cheap RT simulation of cement system 2 (orange dots) performed with the original HPx code, (2) the corresponding sampled points by kernel density estimation (KDE) using the selected bandwith (turquoise dots) and, (3) the same sampled KDE points after porosity correction (red dots).}\n\t\\label{fig9}\n\\end{figure}\n\nTraining performance of the DNN emulator is presented in Figure \\ref{fig10}. Similarly as for cement system 1, about 90\\% of the available data was used for the actual training of the DNN while the remaining 10\\% were used a validation set to contol overfitting. Lastly, performance is evaluated for both the trained DNN and kNN emulators using an independent test set of 10,000 examples. Overall, the accuracy of our ``local \" DNN emulator for this RT-inspired dataset is rather large with $Q_2$ values always greater or equal than 0.998. Training performance of the corresponding local kNN emulator is equally good (not shown). Regarding speedup, for this problem single-threaded PHREEQC achieves about 210 geochemical calculations per second on our used Intel\\textsuperscript{\\textregistered} i7 CPU while the GPU-based DNN and kNN emulators are both about 3000 times faster when predicting the 10,000 test points at once .\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure10.png}\n\t\\caption{1-1 plots of local DNN emulation performance obtained for system 2 when the local DNN is trained using 1,000,000 samples. The $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig10}\n\\end{figure}\n\n\\subsubsection{Reactive transport simulation}\n\\label{prob2res}\nOur ``local\" DNN performs rather well when applied to the 61 $\\times$ 61 grid size and a time period of 6 years (Figures \\ref{fig11} - \\ref{fig12}). This for all components but C, for which some localized deviations appear between original and emulated 2D concentration profiles towards the end of the simulation period (Figure \\ref{fig12}). When applied to the 121 $\\times$ 121 grid, HPx$_{\\rm py}$-DNN produces additional discrepancies for O, Si and Al towards the end of the simulation period (Figures \\ref{fig13} - \\ref{fig14}). Yet most of the observed artifacts could probably be smoothed out by using post-filtering. The associated speedups are listed in Table \\ref{table4}. These speedups are larger to those obtained for cement system 1, with values between 8 and 9 when evaluated against HPx$_{\\rm{4C}}$. These speedups represent about 85 \\% to 90 \\% of the maximum possible speedup (Table \\ref{table3}). Overall, these findings indicate that for the considered problem, our RT-based training of a local DNN only works is the training set is sufficiently representative of the particular geochemical conditions encountered in the computationally demanding simulations, which is arguably not easy to achieve. This limitation is further discussed in section \\ref{discussion}.\n\nWe note a more uniform behavior for HPx$_{\\rm py}$-kNN across grid sizes than for HPx$_{\\rm py}$-DNN. Here the results for the 121  $\\times$ 121 grid are only slightly less accurate than those associated with the 61 $ \\times$ 61 grid (see Figures \\ref{fig15} - \\ref{fig16} where for brevity we only show concentration profiles for the C Al and S components). Furthermore, whenever observed the discrepancies between original and emulated profiles are more regularly scattered than for HPx$_{\\rm py}$-DNN. Note also that herein too, post-filtering could likely smooth out a large part of these deviations. In addition, owing to the use of a GPU to achieve the kNN calculations, the speedup provided by HPx$_{\\rm py}$-kNN are as large as those provided by HPx$_{\\rm py}$-DNN (Table \\ref{table4}).\n\nWith respect to the emulated solid amounts, the HPx$_{\\rm py}$-DNN results look visually good for the the 61 $\\times$ 61 grid. This is shown in Figure \\ref{fig17} for the C, Al and S chemical components, while emulation of the H, O Ca and Si chemical components is globally of similar quality (not shown). Nevertheless, significant deviations appear towards the end of the simulation period for every chemical component (see Figure \\ref{fig18} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components shows the same level of mismatch). The HPx$_{\\rm py}$-kNN predictions are also fairly accurate for the 61 $\\times$ 61 (see Figure \\ref{fig19} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components exhibits a globally similar quality) while some discrepancies show up at the end of the simulation (Figure \\ref{fig20}). However, the mismatch is less pronounced than for HPx$_{\\rm py}$-DNN. Overall, HPx$_{\\rm py}$-kNN appears to be somewhat more robust than HPx$_{\\rm py}$-DNN for this cement system, while providing the same (large) speedup.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure11.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig11}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure12.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig12}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure13.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig13}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure14.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig14}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure15.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN).  The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig15}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure16.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig16}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure17.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig17}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure18.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig18}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure19.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig19}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure20.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig20}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 2. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. Both the kNN and DNN predictions make use of a GPU. ML signifies the used machine learning method for emulation, BC denotes the type of flow boundary conditions and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HP$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & BC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 21,415 & 8.2 & 9.0 & 30.3 & 33.1\\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 21,415 & 7.9 & 9.0 & 28.9 & 33.1\\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 199,841 & 8.2 & 9.5 & 29.9 & 32.8\\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 199,841 & 8.5 & 9.5 & 31.3 & 32.8\\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table4}\n\\end{table}\n\n\\FloatBarrier\n\n3.1 Ca-Si Problem\n\\subsection{Ca-Si Problem}\nFor this first cement system, the emulation problem consists of predicting at each time step of the RT simulation the (output) Ca, Si, H and O aqueous concentrations (mol/kg of water or mol/kgw) from the (input) total amounts of Ca and Si (mol). \n\n\\subsubsection{Training the emulators}\n\\label{train_res1}\nHere the kNN and DNN emulators are firstly trained using a set of 400,000 test examples for both. This training set is obtained by randomly sampling the two-dimensional input space by latin hypercube sampling (LHS) between $\\left[0,0\\right]$ and $\\rm{\\left[Ca^{tot}_{max},Si^{tot}_{max}\\right]}$, and running PHREEQC for each input sample, $\\rm{\\textbf{x}_i = \\left[Ca^{tot}_i, Si^{tot}_i\\right]}$ to get the corresponding output vectors, $\\rm{\\textbf{y}_i = \\left[Ca^{conc}_i,Si^{conc}_i,H^{conc}_i,O^{conc}_i\\right]}$. The upper bounds, $\\rm{Ca^{tot}_{max}}$ and $\\rm{Si^{tot}_{max}}$ are defined based on a cheap full RT simulation with advective-dispersive transport using a small 1D domain of 51 nodes. It is worth noting that the total amounts of $\\textbf{x}_i$, corresponding to the PHREEQC-simulated concentrations, $\\textbf{y}_i$, have to be corrected for the different amount of water between the training set and the transport simulations. Doing so, it turns out that about 20 \\% of the post-corrected $x_i$ values exceed their pre-defined upper bounds and these excessively large values need to be filtered out. Creating the 400,000 training examples thus required about 500,000 PHREEQC runs. As stated earlier, for this problem single-threaded PHREEQC performs about 670 geochemical calculations per second on our used Intel\\textsuperscript{\\textregistered} i7 CPU.\n\nWith respect to kNN, the tuning parameters are the number of neighbors, $k$, the type of distance measure, and the interpolation technique. We simply used the default settings: $k = 5$, euclidean distance and inverse-distance interpolation. Regarding training of the DNN, the 400,000 sample were split between the training set itself (90 \\% of the data) and a validation set (10 \\% of the data). The latter serves to monitor the evolution of the selected mean squared error loss function on samples that are not used for training, thereby detecting potential overfitting. If the validation loss stops decreasing before the fixed number of epochs has been completed, then  training is stopped. Importantly, the emulation is achieved in log-space for both the input, $X$ and output $Y$, domains. This because total amounts and concentrations of the involved components typically cover many orders of magnitudes (up to 10 orders or more). Using a DNN also requires some form of data normalization or standardization. Here both the $\\rm{log\\left(\\textbf{x}_i\\right)}$ and $\\rm{log\\left(\\textbf{y}_i\\right)}$ vectors are standardized around 0 with standard deviation of 1.\n\nFigure \\ref{fig1} illustrates the trained emulators' performance for geochemical predictions using an independent test set that comprises 10,000 test examples. Both kNN (Figures \\ref{fig1}a - d) and DNN (Figures \\ref{fig1}e - h) appear to be rather accurate . DNN also shows a slight degradation for the larger concentration values  (Figures \\ref{fig1}e - h). The latter is likely due to the combination of a small proportion of large concentration values in the training set with the log-transformation that implicitly pushes the DNN to try harder to fit the smaller concentrations during training. Regarding speedup and as written earlier, for this setup the single-threaded kNN method is 300 times faster than single-threaded PHREEQC for predicting the 10,000 concentration vectors all at once. The computational savings allowed by the DNN emulator when ran on our GPU is higher with a speedup as large as 4000 for predicting the same 10,000 concentration vectors all at once. \n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure1.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 400,000 samples and the DNN is trained using the same 400,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig1}\n\\end{figure}\n\nTo test the sensitivity of the emulators' performance to the training set size, training was also performed using reduced training sets comprising 100,000, 10,000, 1000 and 100 samples, respectively. It seen that the DNN performance achieved  when using 10,000 training samples is virtually the same as that obtained when using 400,000 training samples (Figure \\ref{fig2} and Table \\ref{table2}). It is only for training sets smaller than 1000 samples that the DNN performance starts to degrade significantly (Table \\ref{table2}). In contrast, the behavior of kNN appears to markedly decrease as the training set gets smaller (Figure \\ref{fig2} and Table \\ref{table1}).\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure2.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 10,000 samples and the DNN is trained using the same 10,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig2}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Performance of the DNN and kNN emulators for cement system 1 and different training set sizes. For brevity, only the results for Ca$^{conc}$ and Si$^{conc}$ are shown. The units are mol per kg of water (mol/kgw). ML refers to the type of emulator, TR signifies the size of the training set, and the RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & \\\\\n\t\t\tML & TR & RMSE - Ca$^{conc}$ & $Q_2$ - Ca$^{conc}$ & RMSE - Si$^{conc}$ & $Q_2$ - Si$^{conc}$\\\\\n\t\t\tDNN & 4 $\\times$ 10$^{5}$ & 4.50 $\\times$ 10$^{-5}$ & 0.9999 & 1.21 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{5}$ & 5.06 $\\times$ 10$^{-5}$ & 0.9999 & 1.30 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{4}$ & 4.57 $\\times$ 10$^{-5}$ & 0.9999 & 1.48 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{3}$ & 8.02 $\\times$ 10$^{-5}$ & 0.9998 & 3.31 $\\times$ 10$^{-6}$ & 0.9994\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{2}$ & 47.1 $\\times$ 10$^{-5}$ & 0.9935 & 9.02 $\\times$ 10$^{-6}$ & 0.9959\\\\\n\t\t\tkNN & 4 $\\times$ 10$^{5}$ & 2.52 $\\times$ 10$^{-5}$ & 1.0000 & 1.59 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{5}$ & 6.27 $\\times$ 10$^{-5}$ & 0.9999 & 4.20 $\\times$ 10$^{-6}$ & 0.9991\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{4}$ & 14.1 $\\times$ 10$^{-5}$ & 0.9994 & 13.5 $\\times$ 10$^{-6}$ & 0.9908\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{3}$ & 43.9 $\\times$ 10$^{-5}$ & 0.9944 & 74.6 $\\times$ 10$^{-6}$ & 0.7180\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{2}$  & 96.4 $\\times$ 10$^{-5}$ & 0.9729 & 55.0 $\\times$ 10$^{-6}$ & 0.8468\\\\\n\t\t\\end{tabular}\n\t\\end{center}\n\t\\label{table2}\n\\end{table}\n\n\\FloatBarrier\n\n\\subsubsection{Reactive transport simulation}\n\\label{prob12Dres}\n\nThis section focuses on reactive transport simulations with HPx$_{\\rm py}$ within cement system 1, under both advective-dispersive and diffusive transport conditions. As written above, the domain sizes are both 61 $\\times$ 61 and 121 $\\times$ 121 for the advection-dispersion case and, because of computational constraints, solely 61 $\\times$ 61  for the diffusion case. In addition, the simulation time period is 2 years for the advection-dispersion case and 1 year for the diffusion case. Figures \\ref{fig3} and \\ref{fig4} present times series of original and emulated Ca, Si, H and O concentrations at 5 locations within the 2D domain for advective-dispersive transport conditions, for both our kNN-based (HPx$_{\\rm py}$-kNN) and DNN-based (HPx$_{\\rm py}$-DNN) reactive transport codes. It is seen that HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN both induce a quite good simulation accuracy. Also, the results for the diffusive transport case are of similarly good quality (not shown). Figures \\ref{fig5} - \\ref{fig6} provide more insights into the HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN performances by displaying 2D Ca, Si, H and O concentration profiles at a given time. For each experiment and chemical component, this time is selected as to be well representative of the simulated dynamics. It is observed that the original and emulated images are visually almost indistinguishable for the advection-dispersion case (Figure \\ref{fig5}). For the diffusion case, the emulators also perform quite well for Ca, H and O (Figures\\ref{fig6}a - c, g - i and j - l), while some slight to moderate discrepancies appear at the concentration front for Si (Figures \\ref{fig6}d - f). Notwithstanding, the Si concentration remains globally well predicted. Furthermore, Figures \\ref{fig7} - \\ref{fig8} present the original and emulated 2D solid amount profiles corresponding to Figures \\ref{fig5} - \\ref{fig6}. The original solid amount profiles are overall well approximated by HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN for the advection-dispersion case (Figure \\ref{fig7}), even though some mismatch appears at the border of the fully depleted zone for the H component. As of the diffusion case (Figure \\ref{fig8}), the same kind of mismatch is observed for the emulated solid amounts of H by HPx$_{\\rm py}$-kNN while the emulated profiles by HPx$_{\\rm py}$-DNN show some more discrepancies. Though we decided to present raw emulation results, we would like to stress that some if not all of the observed artifacts could likely be smoothed out by using some post-filtering such as median filtering.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure3.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-kNN emulated (TM+kNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig3}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure4.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-DNN emulated (TM+DNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig4}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure5.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig5}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure6.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig6}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure7.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig7}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure8.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig8}\n\\end{figure}\n\nThe speedups associated with the considered problem are detailed in Table \\ref{table3}. It is noted that the GPU-based DNN emulator allows for a speedup that is close to optimal. Indeed, The DNN speedups overall represent 85 \\% to 95 \\% of the maximum possible speedups (that is, speedups that would be obtained if the geochemcical calculations would come at no cost at all). The speedups associated with single-threaded kNN remain substantial but only amount to 57 \\% - 65 \\% of the corresponding maximum speedups. As detailed in section \\ref{train_res1}, the used kNN and DNN implementations are found to be respectively 300 and 4000 faster than single-threaded PHREEQC when predicting 10,000 points all one once for this geochemical system. Based on these numbers one could have expected the achieved speedups to represent say 90 \\% (kNN) or 99 \\% (DNN) of the maximum possible ones. A large part of the gaps between achieved and maximum possible speedups is thus likely caused by the time required for communicating and exchanging data between the main C/C++ code and the Python-based emulators.\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 1. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. The kNN predictions are performed on a single CPU using the scikit-learn implementation while the DNN predictions make use of our GPU. ML signifies the used machine learning method for emulation, TC denotes transport conditions (ADV: advection-dispersion, DIF: diffusion) and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HPx$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & TC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 3189 & 6.8 & 7.7 & 24.5 & 28.5 \\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 3189 & 5.0 & 7.7 & 18.0 & 28.5 \\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 23,337 & 5.0 & 5.2 & 17.0 & 18.0 \\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 23,337 & 3.4 & 5.2 & 11.6 & 18.0 \\\\\n\t\t\tDNN & DIF & 61 $\\times$ 61 & 25,448 & 4.2 & 4.9 & 16.2 & 19.1 \\\\\n\t\t\tkNN & DIF & 61 $\\times$ 61 & 25,448 & 2.8 & 4.9 & 10.8 & 19.1 \\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table3}\n\\end{table}\n\n\\FloatBarrier\n\n3.1.1 Training the emulators\n\\subsubsection{Training the emulators}\n\\label{train_res1}\nHere the kNN and DNN emulators are firstly trained using a set of 400,000 test examples for both. This training set is obtained by randomly sampling the two-dimensional input space by latin hypercube sampling (LHS) between $\\left[0,0\\right]$ and $\\rm{\\left[Ca^{tot}_{max},Si^{tot}_{max}\\right]}$, and running PHREEQC for each input sample, $\\rm{\\textbf{x}_i = \\left[Ca^{tot}_i, Si^{tot}_i\\right]}$ to get the corresponding output vectors, $\\rm{\\textbf{y}_i = \\left[Ca^{conc}_i,Si^{conc}_i,H^{conc}_i,O^{conc}_i\\right]}$. The upper bounds, $\\rm{Ca^{tot}_{max}}$ and $\\rm{Si^{tot}_{max}}$ are defined based on a cheap full RT simulation with advective-dispersive transport using a small 1D domain of 51 nodes. It is worth noting that the total amounts of $\\textbf{x}_i$, corresponding to the PHREEQC-simulated concentrations, $\\textbf{y}_i$, have to be corrected for the different amount of water between the training set and the transport simulations. Doing so, it turns out that about 20 \\% of the post-corrected $x_i$ values exceed their pre-defined upper bounds and these excessively large values need to be filtered out. Creating the 400,000 training examples thus required about 500,000 PHREEQC runs. As stated earlier, for this problem single-threaded PHREEQC performs about 670 geochemical calculations per second on our used Intel\\textsuperscript{\\textregistered} i7 CPU.\n\nWith respect to kNN, the tuning parameters are the number of neighbors, $k$, the type of distance measure, and the interpolation technique. We simply used the default settings: $k = 5$, euclidean distance and inverse-distance interpolation. Regarding training of the DNN, the 400,000 sample were split between the training set itself (90 \\% of the data) and a validation set (10 \\% of the data). The latter serves to monitor the evolution of the selected mean squared error loss function on samples that are not used for training, thereby detecting potential overfitting. If the validation loss stops decreasing before the fixed number of epochs has been completed, then  training is stopped. Importantly, the emulation is achieved in log-space for both the input, $X$ and output $Y$, domains. This because total amounts and concentrations of the involved components typically cover many orders of magnitudes (up to 10 orders or more). Using a DNN also requires some form of data normalization or standardization. Here both the $\\rm{log\\left(\\textbf{x}_i\\right)}$ and $\\rm{log\\left(\\textbf{y}_i\\right)}$ vectors are standardized around 0 with standard deviation of 1.\n\nFigure \\ref{fig1} illustrates the trained emulators' performance for geochemical predictions using an independent test set that comprises 10,000 test examples. Both kNN (Figures \\ref{fig1}a - d) and DNN (Figures \\ref{fig1}e - h) appear to be rather accurate . DNN also shows a slight degradation for the larger concentration values  (Figures \\ref{fig1}e - h). The latter is likely due to the combination of a small proportion of large concentration values in the training set with the log-transformation that implicitly pushes the DNN to try harder to fit the smaller concentrations during training. Regarding speedup and as written earlier, for this setup the single-threaded kNN method is 300 times faster than single-threaded PHREEQC for predicting the 10,000 concentration vectors all at once. The computational savings allowed by the DNN emulator when ran on our GPU is higher with a speedup as large as 4000 for predicting the same 10,000 concentration vectors all at once. \n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure1.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 400,000 samples and the DNN is trained using the same 400,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig1}\n\\end{figure}\n\nTo test the sensitivity of the emulators' performance to the training set size, training was also performed using reduced training sets comprising 100,000, 10,000, 1000 and 100 samples, respectively. It seen that the DNN performance achieved  when using 10,000 training samples is virtually the same as that obtained when using 400,000 training samples (Figure \\ref{fig2} and Table \\ref{table2}). It is only for training sets smaller than 1000 samples that the DNN performance starts to degrade significantly (Table \\ref{table2}). In contrast, the behavior of kNN appears to markedly decrease as the training set gets smaller (Figure \\ref{fig2} and Table \\ref{table1}).\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure2.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 10,000 samples and the DNN is trained using the same 10,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig2}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Performance of the DNN and kNN emulators for cement system 1 and different training set sizes. For brevity, only the results for Ca$^{conc}$ and Si$^{conc}$ are shown. The units are mol per kg of water (mol/kgw). ML refers to the type of emulator, TR signifies the size of the training set, and the RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & \\\\\n\t\t\tML & TR & RMSE - Ca$^{conc}$ & $Q_2$ - Ca$^{conc}$ & RMSE - Si$^{conc}$ & $Q_2$ - Si$^{conc}$\\\\\n\t\t\tDNN & 4 $\\times$ 10$^{5}$ & 4.50 $\\times$ 10$^{-5}$ & 0.9999 & 1.21 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{5}$ & 5.06 $\\times$ 10$^{-5}$ & 0.9999 & 1.30 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{4}$ & 4.57 $\\times$ 10$^{-5}$ & 0.9999 & 1.48 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{3}$ & 8.02 $\\times$ 10$^{-5}$ & 0.9998 & 3.31 $\\times$ 10$^{-6}$ & 0.9994\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{2}$ & 47.1 $\\times$ 10$^{-5}$ & 0.9935 & 9.02 $\\times$ 10$^{-6}$ & 0.9959\\\\\n\t\t\tkNN & 4 $\\times$ 10$^{5}$ & 2.52 $\\times$ 10$^{-5}$ & 1.0000 & 1.59 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{5}$ & 6.27 $\\times$ 10$^{-5}$ & 0.9999 & 4.20 $\\times$ 10$^{-6}$ & 0.9991\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{4}$ & 14.1 $\\times$ 10$^{-5}$ & 0.9994 & 13.5 $\\times$ 10$^{-6}$ & 0.9908\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{3}$ & 43.9 $\\times$ 10$^{-5}$ & 0.9944 & 74.6 $\\times$ 10$^{-6}$ & 0.7180\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{2}$  & 96.4 $\\times$ 10$^{-5}$ & 0.9729 & 55.0 $\\times$ 10$^{-6}$ & 0.8468\\\\\n\t\t\\end{tabular}\n\t\\end{center}\n\t\\label{table2}\n\\end{table}\n\n\\FloatBarrier\n\n",
        "Descriptive_question1": "What is the RMSE value for Ca concentration using the DNN emulator with a training set size of 400,000 in table_2?",
        "Descriptive_question2": "What is the Q2 value for Si concentration using the kNN emulator with a training set size of 100,000 in table_2?",
        "Reasoning_question1": "How does the performance of the DNN emulator compare to the kNN emulator in terms of RMSE for Ca concentration across different training set sizes in table_2?",
        "Reasoning_question2": "What trend can be observed in the Q2 values for Si concentration as the training set size decreases for the kNN emulator in table_2?",
        "Descriptive_answer1": "4.50e-5",
        "Descriptive_answer2": "0.9991",
        "Reasoning_answer1": "Let's analyze the RMSE values for Ca concentration across different training set sizes for both DNN and kNN emulators from table_2. For DNN, the RMSE values are: 4.50e-5 (400,000), 5.06e-5 (100,000), 4.57e-5 (10,000), 8.02e-5 (1,000), and 47.1e-5 (100). For kNN, the RMSE values are: 2.52e-5 (400,000), 6.27e-5 (100,000), 14.1e-5 (10,000), 43.9e-5 (1,000), and 96.4e-5 (100). Comparing the two, kNN shows a lower RMSE at the largest training set size of 400,000 (2.52e-5 vs. 4.50e-5 for DNN), indicating better performance at this size. However, as the training set size decreases, DNN consistently maintains lower RMSE values compared to kNN. For instance, at 100,000, DNN's RMSE is 5.06e-5 while kNN's is 6.27e-5; at 10,000, DNN's is 4.57e-5 compared to kNN's 14.1e-5; at 1,000, DNN's is 8.02e-5 vs. kNN's 43.9e-5; and at 100, DNN's is 47.1e-5 compared to kNN's 96.4e-5. This indicates that DNN is more robust to reductions in training set size, showing less degradation in performance compared to kNN, which experiences a steeper increase in RMSE as the training set size decreases. Overall, while kNN performs better with the largest dataset, DNN outperforms kNN in terms of maintaining lower RMSE for Ca concentration across smaller training set sizes.",
        "Reasoning_answer2": "Examining the Q2 values for Si concentration using the kNN emulator across different training set sizes in table_2, we observe the following values: 0.9999 (400,000), 0.9991 (100,000), 0.9908 (10,000), 0.7180 (1,000), and 0.8468 (100). Analyzing the trend, there is a general decline in Q2 values as the training set size decreases, indicating a reduction in the emulator's predictive accuracy. Starting from a near-perfect Q2 of 0.9999 at 400,000, the value slightly drops to 0.9991 at 100,000, showing minimal loss in performance. However, a more noticeable decrease occurs at 10,000 with a Q2 of 0.9908, suggesting a moderate decline in accuracy. The most significant drop is observed at 1,000, where Q2 falls to 0.7180, reflecting a substantial loss in predictive power. Interestingly, at the smallest training set size of 100, the Q2 value slightly increases to 0.8468, though it remains much lower than the values at larger training sizes. This anomaly might suggest variability or noise in the data at very small sample sizes, but the overall trend is clear: as the training set size decreases, the Q2 values for Si concentration with the kNN emulator generally decrease, indicating worsening performance in terms of the coefficient of determination."
    },
    {
        "paper_id": "2107.07598.json",
        "table_id": "table_3",
        "table_content": "\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 1. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. The kNN predictions are performed on a single CPU using the scikit-learn implementation while the DNN predictions make use of our GPU. ML signifies the used machine learning method for emulation, TC denotes transport conditions (ADV: advection-dispersion, DIF: diffusion) and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HPx$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & TC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 3189 & 6.8 & 7.7 & 24.5 & 28.5 \\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 3189 & 5.0 & 7.7 & 18.0 & 28.5 \\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 23,337 & 5.0 & 5.2 & 17.0 & 18.0 \\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 23,337 & 3.4 & 5.2 & 11.6 & 18.0 \\\\\n\t\t\tDNN & DIF & 61 $\\times$ 61 & 25,448 & 4.2 & 4.9 & 16.2 & 19.1 \\\\\n\t\t\tkNN & DIF & 61 $\\times$ 61 & 25,448 & 2.8 & 4.9 & 10.8 & 19.1 \\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table3}\n\\end{table}",
        "caption": "Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 1. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. The kNN predictions are performed on a single CPU using the scikit-learn implementation while the DNN predictions make use of our GPU. ML signifies the used machine learning method for emulation, TC denotes transport conditions (ADV: advection-dispersion, DIF: diffusion) and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HPx$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.",
        "label": "table3",
        "section_info": "3 Results\n\\section{Results}\n\\label{results}\n\n\\subsection{Ca-Si Problem}\nFor this first cement system, the emulation problem consists of predicting at each time step of the RT simulation the (output) Ca, Si, H and O aqueous concentrations (mol/kg of water or mol/kgw) from the (input) total amounts of Ca and Si (mol). \n\n\\subsubsection{Training the emulators}\n\\label{train_res1}\nHere the kNN and DNN emulators are firstly trained using a set of 400,000 test examples for both. This training set is obtained by randomly sampling the two-dimensional input space by latin hypercube sampling (LHS) between $\\left[0,0\\right]$ and $\\rm{\\left[Ca^{tot}_{max},Si^{tot}_{max}\\right]}$, and running PHREEQC for each input sample, $\\rm{\\textbf{x}_i = \\left[Ca^{tot}_i, Si^{tot}_i\\right]}$ to get the corresponding output vectors, $\\rm{\\textbf{y}_i = \\left[Ca^{conc}_i,Si^{conc}_i,H^{conc}_i,O^{conc}_i\\right]}$. The upper bounds, $\\rm{Ca^{tot}_{max}}$ and $\\rm{Si^{tot}_{max}}$ are defined based on a cheap full RT simulation with advective-dispersive transport using a small 1D domain of 51 nodes. It is worth noting that the total amounts of $\\textbf{x}_i$, corresponding to the PHREEQC-simulated concentrations, $\\textbf{y}_i$, have to be corrected for the different amount of water between the training set and the transport simulations. Doing so, it turns out that about 20 \\% of the post-corrected $x_i$ values exceed their pre-defined upper bounds and these excessively large values need to be filtered out. Creating the 400,000 training examples thus required about 500,000 PHREEQC runs. As stated earlier, for this problem single-threaded PHREEQC performs about 670 geochemical calculations per second on our used Intel\\textsuperscript{\\textregistered} i7 CPU.\n\nWith respect to kNN, the tuning parameters are the number of neighbors, $k$, the type of distance measure, and the interpolation technique. We simply used the default settings: $k = 5$, euclidean distance and inverse-distance interpolation. Regarding training of the DNN, the 400,000 sample were split between the training set itself (90 \\% of the data) and a validation set (10 \\% of the data). The latter serves to monitor the evolution of the selected mean squared error loss function on samples that are not used for training, thereby detecting potential overfitting. If the validation loss stops decreasing before the fixed number of epochs has been completed, then  training is stopped. Importantly, the emulation is achieved in log-space for both the input, $X$ and output $Y$, domains. This because total amounts and concentrations of the involved components typically cover many orders of magnitudes (up to 10 orders or more). Using a DNN also requires some form of data normalization or standardization. Here both the $\\rm{log\\left(\\textbf{x}_i\\right)}$ and $\\rm{log\\left(\\textbf{y}_i\\right)}$ vectors are standardized around 0 with standard deviation of 1.\n\nFigure \\ref{fig1} illustrates the trained emulators' performance for geochemical predictions using an independent test set that comprises 10,000 test examples. Both kNN (Figures \\ref{fig1}a - d) and DNN (Figures \\ref{fig1}e - h) appear to be rather accurate . DNN also shows a slight degradation for the larger concentration values  (Figures \\ref{fig1}e - h). The latter is likely due to the combination of a small proportion of large concentration values in the training set with the log-transformation that implicitly pushes the DNN to try harder to fit the smaller concentrations during training. Regarding speedup and as written earlier, for this setup the single-threaded kNN method is 300 times faster than single-threaded PHREEQC for predicting the 10,000 concentration vectors all at once. The computational savings allowed by the DNN emulator when ran on our GPU is higher with a speedup as large as 4000 for predicting the same 10,000 concentration vectors all at once. \n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure1.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 400,000 samples and the DNN is trained using the same 400,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig1}\n\\end{figure}\n\nTo test the sensitivity of the emulators' performance to the training set size, training was also performed using reduced training sets comprising 100,000, 10,000, 1000 and 100 samples, respectively. It seen that the DNN performance achieved  when using 10,000 training samples is virtually the same as that obtained when using 400,000 training samples (Figure \\ref{fig2} and Table \\ref{table2}). It is only for training sets smaller than 1000 samples that the DNN performance starts to degrade significantly (Table \\ref{table2}). In contrast, the behavior of kNN appears to markedly decrease as the training set gets smaller (Figure \\ref{fig2} and Table \\ref{table1}).\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure2.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 10,000 samples and the DNN is trained using the same 10,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig2}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Performance of the DNN and kNN emulators for cement system 1 and different training set sizes. For brevity, only the results for Ca$^{conc}$ and Si$^{conc}$ are shown. The units are mol per kg of water (mol/kgw). ML refers to the type of emulator, TR signifies the size of the training set, and the RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & \\\\\n\t\t\tML & TR & RMSE - Ca$^{conc}$ & $Q_2$ - Ca$^{conc}$ & RMSE - Si$^{conc}$ & $Q_2$ - Si$^{conc}$\\\\\n\t\t\tDNN & 4 $\\times$ 10$^{5}$ & 4.50 $\\times$ 10$^{-5}$ & 0.9999 & 1.21 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{5}$ & 5.06 $\\times$ 10$^{-5}$ & 0.9999 & 1.30 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{4}$ & 4.57 $\\times$ 10$^{-5}$ & 0.9999 & 1.48 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{3}$ & 8.02 $\\times$ 10$^{-5}$ & 0.9998 & 3.31 $\\times$ 10$^{-6}$ & 0.9994\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{2}$ & 47.1 $\\times$ 10$^{-5}$ & 0.9935 & 9.02 $\\times$ 10$^{-6}$ & 0.9959\\\\\n\t\t\tkNN & 4 $\\times$ 10$^{5}$ & 2.52 $\\times$ 10$^{-5}$ & 1.0000 & 1.59 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{5}$ & 6.27 $\\times$ 10$^{-5}$ & 0.9999 & 4.20 $\\times$ 10$^{-6}$ & 0.9991\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{4}$ & 14.1 $\\times$ 10$^{-5}$ & 0.9994 & 13.5 $\\times$ 10$^{-6}$ & 0.9908\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{3}$ & 43.9 $\\times$ 10$^{-5}$ & 0.9944 & 74.6 $\\times$ 10$^{-6}$ & 0.7180\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{2}$  & 96.4 $\\times$ 10$^{-5}$ & 0.9729 & 55.0 $\\times$ 10$^{-6}$ & 0.8468\\\\\n\t\t\\end{tabular}\n\t\\end{center}\n\t\\label{table2}\n\\end{table}\n\n\\FloatBarrier\n\n\\subsubsection{Reactive transport simulation}\n\\label{prob12Dres}\n\nThis section focuses on reactive transport simulations with HPx$_{\\rm py}$ within cement system 1, under both advective-dispersive and diffusive transport conditions. As written above, the domain sizes are both 61 $\\times$ 61 and 121 $\\times$ 121 for the advection-dispersion case and, because of computational constraints, solely 61 $\\times$ 61  for the diffusion case. In addition, the simulation time period is 2 years for the advection-dispersion case and 1 year for the diffusion case. Figures \\ref{fig3} and \\ref{fig4} present times series of original and emulated Ca, Si, H and O concentrations at 5 locations within the 2D domain for advective-dispersive transport conditions, for both our kNN-based (HPx$_{\\rm py}$-kNN) and DNN-based (HPx$_{\\rm py}$-DNN) reactive transport codes. It is seen that HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN both induce a quite good simulation accuracy. Also, the results for the diffusive transport case are of similarly good quality (not shown). Figures \\ref{fig5} - \\ref{fig6} provide more insights into the HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN performances by displaying 2D Ca, Si, H and O concentration profiles at a given time. For each experiment and chemical component, this time is selected as to be well representative of the simulated dynamics. It is observed that the original and emulated images are visually almost indistinguishable for the advection-dispersion case (Figure \\ref{fig5}). For the diffusion case, the emulators also perform quite well for Ca, H and O (Figures\\ref{fig6}a - c, g - i and j - l), while some slight to moderate discrepancies appear at the concentration front for Si (Figures \\ref{fig6}d - f). Notwithstanding, the Si concentration remains globally well predicted. Furthermore, Figures \\ref{fig7} - \\ref{fig8} present the original and emulated 2D solid amount profiles corresponding to Figures \\ref{fig5} - \\ref{fig6}. The original solid amount profiles are overall well approximated by HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN for the advection-dispersion case (Figure \\ref{fig7}), even though some mismatch appears at the border of the fully depleted zone for the H component. As of the diffusion case (Figure \\ref{fig8}), the same kind of mismatch is observed for the emulated solid amounts of H by HPx$_{\\rm py}$-kNN while the emulated profiles by HPx$_{\\rm py}$-DNN show some more discrepancies. Though we decided to present raw emulation results, we would like to stress that some if not all of the observed artifacts could likely be smoothed out by using some post-filtering such as median filtering.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure3.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-kNN emulated (TM+kNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig3}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure4.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-DNN emulated (TM+DNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig4}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure5.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig5}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure6.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig6}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure7.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig7}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure8.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig8}\n\\end{figure}\n\nThe speedups associated with the considered problem are detailed in Table \\ref{table3}. It is noted that the GPU-based DNN emulator allows for a speedup that is close to optimal. Indeed, The DNN speedups overall represent 85 \\% to 95 \\% of the maximum possible speedups (that is, speedups that would be obtained if the geochemcical calculations would come at no cost at all). The speedups associated with single-threaded kNN remain substantial but only amount to 57 \\% - 65 \\% of the corresponding maximum speedups. As detailed in section \\ref{train_res1}, the used kNN and DNN implementations are found to be respectively 300 and 4000 faster than single-threaded PHREEQC when predicting 10,000 points all one once for this geochemical system. Based on these numbers one could have expected the achieved speedups to represent say 90 \\% (kNN) or 99 \\% (DNN) of the maximum possible ones. A large part of the gaps between achieved and maximum possible speedups is thus likely caused by the time required for communicating and exchanging data between the main C/C++ code and the Python-based emulators.\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 1. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. The kNN predictions are performed on a single CPU using the scikit-learn implementation while the DNN predictions make use of our GPU. ML signifies the used machine learning method for emulation, TC denotes transport conditions (ADV: advection-dispersion, DIF: diffusion) and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HPx$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & TC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 3189 & 6.8 & 7.7 & 24.5 & 28.5 \\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 3189 & 5.0 & 7.7 & 18.0 & 28.5 \\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 23,337 & 5.0 & 5.2 & 17.0 & 18.0 \\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 23,337 & 3.4 & 5.2 & 11.6 & 18.0 \\\\\n\t\t\tDNN & DIF & 61 $\\times$ 61 & 25,448 & 4.2 & 4.9 & 16.2 & 19.1 \\\\\n\t\t\tkNN & DIF & 61 $\\times$ 61 & 25,448 & 2.8 & 4.9 & 10.8 & 19.1 \\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table3}\n\\end{table}\n\n\\FloatBarrier\n\n\\subsection{Al-C-Ca-S-Si Problem}\n\nFor this second cement system, the emulation problem consists of predicting at each time step of the RT simulation the (output) Al, C, Ca, S, Si, H and O aqueous concentrations (mol/kgw) from the (input) total amounts of Al, C, Ca, S and Si (mol). Here we focus on advective-dispersive transport only while similarly as for cement system 1, the considered domain sizes are 61 $\\times$ 61 and 121 $\\times$ 121. Moreover, the simulation time period is set to 6 years. \n\nAs mentioned earlier, for this higher-dimensional problem it is observed that the used scikit-learn kNN implementation becomes prohibitively slow compared to HPx$_{\\rm{4C}}$. We found that to get a good emulation accuracy the kNN training base needs to contain 1,000,000 samples (or more). This training base's size together with a 5-dimensional search space leads to an HPx$_{\\rm py}$-kNN reactive transport simulation time that is comparable to that of HPx$_{\\rm{4C}}$. Therefore, for this second cement system we built a custom kNN regressor around another kNN implementation contained in the FAISS package \\citep{faiss2017}. Our used FAISS variant allows for GPU computing and is much faster than scikit-learn for this cement system, but is slightly less accurate due to the use of an approximate rather than exact nearest neighbor search \\citep[see][for details]{faiss2017}.\n\n\\subsubsection{Training the DNN}\n\\label{train_res2}\n\nBuilding a good training set to perform a kNN search and learn the weights and biases of our DNN turned out to be a complicated task in this case. This because to make useful kNN predictions and/or learn an useful DNN, the training set must be sufficiently representative of the geochemical conditions encountered during the reactive transport simulation one wish to perform with HPx$_{\\rm py}$-kNN and the trained HPx$_{\\rm py}$-DNN. In contrast to cement system 1, creating the training set by sampling the $X$-space with a controlled randomness between predefined lower and upper bounds did not prove successful. We tried that strategy by drawing as much as 4,000,000 5-dimensional $\\textbf{x}$ vectors from the $X$-space using a Sobol low-discrepancy sequence \\citep[][]{Sobol1967, Joe-Kuo2003}. Such low-discrepancy sampling scheme covers the 5-dimensional hypercube more uniformly than LHS. Despite a good performance on the test set (not shown), the resulting DNN accuracy in reactive transport mode was never deemed satisfying. In other words, no satisfying ``global\" or ``universal\" DNN emulator could be devised for this cement system. This is probably caused by the fact for this problem, the input (5 total amounts) and output (7 aqueous concentrations) spaces are quite nonlinearly related and both cover 6 to 10 orders of magnitudes depending on the considered element. Therefore, we resorted to the alternative training strategy detailed below. The latter basically tries to grasp the complex correlations and higher-order dependencies that exist between the elements of $\\textbf{x}$ (total amounts, input space) for a given reactive transport simulation setup, in order to produce a training set that honors these between-input relationships.\n\n\\begin{itemize}\n\t\\item Perform a ``cheap\" full reactive transport simulation under the transport conditions and geochemistry of interest and collect the resulting $\\textbf{x}$ -$\\textbf{y}$  pairs of examples (for the considered grid nodes and time steps). Computational demand controls what domain size and simulation time period can be used for this cheap calculation. We used a modest 16 $\\times$ 16 domain and a simulation time period of 10 years. The associated HPx$_{\\rm{4C}}$ runtimes is 180 s.\n\t\n\t\\item Fit a kernel density estimator (KDE) with a Gaussian kernel to the collected $\\textbf{x}$ vectors (encapsulated in the $\\textbf{X}$ array) and generate a fixed number of new input vectors, $\\textbf{x}_{KDE}$. Then run PHREEQC for the $\\textbf{X}_{KDE}$ set to get the corresponding output set, $\\textbf{Y}_{KDE}$. Now apply the correction for porosity described in section \\ref{train_res1} to the $\\textbf{X}_{KDE}$ set and form the training set by merging the ensemble of $\\textbf{x}$-$\\textbf{y}$ pairs with that of the $\\textbf{x}_{KDE}$-$\\textbf{y}_{KDE}$ pairs. The number of produced unique examples by the considered cheap HPx$_{\\rm{4C}}$ simulations varied between 10,000 and 50,000. The KDE-based enrichment of this dataset was deemed necessary to provide more input variability, thereby avoiding overfitting of the trained DNN and improving the kNN accuracy, while still honoring the complex between-input relationships. The number of KDE-generated samples was set as to obtain a total training set size of 1,000,000 examples. A key component of the approach is the bandwith parameter of the Gaussian KDE kernel which controls how much the KDE-generated samples depart from the original ensemble. After limited trial and error, we fixed the kernel bandwith to 0.0025 for the considered case studies.\n\t\n\\end{itemize}\n\nThe scatter plots in Figure \\ref{fig9} illustrate our training set creation procedure. The orange dots depict the pairwise relationships observed between the 5 elements of $\\textbf{x}$ in the cheap simulation. The red and cyan dots in Figure \\ref{fig9} represent the KDE-generated samples with the selected bandwidth, before and after applying the correction for porosity, respectively. Training of the DNN is achieved using the ensemble of original and KDE-corrected input points. We refer to this kind of dataset as RT-based, since it is based on an albeit cheap, full RT simulation. Furthermore, we refer to the obtained DNN and kNN emulators as ``local\" emulators, since as opposed to the emulators constructed for cement system 1, the current emulators are only valid for the input conditions encapsulated in the RT-based training set.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1.5cm}\\includegraphics[width=50pc]{Figure9.png}\n\t\\caption{Scatter plots of the complex relationships between the five considered inputs for (1) the computationally cheap RT simulation of cement system 2 (orange dots) performed with the original HPx code, (2) the corresponding sampled points by kernel density estimation (KDE) using the selected bandwith (turquoise dots) and, (3) the same sampled KDE points after porosity correction (red dots).}\n\t\\label{fig9}\n\\end{figure}\n\nTraining performance of the DNN emulator is presented in Figure \\ref{fig10}. Similarly as for cement system 1, about 90\\% of the available data was used for the actual training of the DNN while the remaining 10\\% were used a validation set to contol overfitting. Lastly, performance is evaluated for both the trained DNN and kNN emulators using an independent test set of 10,000 examples. Overall, the accuracy of our ``local \" DNN emulator for this RT-inspired dataset is rather large with $Q_2$ values always greater or equal than 0.998. Training performance of the corresponding local kNN emulator is equally good (not shown). Regarding speedup, for this problem single-threaded PHREEQC achieves about 210 geochemical calculations per second on our used Intel\\textsuperscript{\\textregistered} i7 CPU while the GPU-based DNN and kNN emulators are both about 3000 times faster when predicting the 10,000 test points at once .\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure10.png}\n\t\\caption{1-1 plots of local DNN emulation performance obtained for system 2 when the local DNN is trained using 1,000,000 samples. The $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig10}\n\\end{figure}\n\n\\subsubsection{Reactive transport simulation}\n\\label{prob2res}\nOur ``local\" DNN performs rather well when applied to the 61 $\\times$ 61 grid size and a time period of 6 years (Figures \\ref{fig11} - \\ref{fig12}). This for all components but C, for which some localized deviations appear between original and emulated 2D concentration profiles towards the end of the simulation period (Figure \\ref{fig12}). When applied to the 121 $\\times$ 121 grid, HPx$_{\\rm py}$-DNN produces additional discrepancies for O, Si and Al towards the end of the simulation period (Figures \\ref{fig13} - \\ref{fig14}). Yet most of the observed artifacts could probably be smoothed out by using post-filtering. The associated speedups are listed in Table \\ref{table4}. These speedups are larger to those obtained for cement system 1, with values between 8 and 9 when evaluated against HPx$_{\\rm{4C}}$. These speedups represent about 85 \\% to 90 \\% of the maximum possible speedup (Table \\ref{table3}). Overall, these findings indicate that for the considered problem, our RT-based training of a local DNN only works is the training set is sufficiently representative of the particular geochemical conditions encountered in the computationally demanding simulations, which is arguably not easy to achieve. This limitation is further discussed in section \\ref{discussion}.\n\nWe note a more uniform behavior for HPx$_{\\rm py}$-kNN across grid sizes than for HPx$_{\\rm py}$-DNN. Here the results for the 121  $\\times$ 121 grid are only slightly less accurate than those associated with the 61 $ \\times$ 61 grid (see Figures \\ref{fig15} - \\ref{fig16} where for brevity we only show concentration profiles for the C Al and S components). Furthermore, whenever observed the discrepancies between original and emulated profiles are more regularly scattered than for HPx$_{\\rm py}$-DNN. Note also that herein too, post-filtering could likely smooth out a large part of these deviations. In addition, owing to the use of a GPU to achieve the kNN calculations, the speedup provided by HPx$_{\\rm py}$-kNN are as large as those provided by HPx$_{\\rm py}$-DNN (Table \\ref{table4}).\n\nWith respect to the emulated solid amounts, the HPx$_{\\rm py}$-DNN results look visually good for the the 61 $\\times$ 61 grid. This is shown in Figure \\ref{fig17} for the C, Al and S chemical components, while emulation of the H, O Ca and Si chemical components is globally of similar quality (not shown). Nevertheless, significant deviations appear towards the end of the simulation period for every chemical component (see Figure \\ref{fig18} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components shows the same level of mismatch). The HPx$_{\\rm py}$-kNN predictions are also fairly accurate for the 61 $\\times$ 61 (see Figure \\ref{fig19} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components exhibits a globally similar quality) while some discrepancies show up at the end of the simulation (Figure \\ref{fig20}). However, the mismatch is less pronounced than for HPx$_{\\rm py}$-DNN. Overall, HPx$_{\\rm py}$-kNN appears to be somewhat more robust than HPx$_{\\rm py}$-DNN for this cement system, while providing the same (large) speedup.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure11.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig11}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure12.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig12}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure13.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig13}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure14.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig14}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure15.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN).  The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig15}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure16.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig16}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure17.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig17}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure18.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig18}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure19.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig19}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure20.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig20}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 2. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. Both the kNN and DNN predictions make use of a GPU. ML signifies the used machine learning method for emulation, BC denotes the type of flow boundary conditions and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HP$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & BC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 21,415 & 8.2 & 9.0 & 30.3 & 33.1\\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 21,415 & 7.9 & 9.0 & 28.9 & 33.1\\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 199,841 & 8.2 & 9.5 & 29.9 & 32.8\\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 199,841 & 8.5 & 9.5 & 31.3 & 32.8\\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table4}\n\\end{table}\n\n\\FloatBarrier\n\n3.1 Ca-Si Problem\n\\subsection{Ca-Si Problem}\nFor this first cement system, the emulation problem consists of predicting at each time step of the RT simulation the (output) Ca, Si, H and O aqueous concentrations (mol/kg of water or mol/kgw) from the (input) total amounts of Ca and Si (mol). \n\n\\subsubsection{Training the emulators}\n\\label{train_res1}\nHere the kNN and DNN emulators are firstly trained using a set of 400,000 test examples for both. This training set is obtained by randomly sampling the two-dimensional input space by latin hypercube sampling (LHS) between $\\left[0,0\\right]$ and $\\rm{\\left[Ca^{tot}_{max},Si^{tot}_{max}\\right]}$, and running PHREEQC for each input sample, $\\rm{\\textbf{x}_i = \\left[Ca^{tot}_i, Si^{tot}_i\\right]}$ to get the corresponding output vectors, $\\rm{\\textbf{y}_i = \\left[Ca^{conc}_i,Si^{conc}_i,H^{conc}_i,O^{conc}_i\\right]}$. The upper bounds, $\\rm{Ca^{tot}_{max}}$ and $\\rm{Si^{tot}_{max}}$ are defined based on a cheap full RT simulation with advective-dispersive transport using a small 1D domain of 51 nodes. It is worth noting that the total amounts of $\\textbf{x}_i$, corresponding to the PHREEQC-simulated concentrations, $\\textbf{y}_i$, have to be corrected for the different amount of water between the training set and the transport simulations. Doing so, it turns out that about 20 \\% of the post-corrected $x_i$ values exceed their pre-defined upper bounds and these excessively large values need to be filtered out. Creating the 400,000 training examples thus required about 500,000 PHREEQC runs. As stated earlier, for this problem single-threaded PHREEQC performs about 670 geochemical calculations per second on our used Intel\\textsuperscript{\\textregistered} i7 CPU.\n\nWith respect to kNN, the tuning parameters are the number of neighbors, $k$, the type of distance measure, and the interpolation technique. We simply used the default settings: $k = 5$, euclidean distance and inverse-distance interpolation. Regarding training of the DNN, the 400,000 sample were split between the training set itself (90 \\% of the data) and a validation set (10 \\% of the data). The latter serves to monitor the evolution of the selected mean squared error loss function on samples that are not used for training, thereby detecting potential overfitting. If the validation loss stops decreasing before the fixed number of epochs has been completed, then  training is stopped. Importantly, the emulation is achieved in log-space for both the input, $X$ and output $Y$, domains. This because total amounts and concentrations of the involved components typically cover many orders of magnitudes (up to 10 orders or more). Using a DNN also requires some form of data normalization or standardization. Here both the $\\rm{log\\left(\\textbf{x}_i\\right)}$ and $\\rm{log\\left(\\textbf{y}_i\\right)}$ vectors are standardized around 0 with standard deviation of 1.\n\nFigure \\ref{fig1} illustrates the trained emulators' performance for geochemical predictions using an independent test set that comprises 10,000 test examples. Both kNN (Figures \\ref{fig1}a - d) and DNN (Figures \\ref{fig1}e - h) appear to be rather accurate . DNN also shows a slight degradation for the larger concentration values  (Figures \\ref{fig1}e - h). The latter is likely due to the combination of a small proportion of large concentration values in the training set with the log-transformation that implicitly pushes the DNN to try harder to fit the smaller concentrations during training. Regarding speedup and as written earlier, for this setup the single-threaded kNN method is 300 times faster than single-threaded PHREEQC for predicting the 10,000 concentration vectors all at once. The computational savings allowed by the DNN emulator when ran on our GPU is higher with a speedup as large as 4000 for predicting the same 10,000 concentration vectors all at once. \n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure1.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 400,000 samples and the DNN is trained using the same 400,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig1}\n\\end{figure}\n\nTo test the sensitivity of the emulators' performance to the training set size, training was also performed using reduced training sets comprising 100,000, 10,000, 1000 and 100 samples, respectively. It seen that the DNN performance achieved  when using 10,000 training samples is virtually the same as that obtained when using 400,000 training samples (Figure \\ref{fig2} and Table \\ref{table2}). It is only for training sets smaller than 1000 samples that the DNN performance starts to degrade significantly (Table \\ref{table2}). In contrast, the behavior of kNN appears to markedly decrease as the training set gets smaller (Figure \\ref{fig2} and Table \\ref{table1}).\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure2.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 10,000 samples and the DNN is trained using the same 10,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig2}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Performance of the DNN and kNN emulators for cement system 1 and different training set sizes. For brevity, only the results for Ca$^{conc}$ and Si$^{conc}$ are shown. The units are mol per kg of water (mol/kgw). ML refers to the type of emulator, TR signifies the size of the training set, and the RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & \\\\\n\t\t\tML & TR & RMSE - Ca$^{conc}$ & $Q_2$ - Ca$^{conc}$ & RMSE - Si$^{conc}$ & $Q_2$ - Si$^{conc}$\\\\\n\t\t\tDNN & 4 $\\times$ 10$^{5}$ & 4.50 $\\times$ 10$^{-5}$ & 0.9999 & 1.21 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{5}$ & 5.06 $\\times$ 10$^{-5}$ & 0.9999 & 1.30 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{4}$ & 4.57 $\\times$ 10$^{-5}$ & 0.9999 & 1.48 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{3}$ & 8.02 $\\times$ 10$^{-5}$ & 0.9998 & 3.31 $\\times$ 10$^{-6}$ & 0.9994\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{2}$ & 47.1 $\\times$ 10$^{-5}$ & 0.9935 & 9.02 $\\times$ 10$^{-6}$ & 0.9959\\\\\n\t\t\tkNN & 4 $\\times$ 10$^{5}$ & 2.52 $\\times$ 10$^{-5}$ & 1.0000 & 1.59 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{5}$ & 6.27 $\\times$ 10$^{-5}$ & 0.9999 & 4.20 $\\times$ 10$^{-6}$ & 0.9991\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{4}$ & 14.1 $\\times$ 10$^{-5}$ & 0.9994 & 13.5 $\\times$ 10$^{-6}$ & 0.9908\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{3}$ & 43.9 $\\times$ 10$^{-5}$ & 0.9944 & 74.6 $\\times$ 10$^{-6}$ & 0.7180\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{2}$  & 96.4 $\\times$ 10$^{-5}$ & 0.9729 & 55.0 $\\times$ 10$^{-6}$ & 0.8468\\\\\n\t\t\\end{tabular}\n\t\\end{center}\n\t\\label{table2}\n\\end{table}\n\n\\FloatBarrier\n\n\\subsubsection{Reactive transport simulation}\n\\label{prob12Dres}\n\nThis section focuses on reactive transport simulations with HPx$_{\\rm py}$ within cement system 1, under both advective-dispersive and diffusive transport conditions. As written above, the domain sizes are both 61 $\\times$ 61 and 121 $\\times$ 121 for the advection-dispersion case and, because of computational constraints, solely 61 $\\times$ 61  for the diffusion case. In addition, the simulation time period is 2 years for the advection-dispersion case and 1 year for the diffusion case. Figures \\ref{fig3} and \\ref{fig4} present times series of original and emulated Ca, Si, H and O concentrations at 5 locations within the 2D domain for advective-dispersive transport conditions, for both our kNN-based (HPx$_{\\rm py}$-kNN) and DNN-based (HPx$_{\\rm py}$-DNN) reactive transport codes. It is seen that HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN both induce a quite good simulation accuracy. Also, the results for the diffusive transport case are of similarly good quality (not shown). Figures \\ref{fig5} - \\ref{fig6} provide more insights into the HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN performances by displaying 2D Ca, Si, H and O concentration profiles at a given time. For each experiment and chemical component, this time is selected as to be well representative of the simulated dynamics. It is observed that the original and emulated images are visually almost indistinguishable for the advection-dispersion case (Figure \\ref{fig5}). For the diffusion case, the emulators also perform quite well for Ca, H and O (Figures\\ref{fig6}a - c, g - i and j - l), while some slight to moderate discrepancies appear at the concentration front for Si (Figures \\ref{fig6}d - f). Notwithstanding, the Si concentration remains globally well predicted. Furthermore, Figures \\ref{fig7} - \\ref{fig8} present the original and emulated 2D solid amount profiles corresponding to Figures \\ref{fig5} - \\ref{fig6}. The original solid amount profiles are overall well approximated by HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN for the advection-dispersion case (Figure \\ref{fig7}), even though some mismatch appears at the border of the fully depleted zone for the H component. As of the diffusion case (Figure \\ref{fig8}), the same kind of mismatch is observed for the emulated solid amounts of H by HPx$_{\\rm py}$-kNN while the emulated profiles by HPx$_{\\rm py}$-DNN show some more discrepancies. Though we decided to present raw emulation results, we would like to stress that some if not all of the observed artifacts could likely be smoothed out by using some post-filtering such as median filtering.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure3.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-kNN emulated (TM+kNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig3}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure4.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-DNN emulated (TM+DNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig4}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure5.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig5}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure6.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig6}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure7.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig7}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure8.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig8}\n\\end{figure}\n\nThe speedups associated with the considered problem are detailed in Table \\ref{table3}. It is noted that the GPU-based DNN emulator allows for a speedup that is close to optimal. Indeed, The DNN speedups overall represent 85 \\% to 95 \\% of the maximum possible speedups (that is, speedups that would be obtained if the geochemcical calculations would come at no cost at all). The speedups associated with single-threaded kNN remain substantial but only amount to 57 \\% - 65 \\% of the corresponding maximum speedups. As detailed in section \\ref{train_res1}, the used kNN and DNN implementations are found to be respectively 300 and 4000 faster than single-threaded PHREEQC when predicting 10,000 points all one once for this geochemical system. Based on these numbers one could have expected the achieved speedups to represent say 90 \\% (kNN) or 99 \\% (DNN) of the maximum possible ones. A large part of the gaps between achieved and maximum possible speedups is thus likely caused by the time required for communicating and exchanging data between the main C/C++ code and the Python-based emulators.\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 1. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. The kNN predictions are performed on a single CPU using the scikit-learn implementation while the DNN predictions make use of our GPU. ML signifies the used machine learning method for emulation, TC denotes transport conditions (ADV: advection-dispersion, DIF: diffusion) and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HPx$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & TC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 3189 & 6.8 & 7.7 & 24.5 & 28.5 \\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 3189 & 5.0 & 7.7 & 18.0 & 28.5 \\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 23,337 & 5.0 & 5.2 & 17.0 & 18.0 \\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 23,337 & 3.4 & 5.2 & 11.6 & 18.0 \\\\\n\t\t\tDNN & DIF & 61 $\\times$ 61 & 25,448 & 4.2 & 4.9 & 16.2 & 19.1 \\\\\n\t\t\tkNN & DIF & 61 $\\times$ 61 & 25,448 & 2.8 & 4.9 & 10.8 & 19.1 \\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table3}\n\\end{table}\n\n\\FloatBarrier\n\n3.1.2 Reactive transport simulation\n\\subsubsection{Reactive transport simulation}\n\\label{prob12Dres}\n\nThis section focuses on reactive transport simulations with HPx$_{\\rm py}$ within cement system 1, under both advective-dispersive and diffusive transport conditions. As written above, the domain sizes are both 61 $\\times$ 61 and 121 $\\times$ 121 for the advection-dispersion case and, because of computational constraints, solely 61 $\\times$ 61  for the diffusion case. In addition, the simulation time period is 2 years for the advection-dispersion case and 1 year for the diffusion case. Figures \\ref{fig3} and \\ref{fig4} present times series of original and emulated Ca, Si, H and O concentrations at 5 locations within the 2D domain for advective-dispersive transport conditions, for both our kNN-based (HPx$_{\\rm py}$-kNN) and DNN-based (HPx$_{\\rm py}$-DNN) reactive transport codes. It is seen that HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN both induce a quite good simulation accuracy. Also, the results for the diffusive transport case are of similarly good quality (not shown). Figures \\ref{fig5} - \\ref{fig6} provide more insights into the HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN performances by displaying 2D Ca, Si, H and O concentration profiles at a given time. For each experiment and chemical component, this time is selected as to be well representative of the simulated dynamics. It is observed that the original and emulated images are visually almost indistinguishable for the advection-dispersion case (Figure \\ref{fig5}). For the diffusion case, the emulators also perform quite well for Ca, H and O (Figures\\ref{fig6}a - c, g - i and j - l), while some slight to moderate discrepancies appear at the concentration front for Si (Figures \\ref{fig6}d - f). Notwithstanding, the Si concentration remains globally well predicted. Furthermore, Figures \\ref{fig7} - \\ref{fig8} present the original and emulated 2D solid amount profiles corresponding to Figures \\ref{fig5} - \\ref{fig6}. The original solid amount profiles are overall well approximated by HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN for the advection-dispersion case (Figure \\ref{fig7}), even though some mismatch appears at the border of the fully depleted zone for the H component. As of the diffusion case (Figure \\ref{fig8}), the same kind of mismatch is observed for the emulated solid amounts of H by HPx$_{\\rm py}$-kNN while the emulated profiles by HPx$_{\\rm py}$-DNN show some more discrepancies. Though we decided to present raw emulation results, we would like to stress that some if not all of the observed artifacts could likely be smoothed out by using some post-filtering such as median filtering.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure3.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-kNN emulated (TM+kNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig3}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure4.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-DNN emulated (TM+DNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig4}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure5.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig5}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure6.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig6}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure7.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig7}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure8.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig8}\n\\end{figure}\n\nThe speedups associated with the considered problem are detailed in Table \\ref{table3}. It is noted that the GPU-based DNN emulator allows for a speedup that is close to optimal. Indeed, The DNN speedups overall represent 85 \\% to 95 \\% of the maximum possible speedups (that is, speedups that would be obtained if the geochemcical calculations would come at no cost at all). The speedups associated with single-threaded kNN remain substantial but only amount to 57 \\% - 65 \\% of the corresponding maximum speedups. As detailed in section \\ref{train_res1}, the used kNN and DNN implementations are found to be respectively 300 and 4000 faster than single-threaded PHREEQC when predicting 10,000 points all one once for this geochemical system. Based on these numbers one could have expected the achieved speedups to represent say 90 \\% (kNN) or 99 \\% (DNN) of the maximum possible ones. A large part of the gaps between achieved and maximum possible speedups is thus likely caused by the time required for communicating and exchanging data between the main C/C++ code and the Python-based emulators.\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 1. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. The kNN predictions are performed on a single CPU using the scikit-learn implementation while the DNN predictions make use of our GPU. ML signifies the used machine learning method for emulation, TC denotes transport conditions (ADV: advection-dispersion, DIF: diffusion) and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HPx$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & TC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 3189 & 6.8 & 7.7 & 24.5 & 28.5 \\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 3189 & 5.0 & 7.7 & 18.0 & 28.5 \\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 23,337 & 5.0 & 5.2 & 17.0 & 18.0 \\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 23,337 & 3.4 & 5.2 & 11.6 & 18.0 \\\\\n\t\t\tDNN & DIF & 61 $\\times$ 61 & 25,448 & 4.2 & 4.9 & 16.2 & 19.1 \\\\\n\t\t\tkNN & DIF & 61 $\\times$ 61 & 25,448 & 2.8 & 4.9 & 10.8 & 19.1 \\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table3}\n\\end{table}\n\n\\FloatBarrier\n\n3.2 Al-C-Ca-S-Si Problem\n\\subsection{Al-C-Ca-S-Si Problem}\n\nFor this second cement system, the emulation problem consists of predicting at each time step of the RT simulation the (output) Al, C, Ca, S, Si, H and O aqueous concentrations (mol/kgw) from the (input) total amounts of Al, C, Ca, S and Si (mol). Here we focus on advective-dispersive transport only while similarly as for cement system 1, the considered domain sizes are 61 $\\times$ 61 and 121 $\\times$ 121. Moreover, the simulation time period is set to 6 years. \n\nAs mentioned earlier, for this higher-dimensional problem it is observed that the used scikit-learn kNN implementation becomes prohibitively slow compared to HPx$_{\\rm{4C}}$. We found that to get a good emulation accuracy the kNN training base needs to contain 1,000,000 samples (or more). This training base's size together with a 5-dimensional search space leads to an HPx$_{\\rm py}$-kNN reactive transport simulation time that is comparable to that of HPx$_{\\rm{4C}}$. Therefore, for this second cement system we built a custom kNN regressor around another kNN implementation contained in the FAISS package \\citep{faiss2017}. Our used FAISS variant allows for GPU computing and is much faster than scikit-learn for this cement system, but is slightly less accurate due to the use of an approximate rather than exact nearest neighbor search \\citep[see][for details]{faiss2017}.\n\n\\subsubsection{Training the DNN}\n\\label{train_res2}\n\nBuilding a good training set to perform a kNN search and learn the weights and biases of our DNN turned out to be a complicated task in this case. This because to make useful kNN predictions and/or learn an useful DNN, the training set must be sufficiently representative of the geochemical conditions encountered during the reactive transport simulation one wish to perform with HPx$_{\\rm py}$-kNN and the trained HPx$_{\\rm py}$-DNN. In contrast to cement system 1, creating the training set by sampling the $X$-space with a controlled randomness between predefined lower and upper bounds did not prove successful. We tried that strategy by drawing as much as 4,000,000 5-dimensional $\\textbf{x}$ vectors from the $X$-space using a Sobol low-discrepancy sequence \\citep[][]{Sobol1967, Joe-Kuo2003}. Such low-discrepancy sampling scheme covers the 5-dimensional hypercube more uniformly than LHS. Despite a good performance on the test set (not shown), the resulting DNN accuracy in reactive transport mode was never deemed satisfying. In other words, no satisfying ``global\" or ``universal\" DNN emulator could be devised for this cement system. This is probably caused by the fact for this problem, the input (5 total amounts) and output (7 aqueous concentrations) spaces are quite nonlinearly related and both cover 6 to 10 orders of magnitudes depending on the considered element. Therefore, we resorted to the alternative training strategy detailed below. The latter basically tries to grasp the complex correlations and higher-order dependencies that exist between the elements of $\\textbf{x}$ (total amounts, input space) for a given reactive transport simulation setup, in order to produce a training set that honors these between-input relationships.\n\n\\begin{itemize}\n\t\\item Perform a ``cheap\" full reactive transport simulation under the transport conditions and geochemistry of interest and collect the resulting $\\textbf{x}$ -$\\textbf{y}$  pairs of examples (for the considered grid nodes and time steps). Computational demand controls what domain size and simulation time period can be used for this cheap calculation. We used a modest 16 $\\times$ 16 domain and a simulation time period of 10 years. The associated HPx$_{\\rm{4C}}$ runtimes is 180 s.\n\t\n\t\\item Fit a kernel density estimator (KDE) with a Gaussian kernel to the collected $\\textbf{x}$ vectors (encapsulated in the $\\textbf{X}$ array) and generate a fixed number of new input vectors, $\\textbf{x}_{KDE}$. Then run PHREEQC for the $\\textbf{X}_{KDE}$ set to get the corresponding output set, $\\textbf{Y}_{KDE}$. Now apply the correction for porosity described in section \\ref{train_res1} to the $\\textbf{X}_{KDE}$ set and form the training set by merging the ensemble of $\\textbf{x}$-$\\textbf{y}$ pairs with that of the $\\textbf{x}_{KDE}$-$\\textbf{y}_{KDE}$ pairs. The number of produced unique examples by the considered cheap HPx$_{\\rm{4C}}$ simulations varied between 10,000 and 50,000. The KDE-based enrichment of this dataset was deemed necessary to provide more input variability, thereby avoiding overfitting of the trained DNN and improving the kNN accuracy, while still honoring the complex between-input relationships. The number of KDE-generated samples was set as to obtain a total training set size of 1,000,000 examples. A key component of the approach is the bandwith parameter of the Gaussian KDE kernel which controls how much the KDE-generated samples depart from the original ensemble. After limited trial and error, we fixed the kernel bandwith to 0.0025 for the considered case studies.\n\t\n\\end{itemize}\n\nThe scatter plots in Figure \\ref{fig9} illustrate our training set creation procedure. The orange dots depict the pairwise relationships observed between the 5 elements of $\\textbf{x}$ in the cheap simulation. The red and cyan dots in Figure \\ref{fig9} represent the KDE-generated samples with the selected bandwidth, before and after applying the correction for porosity, respectively. Training of the DNN is achieved using the ensemble of original and KDE-corrected input points. We refer to this kind of dataset as RT-based, since it is based on an albeit cheap, full RT simulation. Furthermore, we refer to the obtained DNN and kNN emulators as ``local\" emulators, since as opposed to the emulators constructed for cement system 1, the current emulators are only valid for the input conditions encapsulated in the RT-based training set.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1.5cm}\\includegraphics[width=50pc]{Figure9.png}\n\t\\caption{Scatter plots of the complex relationships between the five considered inputs for (1) the computationally cheap RT simulation of cement system 2 (orange dots) performed with the original HPx code, (2) the corresponding sampled points by kernel density estimation (KDE) using the selected bandwith (turquoise dots) and, (3) the same sampled KDE points after porosity correction (red dots).}\n\t\\label{fig9}\n\\end{figure}\n\nTraining performance of the DNN emulator is presented in Figure \\ref{fig10}. Similarly as for cement system 1, about 90\\% of the available data was used for the actual training of the DNN while the remaining 10\\% were used a validation set to contol overfitting. Lastly, performance is evaluated for both the trained DNN and kNN emulators using an independent test set of 10,000 examples. Overall, the accuracy of our ``local \" DNN emulator for this RT-inspired dataset is rather large with $Q_2$ values always greater or equal than 0.998. Training performance of the corresponding local kNN emulator is equally good (not shown). Regarding speedup, for this problem single-threaded PHREEQC achieves about 210 geochemical calculations per second on our used Intel\\textsuperscript{\\textregistered} i7 CPU while the GPU-based DNN and kNN emulators are both about 3000 times faster when predicting the 10,000 test points at once .\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure10.png}\n\t\\caption{1-1 plots of local DNN emulation performance obtained for system 2 when the local DNN is trained using 1,000,000 samples. The $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig10}\n\\end{figure}\n\n\\subsubsection{Reactive transport simulation}\n\\label{prob2res}\nOur ``local\" DNN performs rather well when applied to the 61 $\\times$ 61 grid size and a time period of 6 years (Figures \\ref{fig11} - \\ref{fig12}). This for all components but C, for which some localized deviations appear between original and emulated 2D concentration profiles towards the end of the simulation period (Figure \\ref{fig12}). When applied to the 121 $\\times$ 121 grid, HPx$_{\\rm py}$-DNN produces additional discrepancies for O, Si and Al towards the end of the simulation period (Figures \\ref{fig13} - \\ref{fig14}). Yet most of the observed artifacts could probably be smoothed out by using post-filtering. The associated speedups are listed in Table \\ref{table4}. These speedups are larger to those obtained for cement system 1, with values between 8 and 9 when evaluated against HPx$_{\\rm{4C}}$. These speedups represent about 85 \\% to 90 \\% of the maximum possible speedup (Table \\ref{table3}). Overall, these findings indicate that for the considered problem, our RT-based training of a local DNN only works is the training set is sufficiently representative of the particular geochemical conditions encountered in the computationally demanding simulations, which is arguably not easy to achieve. This limitation is further discussed in section \\ref{discussion}.\n\nWe note a more uniform behavior for HPx$_{\\rm py}$-kNN across grid sizes than for HPx$_{\\rm py}$-DNN. Here the results for the 121  $\\times$ 121 grid are only slightly less accurate than those associated with the 61 $ \\times$ 61 grid (see Figures \\ref{fig15} - \\ref{fig16} where for brevity we only show concentration profiles for the C Al and S components). Furthermore, whenever observed the discrepancies between original and emulated profiles are more regularly scattered than for HPx$_{\\rm py}$-DNN. Note also that herein too, post-filtering could likely smooth out a large part of these deviations. In addition, owing to the use of a GPU to achieve the kNN calculations, the speedup provided by HPx$_{\\rm py}$-kNN are as large as those provided by HPx$_{\\rm py}$-DNN (Table \\ref{table4}).\n\nWith respect to the emulated solid amounts, the HPx$_{\\rm py}$-DNN results look visually good for the the 61 $\\times$ 61 grid. This is shown in Figure \\ref{fig17} for the C, Al and S chemical components, while emulation of the H, O Ca and Si chemical components is globally of similar quality (not shown). Nevertheless, significant deviations appear towards the end of the simulation period for every chemical component (see Figure \\ref{fig18} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components shows the same level of mismatch). The HPx$_{\\rm py}$-kNN predictions are also fairly accurate for the 61 $\\times$ 61 (see Figure \\ref{fig19} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components exhibits a globally similar quality) while some discrepancies show up at the end of the simulation (Figure \\ref{fig20}). However, the mismatch is less pronounced than for HPx$_{\\rm py}$-DNN. Overall, HPx$_{\\rm py}$-kNN appears to be somewhat more robust than HPx$_{\\rm py}$-DNN for this cement system, while providing the same (large) speedup.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure11.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig11}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure12.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig12}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure13.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig13}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure14.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig14}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure15.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN).  The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig15}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure16.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig16}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure17.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig17}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure18.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig18}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure19.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig19}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure20.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig20}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 2. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. Both the kNN and DNN predictions make use of a GPU. ML signifies the used machine learning method for emulation, BC denotes the type of flow boundary conditions and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HP$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & BC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 21,415 & 8.2 & 9.0 & 30.3 & 33.1\\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 21,415 & 7.9 & 9.0 & 28.9 & 33.1\\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 199,841 & 8.2 & 9.5 & 29.9 & 32.8\\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 199,841 & 8.5 & 9.5 & 31.3 & 32.8\\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table4}\n\\end{table}\n\n\\FloatBarrier\n\n3.2.2 Reactive transport simulation\n\\subsubsection{Reactive transport simulation}\n\\label{prob2res}\nOur ``local\" DNN performs rather well when applied to the 61 $\\times$ 61 grid size and a time period of 6 years (Figures \\ref{fig11} - \\ref{fig12}). This for all components but C, for which some localized deviations appear between original and emulated 2D concentration profiles towards the end of the simulation period (Figure \\ref{fig12}). When applied to the 121 $\\times$ 121 grid, HPx$_{\\rm py}$-DNN produces additional discrepancies for O, Si and Al towards the end of the simulation period (Figures \\ref{fig13} - \\ref{fig14}). Yet most of the observed artifacts could probably be smoothed out by using post-filtering. The associated speedups are listed in Table \\ref{table4}. These speedups are larger to those obtained for cement system 1, with values between 8 and 9 when evaluated against HPx$_{\\rm{4C}}$. These speedups represent about 85 \\% to 90 \\% of the maximum possible speedup (Table \\ref{table3}). Overall, these findings indicate that for the considered problem, our RT-based training of a local DNN only works is the training set is sufficiently representative of the particular geochemical conditions encountered in the computationally demanding simulations, which is arguably not easy to achieve. This limitation is further discussed in section \\ref{discussion}.\n\nWe note a more uniform behavior for HPx$_{\\rm py}$-kNN across grid sizes than for HPx$_{\\rm py}$-DNN. Here the results for the 121  $\\times$ 121 grid are only slightly less accurate than those associated with the 61 $ \\times$ 61 grid (see Figures \\ref{fig15} - \\ref{fig16} where for brevity we only show concentration profiles for the C Al and S components). Furthermore, whenever observed the discrepancies between original and emulated profiles are more regularly scattered than for HPx$_{\\rm py}$-DNN. Note also that herein too, post-filtering could likely smooth out a large part of these deviations. In addition, owing to the use of a GPU to achieve the kNN calculations, the speedup provided by HPx$_{\\rm py}$-kNN are as large as those provided by HPx$_{\\rm py}$-DNN (Table \\ref{table4}).\n\nWith respect to the emulated solid amounts, the HPx$_{\\rm py}$-DNN results look visually good for the the 61 $\\times$ 61 grid. This is shown in Figure \\ref{fig17} for the C, Al and S chemical components, while emulation of the H, O Ca and Si chemical components is globally of similar quality (not shown). Nevertheless, significant deviations appear towards the end of the simulation period for every chemical component (see Figure \\ref{fig18} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components shows the same level of mismatch). The HPx$_{\\rm py}$-kNN predictions are also fairly accurate for the 61 $\\times$ 61 (see Figure \\ref{fig19} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components exhibits a globally similar quality) while some discrepancies show up at the end of the simulation (Figure \\ref{fig20}). However, the mismatch is less pronounced than for HPx$_{\\rm py}$-DNN. Overall, HPx$_{\\rm py}$-kNN appears to be somewhat more robust than HPx$_{\\rm py}$-DNN for this cement system, while providing the same (large) speedup.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure11.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig11}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure12.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig12}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure13.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig13}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure14.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig14}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure15.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN).  The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig15}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure16.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig16}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure17.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig17}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure18.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig18}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure19.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig19}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure20.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig20}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 2. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. Both the kNN and DNN predictions make use of a GPU. ML signifies the used machine learning method for emulation, BC denotes the type of flow boundary conditions and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HP$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & BC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 21,415 & 8.2 & 9.0 & 30.3 & 33.1\\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 21,415 & 7.9 & 9.0 & 28.9 & 33.1\\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 199,841 & 8.2 & 9.5 & 29.9 & 32.8\\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 199,841 & 8.5 & 9.5 & 31.3 & 32.8\\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table4}\n\\end{table}\n\n\\FloatBarrier\n\n",
        "Descriptive_question1": "What is the speedup value for the DNN emulator under advection-dispersion conditions with a grid size of 61 x 61 in table_3?",
        "Descriptive_question2": "What is the maximum possible speedup for HPx_4C under diffusion conditions with a grid size of 61 x 61 in table_3?",
        "Reasoning_question1": "Why might the achieved speedup for the kNN emulator be significantly lower than the maximum possible speedup in table_3, particularly for advection-dispersion conditions with a grid size of 121 x 121?",
        "Reasoning_question2": "How does the performance of the DNN emulator compare to the kNN emulator in terms of speedup efficiency across different transport conditions and grid sizes in table_3?",
        "Descriptive_answer1": "6.8",
        "Descriptive_answer2": "4.9",
        "Reasoning_answer1": "The achieved speedup for the kNN emulator is significantly lower than the maximum possible speedup, especially for advection-dispersion conditions with a grid size of 121 x 121, due to several contributing factors. First, the kNN emulator operates on a single CPU using the scikit-learn implementation, which inherently limits its computational efficiency compared to GPU-based solutions. In contrast, the maximum possible speedup assumes zero computational cost for geochemical calculations, an ideal scenario not achievable in practice. For the specific case of grid size 121 x 121 under advection-dispersion conditions, the achieved speedup is only 3.4 compared to a maximum possible speedup of 5.2 for HPx_4C. This gap can be attributed to the increased computational demand with larger grid sizes, where the kNN emulator struggles to scale efficiently. Additionally, as mentioned in the text, a significant part of the performance gap likely arises from the overhead of data communication and exchange between the main C/C++ code and the Python-based kNN emulator, which becomes more pronounced with larger datasets and complex simulations. These factors collectively reduce the achieved speedup to approximately 65% of the maximum possible speedup, far below the expected potential based on standalone kNN prediction speedups (300 times faster than single-threaded PHREEQC for 10,000 points).",
        "Reasoning_answer2": "The performance of the DNN emulator consistently surpasses that of the kNN emulator in terms of speedup efficiency across different transport conditions and grid sizes in table_3. Let's analyze this step-by-step. First, under advection-dispersion (ADV) conditions with a grid size of 61 x 61, the DNN emulator achieves a speedup of 6.8 (HPx_4C) compared to kNN's 5.0, representing approximately 88% of the maximum possible speedup (7.7) for DNN versus 65% for kNN. Similarly, for HPx_1C, DNN achieves 24.5 (86% of max 28.5) versus kNN's 18.0 (63% of max 28.5). Next, for ADV conditions with a larger grid size of 121 x 121, DNN's speedup is 5.0 (96% of max 5.2) compared to kNN's 3.4 (65% of max 5.2) for HPx_4C, and 17.0 (94% of max 18.0) versus kNN's 11.6 (64% of max 18.0) for HPx_1C. Under diffusion (DIF) conditions with a grid size of 61 x 61, DNN achieves a speedup of 4.2 (86% of max 4.9) compared to kNN's 2.8 (57% of max 4.9) for HPx_4C, and 16.2 (85% of max 19.1) versus kNN's 10.8 (57% of max 19.1) for HPx_1C. The trend is clear: DNN consistently achieves 85-96% of the maximum possible speedup across all scenarios, while kNN lags at 57-65%. This disparity is largely due to DNN's utilization of GPU acceleration, which significantly enhances computational efficiency, whereas kNN's single-CPU implementation and associated data transfer overheads limit its performance. Thus, DNN demonstrates superior speedup efficiency across varying conditions and grid sizes."
    },
    {
        "paper_id": "2107.07598.json",
        "table_id": "table_4",
        "table_content": "\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 2. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. Both the kNN and DNN predictions make use of a GPU. ML signifies the used machine learning method for emulation, BC denotes the type of flow boundary conditions and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HP$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & BC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 21,415 & 8.2 & 9.0 & 30.3 & 33.1\\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 21,415 & 7.9 & 9.0 & 28.9 & 33.1\\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 199,841 & 8.2 & 9.5 & 29.9 & 32.8\\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 199,841 & 8.5 & 9.5 & 31.3 & 32.8\\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table4}\n\\end{table}",
        "caption": "Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 2. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. Both the kNN and DNN predictions make use of a GPU. ML signifies the used machine learning method for emulation, BC denotes the type of flow boundary conditions and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HP$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.",
        "label": "table4",
        "section_info": "3 Results\n\\section{Results}\n\\label{results}\n\n\\subsection{Ca-Si Problem}\nFor this first cement system, the emulation problem consists of predicting at each time step of the RT simulation the (output) Ca, Si, H and O aqueous concentrations (mol/kg of water or mol/kgw) from the (input) total amounts of Ca and Si (mol). \n\n\\subsubsection{Training the emulators}\n\\label{train_res1}\nHere the kNN and DNN emulators are firstly trained using a set of 400,000 test examples for both. This training set is obtained by randomly sampling the two-dimensional input space by latin hypercube sampling (LHS) between $\\left[0,0\\right]$ and $\\rm{\\left[Ca^{tot}_{max},Si^{tot}_{max}\\right]}$, and running PHREEQC for each input sample, $\\rm{\\textbf{x}_i = \\left[Ca^{tot}_i, Si^{tot}_i\\right]}$ to get the corresponding output vectors, $\\rm{\\textbf{y}_i = \\left[Ca^{conc}_i,Si^{conc}_i,H^{conc}_i,O^{conc}_i\\right]}$. The upper bounds, $\\rm{Ca^{tot}_{max}}$ and $\\rm{Si^{tot}_{max}}$ are defined based on a cheap full RT simulation with advective-dispersive transport using a small 1D domain of 51 nodes. It is worth noting that the total amounts of $\\textbf{x}_i$, corresponding to the PHREEQC-simulated concentrations, $\\textbf{y}_i$, have to be corrected for the different amount of water between the training set and the transport simulations. Doing so, it turns out that about 20 \\% of the post-corrected $x_i$ values exceed their pre-defined upper bounds and these excessively large values need to be filtered out. Creating the 400,000 training examples thus required about 500,000 PHREEQC runs. As stated earlier, for this problem single-threaded PHREEQC performs about 670 geochemical calculations per second on our used Intel\\textsuperscript{\\textregistered} i7 CPU.\n\nWith respect to kNN, the tuning parameters are the number of neighbors, $k$, the type of distance measure, and the interpolation technique. We simply used the default settings: $k = 5$, euclidean distance and inverse-distance interpolation. Regarding training of the DNN, the 400,000 sample were split between the training set itself (90 \\% of the data) and a validation set (10 \\% of the data). The latter serves to monitor the evolution of the selected mean squared error loss function on samples that are not used for training, thereby detecting potential overfitting. If the validation loss stops decreasing before the fixed number of epochs has been completed, then  training is stopped. Importantly, the emulation is achieved in log-space for both the input, $X$ and output $Y$, domains. This because total amounts and concentrations of the involved components typically cover many orders of magnitudes (up to 10 orders or more). Using a DNN also requires some form of data normalization or standardization. Here both the $\\rm{log\\left(\\textbf{x}_i\\right)}$ and $\\rm{log\\left(\\textbf{y}_i\\right)}$ vectors are standardized around 0 with standard deviation of 1.\n\nFigure \\ref{fig1} illustrates the trained emulators' performance for geochemical predictions using an independent test set that comprises 10,000 test examples. Both kNN (Figures \\ref{fig1}a - d) and DNN (Figures \\ref{fig1}e - h) appear to be rather accurate . DNN also shows a slight degradation for the larger concentration values  (Figures \\ref{fig1}e - h). The latter is likely due to the combination of a small proportion of large concentration values in the training set with the log-transformation that implicitly pushes the DNN to try harder to fit the smaller concentrations during training. Regarding speedup and as written earlier, for this setup the single-threaded kNN method is 300 times faster than single-threaded PHREEQC for predicting the 10,000 concentration vectors all at once. The computational savings allowed by the DNN emulator when ran on our GPU is higher with a speedup as large as 4000 for predicting the same 10,000 concentration vectors all at once. \n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure1.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 400,000 samples and the DNN is trained using the same 400,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig1}\n\\end{figure}\n\nTo test the sensitivity of the emulators' performance to the training set size, training was also performed using reduced training sets comprising 100,000, 10,000, 1000 and 100 samples, respectively. It seen that the DNN performance achieved  when using 10,000 training samples is virtually the same as that obtained when using 400,000 training samples (Figure \\ref{fig2} and Table \\ref{table2}). It is only for training sets smaller than 1000 samples that the DNN performance starts to degrade significantly (Table \\ref{table2}). In contrast, the behavior of kNN appears to markedly decrease as the training set gets smaller (Figure \\ref{fig2} and Table \\ref{table1}).\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{0cm}\\includegraphics[width=35pc]{Figure2.png}\n\t\\caption{1-1 plots of the kNN (subfigures (a) - (d)) and DNN (subfigures (e) - (h)) emulators' performance obtained for system 1 when the kNN training base contains 10,000 samples and the DNN is trained using the same 10,000 samples. Here ``true\" means the original PHREEQC-simulated data and ``predicted\" denotes the emulated (that is, kNN-simulated and DNN-simulated) data. Hence, the $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig2}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Performance of the DNN and kNN emulators for cement system 1 and different training set sizes. For brevity, only the results for Ca$^{conc}$ and Si$^{conc}$ are shown. The units are mol per kg of water (mol/kgw). ML refers to the type of emulator, TR signifies the size of the training set, and the RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & \\\\\n\t\t\tML & TR & RMSE - Ca$^{conc}$ & $Q_2$ - Ca$^{conc}$ & RMSE - Si$^{conc}$ & $Q_2$ - Si$^{conc}$\\\\\n\t\t\tDNN & 4 $\\times$ 10$^{5}$ & 4.50 $\\times$ 10$^{-5}$ & 0.9999 & 1.21 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{5}$ & 5.06 $\\times$ 10$^{-5}$ & 0.9999 & 1.30 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{4}$ & 4.57 $\\times$ 10$^{-5}$ & 0.9999 & 1.48 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{3}$ & 8.02 $\\times$ 10$^{-5}$ & 0.9998 & 3.31 $\\times$ 10$^{-6}$ & 0.9994\\\\\n\t\t\tDNN & 1 $\\times$ 10$^{2}$ & 47.1 $\\times$ 10$^{-5}$ & 0.9935 & 9.02 $\\times$ 10$^{-6}$ & 0.9959\\\\\n\t\t\tkNN & 4 $\\times$ 10$^{5}$ & 2.52 $\\times$ 10$^{-5}$ & 1.0000 & 1.59 $\\times$ 10$^{-6}$ & 0.9999\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{5}$ & 6.27 $\\times$ 10$^{-5}$ & 0.9999 & 4.20 $\\times$ 10$^{-6}$ & 0.9991\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{4}$ & 14.1 $\\times$ 10$^{-5}$ & 0.9994 & 13.5 $\\times$ 10$^{-6}$ & 0.9908\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{3}$ & 43.9 $\\times$ 10$^{-5}$ & 0.9944 & 74.6 $\\times$ 10$^{-6}$ & 0.7180\\\\\n\t\t\tkNN & 1 $\\times$ 10$^{2}$  & 96.4 $\\times$ 10$^{-5}$ & 0.9729 & 55.0 $\\times$ 10$^{-6}$ & 0.8468\\\\\n\t\t\\end{tabular}\n\t\\end{center}\n\t\\label{table2}\n\\end{table}\n\n\\FloatBarrier\n\n\\subsubsection{Reactive transport simulation}\n\\label{prob12Dres}\n\nThis section focuses on reactive transport simulations with HPx$_{\\rm py}$ within cement system 1, under both advective-dispersive and diffusive transport conditions. As written above, the domain sizes are both 61 $\\times$ 61 and 121 $\\times$ 121 for the advection-dispersion case and, because of computational constraints, solely 61 $\\times$ 61  for the diffusion case. In addition, the simulation time period is 2 years for the advection-dispersion case and 1 year for the diffusion case. Figures \\ref{fig3} and \\ref{fig4} present times series of original and emulated Ca, Si, H and O concentrations at 5 locations within the 2D domain for advective-dispersive transport conditions, for both our kNN-based (HPx$_{\\rm py}$-kNN) and DNN-based (HPx$_{\\rm py}$-DNN) reactive transport codes. It is seen that HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN both induce a quite good simulation accuracy. Also, the results for the diffusive transport case are of similarly good quality (not shown). Figures \\ref{fig5} - \\ref{fig6} provide more insights into the HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN performances by displaying 2D Ca, Si, H and O concentration profiles at a given time. For each experiment and chemical component, this time is selected as to be well representative of the simulated dynamics. It is observed that the original and emulated images are visually almost indistinguishable for the advection-dispersion case (Figure \\ref{fig5}). For the diffusion case, the emulators also perform quite well for Ca, H and O (Figures\\ref{fig6}a - c, g - i and j - l), while some slight to moderate discrepancies appear at the concentration front for Si (Figures \\ref{fig6}d - f). Notwithstanding, the Si concentration remains globally well predicted. Furthermore, Figures \\ref{fig7} - \\ref{fig8} present the original and emulated 2D solid amount profiles corresponding to Figures \\ref{fig5} - \\ref{fig6}. The original solid amount profiles are overall well approximated by HPx$_{\\rm py}$-kNN and HPx$_{\\rm py}$-DNN for the advection-dispersion case (Figure \\ref{fig7}), even though some mismatch appears at the border of the fully depleted zone for the H component. As of the diffusion case (Figure \\ref{fig8}), the same kind of mismatch is observed for the emulated solid amounts of H by HPx$_{\\rm py}$-kNN while the emulated profiles by HPx$_{\\rm py}$-DNN show some more discrepancies. Though we decided to present raw emulation results, we would like to stress that some if not all of the observed artifacts could likely be smoothed out by using some post-filtering such as median filtering.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure3.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-kNN emulated (TM+kNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig3}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure4.png}\n\t\\caption{Time series of original (RTM, solid green lines) and HPx$_{\\rm py}$-DNN emulated (TM+DNN, dashed orange lines) concentrations (mol/kg) of Ca, H, O and Si at selected observations points for cement system 1 and advective-dispersive transport. Obs. 1 - 5 denote the selected observation points, with the following $\\left[x,y\\right]$ locations (in cm). Obs. 1:  $\\left[0.5, 2.5\\right]$, Obs. 2: $\\left[1, 2\\right]$, Obs. 3: $\\left[2, 2\\right]$, Obs. 4: $\\left[1, 1\\right]$, Obs. 5: $\\left[2, 1\\right]$. The results for the 121 $\\times$ 121 grid size are rather similar.}\n\t\\label{fig4}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure5.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig5}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure6.png}\n\t\\caption{2D concentration profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig6}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure7.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the end of the 2-year simulation performed for the advection-dispersion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61. The results for the 121 $\\times$ 121 grid are rather similar.}\n\t\\label{fig7}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure8.png}\n\t\\caption{2D solid amount profiles obtained for cement system 1 at the final time step of the 1-year simulation performed for the diffusion case. RTM means the original HP$_{\\rm{4C}}$ model, TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN) and TM+DNN signifies the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{solid}}$, $Si^{\\rm{solid}}$, $H^{\\rm{solid}}$, and $O^{\\rm{solid}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig8}\n\\end{figure}\n\nThe speedups associated with the considered problem are detailed in Table \\ref{table3}. It is noted that the GPU-based DNN emulator allows for a speedup that is close to optimal. Indeed, The DNN speedups overall represent 85 \\% to 95 \\% of the maximum possible speedups (that is, speedups that would be obtained if the geochemcical calculations would come at no cost at all). The speedups associated with single-threaded kNN remain substantial but only amount to 57 \\% - 65 \\% of the corresponding maximum speedups. As detailed in section \\ref{train_res1}, the used kNN and DNN implementations are found to be respectively 300 and 4000 faster than single-threaded PHREEQC when predicting 10,000 points all one once for this geochemical system. Based on these numbers one could have expected the achieved speedups to represent say 90 \\% (kNN) or 99 \\% (DNN) of the maximum possible ones. A large part of the gaps between achieved and maximum possible speedups is thus likely caused by the time required for communicating and exchanging data between the main C/C++ code and the Python-based emulators.\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 1. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. The kNN predictions are performed on a single CPU using the scikit-learn implementation while the DNN predictions make use of our GPU. ML signifies the used machine learning method for emulation, TC denotes transport conditions (ADV: advection-dispersion, DIF: diffusion) and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HPx$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & TC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 3189 & 6.8 & 7.7 & 24.5 & 28.5 \\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 3189 & 5.0 & 7.7 & 18.0 & 28.5 \\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 23,337 & 5.0 & 5.2 & 17.0 & 18.0 \\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 23,337 & 3.4 & 5.2 & 11.6 & 18.0 \\\\\n\t\t\tDNN & DIF & 61 $\\times$ 61 & 25,448 & 4.2 & 4.9 & 16.2 & 19.1 \\\\\n\t\t\tkNN & DIF & 61 $\\times$ 61 & 25,448 & 2.8 & 4.9 & 10.8 & 19.1 \\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table3}\n\\end{table}\n\n\\FloatBarrier\n\n\\subsection{Al-C-Ca-S-Si Problem}\n\nFor this second cement system, the emulation problem consists of predicting at each time step of the RT simulation the (output) Al, C, Ca, S, Si, H and O aqueous concentrations (mol/kgw) from the (input) total amounts of Al, C, Ca, S and Si (mol). Here we focus on advective-dispersive transport only while similarly as for cement system 1, the considered domain sizes are 61 $\\times$ 61 and 121 $\\times$ 121. Moreover, the simulation time period is set to 6 years. \n\nAs mentioned earlier, for this higher-dimensional problem it is observed that the used scikit-learn kNN implementation becomes prohibitively slow compared to HPx$_{\\rm{4C}}$. We found that to get a good emulation accuracy the kNN training base needs to contain 1,000,000 samples (or more). This training base's size together with a 5-dimensional search space leads to an HPx$_{\\rm py}$-kNN reactive transport simulation time that is comparable to that of HPx$_{\\rm{4C}}$. Therefore, for this second cement system we built a custom kNN regressor around another kNN implementation contained in the FAISS package \\citep{faiss2017}. Our used FAISS variant allows for GPU computing and is much faster than scikit-learn for this cement system, but is slightly less accurate due to the use of an approximate rather than exact nearest neighbor search \\citep[see][for details]{faiss2017}.\n\n\\subsubsection{Training the DNN}\n\\label{train_res2}\n\nBuilding a good training set to perform a kNN search and learn the weights and biases of our DNN turned out to be a complicated task in this case. This because to make useful kNN predictions and/or learn an useful DNN, the training set must be sufficiently representative of the geochemical conditions encountered during the reactive transport simulation one wish to perform with HPx$_{\\rm py}$-kNN and the trained HPx$_{\\rm py}$-DNN. In contrast to cement system 1, creating the training set by sampling the $X$-space with a controlled randomness between predefined lower and upper bounds did not prove successful. We tried that strategy by drawing as much as 4,000,000 5-dimensional $\\textbf{x}$ vectors from the $X$-space using a Sobol low-discrepancy sequence \\citep[][]{Sobol1967, Joe-Kuo2003}. Such low-discrepancy sampling scheme covers the 5-dimensional hypercube more uniformly than LHS. Despite a good performance on the test set (not shown), the resulting DNN accuracy in reactive transport mode was never deemed satisfying. In other words, no satisfying ``global\" or ``universal\" DNN emulator could be devised for this cement system. This is probably caused by the fact for this problem, the input (5 total amounts) and output (7 aqueous concentrations) spaces are quite nonlinearly related and both cover 6 to 10 orders of magnitudes depending on the considered element. Therefore, we resorted to the alternative training strategy detailed below. The latter basically tries to grasp the complex correlations and higher-order dependencies that exist between the elements of $\\textbf{x}$ (total amounts, input space) for a given reactive transport simulation setup, in order to produce a training set that honors these between-input relationships.\n\n\\begin{itemize}\n\t\\item Perform a ``cheap\" full reactive transport simulation under the transport conditions and geochemistry of interest and collect the resulting $\\textbf{x}$ -$\\textbf{y}$  pairs of examples (for the considered grid nodes and time steps). Computational demand controls what domain size and simulation time period can be used for this cheap calculation. We used a modest 16 $\\times$ 16 domain and a simulation time period of 10 years. The associated HPx$_{\\rm{4C}}$ runtimes is 180 s.\n\t\n\t\\item Fit a kernel density estimator (KDE) with a Gaussian kernel to the collected $\\textbf{x}$ vectors (encapsulated in the $\\textbf{X}$ array) and generate a fixed number of new input vectors, $\\textbf{x}_{KDE}$. Then run PHREEQC for the $\\textbf{X}_{KDE}$ set to get the corresponding output set, $\\textbf{Y}_{KDE}$. Now apply the correction for porosity described in section \\ref{train_res1} to the $\\textbf{X}_{KDE}$ set and form the training set by merging the ensemble of $\\textbf{x}$-$\\textbf{y}$ pairs with that of the $\\textbf{x}_{KDE}$-$\\textbf{y}_{KDE}$ pairs. The number of produced unique examples by the considered cheap HPx$_{\\rm{4C}}$ simulations varied between 10,000 and 50,000. The KDE-based enrichment of this dataset was deemed necessary to provide more input variability, thereby avoiding overfitting of the trained DNN and improving the kNN accuracy, while still honoring the complex between-input relationships. The number of KDE-generated samples was set as to obtain a total training set size of 1,000,000 examples. A key component of the approach is the bandwith parameter of the Gaussian KDE kernel which controls how much the KDE-generated samples depart from the original ensemble. After limited trial and error, we fixed the kernel bandwith to 0.0025 for the considered case studies.\n\t\n\\end{itemize}\n\nThe scatter plots in Figure \\ref{fig9} illustrate our training set creation procedure. The orange dots depict the pairwise relationships observed between the 5 elements of $\\textbf{x}$ in the cheap simulation. The red and cyan dots in Figure \\ref{fig9} represent the KDE-generated samples with the selected bandwidth, before and after applying the correction for porosity, respectively. Training of the DNN is achieved using the ensemble of original and KDE-corrected input points. We refer to this kind of dataset as RT-based, since it is based on an albeit cheap, full RT simulation. Furthermore, we refer to the obtained DNN and kNN emulators as ``local\" emulators, since as opposed to the emulators constructed for cement system 1, the current emulators are only valid for the input conditions encapsulated in the RT-based training set.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1.5cm}\\includegraphics[width=50pc]{Figure9.png}\n\t\\caption{Scatter plots of the complex relationships between the five considered inputs for (1) the computationally cheap RT simulation of cement system 2 (orange dots) performed with the original HPx code, (2) the corresponding sampled points by kernel density estimation (KDE) using the selected bandwith (turquoise dots) and, (3) the same sampled KDE points after porosity correction (red dots).}\n\t\\label{fig9}\n\\end{figure}\n\nTraining performance of the DNN emulator is presented in Figure \\ref{fig10}. Similarly as for cement system 1, about 90\\% of the available data was used for the actual training of the DNN while the remaining 10\\% were used a validation set to contol overfitting. Lastly, performance is evaluated for both the trained DNN and kNN emulators using an independent test set of 10,000 examples. Overall, the accuracy of our ``local \" DNN emulator for this RT-inspired dataset is rather large with $Q_2$ values always greater or equal than 0.998. Training performance of the corresponding local kNN emulator is equally good (not shown). Regarding speedup, for this problem single-threaded PHREEQC achieves about 210 geochemical calculations per second on our used Intel\\textsuperscript{\\textregistered} i7 CPU while the GPU-based DNN and kNN emulators are both about 3000 times faster when predicting the 10,000 test points at once .\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure10.png}\n\t\\caption{1-1 plots of local DNN emulation performance obtained for system 2 when the local DNN is trained using 1,000,000 samples. The $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig10}\n\\end{figure}\n\n\\subsubsection{Reactive transport simulation}\n\\label{prob2res}\nOur ``local\" DNN performs rather well when applied to the 61 $\\times$ 61 grid size and a time period of 6 years (Figures \\ref{fig11} - \\ref{fig12}). This for all components but C, for which some localized deviations appear between original and emulated 2D concentration profiles towards the end of the simulation period (Figure \\ref{fig12}). When applied to the 121 $\\times$ 121 grid, HPx$_{\\rm py}$-DNN produces additional discrepancies for O, Si and Al towards the end of the simulation period (Figures \\ref{fig13} - \\ref{fig14}). Yet most of the observed artifacts could probably be smoothed out by using post-filtering. The associated speedups are listed in Table \\ref{table4}. These speedups are larger to those obtained for cement system 1, with values between 8 and 9 when evaluated against HPx$_{\\rm{4C}}$. These speedups represent about 85 \\% to 90 \\% of the maximum possible speedup (Table \\ref{table3}). Overall, these findings indicate that for the considered problem, our RT-based training of a local DNN only works is the training set is sufficiently representative of the particular geochemical conditions encountered in the computationally demanding simulations, which is arguably not easy to achieve. This limitation is further discussed in section \\ref{discussion}.\n\nWe note a more uniform behavior for HPx$_{\\rm py}$-kNN across grid sizes than for HPx$_{\\rm py}$-DNN. Here the results for the 121  $\\times$ 121 grid are only slightly less accurate than those associated with the 61 $ \\times$ 61 grid (see Figures \\ref{fig15} - \\ref{fig16} where for brevity we only show concentration profiles for the C Al and S components). Furthermore, whenever observed the discrepancies between original and emulated profiles are more regularly scattered than for HPx$_{\\rm py}$-DNN. Note also that herein too, post-filtering could likely smooth out a large part of these deviations. In addition, owing to the use of a GPU to achieve the kNN calculations, the speedup provided by HPx$_{\\rm py}$-kNN are as large as those provided by HPx$_{\\rm py}$-DNN (Table \\ref{table4}).\n\nWith respect to the emulated solid amounts, the HPx$_{\\rm py}$-DNN results look visually good for the the 61 $\\times$ 61 grid. This is shown in Figure \\ref{fig17} for the C, Al and S chemical components, while emulation of the H, O Ca and Si chemical components is globally of similar quality (not shown). Nevertheless, significant deviations appear towards the end of the simulation period for every chemical component (see Figure \\ref{fig18} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components shows the same level of mismatch). The HPx$_{\\rm py}$-kNN predictions are also fairly accurate for the 61 $\\times$ 61 (see Figure \\ref{fig19} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components exhibits a globally similar quality) while some discrepancies show up at the end of the simulation (Figure \\ref{fig20}). However, the mismatch is less pronounced than for HPx$_{\\rm py}$-DNN. Overall, HPx$_{\\rm py}$-kNN appears to be somewhat more robust than HPx$_{\\rm py}$-DNN for this cement system, while providing the same (large) speedup.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure11.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig11}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure12.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig12}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure13.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig13}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure14.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig14}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure15.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN).  The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig15}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure16.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig16}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure17.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig17}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure18.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig18}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure19.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig19}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure20.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig20}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 2. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. Both the kNN and DNN predictions make use of a GPU. ML signifies the used machine learning method for emulation, BC denotes the type of flow boundary conditions and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HP$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & BC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 21,415 & 8.2 & 9.0 & 30.3 & 33.1\\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 21,415 & 7.9 & 9.0 & 28.9 & 33.1\\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 199,841 & 8.2 & 9.5 & 29.9 & 32.8\\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 199,841 & 8.5 & 9.5 & 31.3 & 32.8\\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table4}\n\\end{table}\n\n\\FloatBarrier\n\n3.2 Al-C-Ca-S-Si Problem\n\\subsection{Al-C-Ca-S-Si Problem}\n\nFor this second cement system, the emulation problem consists of predicting at each time step of the RT simulation the (output) Al, C, Ca, S, Si, H and O aqueous concentrations (mol/kgw) from the (input) total amounts of Al, C, Ca, S and Si (mol). Here we focus on advective-dispersive transport only while similarly as for cement system 1, the considered domain sizes are 61 $\\times$ 61 and 121 $\\times$ 121. Moreover, the simulation time period is set to 6 years. \n\nAs mentioned earlier, for this higher-dimensional problem it is observed that the used scikit-learn kNN implementation becomes prohibitively slow compared to HPx$_{\\rm{4C}}$. We found that to get a good emulation accuracy the kNN training base needs to contain 1,000,000 samples (or more). This training base's size together with a 5-dimensional search space leads to an HPx$_{\\rm py}$-kNN reactive transport simulation time that is comparable to that of HPx$_{\\rm{4C}}$. Therefore, for this second cement system we built a custom kNN regressor around another kNN implementation contained in the FAISS package \\citep{faiss2017}. Our used FAISS variant allows for GPU computing and is much faster than scikit-learn for this cement system, but is slightly less accurate due to the use of an approximate rather than exact nearest neighbor search \\citep[see][for details]{faiss2017}.\n\n\\subsubsection{Training the DNN}\n\\label{train_res2}\n\nBuilding a good training set to perform a kNN search and learn the weights and biases of our DNN turned out to be a complicated task in this case. This because to make useful kNN predictions and/or learn an useful DNN, the training set must be sufficiently representative of the geochemical conditions encountered during the reactive transport simulation one wish to perform with HPx$_{\\rm py}$-kNN and the trained HPx$_{\\rm py}$-DNN. In contrast to cement system 1, creating the training set by sampling the $X$-space with a controlled randomness between predefined lower and upper bounds did not prove successful. We tried that strategy by drawing as much as 4,000,000 5-dimensional $\\textbf{x}$ vectors from the $X$-space using a Sobol low-discrepancy sequence \\citep[][]{Sobol1967, Joe-Kuo2003}. Such low-discrepancy sampling scheme covers the 5-dimensional hypercube more uniformly than LHS. Despite a good performance on the test set (not shown), the resulting DNN accuracy in reactive transport mode was never deemed satisfying. In other words, no satisfying ``global\" or ``universal\" DNN emulator could be devised for this cement system. This is probably caused by the fact for this problem, the input (5 total amounts) and output (7 aqueous concentrations) spaces are quite nonlinearly related and both cover 6 to 10 orders of magnitudes depending on the considered element. Therefore, we resorted to the alternative training strategy detailed below. The latter basically tries to grasp the complex correlations and higher-order dependencies that exist between the elements of $\\textbf{x}$ (total amounts, input space) for a given reactive transport simulation setup, in order to produce a training set that honors these between-input relationships.\n\n\\begin{itemize}\n\t\\item Perform a ``cheap\" full reactive transport simulation under the transport conditions and geochemistry of interest and collect the resulting $\\textbf{x}$ -$\\textbf{y}$  pairs of examples (for the considered grid nodes and time steps). Computational demand controls what domain size and simulation time period can be used for this cheap calculation. We used a modest 16 $\\times$ 16 domain and a simulation time period of 10 years. The associated HPx$_{\\rm{4C}}$ runtimes is 180 s.\n\t\n\t\\item Fit a kernel density estimator (KDE) with a Gaussian kernel to the collected $\\textbf{x}$ vectors (encapsulated in the $\\textbf{X}$ array) and generate a fixed number of new input vectors, $\\textbf{x}_{KDE}$. Then run PHREEQC for the $\\textbf{X}_{KDE}$ set to get the corresponding output set, $\\textbf{Y}_{KDE}$. Now apply the correction for porosity described in section \\ref{train_res1} to the $\\textbf{X}_{KDE}$ set and form the training set by merging the ensemble of $\\textbf{x}$-$\\textbf{y}$ pairs with that of the $\\textbf{x}_{KDE}$-$\\textbf{y}_{KDE}$ pairs. The number of produced unique examples by the considered cheap HPx$_{\\rm{4C}}$ simulations varied between 10,000 and 50,000. The KDE-based enrichment of this dataset was deemed necessary to provide more input variability, thereby avoiding overfitting of the trained DNN and improving the kNN accuracy, while still honoring the complex between-input relationships. The number of KDE-generated samples was set as to obtain a total training set size of 1,000,000 examples. A key component of the approach is the bandwith parameter of the Gaussian KDE kernel which controls how much the KDE-generated samples depart from the original ensemble. After limited trial and error, we fixed the kernel bandwith to 0.0025 for the considered case studies.\n\t\n\\end{itemize}\n\nThe scatter plots in Figure \\ref{fig9} illustrate our training set creation procedure. The orange dots depict the pairwise relationships observed between the 5 elements of $\\textbf{x}$ in the cheap simulation. The red and cyan dots in Figure \\ref{fig9} represent the KDE-generated samples with the selected bandwidth, before and after applying the correction for porosity, respectively. Training of the DNN is achieved using the ensemble of original and KDE-corrected input points. We refer to this kind of dataset as RT-based, since it is based on an albeit cheap, full RT simulation. Furthermore, we refer to the obtained DNN and kNN emulators as ``local\" emulators, since as opposed to the emulators constructed for cement system 1, the current emulators are only valid for the input conditions encapsulated in the RT-based training set.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1.5cm}\\includegraphics[width=50pc]{Figure9.png}\n\t\\caption{Scatter plots of the complex relationships between the five considered inputs for (1) the computationally cheap RT simulation of cement system 2 (orange dots) performed with the original HPx code, (2) the corresponding sampled points by kernel density estimation (KDE) using the selected bandwith (turquoise dots) and, (3) the same sampled KDE points after porosity correction (red dots).}\n\t\\label{fig9}\n\\end{figure}\n\nTraining performance of the DNN emulator is presented in Figure \\ref{fig10}. Similarly as for cement system 1, about 90\\% of the available data was used for the actual training of the DNN while the remaining 10\\% were used a validation set to contol overfitting. Lastly, performance is evaluated for both the trained DNN and kNN emulators using an independent test set of 10,000 examples. Overall, the accuracy of our ``local \" DNN emulator for this RT-inspired dataset is rather large with $Q_2$ values always greater or equal than 0.998. Training performance of the corresponding local kNN emulator is equally good (not shown). Regarding speedup, for this problem single-threaded PHREEQC achieves about 210 geochemical calculations per second on our used Intel\\textsuperscript{\\textregistered} i7 CPU while the GPU-based DNN and kNN emulators are both about 3000 times faster when predicting the 10,000 test points at once .\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure10.png}\n\t\\caption{1-1 plots of local DNN emulation performance obtained for system 2 when the local DNN is trained using 1,000,000 samples. The $x$-axis and $y$-axis present the original and emulated 10,000 independent test data points, respectively. The RMSE and $Q_2$ coefficient denote the root-mean-square-error and coefficient of determination in testing mode, respectively, between the original and emulated 10,000 test data points.}\n\t\\label{fig10}\n\\end{figure}\n\n\\subsubsection{Reactive transport simulation}\n\\label{prob2res}\nOur ``local\" DNN performs rather well when applied to the 61 $\\times$ 61 grid size and a time period of 6 years (Figures \\ref{fig11} - \\ref{fig12}). This for all components but C, for which some localized deviations appear between original and emulated 2D concentration profiles towards the end of the simulation period (Figure \\ref{fig12}). When applied to the 121 $\\times$ 121 grid, HPx$_{\\rm py}$-DNN produces additional discrepancies for O, Si and Al towards the end of the simulation period (Figures \\ref{fig13} - \\ref{fig14}). Yet most of the observed artifacts could probably be smoothed out by using post-filtering. The associated speedups are listed in Table \\ref{table4}. These speedups are larger to those obtained for cement system 1, with values between 8 and 9 when evaluated against HPx$_{\\rm{4C}}$. These speedups represent about 85 \\% to 90 \\% of the maximum possible speedup (Table \\ref{table3}). Overall, these findings indicate that for the considered problem, our RT-based training of a local DNN only works is the training set is sufficiently representative of the particular geochemical conditions encountered in the computationally demanding simulations, which is arguably not easy to achieve. This limitation is further discussed in section \\ref{discussion}.\n\nWe note a more uniform behavior for HPx$_{\\rm py}$-kNN across grid sizes than for HPx$_{\\rm py}$-DNN. Here the results for the 121  $\\times$ 121 grid are only slightly less accurate than those associated with the 61 $ \\times$ 61 grid (see Figures \\ref{fig15} - \\ref{fig16} where for brevity we only show concentration profiles for the C Al and S components). Furthermore, whenever observed the discrepancies between original and emulated profiles are more regularly scattered than for HPx$_{\\rm py}$-DNN. Note also that herein too, post-filtering could likely smooth out a large part of these deviations. In addition, owing to the use of a GPU to achieve the kNN calculations, the speedup provided by HPx$_{\\rm py}$-kNN are as large as those provided by HPx$_{\\rm py}$-DNN (Table \\ref{table4}).\n\nWith respect to the emulated solid amounts, the HPx$_{\\rm py}$-DNN results look visually good for the the 61 $\\times$ 61 grid. This is shown in Figure \\ref{fig17} for the C, Al and S chemical components, while emulation of the H, O Ca and Si chemical components is globally of similar quality (not shown). Nevertheless, significant deviations appear towards the end of the simulation period for every chemical component (see Figure \\ref{fig18} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components shows the same level of mismatch). The HPx$_{\\rm py}$-kNN predictions are also fairly accurate for the 61 $\\times$ 61 (see Figure \\ref{fig19} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components exhibits a globally similar quality) while some discrepancies show up at the end of the simulation (Figure \\ref{fig20}). However, the mismatch is less pronounced than for HPx$_{\\rm py}$-DNN. Overall, HPx$_{\\rm py}$-kNN appears to be somewhat more robust than HPx$_{\\rm py}$-DNN for this cement system, while providing the same (large) speedup.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure11.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig11}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure12.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig12}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure13.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig13}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure14.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig14}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure15.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN).  The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig15}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure16.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig16}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure17.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig17}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure18.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig18}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure19.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig19}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure20.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig20}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 2. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. Both the kNN and DNN predictions make use of a GPU. ML signifies the used machine learning method for emulation, BC denotes the type of flow boundary conditions and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HP$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & BC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 21,415 & 8.2 & 9.0 & 30.3 & 33.1\\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 21,415 & 7.9 & 9.0 & 28.9 & 33.1\\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 199,841 & 8.2 & 9.5 & 29.9 & 32.8\\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 199,841 & 8.5 & 9.5 & 31.3 & 32.8\\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table4}\n\\end{table}\n\n\\FloatBarrier\n\n3.2.2 Reactive transport simulation\n\\subsubsection{Reactive transport simulation}\n\\label{prob2res}\nOur ``local\" DNN performs rather well when applied to the 61 $\\times$ 61 grid size and a time period of 6 years (Figures \\ref{fig11} - \\ref{fig12}). This for all components but C, for which some localized deviations appear between original and emulated 2D concentration profiles towards the end of the simulation period (Figure \\ref{fig12}). When applied to the 121 $\\times$ 121 grid, HPx$_{\\rm py}$-DNN produces additional discrepancies for O, Si and Al towards the end of the simulation period (Figures \\ref{fig13} - \\ref{fig14}). Yet most of the observed artifacts could probably be smoothed out by using post-filtering. The associated speedups are listed in Table \\ref{table4}. These speedups are larger to those obtained for cement system 1, with values between 8 and 9 when evaluated against HPx$_{\\rm{4C}}$. These speedups represent about 85 \\% to 90 \\% of the maximum possible speedup (Table \\ref{table3}). Overall, these findings indicate that for the considered problem, our RT-based training of a local DNN only works is the training set is sufficiently representative of the particular geochemical conditions encountered in the computationally demanding simulations, which is arguably not easy to achieve. This limitation is further discussed in section \\ref{discussion}.\n\nWe note a more uniform behavior for HPx$_{\\rm py}$-kNN across grid sizes than for HPx$_{\\rm py}$-DNN. Here the results for the 121  $\\times$ 121 grid are only slightly less accurate than those associated with the 61 $ \\times$ 61 grid (see Figures \\ref{fig15} - \\ref{fig16} where for brevity we only show concentration profiles for the C Al and S components). Furthermore, whenever observed the discrepancies between original and emulated profiles are more regularly scattered than for HPx$_{\\rm py}$-DNN. Note also that herein too, post-filtering could likely smooth out a large part of these deviations. In addition, owing to the use of a GPU to achieve the kNN calculations, the speedup provided by HPx$_{\\rm py}$-kNN are as large as those provided by HPx$_{\\rm py}$-DNN (Table \\ref{table4}).\n\nWith respect to the emulated solid amounts, the HPx$_{\\rm py}$-DNN results look visually good for the the 61 $\\times$ 61 grid. This is shown in Figure \\ref{fig17} for the C, Al and S chemical components, while emulation of the H, O Ca and Si chemical components is globally of similar quality (not shown). Nevertheless, significant deviations appear towards the end of the simulation period for every chemical component (see Figure \\ref{fig18} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components shows the same level of mismatch). The HPx$_{\\rm py}$-kNN predictions are also fairly accurate for the 61 $\\times$ 61 (see Figure \\ref{fig19} for the C, Al and S chemical components, emulation of the H, O Ca and Si chemical components exhibits a globally similar quality) while some discrepancies show up at the end of the simulation (Figure \\ref{fig20}). However, the mismatch is less pronounced than for HPx$_{\\rm py}$-DNN. Overall, HPx$_{\\rm py}$-kNN appears to be somewhat more robust than HPx$_{\\rm py}$-DNN for this cement system, while providing the same (large) speedup.\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure11.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig11}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure12.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig12}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure13.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to fourth row present profiles for $Ca^{\\rm{conc}}$, $Si^{\\rm{conc}}$, $H^{\\rm{conc}}$, and $O^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig13}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure14.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig14}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure15.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN).  The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig15}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure16.png}\n\t\\caption{2D concentration profiles obtained for cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{conc}}$, $Al^{\\rm{conc}}$, and $S^{\\rm{conc}}$, respectively. The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig16}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure17.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig17}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure18.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+DNN denotes the Hydrus transport model coupled with our DNN geochemical emulator (HPx$_{\\rm py}$-DNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig18}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure19.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 61 $\\times$ 61.}\n\t\\label{fig19}\n\\end{figure}\n\n\\begin{figure}[h!]\n\t\\noindent\\hspace{-1cm}\\includegraphics[width=45pc]{Figure20.png}\n\t\\caption{2D solid amount profiles of the C, Al and S chemical components of cement system 2 after 2 and 6 (final time step) years. RTM means the original HPx$_{\\rm{4C}}$ model and TM+kNN denotes the Hydrus transport model coupled with our kNN geochemical emulator (HPx$_{\\rm py}$-kNN). The first to third row present profiles for $C^{\\rm{solid}}$, $Al^{\\rm{solid}}$, and $S^{\\rm{solid}}$, respectively The considered grid size is 121 $\\times$ 121.}\n\t\\label{fig20}\n\\end{figure}\n\n\\begin{table}[h!]\n\t\\caption{Speedups offered by the KNN and DNN emulators in HPx$_{\\rm py}$ for the reactive transport simulations considered for cement system 2. The HPx$_{\\rm{4C}}$ calculations involve the parallelization of PHREEQC over our 4 CPUs. The HPx$_{\\rm{1C}}$ calculations are performed on a single CPU. Both the kNN and DNN predictions make use of a GPU. ML signifies the used machine learning method for emulation, BC denotes the type of flow boundary conditions and GS is the grid size. The maximum possible speedups associated with HPx$_{\\rm{4C}}$ and HP$_{\\rm{1C}}$,  Max SP HPx$_{\\rm{4C}}$ and  Max SP HPx$_{\\rm{1C}}$, correspond to an hypothetical situation where the geochemical calculations incur zero computational cost.}\n\n\t\\begin{center}\n\t\t\\begin{tabular}{cccccccc}\n\t\t\t\\hline\n\t\t\t& & & & & & & \\\\\n\t\t\tML & BC & GS & HPx$_{\\rm{4C}}$ time (s) & SP HPx$_{\\rm{4C}}$ & Max SP HPx$_{\\rm{4C}}$ & SP HPx$_{\\rm{1C}}$ & Max SP HPx$_{\\rm{1C}}$ \\\\\n\t\t\tDNN & ADV & 61 $\\times$ 61 & 21,415 & 8.2 & 9.0 & 30.3 & 33.1\\\\\n\t\t\tkNN & ADV & 61 $\\times$ 61 & 21,415 & 7.9 & 9.0 & 28.9 & 33.1\\\\\n\t\t\tDNN & ADV & 121 $\\times$ 121 & 199,841 & 8.2 & 9.5 & 29.9 & 32.8\\\\\n\t\t\tkNN & ADV & 121 $\\times$ 121 & 199,841 & 8.5 & 9.5 & 31.3 & 32.8\\\\\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\n\t\\label{table4}\n\\end{table}\n\n\\FloatBarrier\n\n",
        "Descriptive_question1": "What is the speedup value for the DNN emulator with a grid size of 61 x 61 in table_4?",
        "Descriptive_question2": "What is the maximum possible speedup for HPx_4C with a grid size of 121 x 121 in table_4?",
        "Reasoning_question1": "Why might the speedup values for the kNN emulator be slightly lower than those for the DNN emulator for the same grid size in table_4?",
        "Reasoning_question2": "How does the grid size impact the speedup performance of both DNN and kNN emulators when comparing results in table_4?",
        "Descriptive_answer1": "8.2",
        "Descriptive_answer2": "9.5",
        "Reasoning_answer1": "The speedup values for the kNN emulator might be slightly lower than those for the DNN emulator for the same grid size in table_4 due to differences in computational efficiency and implementation. First, looking at the data, for a grid size of 61 x 61, kNN has a speedup of 7.9 compared to DNN's 8.2 for HPx_4C, and for HPx_1C, kNN achieves 28.9 versus DNN's 30.3. This pattern holds for the 121 x 121 grid as well, though kNN slightly outperforms DNN in HPx_4C with 8.5 versus 8.2. A possible reason is that the kNN implementation, even with GPU support via FAISS, might involve an approximate nearest neighbor search which could be less optimized compared to the highly parallelized matrix operations in DNNs on GPUs. Additionally, as mentioned in the context, kNN required a custom implementation to handle the higher-dimensional problem of cement system 2, which might introduce some overhead or slight inaccuracies compared to the DNN's more streamlined training and prediction process. Thus, while both methods leverage GPU acceleration, the DNN architecture might inherently handle batch predictions more efficiently, leading to marginally higher speedups in most cases.",
        "Reasoning_answer2": "The grid size appears to have a nuanced impact on the speedup performance of both DNN and kNN emulators when comparing results in table_4. Observing the data, for a grid size of 61 x 61, DNN achieves a speedup of 8.2 for HPx_4C and 30.3 for HPx_1C, while kNN achieves 7.9 and 28.9, respectively. For the larger grid size of 121 x 121, DNN maintains a speedup of 8.2 for HPx_4C and drops slightly to 29.9 for HPx_1C, whereas kNN shows a small increase to 8.5 for HPx_4C and a rise to 31.3 for HPx_1C. This suggests that increasing grid size does not significantly degrade speedup performance for either emulator in the HPx_4C context, with kNN even showing a slight improvement. However, for HPx_1C, DNN experiences a minor decrease while kNN gains a bit, indicating that kNN might scale slightly better with grid size in single-CPU comparisons. A potential explanation is that larger grid sizes increase the computational load, but since both emulators use GPU acceleration, the parallel processing capability mitigates much of the additional cost. Additionally, the context mentions that data communication overhead between the main code and Python-based emulators could play a role, and this overhead might proportionally be less significant for kNN at larger grid sizes due to its implementation. Meanwhile, the maximum possible speedups (Max SP) increase slightly with grid size for HPx_4C (from 9.0 to 9.5), suggesting that the potential for speedup grows as grid size increases, though actual speedups don't fully reach this maximum due to implementation and communication bottlenecks. Overall, grid size impacts speedup subtly, with both emulators maintaining relatively stable performance, likely due to GPU efficiency, but kNN shows a slight edge in improvement at larger scales."
    },
    {
        "paper_id": "1908.06407.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}[]\n\t\\caption{Description of features on the heat map.}\n\t\\label{heatmap_table}\n\t\\begin{center}\n\t\t\\begin{tabular}{|p{0.6cm}|p{7.4cm}|}\n\t\t\t\\hline\n\t\t\t\\textbf{Name}  & \\textbf{Description}                                                                        \\\\\\hline\n\t\t\tSkill & Player’s skill in CS:GO. 0 or 1 (low or high)                                      \\\\\\hline\n\t\t\taxn   & Active movement along x-axis. Intensity of rectilinear motion to the right/left.   \\\\\\hline\n\t\t\tayn   & Active movement along y-axis.Intensity of rectilinear motion to/from table.        \\\\\\hline\n\t\t\tazn   & Active movement along z-axis. Intensity of rectilinear motion up/down.             \\\\\\hline\n\t\t\tgxn   & Active rotation on x-axis. Frequency of approaching to/distancing from a monitor.  \\\\\\hline\n\t\t\tgyn   & Active rotation on y-axis. Intensity of swaying to the right/left side of a chair. \\\\\\hline\n\t\t\tgzn   & Active rotation on z-axis. Intensity of rotations on a vertical axis.              \\\\\\hline\n\t\t\tlb    & Portion of time when player leans to the back of a chair.                          \\\\\\hline\n\t\t\taxo   & Intensity of subtle rectilinear oscillations parallel to table.                    \\\\\\hline\n\t\t\tayo   & Intensity of subtle rectilinear oscillations to/from table.                        \\\\\\hline\n\t\t\tazo   & Intensity of subtle rectilinear oscillations up/down.                              \\\\\\hline\n\t\t\tgxo   & Intensity of subtle approaching to/distancing from a monitor.                      \\\\\\hline\n\t\t\tgyo   & Intensity of subtle swaying to the right/left side of a chair.                     \\\\\\hline\n\t\t\tgzo   & Intensity of subtle rotations on a vertical axis.                                 \\\\\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\\end{table}",
        "caption": "Description of features on the heat map.",
        "label": "heatmap_table",
        "section_info": "4 Machine Learning\n\\section{Machine Learning}\\label{Machine Learning}\n\n\\subsection{Training Data}\nWe asked 19 participants (9 professional athletes and 10 amateur players) to estimate their skill against the low/high scale. After encoding the low skill to \\textit{0} and the high skill to \\textit{1} we got the binary target for machine learning models.\n\nFrom preprocessing step (see Section III-B) we have 13 features. Correlations between them and the target are represented in Fig.~\\ref{heatmap}. The detailed description is provided in Table~\\ref{heatmap_table}.\n\n\\begin{figure}[!bt]\n\t\\centerline{\\includegraphics[width=\\linewidth]{pic/heatmap.png}}\n\t\\caption{Heat map for features.}\n\t\\label{heatmap}\n\\end{figure}\n\n\n\\begin{table}[]\n\t\\caption{Description of features on the heat map.}\n\t\\label{heatmap_table}\n\t\\begin{center}\n\t\t\\begin{tabular}{|p{0.6cm}|p{7.4cm}|}\n\t\t\t\\hline\n\t\t\t\\textbf{Name}  & \\textbf{Description}                                                                        \\\\\\hline\n\t\t\tSkill & Player’s skill in CS:GO. 0 or 1 (low or high)                                      \\\\\\hline\n\t\t\taxn   & Active movement along x-axis. Intensity of rectilinear motion to the right/left.   \\\\\\hline\n\t\t\tayn   & Active movement along y-axis.Intensity of rectilinear motion to/from table.        \\\\\\hline\n\t\t\tazn   & Active movement along z-axis. Intensity of rectilinear motion up/down.             \\\\\\hline\n\t\t\tgxn   & Active rotation on x-axis. Frequency of approaching to/distancing from a monitor.  \\\\\\hline\n\t\t\tgyn   & Active rotation on y-axis. Intensity of swaying to the right/left side of a chair. \\\\\\hline\n\t\t\tgzn   & Active rotation on z-axis. Intensity of rotations on a vertical axis.              \\\\\\hline\n\t\t\tlb    & Portion of time when player leans to the back of a chair.                          \\\\\\hline\n\t\t\taxo   & Intensity of subtle rectilinear oscillations parallel to table.                    \\\\\\hline\n\t\t\tayo   & Intensity of subtle rectilinear oscillations to/from table.                        \\\\\\hline\n\t\t\tazo   & Intensity of subtle rectilinear oscillations up/down.                              \\\\\\hline\n\t\t\tgxo   & Intensity of subtle approaching to/distancing from a monitor.                      \\\\\\hline\n\t\t\tgyo   & Intensity of subtle swaying to the right/left side of a chair.                     \\\\\\hline\n\t\t\tgzo   & Intensity of subtle rotations on a vertical axis.                                 \\\\\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\\end{table}\n\nWe observe that the professional athletes perform less active movements during the game (rows/columns 2-7). Less intensive active movements could be connected with the higher concentration on the game. On the other hand, some subtle motions (rows/columns 9-14) are more characteristic of the athletes. It includes the rotational motion on the y-axis and the rectilinear motion along the x-axis. The first corresponds to the point on how much the person is swaying from right to left; the second means that the person moves in rectilinear way from right to left. Leaning on the back of the chair is more representative of the amateur players.\n\nIn order to increase the amount of data we divided the log of  each player into 3-minutes sessions. As a result, our dataset includes 154 sessions from 19 persons. We have fitted several machine learning models based on 13 features to predict the player skill.\n\n\\subsection{Machine Learning Algorithms}\\label{ml_algorithms}\n\\subsubsection{Logistic regression}\n\nIt is the classifier which uses the logistic function to calculate probability from margin and maximizing the likelihood function,  \n$L$, \ngiven the observations \\cite{ng_lectures}:\n\n\\begin{equation}\nL(\\theta) = \\text{Pr} (Y|X; \\theta) = \\prod_i \\text{Pr}(y_i | x_i; \\theta),\n\\label{lr}\n\\end{equation}\nwhere $X$ is a design matrix, $Y$ is a vector of targets, $\\theta$ is the vector of model parameters. Probability of being from class 1 is determined by sigmoid function \\eqref{sigmoid}:\n\n\\begin{equation}\n\\text{Pr}(y_i | x_i; \\theta) = \\frac1{1 + e^{-\\theta^\\top x_i}}.\n\\label{sigmoid}\n\\end{equation}\n\nIn our experiment the main advantages of the logistic regression are stability, easy interpretation and good approximation.\n\n\\subsubsection{Support Vector Machine (SVM)}\n\nWe also used SVM classifier with soft-margins \\cite{svm_classic}. This method tries to separete classes by a hyperplane so that the gap between them is maximum possible:\n\n\\begin{align}\n\\label{eqn:eqlabel}\n\\begin{split}\n\\min_{w, \\xi}~& \\frac12 w^\\top w + \\frac\\gamma{2}\\sum_ {i=1}^n \\xi_i,\n\\\\\n\\text{subject to~}&y_i(w^\\top x_i + b) \\geq 1 - \\xi_i,\\\\\n& \\xi_i \\geq 0,  i \\in 1 \\dots n,\n\\end{split}\n\\end{align}\nwhere $x_i$ is a feature vector, $y_i$ is a scalar target, $w$ and $b$ are the parameters determining the separation hyperplane and its width, $\\xi_k$ is a slack variable, $\\gamma$ is the parameter which determines the tradeoff between the maximum margin and the minimum classification error.\n\nThe main advantage of this method is, again, stability due to  maximization of the gap between different classes.\n\n\\subsubsection{Nearest neighbors}\n\nFor the given input, $k$-nearest neighbors classifier searches for the $k$ nearest neighbors in the feature space from the training set and returns the most popular label among them \\cite{cover1967nearest}. In our problem we used the number of neighbours equal to 5, which provides the maximum ROC AUC.\n\nIn our case this simple algorithm can predict the unknown player performance taking into account performance of similar players.\n\n\\subsubsection{Random Forest}\n\nA random forest is the classifier represented by an ensemble of tree-structured classifiers. Random forest can handle complex dependencies in data, but this may lead to overfitting. According to experiments the optimal maximum tree depth in our problem is 4. \n\nFor our problem random forest can learn logical rules to distinguish the high-skilled and low-skilled players and catch more complex patterns in their behaviours.\n\n\n4.1 Training Data\n\\subsection{Training Data}\nWe asked 19 participants (9 professional athletes and 10 amateur players) to estimate their skill against the low/high scale. After encoding the low skill to \\textit{0} and the high skill to \\textit{1} we got the binary target for machine learning models.\n\nFrom preprocessing step (see Section III-B) we have 13 features. Correlations between them and the target are represented in Fig.~\\ref{heatmap}. The detailed description is provided in Table~\\ref{heatmap_table}.\n\n\\begin{figure}[!bt]\n\t\\centerline{\\includegraphics[width=\\linewidth]{pic/heatmap.png}}\n\t\\caption{Heat map for features.}\n\t\\label{heatmap}\n\\end{figure}\n\n\n\\begin{table}[]\n\t\\caption{Description of features on the heat map.}\n\t\\label{heatmap_table}\n\t\\begin{center}\n\t\t\\begin{tabular}{|p{0.6cm}|p{7.4cm}|}\n\t\t\t\\hline\n\t\t\t\\textbf{Name}  & \\textbf{Description}                                                                        \\\\\\hline\n\t\t\tSkill & Player’s skill in CS:GO. 0 or 1 (low or high)                                      \\\\\\hline\n\t\t\taxn   & Active movement along x-axis. Intensity of rectilinear motion to the right/left.   \\\\\\hline\n\t\t\tayn   & Active movement along y-axis.Intensity of rectilinear motion to/from table.        \\\\\\hline\n\t\t\tazn   & Active movement along z-axis. Intensity of rectilinear motion up/down.             \\\\\\hline\n\t\t\tgxn   & Active rotation on x-axis. Frequency of approaching to/distancing from a monitor.  \\\\\\hline\n\t\t\tgyn   & Active rotation on y-axis. Intensity of swaying to the right/left side of a chair. \\\\\\hline\n\t\t\tgzn   & Active rotation on z-axis. Intensity of rotations on a vertical axis.              \\\\\\hline\n\t\t\tlb    & Portion of time when player leans to the back of a chair.                          \\\\\\hline\n\t\t\taxo   & Intensity of subtle rectilinear oscillations parallel to table.                    \\\\\\hline\n\t\t\tayo   & Intensity of subtle rectilinear oscillations to/from table.                        \\\\\\hline\n\t\t\tazo   & Intensity of subtle rectilinear oscillations up/down.                              \\\\\\hline\n\t\t\tgxo   & Intensity of subtle approaching to/distancing from a monitor.                      \\\\\\hline\n\t\t\tgyo   & Intensity of subtle swaying to the right/left side of a chair.                     \\\\\\hline\n\t\t\tgzo   & Intensity of subtle rotations on a vertical axis.                                 \\\\\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\\end{table}\n\nWe observe that the professional athletes perform less active movements during the game (rows/columns 2-7). Less intensive active movements could be connected with the higher concentration on the game. On the other hand, some subtle motions (rows/columns 9-14) are more characteristic of the athletes. It includes the rotational motion on the y-axis and the rectilinear motion along the x-axis. The first corresponds to the point on how much the person is swaying from right to left; the second means that the person moves in rectilinear way from right to left. Leaning on the back of the chair is more representative of the amateur players.\n\nIn order to increase the amount of data we divided the log of  each player into 3-minutes sessions. As a result, our dataset includes 154 sessions from 19 persons. We have fitted several machine learning models based on 13 features to predict the player skill.\n\n",
        "Descriptive_question1": "What is the feature name for active movement along the x-axis in table_1?",
        "Descriptive_question2": "What does the feature 'lb' represent in table_1?",
        "Reasoning_question1": "Based on the descriptions in table_1, why might professional athletes exhibit less active movement compared to amateur players?",
        "Reasoning_question2": "According to the feature descriptions in table_1, how do subtle motions differ between professional and amateur players, and what might this indicate about their gameplay focus?",
        "Descriptive_answer1": "axn",
        "Descriptive_answer2": "leaning back",
        "Reasoning_answer1": "Professional athletes might exhibit less active movement compared to amateur players because they are likely more focused and concentrated on the game. As noted in the context, less intensive active movements (features like axn, ayn, azn, gxn, gyn, gzn) could be connected with higher concentration. This suggests that professionals minimize unnecessary physical actions to maintain focus on strategic and tactical aspects of gameplay, whereas amateurs might display more restlessness or lack of focus, leading to increased active movements.",
        "Reasoning_answer2": "According to the context and feature descriptions in table_1, subtle motions (features like axo, ayo, azo, gxo, gyo, gzo) are more characteristic of professional athletes, particularly rotational motion on the y-axis (gyo, subtle swaying right/left) and rectilinear motion along the x-axis (axo, subtle oscillations parallel to the table). In contrast, leaning back (lb) is more representative of amateur players. This difference might indicate that professionals maintain a poised, alert posture with subtle adjustments to stay engaged and responsive during gameplay, reflecting a higher level of focus and readiness. Amateurs, by leaning back more, may exhibit a more relaxed or less engaged posture, potentially indicating lower concentration or situational awareness."
    },
    {
        "paper_id": "1908.06407.json",
        "table_id": "table_2",
        "table_content": "\\begin{table}[!bt]\n\t\\caption{Models performance on predicting the player skill: a comparative study.}\n\t\\label{table_scores}\n\t\\begin{center}\n\t\t\\begin{tabular}{|l|l|l|}\n\t\t\\hline\n\t\t\\textbf{Method}                 & \\textbf{AUC, mean} & \\textbf{AUC, std} \\\\\\hline\n\t\tLogistic Regression    & 0.85      & 0.14     \\\\\\hline\n\t\tSupport Vector Machine & \\textbf{0.86}      & 0.13     \\\\\\hline\n\t\tKNN, 5 neighbours      & 0.80      & 0.13     \\\\\\hline\n\t\tRandom Forest, depth 4 & 0.82      & 0.16  \\\\\\hline  \n\t\t\\end{tabular}\n\t\\end{center}\n\\end{table}",
        "caption": "Models performance on predicting the player skill: a comparative study.",
        "label": "table_scores",
        "section_info": "5 Evaluation\n\\section{Evaluation}\\label{Evaluation}\n\nWe can not use sessions from the same player for both training and testing stage, because our model should be able to predict the performance for new participants.\nThus, to estimate the model performance correctly we trained models, described in \\ref{ml_algorithms}, on all people except 4-5 out of 19 and then validated on them. For more stable results the scores were calculated 100 times and averaged.\n\nWe used ROC AUC score \\cite{fan2006understanding} as an evaluation metric, for it nicely represents how well the  classes are separated by a model. The maximum possible value is 1, while the random guess get 0.5. Scores for different algorithms are shown in Table~\\ref{table_scores}. Linear models, such as Logistic Regression and SVM perform better than KNN and Random Forest. It can be explained that the real dependence between the player skill and extracted features is similar to the linear dependency.\nBesides, it is possible that KNN and Random Forest are too complex for our problem, for they tend to overfit for small data.\n\nThe mean ROC AUC score for all of the algorithms is more or equal to 0.8, which means that the eSport athlete performance can be successfully predicted by machine learning models. As an illustrative example, ROC curve for Logistic Regression is shown in Fig.~\\ref{roc_auc}.\n\n\\begin{table}[!bt]\n\t\\caption{Models performance on predicting the player skill: a comparative study.}\n\t\\label{table_scores}\n\t\\begin{center}\n\t\t\\begin{tabular}{|l|l|l|}\n\t\t\\hline\n\t\t\\textbf{Method}                 & \\textbf{AUC, mean} & \\textbf{AUC, std} \\\\\\hline\n\t\tLogistic Regression    & 0.85      & 0.14     \\\\\\hline\n\t\tSupport Vector Machine & \\textbf{0.86}      & 0.13     \\\\\\hline\n\t\tKNN, 5 neighbours      & 0.80      & 0.13     \\\\\\hline\n\t\tRandom Forest, depth 4 & 0.82      & 0.16  \\\\\\hline  \n\t\t\\end{tabular}\n\t\\end{center}\n\\end{table}\n\n\nTo figure out which characteristics define the player performance we used coefficients in the logistic regression. It is a stable and reliable estimation of feature importance. Positive coefficients correspond to the high-skilled players behavior, while negative coefficients are typical for the  low-skilled players behavior. Large absolute value means that feature is more important. Fig.~\\ref{feature_importance} demonstrates that the motion to right and to left is the most characteristic of professional athletes. Active rotations on the chair are also important factors, but typically refer to the amateur players.\nThese results are consistent with the recap from the heatmap shown in Fig.~\\ref{heatmap}.\n\n\\begin{figure}[!tb]\n\t\\centerline{\\includegraphics[width=\\linewidth]{pic/roc_auc_curve.png}}\n\t\\caption{ROC AUC curve for logistic regression.}\n\t\\label{roc_auc}\n\\end{figure}\n\n\\begin{figure*}[!hbt]\n\t\\centerline{\\includegraphics[width=\\linewidth]{pic/feature_importance.png}}\n\t\\caption{Feature importance as coefficients in the logistic regression.}\n\t\\label{feature_importance}\n\\end{figure*}\n\n\n \n\n\n\n\n\n\n\n",
        "Descriptive_question1": "Which model has the highest mean AUC score in table_2?",
        "Descriptive_question2": "What is the mean AUC score for KNN with 5 neighbors in table_2?",
        "Reasoning_question1": "Why might linear models like Logistic Regression and Support Vector Machine outperform KNN and Random Forest in table_2?",
        "Reasoning_question2": "What does the standard deviation of AUC scores in table_2 suggest about the stability of the models' performance?",
        "Descriptive_answer1": "Support Vector Machine",
        "Descriptive_answer2": "0.80",
        "Reasoning_answer1": "Linear models like Logistic Regression and Support Vector Machine may outperform KNN and Random Forest because the relationship between player skill and the extracted features might be closer to a linear dependency. As mentioned in the context, linear models seem to better capture this relationship. Additionally, KNN and Random Forest could be too complex for the problem, potentially leading to overfitting, especially with small datasets. This complexity makes it harder for them to generalize compared to the simpler assumptions of linear models, which appear to align well with the data structure in this experiment.",
        "Reasoning_answer2": "The standard deviation of AUC scores in the table provides insight into the consistency or stability of each model's performance across multiple runs. A lower standard deviation, such as 0.13 for Support Vector Machine and KNN, suggests that the model's performance is more consistent and less variable across the 100 averaged runs, indicating greater reliability. Conversely, a higher standard deviation, like 0.16 for Random Forest, implies more variability in performance, which could mean the model is less stable or more sensitive to the specific data splits or conditions during training and validation. This variability can impact the trust in the model's predictions for new data."
    },
    {
        "paper_id": "1704.00974.json",
        "table_id": "table_1",
        "table_content": "\\begin{table*}[t]\n\\caption{\\label{tab:comparison}Comparison of scaling laws with principal quantum number $n$ for Rydberg atoms and Rydberg excitons.}\n\\begin{ruledtabular}\n\\begin{tabular}{ccc}\n & Rydberg atoms & Rydberg excitons \\\\ \\hline\n\\emph{Zero field} & & \\\\\nMultiplet splitting due to quantum defect & $\\propto n^{-3}$ (except of hydrogen) & $\\propto n^{-3}$ \\\\ \\hline \n\\emph{Electric field} & & \\\\ \nPolarizability & $\\propto n^{-7}$ ($\\propto n^{-6}$ for hydrogen) & $\\propto n^{-7}$ \\\\\nResonance field of states from multiplets $n$ and $n+1$&$\\propto n^{-5}$&$\\propto n^{-5}$\\\\ \nAnticrossing energy at first resonance & $\\propto n^{-4}$ & $\\propto n^{-4}$ \\\\\nIonization voltage & $\\propto n^{-4}$ & $\\propto n^{-4}$ \\\\\n\\hline\n\\emph{Magnetic field} & & \\\\ \nCrossover field to magnetoexciton & $\\propto n^{-3}$  & $\\propto n^{-3}$ \\\\\nResonance field of states from multiplets $n$ and $n+1$&$\\propto n^{-6}$&$\\propto n^{-4}$\\\\ \n\\end{tabular}\n\\end{ruledtabular}\n\\end{table*}",
        "caption": "\\label{tab:comparison}Comparison of scaling laws with principal quantum number $n$ for Rydberg atoms and Rydberg excitons.",
        "label": "tab:comparison",
        "section_info": "4 Conclusion\n\\section{Conclusion}\\label{sec:concl}\n\nIn summary, we have studied the scaling of several characteristic quantities with the principal quantum number in the series of Rydberg excitons in cuprous oxide\nand drawn the comparison to Rydberg atoms. The parameters considered are related to the energy range covered by states at zero field due to finite quantum defects and resonant field strengths in external field. They are comprised in Table \\ref{tab:comparison}. The comparison shows that for most of the considered parameters the scaling laws are identical, even though there are differences in absolute magnitude due to the strikingly different Rydberg energy. Despite of the same scaling, the origin of the parameter may be quite different like for the zero field multiplet splitting: For atoms the deviation from a $1/r$ potential leads to this splitting described by the quantum defect, while for excitons the splitting originates mostly from the complex valence band structure deviating from a quadratic dispersion law. \n\nWhen an external field is applied, identical scaling laws hold for atoms and excitons for all parameters, but in magnetic field we find differences between the two systems, mostly related to the different optical selection rules that lead to a different scaling law for the resonance field strength of the states from the multiplets with principal quantum number $n$ and $n+1$. However, even when the same scaling law is found like for the energy splitting at the resonance, the physics origin might be quite different as its size in the exciton case is determined by the crystal specific of the Hamiltonian from the hydrogen case.\n\nFrom first sight the exciton level spectrum appears to be very complex, particularly under field application. Here, the demonstrated scaling laws are particularly appealing as they represent some generally valid rules for the exciton spectrum which help to develop systematics in the diversely looking spectra. The scaling laws may be transferred in a similar form also to other semiconductors.\n\n\n\\acknowledgements\nWe gratefully acknowledge the support of this project by Deutsche Forschungsgemeinschft (DFG) and the Russian Foundation of Basic Research in the frame of TRR 160(projects A1 and \t15-52-12012) and by the DFG in the SPP GiRyd. MMG is grateful to the Dynasty foundation and RF President grant MD-1555.2017.2 for partial support. MB acknowledges support by RF Government Grant No. 14.Z50.31.0021.\n\n\\appendix\n\n",
        "Descriptive_question1": "What is the label of the table that compares scaling laws for Rydberg atoms and Rydberg excitons in table_1?",
        "Descriptive_question2": "Which quantum number is used for scaling comparisons in table_1?",
        "Reasoning_question1": "Why might the scaling laws for Rydberg atoms and Rydberg excitons in table_1 be identical despite differences in absolute magnitude?",
        "Reasoning_question2": "How do the differences in optical selection rules affect the scaling law for resonance field strength between states with principal quantum numbers n and n+1 in table_1?",
        "Descriptive_answer1": "tab:comparison",
        "Descriptive_answer2": "n",
        "Reasoning_answer1": "The scaling laws for Rydberg atoms and Rydberg excitons in table_1 might be identical despite differences in absolute magnitude due to the fundamental similarities in the physical principles governing both systems. Although the absolute values differ because of the strikingly different Rydberg energy, as mentioned in the context, the scaling with the principal quantum number n follows the same mathematical form for most parameters. For instance, both systems exhibit scaling laws like ∝ n^{-3} for multiplet splitting at zero field and ∝ n^{-7} for polarizability in an electric field. This suggests that the underlying quantum mechanical framework, particularly how the principal quantum number influences energy levels and interactions, is consistent across both systems, even if the specific energy scales or physical origins (like quantum defects in atoms versus valence band structure in excitons) differ.",
        "Reasoning_answer2": "The differences in optical selection rules affect the scaling law for resonance field strength between states with principal quantum numbers n and n+1 by leading to distinct scaling behaviors in magnetic fields for Rydberg atoms and Rydberg excitons. As observed in table_1, the resonance field strength scales as ∝ n^{-6} for Rydberg atoms, while it scales as ∝ n^{-4} for Rydberg excitons. This discrepancy arises because optical selection rules determine which transitions between energy states are allowed, influencing how magnetic fields interact with the states in each system. The context provided indicates that these differences in selection rules are a primary reason for the variation in scaling laws, highlighting how the specific electronic structure and interaction with external fields differ between the two systems, even when other parameters might scale similarly."
    },
    {
        "paper_id": "2007.05911.json",
        "table_id": "table_1",
        "table_content": "\\begin{table*}[!t]\n  \\centering\n  \\caption{Statistics of the datasets. ``$\\#$'' means the number of the corresponding items.}\n  \\resizebox{.75\\linewidth}{!}{\n    \\begin{tabular}{llllll}\n    \\toprule\n    \\multirow{2}[4]{*}{Dataset} & \\multirow{2}[4]{*}{Shared (\\#)} & \\multicolumn{2}{c}{Source Domain} & \\multicolumn{2}{c}{Target Domain} \\\\\n\\cmidrule{3-6}          &       & Unshared (\\#) & \\#Feedback & Unshared (\\#) & \\#Feedback \\\\\n    \\midrule\n    TC$\\rightarrow$IQI & Item (5,568) & User (35,398) & 314,621 & User (19,999) & 78,429 \\\\\n    ML$\\rightarrow$NF & Item (5,565) & user (30,279) & 11,555,621 & User (11,498) & 199,765 \\\\\n    MO$\\rightarrow$MU & User (27,898) & Item (15,465) & 7,366,992 & Item (14,521) & 3,784,331 \\\\\n    MU$\\rightarrow$BO & User (27,898) & Item (14,521) & 3,784,331 & Item (15,774) & 1,936,754 \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:dataset}\n\\end{table*}",
        "caption": "Statistics of the datasets. ``$\\#$'' means the number of the corresponding items.",
        "label": "tab:dataset",
        "section_info": "4 Experiments\n\\section{Experiments}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{Statistics of the datasets. ``$\\#$'' means the number of the corresponding items.}\n  \\resizebox{.75\\linewidth}{!}{\n    \\begin{tabular}{llllll}\n    \\toprule\n    \\multirow{2}[4]{*}{Dataset} & \\multirow{2}[4]{*}{Shared (\\#)} & \\multicolumn{2}{c}{Source Domain} & \\multicolumn{2}{c}{Target Domain} \\\\\n\\cmidrule{3-6}          &       & Unshared (\\#) & \\#Feedback & Unshared (\\#) & \\#Feedback \\\\\n    \\midrule\n    TC$\\rightarrow$IQI & Item (5,568) & User (35,398) & 314,621 & User (19,999) & 78,429 \\\\\n    ML$\\rightarrow$NF & Item (5,565) & user (30,279) & 11,555,621 & User (11,498) & 199,765 \\\\\n    MO$\\rightarrow$MU & User (27,898) & Item (15,465) & 7,366,992 & Item (14,521) & 3,784,331 \\\\\n    MU$\\rightarrow$BO & User (27,898) & Item (14,521) & 3,784,331 & Item (15,774) & 1,936,754 \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:dataset}\n\\end{table*}\nIn this section, we perform experiments to evaluate the proposed model and framework against various baselines on real-world datasets. \nWe first introduce the datasets, evaluation protocol, implementation details and baseline methods of our experiments. Finally, we present our experimental results and analysis.\n\n\\subsection{Datasets}\nWe utilize four pairs frequently used real-world datasets, which contain two pairs \\textbf{user-shared} datasets and two pairs \\textbf{item-shared} datasets. \nFor all datasets, we only use the user IDs, item IDs and their implicit feedback information.\nFor simplicity, we intentionally transform the rating data into binary (1/0, indicating whether a user has interacted with an item or not) to fit the problem setting of implicit feedback following \\cite{gao2019natr}.\nThe statistics of the four pairs datasets are listed in Table \\ref{tab:dataset}.\n\\begin{itemize}\n\\item \\textbf{TC$\\rightarrow$IQI} \\cite{yan2019tciqi} are from two mainstream video websites Tencent (TC)\\footnote{https://v.qq.com} and iQIYI (IQI)\\footnote{https://www.iqiyi.com} in China. \nThere are a lot of overlapped items (movies) in the two websites.  \nWe take TC and IQI as the source and target domains, respectively. \nWe got the processed dataset pair directly from \\cite{yan2019tciqi}.\n\n\\item \\textbf{ML$\\rightarrow$NF}\\footnote{https://grouplens.org/datasets/movielens}$^,$\\footnote{https://www.kaggle.com/laowingkin/netflix-movie-recommendation/data} are from two popular movie recommendation platforms MovieLens and Netflix, in which there are a lot of overlapped items (movies). \nWe take MovieLens (ML) as the source domain and the Netflix (NF) as the target domain. We identify the same movies with their names (case insensitive) and years to avoid wrong identifications as possible, which is similar data processing method with \\cite{gao2019natr}.\n\n\\item \\textbf{MO$\\rightarrow$MU} are from the famous social network platform Douban\\footnote{https://www.douban.com\\label{douban}} in China. Overlapped users have feedback on both Movie (MO) and Music (MU).\nWe take MO as the source domain and the MU as the target domain.\n\n\\item \\textbf{MU$\\rightarrow$BO} are also from the famous social network platform Douban\\textsuperscript{\\ref{douban}} in China. Overlapped users have feedback on both Music (MU) and Book (BO).\nWe take MU as the source domain and the BO as the target domain.\n\\end{itemize}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{The experimental results evaluated by HR@K and NDCG@K on single domain recommendation task with 95\\% confidence intervals.}\n  \\resizebox{1\\linewidth}{!}{\n\n    \\begin{tabular}{clcccccc}\n    \\toprule\n    Dataset & Model & HR(NDCG)@1 & HR@10 & HR@50 & NDCG@10 & NDCG@50 & \\textbf{Average} \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{IQI} & NCF   & 0.1545$\\pm$0.0029 & 0.5004$\\pm$0.0039 & 0.9153$\\pm$0.0015 & 0.2986$\\pm$0.0020 & 0.4185$\\pm$0.0088 & 0.4575  \\\\\n          & GCN   & 0.0877$\\pm$0.0040 & 0.4747$\\pm$0.0233 & 0.6620$\\pm$0.0323 & 0.2937$\\pm$0.0116 & 0.3361$\\pm$0.0137 & 0.3708  \\\\\n          & GAT   & 0.1497$\\pm$0.0545 & \\textbf{0.5878$\\pm$0.0765} & 0.9589$\\pm$0.0100 & 0.3359$\\pm$0.0797 & 0.4368$\\pm$0.0632 & 0.4938  \\\\\n          & GraphSAGE-mean & 0.0912$\\pm$0.0243 & 0.5671$\\pm$0.0388 & 0.9618$\\pm$0.0013 & 0.3145$\\pm$0.0298 & 0.3943$\\pm$0.0234 & 0.4658  \\\\\n          & GraphSAGE-pooling & 0.1122$\\pm$0.0217 & 0.5796$\\pm$0.0522 & 0.9508$\\pm$0.0041 & 0.3083$\\pm$0.0346 & 0.3956$\\pm$0.0231 & 0.4693  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.1591$\\pm$0.0278} & 0.5821$\\pm$0.0486 & \\textbf{0.9671$\\pm$0.0060} & \\textbf{0.3376$\\pm$0.0315} & \\textbf{0.4391$\\pm$0.0228} & \\textbf{0.4970 } \\\\\n\\midrule\n    \\multirow{6}[3]{*}{NF} & NCF   & 0.2102$\\pm$0.0038 & 0.5840$\\pm$0.004 & 0.8706$\\pm$0.0025 & 0.3804$\\pm$0.0036 & 0.4446$\\pm$0.0034 & 0.4980  \\\\\n          & GCN   & 0.1048$\\pm$0.0141 & 0.1688$\\pm$0.0141 & 0.4981$\\pm$0.0212 & 0.1328$\\pm$0.0144 & 0.2009$\\pm$0.0159 & 0.2211  \\\\\n          & GAT   & 0.1918$\\pm$0.0045 & 0.5564$\\pm$0.0027 & 0.9028$\\pm$0.0030 & 0.3554$\\pm$0.0021 & 0.4318$\\pm$0.0026 & 0.4876  \\\\\n          & GraphSAGE-mean & 0.1920$\\pm$0.0053 & 0.5525$\\pm$0.0008 & 0.8874$\\pm$0.0025 & 0.3542$\\pm$0.0025 & 0.4280$\\pm$0.0030 & 0.4828  \\\\\n          & GraphSAGE-pooling & 0.2059$\\pm$0.0027 & 0.6054$\\pm$0.0034 & \\textbf{0.9217$\\pm$0.0014} & 0.3906$\\pm$0.0027 & \\textbf{0.4696$\\pm$0.0023} & \\textbf{0.5186 } \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2140$\\pm$0.0042} & \\textbf{0.6077$\\pm$0.0131} & 0.9184$\\pm$0.0054 & \\textbf{0.3918$\\pm$0.0072} & 0.4613$\\pm$0.0055 & \\textbf{0.5186 } \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{MU} & NCF   & 0.2046$\\pm$0.0043 & 0.6078$\\pm$0.0026 & \\textbf{0.9590$\\pm$0.0007} & 0.3835$\\pm$0.0036 & \\textbf{0.5093$\\pm$0.0031} & 0.5328  \\\\\n          & GCN   & 0.1594$\\pm$0.0002 & 0.4984$\\pm$0.0019 & 0.7589$\\pm$0.0034 & 0.2946$\\pm$0.0006 & 0.3981$\\pm$0.0008 & 0.4219  \\\\\n          & GAT   & 0.2335$\\pm$0.0159 & 0.6833$\\pm$0.0072 & 0.9545$\\pm$0.0005 & 0.4463$\\pm$0.0128 & 0.5002$\\pm$0.0112 & 0.5636  \\\\\n          & GraphSAGE-mean & 0.1927$\\pm$0.0121 & 0.5923$\\pm$0.0196 & 0.8901$\\pm$0.0220 & 0.3742$\\pm$0.0161 & 0.4406$\\pm$0.0167 & 0.4980  \\\\\n          & GraphSAGE-pooling & 0.2215$\\pm$0.0193 & 0.6210$\\pm$0.0190 & 0.9484$\\pm$0.0026 & 0.4145$\\pm$0.0208 & 0.4965$\\pm$0.0171 & 0.5404  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2399$\\pm$0.0026} & \\textbf{0.6887$\\pm$0.0009} & 0.9507$\\pm$0.0028 & \\textbf{0.4470$\\pm$0.0011} & 0.5055$\\pm$0.0028 & \\textbf{0.5664 } \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{BO} & NCF   & 0.2567$\\pm$0.0081 & 0.6733$\\pm$0.007 & 0.9422$\\pm$0.0024 & 0.4558$\\pm$0.0081 & 0.5164$\\pm$0.007 & 0.5689  \\\\\n          & GCN   & 0.1899$\\pm$0.0004 & 0.5007$\\pm$0.0017 & 0.6991$\\pm$0.001 & 0.3558$\\pm$0.0002 & 0.3900$\\pm$0.0002 & 0.4271  \\\\\n          & GAT   & 0.2805$\\pm$0.0258 & 0.7034$\\pm$0.0365 & 0.9369$\\pm$0.0202 & \\textbf{0.4776$\\pm$0.0321} & 0.5303$\\pm$0.0286 & 0.5857  \\\\\n          & GraphSAGE-mean & 0.2137$\\pm$0.0009 & 0.6036$\\pm$0.0007 & 0.8741$\\pm$0.0022 & 0.3920$\\pm$0.0007 & 0.4525$\\pm$0.001 & 0.5072  \\\\\n          & GraphSAGE-pooling & 0.2716$\\pm$0.0148 & 0.6987$\\pm$0.0143 & 0.9351$\\pm$0.0051 & 0.4653$\\pm$0.0155 & 0.5166$\\pm$0.0136 & 0.5775  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2867$\\pm$0.005} & \\textbf{0.7055$\\pm$0.0063} & \\textbf{0.9431$\\pm$0.0042} & 0.4757$\\pm$0.0061 & \\textbf{0.5392$\\pm$0.0058} & \\textbf{0.5900 } \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:single}\n\\end{table*}\n\n\\subsection{Evaluation Protocol}\nFollowing existing works \\cite{he2017neural,hu2019hybrid}, we adopt the Leave-One-Out (LOO) evaluation.\nWe randomly sample one interaction for each user as the validation and test sets, respectively.\nWe also follow the common strategy \\cite{hu2019hybrid,gao2019natr} to randomly sample 99 unobserved (negative) items for each user and then evaluate how well the model can rank the test item against these negative ones. \nThen, we adopt two standard metrics, \\textbf{HR@K} and \\textbf{NDCG@K}, which are widely used in recommendation \\cite{gao2019natr,hu2019hybrid,he2017neural,wang2018tem,ding2018improving}, to evaluate the ranking performance of each methods. The HR@K is computed as follows:\n\\begin{eqnarray}\nHR@K=\\frac{1}{|U|}\\sum_{u\\in\\setU} I(p_u\\leq K),\n\\end{eqnarray}\nwhere $p_u$ is the hit position for the user $u$'s test item, and $I(\\cdot)$ is the indicator function.\nThe NDCG@K is computed as follows:\n\\begin{eqnarray}\nNDCG@K=\\frac{1}{|U|}\\sum_{u\\in\\setU} \\frac{\\log 2}{\\log (p_u+1)}.\n\\end{eqnarray}\nWe report HR@K and NDCG@K with K = 1, 10 and 50.\nThe larger the value, the better the performance for all the evaluation metrics.\nFor all experiments, we report the metrics with \\textbf{95\\%} \\textbf{confidence intervals} on five runs.\n\n\\subsection{Implementation Details}\nIf a user has feedback on an item, there is an edge between the user node and the item node.\nThus, we construct the feedback graph $G$ utilized in our experiments.\n\nFor single domain recommendation task, we perform experiments on the four target domain datasets (i.e., IQI, NF, MU, BO).\nFor all datasets we use: embedding dimension $k=32$, neighbor sampling threshold $\\delta=30$ with two \\modela~layers, negative sampling ratio $\\gamma=8$, mini-batch size of 256 and learning rate of 0.001. \nWe also use dropout whose probability is 0.4.\n\nFor cross-domain recommendation task, we perform experiments on the four pairs cross-domain datasets.\nFor all datasets we use: embedding dimension $k=16$, neighbor sampling threshold $\\delta=10$ with one \\modela~layer, negative sampling ratio $\\gamma=8$,\ntunable hyper-parameter $\\alpha=0.7$ to control the different strength in Equation (\\ref{equ:lossst}),\nmini-batch size of 256 and learning rate of 0.001. \nWe also use dropout whose probability is 0.4.\n\nAll these values and hyper-parameters of all baselines are chosen via a grid search on the IQI validation set.\nWe do not perform any datasets-specific tuning except early stopping on validation sets.\nAll models are implemented using TensorFlow\\footnote{https://www.tensorflow.org} and trained on GTX 1080ti GPU.\nTraining is finished through stochastic gradient descent over shuffled mini-batches with the Adam \\cite{kingma2014adam} update rule.\n\n\\subsection{Baseline Methods}\nWe construct three groups of experiments to demonstrate the effectiveness of the proposed model and framework.\n\\subsubsection{Single Domain Recommendation}\nWe compare the proposed \\modela~model with the following baseline models.\n\\begin{itemize}\n\\item \\textbf{NCF}~\\cite{he2017neural}: Neural Collaborative Filtering (NCF) is the state-of-the-art solution for recommendation tasks with implicit feedback.\nWe use one of the variants of NCF, which is also called Generalized Matrix Factorization (GMF).\n\\item \\textbf{GCN}~\\cite{kipf2016gcn}: The vanilla GCN learns latent node representations based on the first-order approximation of spectral graph convolutions. \n\\item \\textbf{GAT}~\\cite{velivckovic2017gat}: It applies the attention mechanism to learn different weights for aggregating node features from neighbors. \n\\item \\textbf{GraphSAGE-mean}~\\cite{hamilton2017graphsage}: It learns to aggregate node messages from a node’s local neighborhood by the mean aggregator.\n\\item \\textbf{GraphSAGE-pooling}~\\cite{hamilton2017graphsage}: It learns to aggregate node messages from a node’s local neighborhood by the pooling aggregator.\n\\end{itemize}\nFor GCN, GAT, GraphSAGE-mean and GraphSAGE-pooling, We apply the inner product on the user and item node representations as the output.\n\n\\subsubsection{Cross-Domain Recommendation}\nWe compare the proposed \\modelb~model with the following baseline models.\n\\begin{itemize}\n\\item \\textbf{CST}~\\cite{pan2010cst}: Coordinate System Transfer (CST) assumes that both users and items are overlapped and adds\ntwo regularization terms in its objective function.  Here, we adapt the CST to our datasets by only reserving single-side (i.e., the user-side or item-side) regularization term.\n\\item \\textbf{CD-NCF}~\\cite{he2017neural}: Neural Collaborative Filtering (NCF) is the state-of-the-art solution for single domain recommendation tasks with implicit feedback. Here, we adapt it to our cross-domain recommendation task via sharing the overlapped user or item embeddings.\n\\item \\textbf{EMCDR}~\\cite{man2017emcdr}: This is an embedding and mapping framework for cross-domain recommendation.\nThe framework contains Latent Factor Model, Latent Space Mapping and Cross-domain Recommendation, and it is not an end-to-end method.\n\\item \\textbf{EATNN}~\\cite{chen2019eatnn}: This is the state-of-the-art solution for cross-domain recommendation tasks. By introducing attention mechanisms, the model automatically assigns a personalized transfer scheme for each user.\n\\end{itemize}\n\n\\subsubsection{General Cross-Domain Recommendation}\nWe apply the proposed cross-domain framework to other baseline GNN models.\n\\begin{itemize}\n\\item \\textbf{CD-GCN}~\\cite{kipf2016gcn}: It applies the proposed general framework to the GCN as described in Section \\ref{sec:general}. \n\\item \\textbf{CD-GAT}~\\cite{velivckovic2017gat}: It applies the proposed general framework to the GAT. \n\\item \\textbf{CD-GraphSAGE-mean}~\\cite{hamilton2017graphsage}: It applies the proposed general framework to the GraphSAGE-mean. \n\\item \\textbf{CD-GraphSAGE-pooling}~\\cite{hamilton2017graphsage}: It applies the proposed general framework to the GraphSAGE-pooling. \n\\end{itemize}\n\n\\subsection{Performance Comparison}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{The experimental results evaluated by HR@K and NDCG@K on cross-domain recommendation task with 95\\% confidence intervals.}\n  \\resizebox{1\\linewidth}{!}{\n    \\begin{tabular}{clcccccc}\n    \\toprule\n    Dataset & Model & HR(NDCG)@1 & HR@10 & HR@50 & NDCG@10 & NGDCG@50 & \\textbf{Average} \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{TC$\\rightarrow$IQI} & CST   & 0.1948$\\pm$0.0039 & \\textbf{0.6678$\\pm$0.0136} & 0.9455$\\pm$0.0028 & 0.4178$\\pm$0.0099 & 0.4858$\\pm$0.0030 & 0.5423  \\\\\n          & CD-NCF & 0.1701$\\pm$0.0314 & 0.5408$\\pm$0.0445 & 0.8702$\\pm$0.0402 & 0.3392$\\pm$0.0411 & 0.4131$\\pm$0.0396 & 0.4667  \\\\\n          & EMCDR & 0.2058$\\pm$0.0239 & 0.3962$\\pm$0.0628 & 0.7438$\\pm$0.0436 & 0.2897$\\pm$0.0394 & 0.3640$\\pm$0.0358 & 0.3999  \\\\\n          & EATNN & 0.1959$\\pm$0.0102 & 0.6473$\\pm$0.0089 & 0.9314$\\pm$0.0026 & 0.4103$\\pm$0.0100 & 0.4906$\\pm$0.0087 & 0.5351  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2105$\\pm$0.0089} & 0.6536$\\pm$0.0159 & \\textbf{0.9758$\\pm$0.0088} & \\textbf{0.4222$\\pm$0.0108} & \\textbf{0.4963$\\pm$0.0080} & \\textbf{0.5517 } \\\\\n\\midrule\n    \\multirow{5}[3]{*}{ML$\\rightarrow$NF} & CST   & 0.1878$\\pm$0.0058 & 0.5413$\\pm$0.0024 & 0.8551$\\pm$0.0007 & 0.3486$\\pm$0.0015 & 0.4178$\\pm$0.0023 & 0.4701  \\\\\n          & CD-NCF & 0.1997$\\pm$0.0260 & 0.5540$\\pm$0.0457 & 0.8539$\\pm$0.0246 & 0.3600$\\pm$0.0353 & 0.4266$\\pm$0.0310 & 0.4788  \\\\\n          & EMCDR & 0.0968$\\pm$0.0260 & 0.3406$\\pm$0.0240 & 0.6522$\\pm$0.0730 & 0.2027$\\pm$0.0170 & 0.2708$\\pm$0.0070 & 0.3126  \\\\\n          & EATNN & 0.2103$\\pm$0.0018 & 0.5892$\\pm$0.0038 & 0.8745$\\pm$0.0016 & 0.3835$\\pm$0.0015 & 0.4472$\\pm$0.0013 & 0.5009  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2243$\\pm$0.0047} & \\textbf{0.6247$\\pm$0.0069} & \\textbf{0.9228$\\pm$0.0033} & \\textbf{0.4062$\\pm$0.0055} & \\textbf{0.4732$\\pm$0.0043} & \\textbf{0.5302 } \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{MO$\\rightarrow$MU} & CST   & 0.2378$\\pm$0.0085 & 0.5934$\\pm$0.0024 & 0.9051$\\pm$0.0073 & 0.3986$\\pm$0.0115 & 0.4775$\\pm$0.0035 & 0.5225  \\\\\n          & CD-NCF & 0.2599$\\pm$0.0200 & 0.7232$\\pm$0.0430 & 0.9480$\\pm$0.0261 & 0.4747$\\pm$0.0315 & 0.5281$\\pm$0.0281 & 0.5868  \\\\\n          & EMCDR & 0.2290$\\pm$0.0290 & 0.5610$\\pm$0.0703 & 0.8430$\\pm$0.0560 & 0.3834$\\pm$0.0320 & 0.4234$\\pm$0.0410 & 0.4880  \\\\\n          & EATNN & 0.2680$\\pm$0.0021 & 0.7253$\\pm$0.0035 & 0.9457$\\pm$0.0026 & \\textbf{0.4881$\\pm$0.0013} & 0.5282$\\pm$0.0014 & 0.5911  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2728$\\pm$0.0054} & \\textbf{0.7314$\\pm$0.0072} & \\textbf{0.9671$\\pm$0.002} & 0.4851$\\pm$0.0060 & \\textbf{0.5389$\\pm$0.0049} & \\textbf{0.5991 } \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{MU$\\rightarrow$BO} & CST   & 0.2524$\\pm$0.0089 & 0.6973$\\pm$0.0102 & 0.9355$\\pm$0.0098 & 0.4575$\\pm$0.0105 & 0.5143$\\pm$0.0068 & 0.5714  \\\\\n          & CD-NCF & 0.2770$\\pm$0.0158 & 0.7184$\\pm$0.0332 & \\textbf{0.9472$\\pm$0.0261} & 0.4841$\\pm$0.0215 & 0.5334$\\pm$0.0836 & 0.5920  \\\\\n          & EMCDR & 0.2004$\\pm$0.2972 & 0.4864$\\pm$0.5881 & 0.7612$\\pm$0.4115 & 0.3324$\\pm$0.4423 & 0.3920$\\pm$0.4082 & 0.4345  \\\\\n          & EATNN & 0.2731$\\pm$0.0015 & 0.7064$\\pm$0.0036 & 0.9277$\\pm$0.0026 & 0.4634$\\pm$0.0013 & 0.5070$\\pm$0.0017 & 0.5755  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2978$\\pm$0.0481} & \\textbf{0.7267$\\pm$0.0688} & 0.9424$\\pm$0.0295 & \\textbf{0.4872$\\pm$0.0609} & \\textbf{0.5502$\\pm$0.0523} & \\textbf{0.6009 } \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:cross}\n\\end{table*}\n\\begin{figure*}[!t]\n\\begin{center}\n\\includegraphics[width=\\linewidth]{framework}\n\\caption{The HR@K results of the general cross-domain framework on 4 (datastes) $\\times$ 10 (models) = 40 tasks.}\n\\label{general_framework}\n\\end{center}\n\\end{figure*}\n\\subsubsection{Single Domain Recommendation Task}\nWe demonstrate the effectiveness of our \\modela~on four target domain datasets.\nThe experimental results evaluated by HR@K and NDCG@K on IQI, NF, MU and Bo are presented in Table \\ref{tab:single}. \nFrom these results, we have the following insightful observations.\n\\begin{itemize}\n\t\\item[-] Among these GNN baselines, the GCN has acceptable performances on multiple datasets. \n\tThe GraphSAGE-mean improves the results comparing with GCN via introducing the mean aggregator to aggregate messages from each node's local neighborhood. \n\tThe GraphSAGE-pooling achieves further improvement over GraphSAGE-mean by replacing the mean aggregator with the more complex pooling aggregator, which applies the element-wise max-pooling operation on the transformed neighbor messages through a fully-connected neural network. \n\tThe GAT obtains further performance improvement via assigning different learnable weights to neighbor messages.\n\t\\item[-] NCF also obtains competitive recommendation performance, which further validates why the simple collaborative filtering methods can be widely used in recommender systems. \n\tOn most tasks, our \\modela~outperforms the NCF, which demonstrates the graph-structured data are useful for recommender systems.\n    \\item[-] Our \\modela~almost obtains the best performance on multiple datasets. It outperforms the GNN baselines on multi-pair metrics.\n    Besides, although the improvement of the \\modela~ compared with the GAT is marginal on a few metrics and datasets, the \\textbf{Average} values of these metrics of the \\modela~are better on all four datasets, which indicates that the \\modela~has better generalization performance than the GAT.\n\\end{itemize}\nThe essence of recommender systems is to find similarity, and local neighbor nodes often contain such similarity.\nOur \\modela~aggregates local neighbor messages via high-order feature interactions.\nTherefore, the \\modela~ can achieve better performance and is more suitable on recommendation tasks.\nOverall, these improvements indicate the fact that \nour \\modela~can effectively integrate neighbor messages to generate more effective node representations and is more suitable when confronting the graph-structured data. \n\n\\subsubsection{Cross-Domain Recommendation Task}\nWe also demonstrate the effectiveness of our \\modelb~on four pairs cross-domain datasets.\nThe experimental results evaluated by HR@K and NDCG@K are presented in Table \\ref{tab:cross}. \nFrom these results, we have the following findings.\n\\begin{itemize}\n    \\item[-] The collaborative filtering based CD-NCF still obtains competitive recommendation performance via sharing the embedding of overlapped users or items, and it improves the recommendation performance of the CST on all datasets except the TC$\\rightarrow$IQI.\n    We conjecture that collaborative filtering methods need a lot of data to obtain good performance, while the TC$\\rightarrow$IQI have less feedback data.\n    It also demonstrates that collaborative filtering is indeed a simple and efficient method in recommender systems.\n\t\\item[-] EMCDR is not an end-to-end method, and the poor performance may result from the accumulation of errors at each step.\n\t\\item[-] EATNN is the state-of-the-art cross-domain recommendation baseline, and it achieves nearly the best results across multiple datasets among these baselines.\n\t\\item[-] By utilizing the graph topology, our \\modelb~ improves the recommendation performance compared with various methods.\n\tIt demonstrates that the proposed cross-domain framework combined with the proposed \\modela~ is more suitable for the graph-structured data in cross-domain recommendation.\n\\end{itemize}\n\n\\subsubsection{General Cross-Domain Recommendation Task}\nOur cross-domain framework is a general framework\nthat can be applied upon various existing GNN models. \nHere we apply the cross-domain framework\nto GCN, GAT, GraphSAGE-mean and GraphSAGE-pooling. \nIn order to prove that our cross domain framework is applicable to various GNN models. \nWe conduct experiments on 40 tasks ($4\\times10=40$, 4 pairs datasets, 10 models). \nThe results are shown in Figure \\ref{general_framework}. The red lines are the baselines which only use the target training set to train model, also shown in Table \\ref{tab:single}, and the blue lines are the cross-domain models which applied the general cross-domain framework.\nFrom the results, we have the following findings:\n\\begin{itemize}\n    \\item[-] On most tasks, our cross-domain framework is effective to improve the performance of the single domain models which also demonstrates the cross-domain framework can be applied upon various existing GNN models.\n    \\item[-] The improvement on GCN is larger than the other four GNN models. The main reason might be that the single domain GCN is significantly weaker than other improved GNN models as showed in Table \\ref{tab:single}, so the improvement of other GNN models brought by the cross-domain framework is relatively less than GCN.\n    \\item[-] The performance of the GraphSAGE-mean and GraphSAGE-pooling is unsatisfying on several datasets, the reason might be that the mean and pooling aggregators are too simple and fewer shared parameters make them difficult to coordinately train in two domains.\n\\end{itemize}\n\nOverall, we observe that the performance improvement of the cross-domain framework is significant and it is able to improve the performance of base GNN models on different datasets,\nwhich proves that the cross-domain framework is compatible with many GNN models.\n\n\\subsection{Ablation Study}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{Results of ablation study on cross-domain recommendation task based on \\modelb. ``*'' indicates that the  improvement is statistically significant with the p-value $<$ 0.05 on independent samples t-tests.}\n  \\resizebox{.7\\linewidth}{!}{\n    \\begin{tabular}{c|ccc|ccc}\n    \\toprule\n    Model & HR@1  & HR@10 & HR@50 & HR@1  & HR@10 & HR@50 \\\\\n    \\midrule\n          & \\multicolumn{3}{c|}{TC$\\rightarrow$IQI} & \\multicolumn{3}{c}{MO$\\rightarrow$MU} \\\\\n    CD-GFM-base & 0.1681 & 0.5914 & 0.9362 & 0.2445 & 0.6989 & 0.9054 \\\\\n    \\modelb & \\textbf{0.2105*} & \\textbf{0.6536*} & \\textbf{0.9758*} & \\textbf{0.2728*} & \\textbf{0.7314*} & \\textbf{0.9671*} \\\\\n    \\midrule\n          & \\multicolumn{3}{c|}{ML$\\rightarrow$NF} & \\multicolumn{3}{c}{MU$\\rightarrow$BO} \\\\\n    CD-GFM-base & 0.2178 & 0.6196 & 0.9182 & 0.2756 & 0.6963 & 0.9395 \\\\\n    \\modelb & \\textbf{0.2243*} & \\textbf{0.6247*} & \\textbf{0.9228} & \\textbf{0.2978*} & \\textbf{0.7267*} & \\textbf{0.9424} \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:ablation}\n\\end{table*}\nMoreover, for understanding the contribution of the shared node initialization in \\modelb.\nwe construct ablation experiments over \\textbf{CD-GFM-base} and \\modelb~on four pairs datastes.\n\\textbf{CD-GFM-base} only uses the domain-specific node representations $\\h_{n_s}$ and $\\h_{n_t}$ output directly from the \\modela~and not to concatenate the initialized input in Equation (\\ref{concat1}) and (\\ref{concat2}), i.e., \n$\\n_s=\\h_{n_s},\n\\n_t=\\h_{n_t}.\n$\nThe results are presented in Table \\ref{tab:ablation}.\nWe conduct independent samples t-tests and the p-value $<$ 0.05 indicates\nthat the improvement of \\modelb~over the \\textbf{CD-GFM-base} is statistically significant.\nThe improvement demonstrates that \\modelb~model can efficiently take advantage of the domain-shared and domain-specific node representations simultaneously, and obtain the best performance on all datasets, which indicates both two representations matter for the cross-domain recommendation performance.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n4.1 Datasets\n\\subsection{Datasets}\nWe utilize four pairs frequently used real-world datasets, which contain two pairs \\textbf{user-shared} datasets and two pairs \\textbf{item-shared} datasets. \nFor all datasets, we only use the user IDs, item IDs and their implicit feedback information.\nFor simplicity, we intentionally transform the rating data into binary (1/0, indicating whether a user has interacted with an item or not) to fit the problem setting of implicit feedback following \\cite{gao2019natr}.\nThe statistics of the four pairs datasets are listed in Table \\ref{tab:dataset}.\n\\begin{itemize}\n\\item \\textbf{TC$\\rightarrow$IQI} \\cite{yan2019tciqi} are from two mainstream video websites Tencent (TC)\\footnote{https://v.qq.com} and iQIYI (IQI)\\footnote{https://www.iqiyi.com} in China. \nThere are a lot of overlapped items (movies) in the two websites.  \nWe take TC and IQI as the source and target domains, respectively. \nWe got the processed dataset pair directly from \\cite{yan2019tciqi}.\n\n\\item \\textbf{ML$\\rightarrow$NF}\\footnote{https://grouplens.org/datasets/movielens}$^,$\\footnote{https://www.kaggle.com/laowingkin/netflix-movie-recommendation/data} are from two popular movie recommendation platforms MovieLens and Netflix, in which there are a lot of overlapped items (movies). \nWe take MovieLens (ML) as the source domain and the Netflix (NF) as the target domain. We identify the same movies with their names (case insensitive) and years to avoid wrong identifications as possible, which is similar data processing method with \\cite{gao2019natr}.\n\n\\item \\textbf{MO$\\rightarrow$MU} are from the famous social network platform Douban\\footnote{https://www.douban.com\\label{douban}} in China. Overlapped users have feedback on both Movie (MO) and Music (MU).\nWe take MO as the source domain and the MU as the target domain.\n\n\\item \\textbf{MU$\\rightarrow$BO} are also from the famous social network platform Douban\\textsuperscript{\\ref{douban}} in China. Overlapped users have feedback on both Music (MU) and Book (BO).\nWe take MU as the source domain and the BO as the target domain.\n\\end{itemize}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{The experimental results evaluated by HR@K and NDCG@K on single domain recommendation task with 95\\% confidence intervals.}\n  \\resizebox{1\\linewidth}{!}{\n\n    \\begin{tabular}{clcccccc}\n    \\toprule\n    Dataset & Model & HR(NDCG)@1 & HR@10 & HR@50 & NDCG@10 & NDCG@50 & \\textbf{Average} \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{IQI} & NCF   & 0.1545$\\pm$0.0029 & 0.5004$\\pm$0.0039 & 0.9153$\\pm$0.0015 & 0.2986$\\pm$0.0020 & 0.4185$\\pm$0.0088 & 0.4575  \\\\\n          & GCN   & 0.0877$\\pm$0.0040 & 0.4747$\\pm$0.0233 & 0.6620$\\pm$0.0323 & 0.2937$\\pm$0.0116 & 0.3361$\\pm$0.0137 & 0.3708  \\\\\n          & GAT   & 0.1497$\\pm$0.0545 & \\textbf{0.5878$\\pm$0.0765} & 0.9589$\\pm$0.0100 & 0.3359$\\pm$0.0797 & 0.4368$\\pm$0.0632 & 0.4938  \\\\\n          & GraphSAGE-mean & 0.0912$\\pm$0.0243 & 0.5671$\\pm$0.0388 & 0.9618$\\pm$0.0013 & 0.3145$\\pm$0.0298 & 0.3943$\\pm$0.0234 & 0.4658  \\\\\n          & GraphSAGE-pooling & 0.1122$\\pm$0.0217 & 0.5796$\\pm$0.0522 & 0.9508$\\pm$0.0041 & 0.3083$\\pm$0.0346 & 0.3956$\\pm$0.0231 & 0.4693  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.1591$\\pm$0.0278} & 0.5821$\\pm$0.0486 & \\textbf{0.9671$\\pm$0.0060} & \\textbf{0.3376$\\pm$0.0315} & \\textbf{0.4391$\\pm$0.0228} & \\textbf{0.4970 } \\\\\n\\midrule\n    \\multirow{6}[3]{*}{NF} & NCF   & 0.2102$\\pm$0.0038 & 0.5840$\\pm$0.004 & 0.8706$\\pm$0.0025 & 0.3804$\\pm$0.0036 & 0.4446$\\pm$0.0034 & 0.4980  \\\\\n          & GCN   & 0.1048$\\pm$0.0141 & 0.1688$\\pm$0.0141 & 0.4981$\\pm$0.0212 & 0.1328$\\pm$0.0144 & 0.2009$\\pm$0.0159 & 0.2211  \\\\\n          & GAT   & 0.1918$\\pm$0.0045 & 0.5564$\\pm$0.0027 & 0.9028$\\pm$0.0030 & 0.3554$\\pm$0.0021 & 0.4318$\\pm$0.0026 & 0.4876  \\\\\n          & GraphSAGE-mean & 0.1920$\\pm$0.0053 & 0.5525$\\pm$0.0008 & 0.8874$\\pm$0.0025 & 0.3542$\\pm$0.0025 & 0.4280$\\pm$0.0030 & 0.4828  \\\\\n          & GraphSAGE-pooling & 0.2059$\\pm$0.0027 & 0.6054$\\pm$0.0034 & \\textbf{0.9217$\\pm$0.0014} & 0.3906$\\pm$0.0027 & \\textbf{0.4696$\\pm$0.0023} & \\textbf{0.5186 } \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2140$\\pm$0.0042} & \\textbf{0.6077$\\pm$0.0131} & 0.9184$\\pm$0.0054 & \\textbf{0.3918$\\pm$0.0072} & 0.4613$\\pm$0.0055 & \\textbf{0.5186 } \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{MU} & NCF   & 0.2046$\\pm$0.0043 & 0.6078$\\pm$0.0026 & \\textbf{0.9590$\\pm$0.0007} & 0.3835$\\pm$0.0036 & \\textbf{0.5093$\\pm$0.0031} & 0.5328  \\\\\n          & GCN   & 0.1594$\\pm$0.0002 & 0.4984$\\pm$0.0019 & 0.7589$\\pm$0.0034 & 0.2946$\\pm$0.0006 & 0.3981$\\pm$0.0008 & 0.4219  \\\\\n          & GAT   & 0.2335$\\pm$0.0159 & 0.6833$\\pm$0.0072 & 0.9545$\\pm$0.0005 & 0.4463$\\pm$0.0128 & 0.5002$\\pm$0.0112 & 0.5636  \\\\\n          & GraphSAGE-mean & 0.1927$\\pm$0.0121 & 0.5923$\\pm$0.0196 & 0.8901$\\pm$0.0220 & 0.3742$\\pm$0.0161 & 0.4406$\\pm$0.0167 & 0.4980  \\\\\n          & GraphSAGE-pooling & 0.2215$\\pm$0.0193 & 0.6210$\\pm$0.0190 & 0.9484$\\pm$0.0026 & 0.4145$\\pm$0.0208 & 0.4965$\\pm$0.0171 & 0.5404  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2399$\\pm$0.0026} & \\textbf{0.6887$\\pm$0.0009} & 0.9507$\\pm$0.0028 & \\textbf{0.4470$\\pm$0.0011} & 0.5055$\\pm$0.0028 & \\textbf{0.5664 } \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{BO} & NCF   & 0.2567$\\pm$0.0081 & 0.6733$\\pm$0.007 & 0.9422$\\pm$0.0024 & 0.4558$\\pm$0.0081 & 0.5164$\\pm$0.007 & 0.5689  \\\\\n          & GCN   & 0.1899$\\pm$0.0004 & 0.5007$\\pm$0.0017 & 0.6991$\\pm$0.001 & 0.3558$\\pm$0.0002 & 0.3900$\\pm$0.0002 & 0.4271  \\\\\n          & GAT   & 0.2805$\\pm$0.0258 & 0.7034$\\pm$0.0365 & 0.9369$\\pm$0.0202 & \\textbf{0.4776$\\pm$0.0321} & 0.5303$\\pm$0.0286 & 0.5857  \\\\\n          & GraphSAGE-mean & 0.2137$\\pm$0.0009 & 0.6036$\\pm$0.0007 & 0.8741$\\pm$0.0022 & 0.3920$\\pm$0.0007 & 0.4525$\\pm$0.001 & 0.5072  \\\\\n          & GraphSAGE-pooling & 0.2716$\\pm$0.0148 & 0.6987$\\pm$0.0143 & 0.9351$\\pm$0.0051 & 0.4653$\\pm$0.0155 & 0.5166$\\pm$0.0136 & 0.5775  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2867$\\pm$0.005} & \\textbf{0.7055$\\pm$0.0063} & \\textbf{0.9431$\\pm$0.0042} & 0.4757$\\pm$0.0061 & \\textbf{0.5392$\\pm$0.0058} & \\textbf{0.5900 } \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:single}\n\\end{table*}\n\n",
        "Descriptive_question1": "What is the number of shared items for the TC→IQI dataset in table_1?",
        "Descriptive_question2": "How many unshared users are there in the source domain for the ML→NF dataset in table_1?",
        "Reasoning_question1": "Which dataset pair in table_1 shows the largest difference in feedback numbers between the source and target domains, and what might this imply about user behavior across domains?",
        "Reasoning_question2": "In table_1, why might the MO→MU dataset pair have significantly higher feedback numbers compared to the TC→IQI pair, considering their shared entity types?",
        "Descriptive_answer1": "5,568",
        "Descriptive_answer2": "30,279",
        "Reasoning_answer1": "The dataset pair ML→NF shows the largest difference in feedback numbers between the source and target domains, with the source domain (MovieLens) having 11,555,621 feedback instances compared to the target domain (Netflix) with 199,765 feedback instances, a difference of approximately 11,355,856. This significant disparity could imply several things about user behavior across domains. First, it might suggest that users in the source domain (MovieLens) are more active or engaged in providing feedback, possibly due to differences in platform design, user interface, or incentives for interaction. Second, the MovieLens platform may have a longer history or a larger user base contributing to feedback accumulation over time. Lastly, the nature of content or recommendation algorithms on MovieLens might encourage more interactions compared to Netflix, where user behavior might be more passive or focused on consumption rather than feedback.",
        "Reasoning_answer2": "The MO→MU dataset pair from Douban has significantly higher feedback numbers (7,366,992 in the source domain and 3,784,331 in the target domain) compared to the TC→IQI pair (314,621 in the source domain and 78,429 in the target domain). This difference can be attributed to several factors related to the shared entity types and platform characteristics. Since MO→MU shares users across Movie (MO) and Music (MU) domains on the Douban platform, the feedback may reflect a more integrated user experience where users are active across multiple content types within the same social network ecosystem, leading to higher engagement. In contrast, TC→IQI shares items (movies) across two separate video platforms (Tencent and iQIYI), which might result in fragmented user bases and less consistent feedback due to users being split between platforms with potentially different interaction patterns. Additionally, Douban's focus on social networking and community-driven content might encourage more user feedback compared to the primarily content consumption-focused platforms like Tencent and iQIYI."
    },
    {
        "paper_id": "2007.05911.json",
        "table_id": "table_2",
        "table_content": "\\begin{table*}[!t]\n  \\centering\n  \\caption{The experimental results evaluated by HR@K and NDCG@K on single domain recommendation task with 95\\% confidence intervals.}\n  \\resizebox{1\\linewidth}{!}{\n\n    \\begin{tabular}{clcccccc}\n    \\toprule\n    Dataset & Model & HR(NDCG)@1 & HR@10 & HR@50 & NDCG@10 & NDCG@50 & \\textbf{Average} \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{IQI} & NCF   & 0.1545$\\pm$0.0029 & 0.5004$\\pm$0.0039 & 0.9153$\\pm$0.0015 & 0.2986$\\pm$0.0020 & 0.4185$\\pm$0.0088 & 0.4575  \\\\\n          & GCN   & 0.0877$\\pm$0.0040 & 0.4747$\\pm$0.0233 & 0.6620$\\pm$0.0323 & 0.2937$\\pm$0.0116 & 0.3361$\\pm$0.0137 & 0.3708  \\\\\n          & GAT   & 0.1497$\\pm$0.0545 & \\textbf{0.5878$\\pm$0.0765} & 0.9589$\\pm$0.0100 & 0.3359$\\pm$0.0797 & 0.4368$\\pm$0.0632 & 0.4938  \\\\\n          & GraphSAGE-mean & 0.0912$\\pm$0.0243 & 0.5671$\\pm$0.0388 & 0.9618$\\pm$0.0013 & 0.3145$\\pm$0.0298 & 0.3943$\\pm$0.0234 & 0.4658  \\\\\n          & GraphSAGE-pooling & 0.1122$\\pm$0.0217 & 0.5796$\\pm$0.0522 & 0.9508$\\pm$0.0041 & 0.3083$\\pm$0.0346 & 0.3956$\\pm$0.0231 & 0.4693  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.1591$\\pm$0.0278} & 0.5821$\\pm$0.0486 & \\textbf{0.9671$\\pm$0.0060} & \\textbf{0.3376$\\pm$0.0315} & \\textbf{0.4391$\\pm$0.0228} & \\textbf{0.4970 } \\\\\n\\midrule\n    \\multirow{6}[3]{*}{NF} & NCF   & 0.2102$\\pm$0.0038 & 0.5840$\\pm$0.004 & 0.8706$\\pm$0.0025 & 0.3804$\\pm$0.0036 & 0.4446$\\pm$0.0034 & 0.4980  \\\\\n          & GCN   & 0.1048$\\pm$0.0141 & 0.1688$\\pm$0.0141 & 0.4981$\\pm$0.0212 & 0.1328$\\pm$0.0144 & 0.2009$\\pm$0.0159 & 0.2211  \\\\\n          & GAT   & 0.1918$\\pm$0.0045 & 0.5564$\\pm$0.0027 & 0.9028$\\pm$0.0030 & 0.3554$\\pm$0.0021 & 0.4318$\\pm$0.0026 & 0.4876  \\\\\n          & GraphSAGE-mean & 0.1920$\\pm$0.0053 & 0.5525$\\pm$0.0008 & 0.8874$\\pm$0.0025 & 0.3542$\\pm$0.0025 & 0.4280$\\pm$0.0030 & 0.4828  \\\\\n          & GraphSAGE-pooling & 0.2059$\\pm$0.0027 & 0.6054$\\pm$0.0034 & \\textbf{0.9217$\\pm$0.0014} & 0.3906$\\pm$0.0027 & \\textbf{0.4696$\\pm$0.0023} & \\textbf{0.5186 } \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2140$\\pm$0.0042} & \\textbf{0.6077$\\pm$0.0131} & 0.9184$\\pm$0.0054 & \\textbf{0.3918$\\pm$0.0072} & 0.4613$\\pm$0.0055 & \\textbf{0.5186 } \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{MU} & NCF   & 0.2046$\\pm$0.0043 & 0.6078$\\pm$0.0026 & \\textbf{0.9590$\\pm$0.0007} & 0.3835$\\pm$0.0036 & \\textbf{0.5093$\\pm$0.0031} & 0.5328  \\\\\n          & GCN   & 0.1594$\\pm$0.0002 & 0.4984$\\pm$0.0019 & 0.7589$\\pm$0.0034 & 0.2946$\\pm$0.0006 & 0.3981$\\pm$0.0008 & 0.4219  \\\\\n          & GAT   & 0.2335$\\pm$0.0159 & 0.6833$\\pm$0.0072 & 0.9545$\\pm$0.0005 & 0.4463$\\pm$0.0128 & 0.5002$\\pm$0.0112 & 0.5636  \\\\\n          & GraphSAGE-mean & 0.1927$\\pm$0.0121 & 0.5923$\\pm$0.0196 & 0.8901$\\pm$0.0220 & 0.3742$\\pm$0.0161 & 0.4406$\\pm$0.0167 & 0.4980  \\\\\n          & GraphSAGE-pooling & 0.2215$\\pm$0.0193 & 0.6210$\\pm$0.0190 & 0.9484$\\pm$0.0026 & 0.4145$\\pm$0.0208 & 0.4965$\\pm$0.0171 & 0.5404  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2399$\\pm$0.0026} & \\textbf{0.6887$\\pm$0.0009} & 0.9507$\\pm$0.0028 & \\textbf{0.4470$\\pm$0.0011} & 0.5055$\\pm$0.0028 & \\textbf{0.5664 } \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{BO} & NCF   & 0.2567$\\pm$0.0081 & 0.6733$\\pm$0.007 & 0.9422$\\pm$0.0024 & 0.4558$\\pm$0.0081 & 0.5164$\\pm$0.007 & 0.5689  \\\\\n          & GCN   & 0.1899$\\pm$0.0004 & 0.5007$\\pm$0.0017 & 0.6991$\\pm$0.001 & 0.3558$\\pm$0.0002 & 0.3900$\\pm$0.0002 & 0.4271  \\\\\n          & GAT   & 0.2805$\\pm$0.0258 & 0.7034$\\pm$0.0365 & 0.9369$\\pm$0.0202 & \\textbf{0.4776$\\pm$0.0321} & 0.5303$\\pm$0.0286 & 0.5857  \\\\\n          & GraphSAGE-mean & 0.2137$\\pm$0.0009 & 0.6036$\\pm$0.0007 & 0.8741$\\pm$0.0022 & 0.3920$\\pm$0.0007 & 0.4525$\\pm$0.001 & 0.5072  \\\\\n          & GraphSAGE-pooling & 0.2716$\\pm$0.0148 & 0.6987$\\pm$0.0143 & 0.9351$\\pm$0.0051 & 0.4653$\\pm$0.0155 & 0.5166$\\pm$0.0136 & 0.5775  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2867$\\pm$0.005} & \\textbf{0.7055$\\pm$0.0063} & \\textbf{0.9431$\\pm$0.0042} & 0.4757$\\pm$0.0061 & \\textbf{0.5392$\\pm$0.0058} & \\textbf{0.5900 } \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:single}\n\\end{table*}",
        "caption": "The experimental results evaluated by HR@K and NDCG@K on single domain recommendation task with 95\\% confidence intervals.",
        "label": "tab:single",
        "section_info": "4 Experiments\n\\section{Experiments}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{Statistics of the datasets. ``$\\#$'' means the number of the corresponding items.}\n  \\resizebox{.75\\linewidth}{!}{\n    \\begin{tabular}{llllll}\n    \\toprule\n    \\multirow{2}[4]{*}{Dataset} & \\multirow{2}[4]{*}{Shared (\\#)} & \\multicolumn{2}{c}{Source Domain} & \\multicolumn{2}{c}{Target Domain} \\\\\n\\cmidrule{3-6}          &       & Unshared (\\#) & \\#Feedback & Unshared (\\#) & \\#Feedback \\\\\n    \\midrule\n    TC$\\rightarrow$IQI & Item (5,568) & User (35,398) & 314,621 & User (19,999) & 78,429 \\\\\n    ML$\\rightarrow$NF & Item (5,565) & user (30,279) & 11,555,621 & User (11,498) & 199,765 \\\\\n    MO$\\rightarrow$MU & User (27,898) & Item (15,465) & 7,366,992 & Item (14,521) & 3,784,331 \\\\\n    MU$\\rightarrow$BO & User (27,898) & Item (14,521) & 3,784,331 & Item (15,774) & 1,936,754 \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:dataset}\n\\end{table*}\nIn this section, we perform experiments to evaluate the proposed model and framework against various baselines on real-world datasets. \nWe first introduce the datasets, evaluation protocol, implementation details and baseline methods of our experiments. Finally, we present our experimental results and analysis.\n\n\\subsection{Datasets}\nWe utilize four pairs frequently used real-world datasets, which contain two pairs \\textbf{user-shared} datasets and two pairs \\textbf{item-shared} datasets. \nFor all datasets, we only use the user IDs, item IDs and their implicit feedback information.\nFor simplicity, we intentionally transform the rating data into binary (1/0, indicating whether a user has interacted with an item or not) to fit the problem setting of implicit feedback following \\cite{gao2019natr}.\nThe statistics of the four pairs datasets are listed in Table \\ref{tab:dataset}.\n\\begin{itemize}\n\\item \\textbf{TC$\\rightarrow$IQI} \\cite{yan2019tciqi} are from two mainstream video websites Tencent (TC)\\footnote{https://v.qq.com} and iQIYI (IQI)\\footnote{https://www.iqiyi.com} in China. \nThere are a lot of overlapped items (movies) in the two websites.  \nWe take TC and IQI as the source and target domains, respectively. \nWe got the processed dataset pair directly from \\cite{yan2019tciqi}.\n\n\\item \\textbf{ML$\\rightarrow$NF}\\footnote{https://grouplens.org/datasets/movielens}$^,$\\footnote{https://www.kaggle.com/laowingkin/netflix-movie-recommendation/data} are from two popular movie recommendation platforms MovieLens and Netflix, in which there are a lot of overlapped items (movies). \nWe take MovieLens (ML) as the source domain and the Netflix (NF) as the target domain. We identify the same movies with their names (case insensitive) and years to avoid wrong identifications as possible, which is similar data processing method with \\cite{gao2019natr}.\n\n\\item \\textbf{MO$\\rightarrow$MU} are from the famous social network platform Douban\\footnote{https://www.douban.com\\label{douban}} in China. Overlapped users have feedback on both Movie (MO) and Music (MU).\nWe take MO as the source domain and the MU as the target domain.\n\n\\item \\textbf{MU$\\rightarrow$BO} are also from the famous social network platform Douban\\textsuperscript{\\ref{douban}} in China. Overlapped users have feedback on both Music (MU) and Book (BO).\nWe take MU as the source domain and the BO as the target domain.\n\\end{itemize}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{The experimental results evaluated by HR@K and NDCG@K on single domain recommendation task with 95\\% confidence intervals.}\n  \\resizebox{1\\linewidth}{!}{\n\n    \\begin{tabular}{clcccccc}\n    \\toprule\n    Dataset & Model & HR(NDCG)@1 & HR@10 & HR@50 & NDCG@10 & NDCG@50 & \\textbf{Average} \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{IQI} & NCF   & 0.1545$\\pm$0.0029 & 0.5004$\\pm$0.0039 & 0.9153$\\pm$0.0015 & 0.2986$\\pm$0.0020 & 0.4185$\\pm$0.0088 & 0.4575  \\\\\n          & GCN   & 0.0877$\\pm$0.0040 & 0.4747$\\pm$0.0233 & 0.6620$\\pm$0.0323 & 0.2937$\\pm$0.0116 & 0.3361$\\pm$0.0137 & 0.3708  \\\\\n          & GAT   & 0.1497$\\pm$0.0545 & \\textbf{0.5878$\\pm$0.0765} & 0.9589$\\pm$0.0100 & 0.3359$\\pm$0.0797 & 0.4368$\\pm$0.0632 & 0.4938  \\\\\n          & GraphSAGE-mean & 0.0912$\\pm$0.0243 & 0.5671$\\pm$0.0388 & 0.9618$\\pm$0.0013 & 0.3145$\\pm$0.0298 & 0.3943$\\pm$0.0234 & 0.4658  \\\\\n          & GraphSAGE-pooling & 0.1122$\\pm$0.0217 & 0.5796$\\pm$0.0522 & 0.9508$\\pm$0.0041 & 0.3083$\\pm$0.0346 & 0.3956$\\pm$0.0231 & 0.4693  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.1591$\\pm$0.0278} & 0.5821$\\pm$0.0486 & \\textbf{0.9671$\\pm$0.0060} & \\textbf{0.3376$\\pm$0.0315} & \\textbf{0.4391$\\pm$0.0228} & \\textbf{0.4970 } \\\\\n\\midrule\n    \\multirow{6}[3]{*}{NF} & NCF   & 0.2102$\\pm$0.0038 & 0.5840$\\pm$0.004 & 0.8706$\\pm$0.0025 & 0.3804$\\pm$0.0036 & 0.4446$\\pm$0.0034 & 0.4980  \\\\\n          & GCN   & 0.1048$\\pm$0.0141 & 0.1688$\\pm$0.0141 & 0.4981$\\pm$0.0212 & 0.1328$\\pm$0.0144 & 0.2009$\\pm$0.0159 & 0.2211  \\\\\n          & GAT   & 0.1918$\\pm$0.0045 & 0.5564$\\pm$0.0027 & 0.9028$\\pm$0.0030 & 0.3554$\\pm$0.0021 & 0.4318$\\pm$0.0026 & 0.4876  \\\\\n          & GraphSAGE-mean & 0.1920$\\pm$0.0053 & 0.5525$\\pm$0.0008 & 0.8874$\\pm$0.0025 & 0.3542$\\pm$0.0025 & 0.4280$\\pm$0.0030 & 0.4828  \\\\\n          & GraphSAGE-pooling & 0.2059$\\pm$0.0027 & 0.6054$\\pm$0.0034 & \\textbf{0.9217$\\pm$0.0014} & 0.3906$\\pm$0.0027 & \\textbf{0.4696$\\pm$0.0023} & \\textbf{0.5186 } \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2140$\\pm$0.0042} & \\textbf{0.6077$\\pm$0.0131} & 0.9184$\\pm$0.0054 & \\textbf{0.3918$\\pm$0.0072} & 0.4613$\\pm$0.0055 & \\textbf{0.5186 } \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{MU} & NCF   & 0.2046$\\pm$0.0043 & 0.6078$\\pm$0.0026 & \\textbf{0.9590$\\pm$0.0007} & 0.3835$\\pm$0.0036 & \\textbf{0.5093$\\pm$0.0031} & 0.5328  \\\\\n          & GCN   & 0.1594$\\pm$0.0002 & 0.4984$\\pm$0.0019 & 0.7589$\\pm$0.0034 & 0.2946$\\pm$0.0006 & 0.3981$\\pm$0.0008 & 0.4219  \\\\\n          & GAT   & 0.2335$\\pm$0.0159 & 0.6833$\\pm$0.0072 & 0.9545$\\pm$0.0005 & 0.4463$\\pm$0.0128 & 0.5002$\\pm$0.0112 & 0.5636  \\\\\n          & GraphSAGE-mean & 0.1927$\\pm$0.0121 & 0.5923$\\pm$0.0196 & 0.8901$\\pm$0.0220 & 0.3742$\\pm$0.0161 & 0.4406$\\pm$0.0167 & 0.4980  \\\\\n          & GraphSAGE-pooling & 0.2215$\\pm$0.0193 & 0.6210$\\pm$0.0190 & 0.9484$\\pm$0.0026 & 0.4145$\\pm$0.0208 & 0.4965$\\pm$0.0171 & 0.5404  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2399$\\pm$0.0026} & \\textbf{0.6887$\\pm$0.0009} & 0.9507$\\pm$0.0028 & \\textbf{0.4470$\\pm$0.0011} & 0.5055$\\pm$0.0028 & \\textbf{0.5664 } \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{BO} & NCF   & 0.2567$\\pm$0.0081 & 0.6733$\\pm$0.007 & 0.9422$\\pm$0.0024 & 0.4558$\\pm$0.0081 & 0.5164$\\pm$0.007 & 0.5689  \\\\\n          & GCN   & 0.1899$\\pm$0.0004 & 0.5007$\\pm$0.0017 & 0.6991$\\pm$0.001 & 0.3558$\\pm$0.0002 & 0.3900$\\pm$0.0002 & 0.4271  \\\\\n          & GAT   & 0.2805$\\pm$0.0258 & 0.7034$\\pm$0.0365 & 0.9369$\\pm$0.0202 & \\textbf{0.4776$\\pm$0.0321} & 0.5303$\\pm$0.0286 & 0.5857  \\\\\n          & GraphSAGE-mean & 0.2137$\\pm$0.0009 & 0.6036$\\pm$0.0007 & 0.8741$\\pm$0.0022 & 0.3920$\\pm$0.0007 & 0.4525$\\pm$0.001 & 0.5072  \\\\\n          & GraphSAGE-pooling & 0.2716$\\pm$0.0148 & 0.6987$\\pm$0.0143 & 0.9351$\\pm$0.0051 & 0.4653$\\pm$0.0155 & 0.5166$\\pm$0.0136 & 0.5775  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2867$\\pm$0.005} & \\textbf{0.7055$\\pm$0.0063} & \\textbf{0.9431$\\pm$0.0042} & 0.4757$\\pm$0.0061 & \\textbf{0.5392$\\pm$0.0058} & \\textbf{0.5900 } \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:single}\n\\end{table*}\n\n\\subsection{Evaluation Protocol}\nFollowing existing works \\cite{he2017neural,hu2019hybrid}, we adopt the Leave-One-Out (LOO) evaluation.\nWe randomly sample one interaction for each user as the validation and test sets, respectively.\nWe also follow the common strategy \\cite{hu2019hybrid,gao2019natr} to randomly sample 99 unobserved (negative) items for each user and then evaluate how well the model can rank the test item against these negative ones. \nThen, we adopt two standard metrics, \\textbf{HR@K} and \\textbf{NDCG@K}, which are widely used in recommendation \\cite{gao2019natr,hu2019hybrid,he2017neural,wang2018tem,ding2018improving}, to evaluate the ranking performance of each methods. The HR@K is computed as follows:\n\\begin{eqnarray}\nHR@K=\\frac{1}{|U|}\\sum_{u\\in\\setU} I(p_u\\leq K),\n\\end{eqnarray}\nwhere $p_u$ is the hit position for the user $u$'s test item, and $I(\\cdot)$ is the indicator function.\nThe NDCG@K is computed as follows:\n\\begin{eqnarray}\nNDCG@K=\\frac{1}{|U|}\\sum_{u\\in\\setU} \\frac{\\log 2}{\\log (p_u+1)}.\n\\end{eqnarray}\nWe report HR@K and NDCG@K with K = 1, 10 and 50.\nThe larger the value, the better the performance for all the evaluation metrics.\nFor all experiments, we report the metrics with \\textbf{95\\%} \\textbf{confidence intervals} on five runs.\n\n\\subsection{Implementation Details}\nIf a user has feedback on an item, there is an edge between the user node and the item node.\nThus, we construct the feedback graph $G$ utilized in our experiments.\n\nFor single domain recommendation task, we perform experiments on the four target domain datasets (i.e., IQI, NF, MU, BO).\nFor all datasets we use: embedding dimension $k=32$, neighbor sampling threshold $\\delta=30$ with two \\modela~layers, negative sampling ratio $\\gamma=8$, mini-batch size of 256 and learning rate of 0.001. \nWe also use dropout whose probability is 0.4.\n\nFor cross-domain recommendation task, we perform experiments on the four pairs cross-domain datasets.\nFor all datasets we use: embedding dimension $k=16$, neighbor sampling threshold $\\delta=10$ with one \\modela~layer, negative sampling ratio $\\gamma=8$,\ntunable hyper-parameter $\\alpha=0.7$ to control the different strength in Equation (\\ref{equ:lossst}),\nmini-batch size of 256 and learning rate of 0.001. \nWe also use dropout whose probability is 0.4.\n\nAll these values and hyper-parameters of all baselines are chosen via a grid search on the IQI validation set.\nWe do not perform any datasets-specific tuning except early stopping on validation sets.\nAll models are implemented using TensorFlow\\footnote{https://www.tensorflow.org} and trained on GTX 1080ti GPU.\nTraining is finished through stochastic gradient descent over shuffled mini-batches with the Adam \\cite{kingma2014adam} update rule.\n\n\\subsection{Baseline Methods}\nWe construct three groups of experiments to demonstrate the effectiveness of the proposed model and framework.\n\\subsubsection{Single Domain Recommendation}\nWe compare the proposed \\modela~model with the following baseline models.\n\\begin{itemize}\n\\item \\textbf{NCF}~\\cite{he2017neural}: Neural Collaborative Filtering (NCF) is the state-of-the-art solution for recommendation tasks with implicit feedback.\nWe use one of the variants of NCF, which is also called Generalized Matrix Factorization (GMF).\n\\item \\textbf{GCN}~\\cite{kipf2016gcn}: The vanilla GCN learns latent node representations based on the first-order approximation of spectral graph convolutions. \n\\item \\textbf{GAT}~\\cite{velivckovic2017gat}: It applies the attention mechanism to learn different weights for aggregating node features from neighbors. \n\\item \\textbf{GraphSAGE-mean}~\\cite{hamilton2017graphsage}: It learns to aggregate node messages from a node’s local neighborhood by the mean aggregator.\n\\item \\textbf{GraphSAGE-pooling}~\\cite{hamilton2017graphsage}: It learns to aggregate node messages from a node’s local neighborhood by the pooling aggregator.\n\\end{itemize}\nFor GCN, GAT, GraphSAGE-mean and GraphSAGE-pooling, We apply the inner product on the user and item node representations as the output.\n\n\\subsubsection{Cross-Domain Recommendation}\nWe compare the proposed \\modelb~model with the following baseline models.\n\\begin{itemize}\n\\item \\textbf{CST}~\\cite{pan2010cst}: Coordinate System Transfer (CST) assumes that both users and items are overlapped and adds\ntwo regularization terms in its objective function.  Here, we adapt the CST to our datasets by only reserving single-side (i.e., the user-side or item-side) regularization term.\n\\item \\textbf{CD-NCF}~\\cite{he2017neural}: Neural Collaborative Filtering (NCF) is the state-of-the-art solution for single domain recommendation tasks with implicit feedback. Here, we adapt it to our cross-domain recommendation task via sharing the overlapped user or item embeddings.\n\\item \\textbf{EMCDR}~\\cite{man2017emcdr}: This is an embedding and mapping framework for cross-domain recommendation.\nThe framework contains Latent Factor Model, Latent Space Mapping and Cross-domain Recommendation, and it is not an end-to-end method.\n\\item \\textbf{EATNN}~\\cite{chen2019eatnn}: This is the state-of-the-art solution for cross-domain recommendation tasks. By introducing attention mechanisms, the model automatically assigns a personalized transfer scheme for each user.\n\\end{itemize}\n\n\\subsubsection{General Cross-Domain Recommendation}\nWe apply the proposed cross-domain framework to other baseline GNN models.\n\\begin{itemize}\n\\item \\textbf{CD-GCN}~\\cite{kipf2016gcn}: It applies the proposed general framework to the GCN as described in Section \\ref{sec:general}. \n\\item \\textbf{CD-GAT}~\\cite{velivckovic2017gat}: It applies the proposed general framework to the GAT. \n\\item \\textbf{CD-GraphSAGE-mean}~\\cite{hamilton2017graphsage}: It applies the proposed general framework to the GraphSAGE-mean. \n\\item \\textbf{CD-GraphSAGE-pooling}~\\cite{hamilton2017graphsage}: It applies the proposed general framework to the GraphSAGE-pooling. \n\\end{itemize}\n\n\\subsection{Performance Comparison}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{The experimental results evaluated by HR@K and NDCG@K on cross-domain recommendation task with 95\\% confidence intervals.}\n  \\resizebox{1\\linewidth}{!}{\n    \\begin{tabular}{clcccccc}\n    \\toprule\n    Dataset & Model & HR(NDCG)@1 & HR@10 & HR@50 & NDCG@10 & NGDCG@50 & \\textbf{Average} \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{TC$\\rightarrow$IQI} & CST   & 0.1948$\\pm$0.0039 & \\textbf{0.6678$\\pm$0.0136} & 0.9455$\\pm$0.0028 & 0.4178$\\pm$0.0099 & 0.4858$\\pm$0.0030 & 0.5423  \\\\\n          & CD-NCF & 0.1701$\\pm$0.0314 & 0.5408$\\pm$0.0445 & 0.8702$\\pm$0.0402 & 0.3392$\\pm$0.0411 & 0.4131$\\pm$0.0396 & 0.4667  \\\\\n          & EMCDR & 0.2058$\\pm$0.0239 & 0.3962$\\pm$0.0628 & 0.7438$\\pm$0.0436 & 0.2897$\\pm$0.0394 & 0.3640$\\pm$0.0358 & 0.3999  \\\\\n          & EATNN & 0.1959$\\pm$0.0102 & 0.6473$\\pm$0.0089 & 0.9314$\\pm$0.0026 & 0.4103$\\pm$0.0100 & 0.4906$\\pm$0.0087 & 0.5351  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2105$\\pm$0.0089} & 0.6536$\\pm$0.0159 & \\textbf{0.9758$\\pm$0.0088} & \\textbf{0.4222$\\pm$0.0108} & \\textbf{0.4963$\\pm$0.0080} & \\textbf{0.5517 } \\\\\n\\midrule\n    \\multirow{5}[3]{*}{ML$\\rightarrow$NF} & CST   & 0.1878$\\pm$0.0058 & 0.5413$\\pm$0.0024 & 0.8551$\\pm$0.0007 & 0.3486$\\pm$0.0015 & 0.4178$\\pm$0.0023 & 0.4701  \\\\\n          & CD-NCF & 0.1997$\\pm$0.0260 & 0.5540$\\pm$0.0457 & 0.8539$\\pm$0.0246 & 0.3600$\\pm$0.0353 & 0.4266$\\pm$0.0310 & 0.4788  \\\\\n          & EMCDR & 0.0968$\\pm$0.0260 & 0.3406$\\pm$0.0240 & 0.6522$\\pm$0.0730 & 0.2027$\\pm$0.0170 & 0.2708$\\pm$0.0070 & 0.3126  \\\\\n          & EATNN & 0.2103$\\pm$0.0018 & 0.5892$\\pm$0.0038 & 0.8745$\\pm$0.0016 & 0.3835$\\pm$0.0015 & 0.4472$\\pm$0.0013 & 0.5009  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2243$\\pm$0.0047} & \\textbf{0.6247$\\pm$0.0069} & \\textbf{0.9228$\\pm$0.0033} & \\textbf{0.4062$\\pm$0.0055} & \\textbf{0.4732$\\pm$0.0043} & \\textbf{0.5302 } \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{MO$\\rightarrow$MU} & CST   & 0.2378$\\pm$0.0085 & 0.5934$\\pm$0.0024 & 0.9051$\\pm$0.0073 & 0.3986$\\pm$0.0115 & 0.4775$\\pm$0.0035 & 0.5225  \\\\\n          & CD-NCF & 0.2599$\\pm$0.0200 & 0.7232$\\pm$0.0430 & 0.9480$\\pm$0.0261 & 0.4747$\\pm$0.0315 & 0.5281$\\pm$0.0281 & 0.5868  \\\\\n          & EMCDR & 0.2290$\\pm$0.0290 & 0.5610$\\pm$0.0703 & 0.8430$\\pm$0.0560 & 0.3834$\\pm$0.0320 & 0.4234$\\pm$0.0410 & 0.4880  \\\\\n          & EATNN & 0.2680$\\pm$0.0021 & 0.7253$\\pm$0.0035 & 0.9457$\\pm$0.0026 & \\textbf{0.4881$\\pm$0.0013} & 0.5282$\\pm$0.0014 & 0.5911  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2728$\\pm$0.0054} & \\textbf{0.7314$\\pm$0.0072} & \\textbf{0.9671$\\pm$0.002} & 0.4851$\\pm$0.0060 & \\textbf{0.5389$\\pm$0.0049} & \\textbf{0.5991 } \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{MU$\\rightarrow$BO} & CST   & 0.2524$\\pm$0.0089 & 0.6973$\\pm$0.0102 & 0.9355$\\pm$0.0098 & 0.4575$\\pm$0.0105 & 0.5143$\\pm$0.0068 & 0.5714  \\\\\n          & CD-NCF & 0.2770$\\pm$0.0158 & 0.7184$\\pm$0.0332 & \\textbf{0.9472$\\pm$0.0261} & 0.4841$\\pm$0.0215 & 0.5334$\\pm$0.0836 & 0.5920  \\\\\n          & EMCDR & 0.2004$\\pm$0.2972 & 0.4864$\\pm$0.5881 & 0.7612$\\pm$0.4115 & 0.3324$\\pm$0.4423 & 0.3920$\\pm$0.4082 & 0.4345  \\\\\n          & EATNN & 0.2731$\\pm$0.0015 & 0.7064$\\pm$0.0036 & 0.9277$\\pm$0.0026 & 0.4634$\\pm$0.0013 & 0.5070$\\pm$0.0017 & 0.5755  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2978$\\pm$0.0481} & \\textbf{0.7267$\\pm$0.0688} & 0.9424$\\pm$0.0295 & \\textbf{0.4872$\\pm$0.0609} & \\textbf{0.5502$\\pm$0.0523} & \\textbf{0.6009 } \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:cross}\n\\end{table*}\n\\begin{figure*}[!t]\n\\begin{center}\n\\includegraphics[width=\\linewidth]{framework}\n\\caption{The HR@K results of the general cross-domain framework on 4 (datastes) $\\times$ 10 (models) = 40 tasks.}\n\\label{general_framework}\n\\end{center}\n\\end{figure*}\n\\subsubsection{Single Domain Recommendation Task}\nWe demonstrate the effectiveness of our \\modela~on four target domain datasets.\nThe experimental results evaluated by HR@K and NDCG@K on IQI, NF, MU and Bo are presented in Table \\ref{tab:single}. \nFrom these results, we have the following insightful observations.\n\\begin{itemize}\n\t\\item[-] Among these GNN baselines, the GCN has acceptable performances on multiple datasets. \n\tThe GraphSAGE-mean improves the results comparing with GCN via introducing the mean aggregator to aggregate messages from each node's local neighborhood. \n\tThe GraphSAGE-pooling achieves further improvement over GraphSAGE-mean by replacing the mean aggregator with the more complex pooling aggregator, which applies the element-wise max-pooling operation on the transformed neighbor messages through a fully-connected neural network. \n\tThe GAT obtains further performance improvement via assigning different learnable weights to neighbor messages.\n\t\\item[-] NCF also obtains competitive recommendation performance, which further validates why the simple collaborative filtering methods can be widely used in recommender systems. \n\tOn most tasks, our \\modela~outperforms the NCF, which demonstrates the graph-structured data are useful for recommender systems.\n    \\item[-] Our \\modela~almost obtains the best performance on multiple datasets. It outperforms the GNN baselines on multi-pair metrics.\n    Besides, although the improvement of the \\modela~ compared with the GAT is marginal on a few metrics and datasets, the \\textbf{Average} values of these metrics of the \\modela~are better on all four datasets, which indicates that the \\modela~has better generalization performance than the GAT.\n\\end{itemize}\nThe essence of recommender systems is to find similarity, and local neighbor nodes often contain such similarity.\nOur \\modela~aggregates local neighbor messages via high-order feature interactions.\nTherefore, the \\modela~ can achieve better performance and is more suitable on recommendation tasks.\nOverall, these improvements indicate the fact that \nour \\modela~can effectively integrate neighbor messages to generate more effective node representations and is more suitable when confronting the graph-structured data. \n\n\\subsubsection{Cross-Domain Recommendation Task}\nWe also demonstrate the effectiveness of our \\modelb~on four pairs cross-domain datasets.\nThe experimental results evaluated by HR@K and NDCG@K are presented in Table \\ref{tab:cross}. \nFrom these results, we have the following findings.\n\\begin{itemize}\n    \\item[-] The collaborative filtering based CD-NCF still obtains competitive recommendation performance via sharing the embedding of overlapped users or items, and it improves the recommendation performance of the CST on all datasets except the TC$\\rightarrow$IQI.\n    We conjecture that collaborative filtering methods need a lot of data to obtain good performance, while the TC$\\rightarrow$IQI have less feedback data.\n    It also demonstrates that collaborative filtering is indeed a simple and efficient method in recommender systems.\n\t\\item[-] EMCDR is not an end-to-end method, and the poor performance may result from the accumulation of errors at each step.\n\t\\item[-] EATNN is the state-of-the-art cross-domain recommendation baseline, and it achieves nearly the best results across multiple datasets among these baselines.\n\t\\item[-] By utilizing the graph topology, our \\modelb~ improves the recommendation performance compared with various methods.\n\tIt demonstrates that the proposed cross-domain framework combined with the proposed \\modela~ is more suitable for the graph-structured data in cross-domain recommendation.\n\\end{itemize}\n\n\\subsubsection{General Cross-Domain Recommendation Task}\nOur cross-domain framework is a general framework\nthat can be applied upon various existing GNN models. \nHere we apply the cross-domain framework\nto GCN, GAT, GraphSAGE-mean and GraphSAGE-pooling. \nIn order to prove that our cross domain framework is applicable to various GNN models. \nWe conduct experiments on 40 tasks ($4\\times10=40$, 4 pairs datasets, 10 models). \nThe results are shown in Figure \\ref{general_framework}. The red lines are the baselines which only use the target training set to train model, also shown in Table \\ref{tab:single}, and the blue lines are the cross-domain models which applied the general cross-domain framework.\nFrom the results, we have the following findings:\n\\begin{itemize}\n    \\item[-] On most tasks, our cross-domain framework is effective to improve the performance of the single domain models which also demonstrates the cross-domain framework can be applied upon various existing GNN models.\n    \\item[-] The improvement on GCN is larger than the other four GNN models. The main reason might be that the single domain GCN is significantly weaker than other improved GNN models as showed in Table \\ref{tab:single}, so the improvement of other GNN models brought by the cross-domain framework is relatively less than GCN.\n    \\item[-] The performance of the GraphSAGE-mean and GraphSAGE-pooling is unsatisfying on several datasets, the reason might be that the mean and pooling aggregators are too simple and fewer shared parameters make them difficult to coordinately train in two domains.\n\\end{itemize}\n\nOverall, we observe that the performance improvement of the cross-domain framework is significant and it is able to improve the performance of base GNN models on different datasets,\nwhich proves that the cross-domain framework is compatible with many GNN models.\n\n\\subsection{Ablation Study}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{Results of ablation study on cross-domain recommendation task based on \\modelb. ``*'' indicates that the  improvement is statistically significant with the p-value $<$ 0.05 on independent samples t-tests.}\n  \\resizebox{.7\\linewidth}{!}{\n    \\begin{tabular}{c|ccc|ccc}\n    \\toprule\n    Model & HR@1  & HR@10 & HR@50 & HR@1  & HR@10 & HR@50 \\\\\n    \\midrule\n          & \\multicolumn{3}{c|}{TC$\\rightarrow$IQI} & \\multicolumn{3}{c}{MO$\\rightarrow$MU} \\\\\n    CD-GFM-base & 0.1681 & 0.5914 & 0.9362 & 0.2445 & 0.6989 & 0.9054 \\\\\n    \\modelb & \\textbf{0.2105*} & \\textbf{0.6536*} & \\textbf{0.9758*} & \\textbf{0.2728*} & \\textbf{0.7314*} & \\textbf{0.9671*} \\\\\n    \\midrule\n          & \\multicolumn{3}{c|}{ML$\\rightarrow$NF} & \\multicolumn{3}{c}{MU$\\rightarrow$BO} \\\\\n    CD-GFM-base & 0.2178 & 0.6196 & 0.9182 & 0.2756 & 0.6963 & 0.9395 \\\\\n    \\modelb & \\textbf{0.2243*} & \\textbf{0.6247*} & \\textbf{0.9228} & \\textbf{0.2978*} & \\textbf{0.7267*} & \\textbf{0.9424} \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:ablation}\n\\end{table*}\nMoreover, for understanding the contribution of the shared node initialization in \\modelb.\nwe construct ablation experiments over \\textbf{CD-GFM-base} and \\modelb~on four pairs datastes.\n\\textbf{CD-GFM-base} only uses the domain-specific node representations $\\h_{n_s}$ and $\\h_{n_t}$ output directly from the \\modela~and not to concatenate the initialized input in Equation (\\ref{concat1}) and (\\ref{concat2}), i.e., \n$\\n_s=\\h_{n_s},\n\\n_t=\\h_{n_t}.\n$\nThe results are presented in Table \\ref{tab:ablation}.\nWe conduct independent samples t-tests and the p-value $<$ 0.05 indicates\nthat the improvement of \\modelb~over the \\textbf{CD-GFM-base} is statistically significant.\nThe improvement demonstrates that \\modelb~model can efficiently take advantage of the domain-shared and domain-specific node representations simultaneously, and obtain the best performance on all datasets, which indicates both two representations matter for the cross-domain recommendation performance.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n4.5 Performance Comparison\n\\subsection{Performance Comparison}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{The experimental results evaluated by HR@K and NDCG@K on cross-domain recommendation task with 95\\% confidence intervals.}\n  \\resizebox{1\\linewidth}{!}{\n    \\begin{tabular}{clcccccc}\n    \\toprule\n    Dataset & Model & HR(NDCG)@1 & HR@10 & HR@50 & NDCG@10 & NGDCG@50 & \\textbf{Average} \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{TC$\\rightarrow$IQI} & CST   & 0.1948$\\pm$0.0039 & \\textbf{0.6678$\\pm$0.0136} & 0.9455$\\pm$0.0028 & 0.4178$\\pm$0.0099 & 0.4858$\\pm$0.0030 & 0.5423  \\\\\n          & CD-NCF & 0.1701$\\pm$0.0314 & 0.5408$\\pm$0.0445 & 0.8702$\\pm$0.0402 & 0.3392$\\pm$0.0411 & 0.4131$\\pm$0.0396 & 0.4667  \\\\\n          & EMCDR & 0.2058$\\pm$0.0239 & 0.3962$\\pm$0.0628 & 0.7438$\\pm$0.0436 & 0.2897$\\pm$0.0394 & 0.3640$\\pm$0.0358 & 0.3999  \\\\\n          & EATNN & 0.1959$\\pm$0.0102 & 0.6473$\\pm$0.0089 & 0.9314$\\pm$0.0026 & 0.4103$\\pm$0.0100 & 0.4906$\\pm$0.0087 & 0.5351  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2105$\\pm$0.0089} & 0.6536$\\pm$0.0159 & \\textbf{0.9758$\\pm$0.0088} & \\textbf{0.4222$\\pm$0.0108} & \\textbf{0.4963$\\pm$0.0080} & \\textbf{0.5517 } \\\\\n\\midrule\n    \\multirow{5}[3]{*}{ML$\\rightarrow$NF} & CST   & 0.1878$\\pm$0.0058 & 0.5413$\\pm$0.0024 & 0.8551$\\pm$0.0007 & 0.3486$\\pm$0.0015 & 0.4178$\\pm$0.0023 & 0.4701  \\\\\n          & CD-NCF & 0.1997$\\pm$0.0260 & 0.5540$\\pm$0.0457 & 0.8539$\\pm$0.0246 & 0.3600$\\pm$0.0353 & 0.4266$\\pm$0.0310 & 0.4788  \\\\\n          & EMCDR & 0.0968$\\pm$0.0260 & 0.3406$\\pm$0.0240 & 0.6522$\\pm$0.0730 & 0.2027$\\pm$0.0170 & 0.2708$\\pm$0.0070 & 0.3126  \\\\\n          & EATNN & 0.2103$\\pm$0.0018 & 0.5892$\\pm$0.0038 & 0.8745$\\pm$0.0016 & 0.3835$\\pm$0.0015 & 0.4472$\\pm$0.0013 & 0.5009  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2243$\\pm$0.0047} & \\textbf{0.6247$\\pm$0.0069} & \\textbf{0.9228$\\pm$0.0033} & \\textbf{0.4062$\\pm$0.0055} & \\textbf{0.4732$\\pm$0.0043} & \\textbf{0.5302 } \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{MO$\\rightarrow$MU} & CST   & 0.2378$\\pm$0.0085 & 0.5934$\\pm$0.0024 & 0.9051$\\pm$0.0073 & 0.3986$\\pm$0.0115 & 0.4775$\\pm$0.0035 & 0.5225  \\\\\n          & CD-NCF & 0.2599$\\pm$0.0200 & 0.7232$\\pm$0.0430 & 0.9480$\\pm$0.0261 & 0.4747$\\pm$0.0315 & 0.5281$\\pm$0.0281 & 0.5868  \\\\\n          & EMCDR & 0.2290$\\pm$0.0290 & 0.5610$\\pm$0.0703 & 0.8430$\\pm$0.0560 & 0.3834$\\pm$0.0320 & 0.4234$\\pm$0.0410 & 0.4880  \\\\\n          & EATNN & 0.2680$\\pm$0.0021 & 0.7253$\\pm$0.0035 & 0.9457$\\pm$0.0026 & \\textbf{0.4881$\\pm$0.0013} & 0.5282$\\pm$0.0014 & 0.5911  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2728$\\pm$0.0054} & \\textbf{0.7314$\\pm$0.0072} & \\textbf{0.9671$\\pm$0.002} & 0.4851$\\pm$0.0060 & \\textbf{0.5389$\\pm$0.0049} & \\textbf{0.5991 } \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{MU$\\rightarrow$BO} & CST   & 0.2524$\\pm$0.0089 & 0.6973$\\pm$0.0102 & 0.9355$\\pm$0.0098 & 0.4575$\\pm$0.0105 & 0.5143$\\pm$0.0068 & 0.5714  \\\\\n          & CD-NCF & 0.2770$\\pm$0.0158 & 0.7184$\\pm$0.0332 & \\textbf{0.9472$\\pm$0.0261} & 0.4841$\\pm$0.0215 & 0.5334$\\pm$0.0836 & 0.5920  \\\\\n          & EMCDR & 0.2004$\\pm$0.2972 & 0.4864$\\pm$0.5881 & 0.7612$\\pm$0.4115 & 0.3324$\\pm$0.4423 & 0.3920$\\pm$0.4082 & 0.4345  \\\\\n          & EATNN & 0.2731$\\pm$0.0015 & 0.7064$\\pm$0.0036 & 0.9277$\\pm$0.0026 & 0.4634$\\pm$0.0013 & 0.5070$\\pm$0.0017 & 0.5755  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2978$\\pm$0.0481} & \\textbf{0.7267$\\pm$0.0688} & 0.9424$\\pm$0.0295 & \\textbf{0.4872$\\pm$0.0609} & \\textbf{0.5502$\\pm$0.0523} & \\textbf{0.6009 } \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:cross}\n\\end{table*}\n\\begin{figure*}[!t]\n\\begin{center}\n\\includegraphics[width=\\linewidth]{framework}\n\\caption{The HR@K results of the general cross-domain framework on 4 (datastes) $\\times$ 10 (models) = 40 tasks.}\n\\label{general_framework}\n\\end{center}\n\\end{figure*}\n\\subsubsection{Single Domain Recommendation Task}\nWe demonstrate the effectiveness of our \\modela~on four target domain datasets.\nThe experimental results evaluated by HR@K and NDCG@K on IQI, NF, MU and Bo are presented in Table \\ref{tab:single}. \nFrom these results, we have the following insightful observations.\n\\begin{itemize}\n\t\\item[-] Among these GNN baselines, the GCN has acceptable performances on multiple datasets. \n\tThe GraphSAGE-mean improves the results comparing with GCN via introducing the mean aggregator to aggregate messages from each node's local neighborhood. \n\tThe GraphSAGE-pooling achieves further improvement over GraphSAGE-mean by replacing the mean aggregator with the more complex pooling aggregator, which applies the element-wise max-pooling operation on the transformed neighbor messages through a fully-connected neural network. \n\tThe GAT obtains further performance improvement via assigning different learnable weights to neighbor messages.\n\t\\item[-] NCF also obtains competitive recommendation performance, which further validates why the simple collaborative filtering methods can be widely used in recommender systems. \n\tOn most tasks, our \\modela~outperforms the NCF, which demonstrates the graph-structured data are useful for recommender systems.\n    \\item[-] Our \\modela~almost obtains the best performance on multiple datasets. It outperforms the GNN baselines on multi-pair metrics.\n    Besides, although the improvement of the \\modela~ compared with the GAT is marginal on a few metrics and datasets, the \\textbf{Average} values of these metrics of the \\modela~are better on all four datasets, which indicates that the \\modela~has better generalization performance than the GAT.\n\\end{itemize}\nThe essence of recommender systems is to find similarity, and local neighbor nodes often contain such similarity.\nOur \\modela~aggregates local neighbor messages via high-order feature interactions.\nTherefore, the \\modela~ can achieve better performance and is more suitable on recommendation tasks.\nOverall, these improvements indicate the fact that \nour \\modela~can effectively integrate neighbor messages to generate more effective node representations and is more suitable when confronting the graph-structured data. \n\n\\subsubsection{Cross-Domain Recommendation Task}\nWe also demonstrate the effectiveness of our \\modelb~on four pairs cross-domain datasets.\nThe experimental results evaluated by HR@K and NDCG@K are presented in Table \\ref{tab:cross}. \nFrom these results, we have the following findings.\n\\begin{itemize}\n    \\item[-] The collaborative filtering based CD-NCF still obtains competitive recommendation performance via sharing the embedding of overlapped users or items, and it improves the recommendation performance of the CST on all datasets except the TC$\\rightarrow$IQI.\n    We conjecture that collaborative filtering methods need a lot of data to obtain good performance, while the TC$\\rightarrow$IQI have less feedback data.\n    It also demonstrates that collaborative filtering is indeed a simple and efficient method in recommender systems.\n\t\\item[-] EMCDR is not an end-to-end method, and the poor performance may result from the accumulation of errors at each step.\n\t\\item[-] EATNN is the state-of-the-art cross-domain recommendation baseline, and it achieves nearly the best results across multiple datasets among these baselines.\n\t\\item[-] By utilizing the graph topology, our \\modelb~ improves the recommendation performance compared with various methods.\n\tIt demonstrates that the proposed cross-domain framework combined with the proposed \\modela~ is more suitable for the graph-structured data in cross-domain recommendation.\n\\end{itemize}\n\n\\subsubsection{General Cross-Domain Recommendation Task}\nOur cross-domain framework is a general framework\nthat can be applied upon various existing GNN models. \nHere we apply the cross-domain framework\nto GCN, GAT, GraphSAGE-mean and GraphSAGE-pooling. \nIn order to prove that our cross domain framework is applicable to various GNN models. \nWe conduct experiments on 40 tasks ($4\\times10=40$, 4 pairs datasets, 10 models). \nThe results are shown in Figure \\ref{general_framework}. The red lines are the baselines which only use the target training set to train model, also shown in Table \\ref{tab:single}, and the blue lines are the cross-domain models which applied the general cross-domain framework.\nFrom the results, we have the following findings:\n\\begin{itemize}\n    \\item[-] On most tasks, our cross-domain framework is effective to improve the performance of the single domain models which also demonstrates the cross-domain framework can be applied upon various existing GNN models.\n    \\item[-] The improvement on GCN is larger than the other four GNN models. The main reason might be that the single domain GCN is significantly weaker than other improved GNN models as showed in Table \\ref{tab:single}, so the improvement of other GNN models brought by the cross-domain framework is relatively less than GCN.\n    \\item[-] The performance of the GraphSAGE-mean and GraphSAGE-pooling is unsatisfying on several datasets, the reason might be that the mean and pooling aggregators are too simple and fewer shared parameters make them difficult to coordinately train in two domains.\n\\end{itemize}\n\nOverall, we observe that the performance improvement of the cross-domain framework is significant and it is able to improve the performance of base GNN models on different datasets,\nwhich proves that the cross-domain framework is compatible with many GNN models.\n\n4.5.1 Single Domain Recommendation Task\n\\subsubsection{Single Domain Recommendation Task}\nWe demonstrate the effectiveness of our \\modela~on four target domain datasets.\nThe experimental results evaluated by HR@K and NDCG@K on IQI, NF, MU and Bo are presented in Table \\ref{tab:single}. \nFrom these results, we have the following insightful observations.\n\\begin{itemize}\n\t\\item[-] Among these GNN baselines, the GCN has acceptable performances on multiple datasets. \n\tThe GraphSAGE-mean improves the results comparing with GCN via introducing the mean aggregator to aggregate messages from each node's local neighborhood. \n\tThe GraphSAGE-pooling achieves further improvement over GraphSAGE-mean by replacing the mean aggregator with the more complex pooling aggregator, which applies the element-wise max-pooling operation on the transformed neighbor messages through a fully-connected neural network. \n\tThe GAT obtains further performance improvement via assigning different learnable weights to neighbor messages.\n\t\\item[-] NCF also obtains competitive recommendation performance, which further validates why the simple collaborative filtering methods can be widely used in recommender systems. \n\tOn most tasks, our \\modela~outperforms the NCF, which demonstrates the graph-structured data are useful for recommender systems.\n    \\item[-] Our \\modela~almost obtains the best performance on multiple datasets. It outperforms the GNN baselines on multi-pair metrics.\n    Besides, although the improvement of the \\modela~ compared with the GAT is marginal on a few metrics and datasets, the \\textbf{Average} values of these metrics of the \\modela~are better on all four datasets, which indicates that the \\modela~has better generalization performance than the GAT.\n\\end{itemize}\nThe essence of recommender systems is to find similarity, and local neighbor nodes often contain such similarity.\nOur \\modela~aggregates local neighbor messages via high-order feature interactions.\nTherefore, the \\modela~ can achieve better performance and is more suitable on recommendation tasks.\nOverall, these improvements indicate the fact that \nour \\modela~can effectively integrate neighbor messages to generate more effective node representations and is more suitable when confronting the graph-structured data. \n\n4.5.3 General Cross-Domain Recommendation Task\n\\subsubsection{General Cross-Domain Recommendation Task}\nOur cross-domain framework is a general framework\nthat can be applied upon various existing GNN models. \nHere we apply the cross-domain framework\nto GCN, GAT, GraphSAGE-mean and GraphSAGE-pooling. \nIn order to prove that our cross domain framework is applicable to various GNN models. \nWe conduct experiments on 40 tasks ($4\\times10=40$, 4 pairs datasets, 10 models). \nThe results are shown in Figure \\ref{general_framework}. The red lines are the baselines which only use the target training set to train model, also shown in Table \\ref{tab:single}, and the blue lines are the cross-domain models which applied the general cross-domain framework.\nFrom the results, we have the following findings:\n\\begin{itemize}\n    \\item[-] On most tasks, our cross-domain framework is effective to improve the performance of the single domain models which also demonstrates the cross-domain framework can be applied upon various existing GNN models.\n    \\item[-] The improvement on GCN is larger than the other four GNN models. The main reason might be that the single domain GCN is significantly weaker than other improved GNN models as showed in Table \\ref{tab:single}, so the improvement of other GNN models brought by the cross-domain framework is relatively less than GCN.\n    \\item[-] The performance of the GraphSAGE-mean and GraphSAGE-pooling is unsatisfying on several datasets, the reason might be that the mean and pooling aggregators are too simple and fewer shared parameters make them difficult to coordinately train in two domains.\n\\end{itemize}\n\nOverall, we observe that the performance improvement of the cross-domain framework is significant and it is able to improve the performance of base GNN models on different datasets,\nwhich proves that the cross-domain framework is compatible with many GNN models.\n\n",
        "Descriptive_question1": "What is the HR@1 value for the GFM model on the IQI dataset in table_2?",
        "Descriptive_question2": "Which model achieved the highest HR@10 score on the NF dataset in table_2?",
        "Reasoning_question1": "Why might the GFM model consistently outperform other models across most metrics on different datasets in table_2?",
        "Reasoning_question2": "How does the performance of GraphSAGE-pooling compare to GraphSAGE-mean across the datasets in table_2, and what could be the reasons for any observed differences?",
        "Descriptive_answer1": "0.1591",
        "Descriptive_answer2": "GFM",
        "Reasoning_answer1": "The GFM model likely outperforms other models across most metrics on different datasets in table_2 due to its ability to effectively integrate neighbor messages through high-order feature interactions. Unlike other baseline models like NCF, GCN, GAT, and GraphSAGE variants, GFM seems to be designed to leverage graph-structured data more efficiently. This is critical in recommendation tasks where finding similarities through local neighbor nodes can significantly enhance performance. As noted in the observations, GFM almost always achieves the best performance on multiple datasets, surpassing GNN baselines on various metrics. Additionally, its consistently higher Average values across all datasets suggest better generalization performance compared to models like GAT, which might excel only on specific metrics. This indicates that GFM's architecture is more suited to capture complex patterns in user-item interactions inherent in the graph structure, leading to improved recommendation accuracy across diverse datasets like IQI, NF, MU, and BO.",
        "Reasoning_answer2": "In comparing GraphSAGE-pooling to GraphSAGE-mean across the datasets in table_2, we observe that GraphSAGE-pooling generally performs better on most metrics. For instance, on the IQI dataset, GraphSAGE-pooling has an HR@1 of 0.1122 compared to GraphSAGE-mean's 0.0912, and a higher Average score of 0.4693 versus 0.4658. On the NF dataset, GraphSAGE-pooling achieves an HR@50 of 0.9217 (the highest) and an Average of 0.5186, outperforming GraphSAGE-mean's HR@50 of 0.8874 and Average of 0.4828. This trend continues across MU and BO datasets, where GraphSAGE-pooling consistently shows higher scores, such as an Average of 0.5404 versus 0.4980 on MU. The likely reason for this difference, as suggested by the text, is the use of a more complex pooling aggregator in GraphSAGE-pooling, which applies element-wise max-pooling on transformed neighbor messages through a fully-connected neural network. This allows for better feature aggregation compared to the simpler mean aggregator used in GraphSAGE-mean, which merely averages neighbor messages. The enhanced aggregation mechanism in GraphSAGE-pooling likely captures more nuanced patterns in the data, leading to improved performance across various metrics and datasets."
    },
    {
        "paper_id": "2007.05911.json",
        "table_id": "table_3",
        "table_content": "\\begin{table*}[!t]\n  \\centering\n  \\caption{The experimental results evaluated by HR@K and NDCG@K on cross-domain recommendation task with 95\\% confidence intervals.}\n  \\resizebox{1\\linewidth}{!}{\n    \\begin{tabular}{clcccccc}\n    \\toprule\n    Dataset & Model & HR(NDCG)@1 & HR@10 & HR@50 & NDCG@10 & NGDCG@50 & \\textbf{Average} \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{TC$\\rightarrow$IQI} & CST   & 0.1948$\\pm$0.0039 & \\textbf{0.6678$\\pm$0.0136} & 0.9455$\\pm$0.0028 & 0.4178$\\pm$0.0099 & 0.4858$\\pm$0.0030 & 0.5423  \\\\\n          & CD-NCF & 0.1701$\\pm$0.0314 & 0.5408$\\pm$0.0445 & 0.8702$\\pm$0.0402 & 0.3392$\\pm$0.0411 & 0.4131$\\pm$0.0396 & 0.4667  \\\\\n          & EMCDR & 0.2058$\\pm$0.0239 & 0.3962$\\pm$0.0628 & 0.7438$\\pm$0.0436 & 0.2897$\\pm$0.0394 & 0.3640$\\pm$0.0358 & 0.3999  \\\\\n          & EATNN & 0.1959$\\pm$0.0102 & 0.6473$\\pm$0.0089 & 0.9314$\\pm$0.0026 & 0.4103$\\pm$0.0100 & 0.4906$\\pm$0.0087 & 0.5351  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2105$\\pm$0.0089} & 0.6536$\\pm$0.0159 & \\textbf{0.9758$\\pm$0.0088} & \\textbf{0.4222$\\pm$0.0108} & \\textbf{0.4963$\\pm$0.0080} & \\textbf{0.5517 } \\\\\n\\midrule\n    \\multirow{5}[3]{*}{ML$\\rightarrow$NF} & CST   & 0.1878$\\pm$0.0058 & 0.5413$\\pm$0.0024 & 0.8551$\\pm$0.0007 & 0.3486$\\pm$0.0015 & 0.4178$\\pm$0.0023 & 0.4701  \\\\\n          & CD-NCF & 0.1997$\\pm$0.0260 & 0.5540$\\pm$0.0457 & 0.8539$\\pm$0.0246 & 0.3600$\\pm$0.0353 & 0.4266$\\pm$0.0310 & 0.4788  \\\\\n          & EMCDR & 0.0968$\\pm$0.0260 & 0.3406$\\pm$0.0240 & 0.6522$\\pm$0.0730 & 0.2027$\\pm$0.0170 & 0.2708$\\pm$0.0070 & 0.3126  \\\\\n          & EATNN & 0.2103$\\pm$0.0018 & 0.5892$\\pm$0.0038 & 0.8745$\\pm$0.0016 & 0.3835$\\pm$0.0015 & 0.4472$\\pm$0.0013 & 0.5009  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2243$\\pm$0.0047} & \\textbf{0.6247$\\pm$0.0069} & \\textbf{0.9228$\\pm$0.0033} & \\textbf{0.4062$\\pm$0.0055} & \\textbf{0.4732$\\pm$0.0043} & \\textbf{0.5302 } \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{MO$\\rightarrow$MU} & CST   & 0.2378$\\pm$0.0085 & 0.5934$\\pm$0.0024 & 0.9051$\\pm$0.0073 & 0.3986$\\pm$0.0115 & 0.4775$\\pm$0.0035 & 0.5225  \\\\\n          & CD-NCF & 0.2599$\\pm$0.0200 & 0.7232$\\pm$0.0430 & 0.9480$\\pm$0.0261 & 0.4747$\\pm$0.0315 & 0.5281$\\pm$0.0281 & 0.5868  \\\\\n          & EMCDR & 0.2290$\\pm$0.0290 & 0.5610$\\pm$0.0703 & 0.8430$\\pm$0.0560 & 0.3834$\\pm$0.0320 & 0.4234$\\pm$0.0410 & 0.4880  \\\\\n          & EATNN & 0.2680$\\pm$0.0021 & 0.7253$\\pm$0.0035 & 0.9457$\\pm$0.0026 & \\textbf{0.4881$\\pm$0.0013} & 0.5282$\\pm$0.0014 & 0.5911  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2728$\\pm$0.0054} & \\textbf{0.7314$\\pm$0.0072} & \\textbf{0.9671$\\pm$0.002} & 0.4851$\\pm$0.0060 & \\textbf{0.5389$\\pm$0.0049} & \\textbf{0.5991 } \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{MU$\\rightarrow$BO} & CST   & 0.2524$\\pm$0.0089 & 0.6973$\\pm$0.0102 & 0.9355$\\pm$0.0098 & 0.4575$\\pm$0.0105 & 0.5143$\\pm$0.0068 & 0.5714  \\\\\n          & CD-NCF & 0.2770$\\pm$0.0158 & 0.7184$\\pm$0.0332 & \\textbf{0.9472$\\pm$0.0261} & 0.4841$\\pm$0.0215 & 0.5334$\\pm$0.0836 & 0.5920  \\\\\n          & EMCDR & 0.2004$\\pm$0.2972 & 0.4864$\\pm$0.5881 & 0.7612$\\pm$0.4115 & 0.3324$\\pm$0.4423 & 0.3920$\\pm$0.4082 & 0.4345  \\\\\n          & EATNN & 0.2731$\\pm$0.0015 & 0.7064$\\pm$0.0036 & 0.9277$\\pm$0.0026 & 0.4634$\\pm$0.0013 & 0.5070$\\pm$0.0017 & 0.5755  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2978$\\pm$0.0481} & \\textbf{0.7267$\\pm$0.0688} & 0.9424$\\pm$0.0295 & \\textbf{0.4872$\\pm$0.0609} & \\textbf{0.5502$\\pm$0.0523} & \\textbf{0.6009 } \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:cross}\n\\end{table*}",
        "caption": "The experimental results evaluated by HR@K and NDCG@K on cross-domain recommendation task with 95\\% confidence intervals.",
        "label": "tab:cross",
        "section_info": "4 Experiments\n\\section{Experiments}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{Statistics of the datasets. ``$\\#$'' means the number of the corresponding items.}\n  \\resizebox{.75\\linewidth}{!}{\n    \\begin{tabular}{llllll}\n    \\toprule\n    \\multirow{2}[4]{*}{Dataset} & \\multirow{2}[4]{*}{Shared (\\#)} & \\multicolumn{2}{c}{Source Domain} & \\multicolumn{2}{c}{Target Domain} \\\\\n\\cmidrule{3-6}          &       & Unshared (\\#) & \\#Feedback & Unshared (\\#) & \\#Feedback \\\\\n    \\midrule\n    TC$\\rightarrow$IQI & Item (5,568) & User (35,398) & 314,621 & User (19,999) & 78,429 \\\\\n    ML$\\rightarrow$NF & Item (5,565) & user (30,279) & 11,555,621 & User (11,498) & 199,765 \\\\\n    MO$\\rightarrow$MU & User (27,898) & Item (15,465) & 7,366,992 & Item (14,521) & 3,784,331 \\\\\n    MU$\\rightarrow$BO & User (27,898) & Item (14,521) & 3,784,331 & Item (15,774) & 1,936,754 \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:dataset}\n\\end{table*}\nIn this section, we perform experiments to evaluate the proposed model and framework against various baselines on real-world datasets. \nWe first introduce the datasets, evaluation protocol, implementation details and baseline methods of our experiments. Finally, we present our experimental results and analysis.\n\n\\subsection{Datasets}\nWe utilize four pairs frequently used real-world datasets, which contain two pairs \\textbf{user-shared} datasets and two pairs \\textbf{item-shared} datasets. \nFor all datasets, we only use the user IDs, item IDs and their implicit feedback information.\nFor simplicity, we intentionally transform the rating data into binary (1/0, indicating whether a user has interacted with an item or not) to fit the problem setting of implicit feedback following \\cite{gao2019natr}.\nThe statistics of the four pairs datasets are listed in Table \\ref{tab:dataset}.\n\\begin{itemize}\n\\item \\textbf{TC$\\rightarrow$IQI} \\cite{yan2019tciqi} are from two mainstream video websites Tencent (TC)\\footnote{https://v.qq.com} and iQIYI (IQI)\\footnote{https://www.iqiyi.com} in China. \nThere are a lot of overlapped items (movies) in the two websites.  \nWe take TC and IQI as the source and target domains, respectively. \nWe got the processed dataset pair directly from \\cite{yan2019tciqi}.\n\n\\item \\textbf{ML$\\rightarrow$NF}\\footnote{https://grouplens.org/datasets/movielens}$^,$\\footnote{https://www.kaggle.com/laowingkin/netflix-movie-recommendation/data} are from two popular movie recommendation platforms MovieLens and Netflix, in which there are a lot of overlapped items (movies). \nWe take MovieLens (ML) as the source domain and the Netflix (NF) as the target domain. We identify the same movies with their names (case insensitive) and years to avoid wrong identifications as possible, which is similar data processing method with \\cite{gao2019natr}.\n\n\\item \\textbf{MO$\\rightarrow$MU} are from the famous social network platform Douban\\footnote{https://www.douban.com\\label{douban}} in China. Overlapped users have feedback on both Movie (MO) and Music (MU).\nWe take MO as the source domain and the MU as the target domain.\n\n\\item \\textbf{MU$\\rightarrow$BO} are also from the famous social network platform Douban\\textsuperscript{\\ref{douban}} in China. Overlapped users have feedback on both Music (MU) and Book (BO).\nWe take MU as the source domain and the BO as the target domain.\n\\end{itemize}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{The experimental results evaluated by HR@K and NDCG@K on single domain recommendation task with 95\\% confidence intervals.}\n  \\resizebox{1\\linewidth}{!}{\n\n    \\begin{tabular}{clcccccc}\n    \\toprule\n    Dataset & Model & HR(NDCG)@1 & HR@10 & HR@50 & NDCG@10 & NDCG@50 & \\textbf{Average} \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{IQI} & NCF   & 0.1545$\\pm$0.0029 & 0.5004$\\pm$0.0039 & 0.9153$\\pm$0.0015 & 0.2986$\\pm$0.0020 & 0.4185$\\pm$0.0088 & 0.4575  \\\\\n          & GCN   & 0.0877$\\pm$0.0040 & 0.4747$\\pm$0.0233 & 0.6620$\\pm$0.0323 & 0.2937$\\pm$0.0116 & 0.3361$\\pm$0.0137 & 0.3708  \\\\\n          & GAT   & 0.1497$\\pm$0.0545 & \\textbf{0.5878$\\pm$0.0765} & 0.9589$\\pm$0.0100 & 0.3359$\\pm$0.0797 & 0.4368$\\pm$0.0632 & 0.4938  \\\\\n          & GraphSAGE-mean & 0.0912$\\pm$0.0243 & 0.5671$\\pm$0.0388 & 0.9618$\\pm$0.0013 & 0.3145$\\pm$0.0298 & 0.3943$\\pm$0.0234 & 0.4658  \\\\\n          & GraphSAGE-pooling & 0.1122$\\pm$0.0217 & 0.5796$\\pm$0.0522 & 0.9508$\\pm$0.0041 & 0.3083$\\pm$0.0346 & 0.3956$\\pm$0.0231 & 0.4693  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.1591$\\pm$0.0278} & 0.5821$\\pm$0.0486 & \\textbf{0.9671$\\pm$0.0060} & \\textbf{0.3376$\\pm$0.0315} & \\textbf{0.4391$\\pm$0.0228} & \\textbf{0.4970 } \\\\\n\\midrule\n    \\multirow{6}[3]{*}{NF} & NCF   & 0.2102$\\pm$0.0038 & 0.5840$\\pm$0.004 & 0.8706$\\pm$0.0025 & 0.3804$\\pm$0.0036 & 0.4446$\\pm$0.0034 & 0.4980  \\\\\n          & GCN   & 0.1048$\\pm$0.0141 & 0.1688$\\pm$0.0141 & 0.4981$\\pm$0.0212 & 0.1328$\\pm$0.0144 & 0.2009$\\pm$0.0159 & 0.2211  \\\\\n          & GAT   & 0.1918$\\pm$0.0045 & 0.5564$\\pm$0.0027 & 0.9028$\\pm$0.0030 & 0.3554$\\pm$0.0021 & 0.4318$\\pm$0.0026 & 0.4876  \\\\\n          & GraphSAGE-mean & 0.1920$\\pm$0.0053 & 0.5525$\\pm$0.0008 & 0.8874$\\pm$0.0025 & 0.3542$\\pm$0.0025 & 0.4280$\\pm$0.0030 & 0.4828  \\\\\n          & GraphSAGE-pooling & 0.2059$\\pm$0.0027 & 0.6054$\\pm$0.0034 & \\textbf{0.9217$\\pm$0.0014} & 0.3906$\\pm$0.0027 & \\textbf{0.4696$\\pm$0.0023} & \\textbf{0.5186 } \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2140$\\pm$0.0042} & \\textbf{0.6077$\\pm$0.0131} & 0.9184$\\pm$0.0054 & \\textbf{0.3918$\\pm$0.0072} & 0.4613$\\pm$0.0055 & \\textbf{0.5186 } \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{MU} & NCF   & 0.2046$\\pm$0.0043 & 0.6078$\\pm$0.0026 & \\textbf{0.9590$\\pm$0.0007} & 0.3835$\\pm$0.0036 & \\textbf{0.5093$\\pm$0.0031} & 0.5328  \\\\\n          & GCN   & 0.1594$\\pm$0.0002 & 0.4984$\\pm$0.0019 & 0.7589$\\pm$0.0034 & 0.2946$\\pm$0.0006 & 0.3981$\\pm$0.0008 & 0.4219  \\\\\n          & GAT   & 0.2335$\\pm$0.0159 & 0.6833$\\pm$0.0072 & 0.9545$\\pm$0.0005 & 0.4463$\\pm$0.0128 & 0.5002$\\pm$0.0112 & 0.5636  \\\\\n          & GraphSAGE-mean & 0.1927$\\pm$0.0121 & 0.5923$\\pm$0.0196 & 0.8901$\\pm$0.0220 & 0.3742$\\pm$0.0161 & 0.4406$\\pm$0.0167 & 0.4980  \\\\\n          & GraphSAGE-pooling & 0.2215$\\pm$0.0193 & 0.6210$\\pm$0.0190 & 0.9484$\\pm$0.0026 & 0.4145$\\pm$0.0208 & 0.4965$\\pm$0.0171 & 0.5404  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2399$\\pm$0.0026} & \\textbf{0.6887$\\pm$0.0009} & 0.9507$\\pm$0.0028 & \\textbf{0.4470$\\pm$0.0011} & 0.5055$\\pm$0.0028 & \\textbf{0.5664 } \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{BO} & NCF   & 0.2567$\\pm$0.0081 & 0.6733$\\pm$0.007 & 0.9422$\\pm$0.0024 & 0.4558$\\pm$0.0081 & 0.5164$\\pm$0.007 & 0.5689  \\\\\n          & GCN   & 0.1899$\\pm$0.0004 & 0.5007$\\pm$0.0017 & 0.6991$\\pm$0.001 & 0.3558$\\pm$0.0002 & 0.3900$\\pm$0.0002 & 0.4271  \\\\\n          & GAT   & 0.2805$\\pm$0.0258 & 0.7034$\\pm$0.0365 & 0.9369$\\pm$0.0202 & \\textbf{0.4776$\\pm$0.0321} & 0.5303$\\pm$0.0286 & 0.5857  \\\\\n          & GraphSAGE-mean & 0.2137$\\pm$0.0009 & 0.6036$\\pm$0.0007 & 0.8741$\\pm$0.0022 & 0.3920$\\pm$0.0007 & 0.4525$\\pm$0.001 & 0.5072  \\\\\n          & GraphSAGE-pooling & 0.2716$\\pm$0.0148 & 0.6987$\\pm$0.0143 & 0.9351$\\pm$0.0051 & 0.4653$\\pm$0.0155 & 0.5166$\\pm$0.0136 & 0.5775  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2867$\\pm$0.005} & \\textbf{0.7055$\\pm$0.0063} & \\textbf{0.9431$\\pm$0.0042} & 0.4757$\\pm$0.0061 & \\textbf{0.5392$\\pm$0.0058} & \\textbf{0.5900 } \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:single}\n\\end{table*}\n\n\\subsection{Evaluation Protocol}\nFollowing existing works \\cite{he2017neural,hu2019hybrid}, we adopt the Leave-One-Out (LOO) evaluation.\nWe randomly sample one interaction for each user as the validation and test sets, respectively.\nWe also follow the common strategy \\cite{hu2019hybrid,gao2019natr} to randomly sample 99 unobserved (negative) items for each user and then evaluate how well the model can rank the test item against these negative ones. \nThen, we adopt two standard metrics, \\textbf{HR@K} and \\textbf{NDCG@K}, which are widely used in recommendation \\cite{gao2019natr,hu2019hybrid,he2017neural,wang2018tem,ding2018improving}, to evaluate the ranking performance of each methods. The HR@K is computed as follows:\n\\begin{eqnarray}\nHR@K=\\frac{1}{|U|}\\sum_{u\\in\\setU} I(p_u\\leq K),\n\\end{eqnarray}\nwhere $p_u$ is the hit position for the user $u$'s test item, and $I(\\cdot)$ is the indicator function.\nThe NDCG@K is computed as follows:\n\\begin{eqnarray}\nNDCG@K=\\frac{1}{|U|}\\sum_{u\\in\\setU} \\frac{\\log 2}{\\log (p_u+1)}.\n\\end{eqnarray}\nWe report HR@K and NDCG@K with K = 1, 10 and 50.\nThe larger the value, the better the performance for all the evaluation metrics.\nFor all experiments, we report the metrics with \\textbf{95\\%} \\textbf{confidence intervals} on five runs.\n\n\\subsection{Implementation Details}\nIf a user has feedback on an item, there is an edge between the user node and the item node.\nThus, we construct the feedback graph $G$ utilized in our experiments.\n\nFor single domain recommendation task, we perform experiments on the four target domain datasets (i.e., IQI, NF, MU, BO).\nFor all datasets we use: embedding dimension $k=32$, neighbor sampling threshold $\\delta=30$ with two \\modela~layers, negative sampling ratio $\\gamma=8$, mini-batch size of 256 and learning rate of 0.001. \nWe also use dropout whose probability is 0.4.\n\nFor cross-domain recommendation task, we perform experiments on the four pairs cross-domain datasets.\nFor all datasets we use: embedding dimension $k=16$, neighbor sampling threshold $\\delta=10$ with one \\modela~layer, negative sampling ratio $\\gamma=8$,\ntunable hyper-parameter $\\alpha=0.7$ to control the different strength in Equation (\\ref{equ:lossst}),\nmini-batch size of 256 and learning rate of 0.001. \nWe also use dropout whose probability is 0.4.\n\nAll these values and hyper-parameters of all baselines are chosen via a grid search on the IQI validation set.\nWe do not perform any datasets-specific tuning except early stopping on validation sets.\nAll models are implemented using TensorFlow\\footnote{https://www.tensorflow.org} and trained on GTX 1080ti GPU.\nTraining is finished through stochastic gradient descent over shuffled mini-batches with the Adam \\cite{kingma2014adam} update rule.\n\n\\subsection{Baseline Methods}\nWe construct three groups of experiments to demonstrate the effectiveness of the proposed model and framework.\n\\subsubsection{Single Domain Recommendation}\nWe compare the proposed \\modela~model with the following baseline models.\n\\begin{itemize}\n\\item \\textbf{NCF}~\\cite{he2017neural}: Neural Collaborative Filtering (NCF) is the state-of-the-art solution for recommendation tasks with implicit feedback.\nWe use one of the variants of NCF, which is also called Generalized Matrix Factorization (GMF).\n\\item \\textbf{GCN}~\\cite{kipf2016gcn}: The vanilla GCN learns latent node representations based on the first-order approximation of spectral graph convolutions. \n\\item \\textbf{GAT}~\\cite{velivckovic2017gat}: It applies the attention mechanism to learn different weights for aggregating node features from neighbors. \n\\item \\textbf{GraphSAGE-mean}~\\cite{hamilton2017graphsage}: It learns to aggregate node messages from a node’s local neighborhood by the mean aggregator.\n\\item \\textbf{GraphSAGE-pooling}~\\cite{hamilton2017graphsage}: It learns to aggregate node messages from a node’s local neighborhood by the pooling aggregator.\n\\end{itemize}\nFor GCN, GAT, GraphSAGE-mean and GraphSAGE-pooling, We apply the inner product on the user and item node representations as the output.\n\n\\subsubsection{Cross-Domain Recommendation}\nWe compare the proposed \\modelb~model with the following baseline models.\n\\begin{itemize}\n\\item \\textbf{CST}~\\cite{pan2010cst}: Coordinate System Transfer (CST) assumes that both users and items are overlapped and adds\ntwo regularization terms in its objective function.  Here, we adapt the CST to our datasets by only reserving single-side (i.e., the user-side or item-side) regularization term.\n\\item \\textbf{CD-NCF}~\\cite{he2017neural}: Neural Collaborative Filtering (NCF) is the state-of-the-art solution for single domain recommendation tasks with implicit feedback. Here, we adapt it to our cross-domain recommendation task via sharing the overlapped user or item embeddings.\n\\item \\textbf{EMCDR}~\\cite{man2017emcdr}: This is an embedding and mapping framework for cross-domain recommendation.\nThe framework contains Latent Factor Model, Latent Space Mapping and Cross-domain Recommendation, and it is not an end-to-end method.\n\\item \\textbf{EATNN}~\\cite{chen2019eatnn}: This is the state-of-the-art solution for cross-domain recommendation tasks. By introducing attention mechanisms, the model automatically assigns a personalized transfer scheme for each user.\n\\end{itemize}\n\n\\subsubsection{General Cross-Domain Recommendation}\nWe apply the proposed cross-domain framework to other baseline GNN models.\n\\begin{itemize}\n\\item \\textbf{CD-GCN}~\\cite{kipf2016gcn}: It applies the proposed general framework to the GCN as described in Section \\ref{sec:general}. \n\\item \\textbf{CD-GAT}~\\cite{velivckovic2017gat}: It applies the proposed general framework to the GAT. \n\\item \\textbf{CD-GraphSAGE-mean}~\\cite{hamilton2017graphsage}: It applies the proposed general framework to the GraphSAGE-mean. \n\\item \\textbf{CD-GraphSAGE-pooling}~\\cite{hamilton2017graphsage}: It applies the proposed general framework to the GraphSAGE-pooling. \n\\end{itemize}\n\n\\subsection{Performance Comparison}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{The experimental results evaluated by HR@K and NDCG@K on cross-domain recommendation task with 95\\% confidence intervals.}\n  \\resizebox{1\\linewidth}{!}{\n    \\begin{tabular}{clcccccc}\n    \\toprule\n    Dataset & Model & HR(NDCG)@1 & HR@10 & HR@50 & NDCG@10 & NGDCG@50 & \\textbf{Average} \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{TC$\\rightarrow$IQI} & CST   & 0.1948$\\pm$0.0039 & \\textbf{0.6678$\\pm$0.0136} & 0.9455$\\pm$0.0028 & 0.4178$\\pm$0.0099 & 0.4858$\\pm$0.0030 & 0.5423  \\\\\n          & CD-NCF & 0.1701$\\pm$0.0314 & 0.5408$\\pm$0.0445 & 0.8702$\\pm$0.0402 & 0.3392$\\pm$0.0411 & 0.4131$\\pm$0.0396 & 0.4667  \\\\\n          & EMCDR & 0.2058$\\pm$0.0239 & 0.3962$\\pm$0.0628 & 0.7438$\\pm$0.0436 & 0.2897$\\pm$0.0394 & 0.3640$\\pm$0.0358 & 0.3999  \\\\\n          & EATNN & 0.1959$\\pm$0.0102 & 0.6473$\\pm$0.0089 & 0.9314$\\pm$0.0026 & 0.4103$\\pm$0.0100 & 0.4906$\\pm$0.0087 & 0.5351  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2105$\\pm$0.0089} & 0.6536$\\pm$0.0159 & \\textbf{0.9758$\\pm$0.0088} & \\textbf{0.4222$\\pm$0.0108} & \\textbf{0.4963$\\pm$0.0080} & \\textbf{0.5517 } \\\\\n\\midrule\n    \\multirow{5}[3]{*}{ML$\\rightarrow$NF} & CST   & 0.1878$\\pm$0.0058 & 0.5413$\\pm$0.0024 & 0.8551$\\pm$0.0007 & 0.3486$\\pm$0.0015 & 0.4178$\\pm$0.0023 & 0.4701  \\\\\n          & CD-NCF & 0.1997$\\pm$0.0260 & 0.5540$\\pm$0.0457 & 0.8539$\\pm$0.0246 & 0.3600$\\pm$0.0353 & 0.4266$\\pm$0.0310 & 0.4788  \\\\\n          & EMCDR & 0.0968$\\pm$0.0260 & 0.3406$\\pm$0.0240 & 0.6522$\\pm$0.0730 & 0.2027$\\pm$0.0170 & 0.2708$\\pm$0.0070 & 0.3126  \\\\\n          & EATNN & 0.2103$\\pm$0.0018 & 0.5892$\\pm$0.0038 & 0.8745$\\pm$0.0016 & 0.3835$\\pm$0.0015 & 0.4472$\\pm$0.0013 & 0.5009  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2243$\\pm$0.0047} & \\textbf{0.6247$\\pm$0.0069} & \\textbf{0.9228$\\pm$0.0033} & \\textbf{0.4062$\\pm$0.0055} & \\textbf{0.4732$\\pm$0.0043} & \\textbf{0.5302 } \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{MO$\\rightarrow$MU} & CST   & 0.2378$\\pm$0.0085 & 0.5934$\\pm$0.0024 & 0.9051$\\pm$0.0073 & 0.3986$\\pm$0.0115 & 0.4775$\\pm$0.0035 & 0.5225  \\\\\n          & CD-NCF & 0.2599$\\pm$0.0200 & 0.7232$\\pm$0.0430 & 0.9480$\\pm$0.0261 & 0.4747$\\pm$0.0315 & 0.5281$\\pm$0.0281 & 0.5868  \\\\\n          & EMCDR & 0.2290$\\pm$0.0290 & 0.5610$\\pm$0.0703 & 0.8430$\\pm$0.0560 & 0.3834$\\pm$0.0320 & 0.4234$\\pm$0.0410 & 0.4880  \\\\\n          & EATNN & 0.2680$\\pm$0.0021 & 0.7253$\\pm$0.0035 & 0.9457$\\pm$0.0026 & \\textbf{0.4881$\\pm$0.0013} & 0.5282$\\pm$0.0014 & 0.5911  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2728$\\pm$0.0054} & \\textbf{0.7314$\\pm$0.0072} & \\textbf{0.9671$\\pm$0.002} & 0.4851$\\pm$0.0060 & \\textbf{0.5389$\\pm$0.0049} & \\textbf{0.5991 } \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{MU$\\rightarrow$BO} & CST   & 0.2524$\\pm$0.0089 & 0.6973$\\pm$0.0102 & 0.9355$\\pm$0.0098 & 0.4575$\\pm$0.0105 & 0.5143$\\pm$0.0068 & 0.5714  \\\\\n          & CD-NCF & 0.2770$\\pm$0.0158 & 0.7184$\\pm$0.0332 & \\textbf{0.9472$\\pm$0.0261} & 0.4841$\\pm$0.0215 & 0.5334$\\pm$0.0836 & 0.5920  \\\\\n          & EMCDR & 0.2004$\\pm$0.2972 & 0.4864$\\pm$0.5881 & 0.7612$\\pm$0.4115 & 0.3324$\\pm$0.4423 & 0.3920$\\pm$0.4082 & 0.4345  \\\\\n          & EATNN & 0.2731$\\pm$0.0015 & 0.7064$\\pm$0.0036 & 0.9277$\\pm$0.0026 & 0.4634$\\pm$0.0013 & 0.5070$\\pm$0.0017 & 0.5755  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2978$\\pm$0.0481} & \\textbf{0.7267$\\pm$0.0688} & 0.9424$\\pm$0.0295 & \\textbf{0.4872$\\pm$0.0609} & \\textbf{0.5502$\\pm$0.0523} & \\textbf{0.6009 } \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:cross}\n\\end{table*}\n\\begin{figure*}[!t]\n\\begin{center}\n\\includegraphics[width=\\linewidth]{framework}\n\\caption{The HR@K results of the general cross-domain framework on 4 (datastes) $\\times$ 10 (models) = 40 tasks.}\n\\label{general_framework}\n\\end{center}\n\\end{figure*}\n\\subsubsection{Single Domain Recommendation Task}\nWe demonstrate the effectiveness of our \\modela~on four target domain datasets.\nThe experimental results evaluated by HR@K and NDCG@K on IQI, NF, MU and Bo are presented in Table \\ref{tab:single}. \nFrom these results, we have the following insightful observations.\n\\begin{itemize}\n\t\\item[-] Among these GNN baselines, the GCN has acceptable performances on multiple datasets. \n\tThe GraphSAGE-mean improves the results comparing with GCN via introducing the mean aggregator to aggregate messages from each node's local neighborhood. \n\tThe GraphSAGE-pooling achieves further improvement over GraphSAGE-mean by replacing the mean aggregator with the more complex pooling aggregator, which applies the element-wise max-pooling operation on the transformed neighbor messages through a fully-connected neural network. \n\tThe GAT obtains further performance improvement via assigning different learnable weights to neighbor messages.\n\t\\item[-] NCF also obtains competitive recommendation performance, which further validates why the simple collaborative filtering methods can be widely used in recommender systems. \n\tOn most tasks, our \\modela~outperforms the NCF, which demonstrates the graph-structured data are useful for recommender systems.\n    \\item[-] Our \\modela~almost obtains the best performance on multiple datasets. It outperforms the GNN baselines on multi-pair metrics.\n    Besides, although the improvement of the \\modela~ compared with the GAT is marginal on a few metrics and datasets, the \\textbf{Average} values of these metrics of the \\modela~are better on all four datasets, which indicates that the \\modela~has better generalization performance than the GAT.\n\\end{itemize}\nThe essence of recommender systems is to find similarity, and local neighbor nodes often contain such similarity.\nOur \\modela~aggregates local neighbor messages via high-order feature interactions.\nTherefore, the \\modela~ can achieve better performance and is more suitable on recommendation tasks.\nOverall, these improvements indicate the fact that \nour \\modela~can effectively integrate neighbor messages to generate more effective node representations and is more suitable when confronting the graph-structured data. \n\n\\subsubsection{Cross-Domain Recommendation Task}\nWe also demonstrate the effectiveness of our \\modelb~on four pairs cross-domain datasets.\nThe experimental results evaluated by HR@K and NDCG@K are presented in Table \\ref{tab:cross}. \nFrom these results, we have the following findings.\n\\begin{itemize}\n    \\item[-] The collaborative filtering based CD-NCF still obtains competitive recommendation performance via sharing the embedding of overlapped users or items, and it improves the recommendation performance of the CST on all datasets except the TC$\\rightarrow$IQI.\n    We conjecture that collaborative filtering methods need a lot of data to obtain good performance, while the TC$\\rightarrow$IQI have less feedback data.\n    It also demonstrates that collaborative filtering is indeed a simple and efficient method in recommender systems.\n\t\\item[-] EMCDR is not an end-to-end method, and the poor performance may result from the accumulation of errors at each step.\n\t\\item[-] EATNN is the state-of-the-art cross-domain recommendation baseline, and it achieves nearly the best results across multiple datasets among these baselines.\n\t\\item[-] By utilizing the graph topology, our \\modelb~ improves the recommendation performance compared with various methods.\n\tIt demonstrates that the proposed cross-domain framework combined with the proposed \\modela~ is more suitable for the graph-structured data in cross-domain recommendation.\n\\end{itemize}\n\n\\subsubsection{General Cross-Domain Recommendation Task}\nOur cross-domain framework is a general framework\nthat can be applied upon various existing GNN models. \nHere we apply the cross-domain framework\nto GCN, GAT, GraphSAGE-mean and GraphSAGE-pooling. \nIn order to prove that our cross domain framework is applicable to various GNN models. \nWe conduct experiments on 40 tasks ($4\\times10=40$, 4 pairs datasets, 10 models). \nThe results are shown in Figure \\ref{general_framework}. The red lines are the baselines which only use the target training set to train model, also shown in Table \\ref{tab:single}, and the blue lines are the cross-domain models which applied the general cross-domain framework.\nFrom the results, we have the following findings:\n\\begin{itemize}\n    \\item[-] On most tasks, our cross-domain framework is effective to improve the performance of the single domain models which also demonstrates the cross-domain framework can be applied upon various existing GNN models.\n    \\item[-] The improvement on GCN is larger than the other four GNN models. The main reason might be that the single domain GCN is significantly weaker than other improved GNN models as showed in Table \\ref{tab:single}, so the improvement of other GNN models brought by the cross-domain framework is relatively less than GCN.\n    \\item[-] The performance of the GraphSAGE-mean and GraphSAGE-pooling is unsatisfying on several datasets, the reason might be that the mean and pooling aggregators are too simple and fewer shared parameters make them difficult to coordinately train in two domains.\n\\end{itemize}\n\nOverall, we observe that the performance improvement of the cross-domain framework is significant and it is able to improve the performance of base GNN models on different datasets,\nwhich proves that the cross-domain framework is compatible with many GNN models.\n\n\\subsection{Ablation Study}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{Results of ablation study on cross-domain recommendation task based on \\modelb. ``*'' indicates that the  improvement is statistically significant with the p-value $<$ 0.05 on independent samples t-tests.}\n  \\resizebox{.7\\linewidth}{!}{\n    \\begin{tabular}{c|ccc|ccc}\n    \\toprule\n    Model & HR@1  & HR@10 & HR@50 & HR@1  & HR@10 & HR@50 \\\\\n    \\midrule\n          & \\multicolumn{3}{c|}{TC$\\rightarrow$IQI} & \\multicolumn{3}{c}{MO$\\rightarrow$MU} \\\\\n    CD-GFM-base & 0.1681 & 0.5914 & 0.9362 & 0.2445 & 0.6989 & 0.9054 \\\\\n    \\modelb & \\textbf{0.2105*} & \\textbf{0.6536*} & \\textbf{0.9758*} & \\textbf{0.2728*} & \\textbf{0.7314*} & \\textbf{0.9671*} \\\\\n    \\midrule\n          & \\multicolumn{3}{c|}{ML$\\rightarrow$NF} & \\multicolumn{3}{c}{MU$\\rightarrow$BO} \\\\\n    CD-GFM-base & 0.2178 & 0.6196 & 0.9182 & 0.2756 & 0.6963 & 0.9395 \\\\\n    \\modelb & \\textbf{0.2243*} & \\textbf{0.6247*} & \\textbf{0.9228} & \\textbf{0.2978*} & \\textbf{0.7267*} & \\textbf{0.9424} \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:ablation}\n\\end{table*}\nMoreover, for understanding the contribution of the shared node initialization in \\modelb.\nwe construct ablation experiments over \\textbf{CD-GFM-base} and \\modelb~on four pairs datastes.\n\\textbf{CD-GFM-base} only uses the domain-specific node representations $\\h_{n_s}$ and $\\h_{n_t}$ output directly from the \\modela~and not to concatenate the initialized input in Equation (\\ref{concat1}) and (\\ref{concat2}), i.e., \n$\\n_s=\\h_{n_s},\n\\n_t=\\h_{n_t}.\n$\nThe results are presented in Table \\ref{tab:ablation}.\nWe conduct independent samples t-tests and the p-value $<$ 0.05 indicates\nthat the improvement of \\modelb~over the \\textbf{CD-GFM-base} is statistically significant.\nThe improvement demonstrates that \\modelb~model can efficiently take advantage of the domain-shared and domain-specific node representations simultaneously, and obtain the best performance on all datasets, which indicates both two representations matter for the cross-domain recommendation performance.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n4.5 Performance Comparison\n\\subsection{Performance Comparison}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{The experimental results evaluated by HR@K and NDCG@K on cross-domain recommendation task with 95\\% confidence intervals.}\n  \\resizebox{1\\linewidth}{!}{\n    \\begin{tabular}{clcccccc}\n    \\toprule\n    Dataset & Model & HR(NDCG)@1 & HR@10 & HR@50 & NDCG@10 & NGDCG@50 & \\textbf{Average} \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{TC$\\rightarrow$IQI} & CST   & 0.1948$\\pm$0.0039 & \\textbf{0.6678$\\pm$0.0136} & 0.9455$\\pm$0.0028 & 0.4178$\\pm$0.0099 & 0.4858$\\pm$0.0030 & 0.5423  \\\\\n          & CD-NCF & 0.1701$\\pm$0.0314 & 0.5408$\\pm$0.0445 & 0.8702$\\pm$0.0402 & 0.3392$\\pm$0.0411 & 0.4131$\\pm$0.0396 & 0.4667  \\\\\n          & EMCDR & 0.2058$\\pm$0.0239 & 0.3962$\\pm$0.0628 & 0.7438$\\pm$0.0436 & 0.2897$\\pm$0.0394 & 0.3640$\\pm$0.0358 & 0.3999  \\\\\n          & EATNN & 0.1959$\\pm$0.0102 & 0.6473$\\pm$0.0089 & 0.9314$\\pm$0.0026 & 0.4103$\\pm$0.0100 & 0.4906$\\pm$0.0087 & 0.5351  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2105$\\pm$0.0089} & 0.6536$\\pm$0.0159 & \\textbf{0.9758$\\pm$0.0088} & \\textbf{0.4222$\\pm$0.0108} & \\textbf{0.4963$\\pm$0.0080} & \\textbf{0.5517 } \\\\\n\\midrule\n    \\multirow{5}[3]{*}{ML$\\rightarrow$NF} & CST   & 0.1878$\\pm$0.0058 & 0.5413$\\pm$0.0024 & 0.8551$\\pm$0.0007 & 0.3486$\\pm$0.0015 & 0.4178$\\pm$0.0023 & 0.4701  \\\\\n          & CD-NCF & 0.1997$\\pm$0.0260 & 0.5540$\\pm$0.0457 & 0.8539$\\pm$0.0246 & 0.3600$\\pm$0.0353 & 0.4266$\\pm$0.0310 & 0.4788  \\\\\n          & EMCDR & 0.0968$\\pm$0.0260 & 0.3406$\\pm$0.0240 & 0.6522$\\pm$0.0730 & 0.2027$\\pm$0.0170 & 0.2708$\\pm$0.0070 & 0.3126  \\\\\n          & EATNN & 0.2103$\\pm$0.0018 & 0.5892$\\pm$0.0038 & 0.8745$\\pm$0.0016 & 0.3835$\\pm$0.0015 & 0.4472$\\pm$0.0013 & 0.5009  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2243$\\pm$0.0047} & \\textbf{0.6247$\\pm$0.0069} & \\textbf{0.9228$\\pm$0.0033} & \\textbf{0.4062$\\pm$0.0055} & \\textbf{0.4732$\\pm$0.0043} & \\textbf{0.5302 } \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{MO$\\rightarrow$MU} & CST   & 0.2378$\\pm$0.0085 & 0.5934$\\pm$0.0024 & 0.9051$\\pm$0.0073 & 0.3986$\\pm$0.0115 & 0.4775$\\pm$0.0035 & 0.5225  \\\\\n          & CD-NCF & 0.2599$\\pm$0.0200 & 0.7232$\\pm$0.0430 & 0.9480$\\pm$0.0261 & 0.4747$\\pm$0.0315 & 0.5281$\\pm$0.0281 & 0.5868  \\\\\n          & EMCDR & 0.2290$\\pm$0.0290 & 0.5610$\\pm$0.0703 & 0.8430$\\pm$0.0560 & 0.3834$\\pm$0.0320 & 0.4234$\\pm$0.0410 & 0.4880  \\\\\n          & EATNN & 0.2680$\\pm$0.0021 & 0.7253$\\pm$0.0035 & 0.9457$\\pm$0.0026 & \\textbf{0.4881$\\pm$0.0013} & 0.5282$\\pm$0.0014 & 0.5911  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2728$\\pm$0.0054} & \\textbf{0.7314$\\pm$0.0072} & \\textbf{0.9671$\\pm$0.002} & 0.4851$\\pm$0.0060 & \\textbf{0.5389$\\pm$0.0049} & \\textbf{0.5991 } \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{MU$\\rightarrow$BO} & CST   & 0.2524$\\pm$0.0089 & 0.6973$\\pm$0.0102 & 0.9355$\\pm$0.0098 & 0.4575$\\pm$0.0105 & 0.5143$\\pm$0.0068 & 0.5714  \\\\\n          & CD-NCF & 0.2770$\\pm$0.0158 & 0.7184$\\pm$0.0332 & \\textbf{0.9472$\\pm$0.0261} & 0.4841$\\pm$0.0215 & 0.5334$\\pm$0.0836 & 0.5920  \\\\\n          & EMCDR & 0.2004$\\pm$0.2972 & 0.4864$\\pm$0.5881 & 0.7612$\\pm$0.4115 & 0.3324$\\pm$0.4423 & 0.3920$\\pm$0.4082 & 0.4345  \\\\\n          & EATNN & 0.2731$\\pm$0.0015 & 0.7064$\\pm$0.0036 & 0.9277$\\pm$0.0026 & 0.4634$\\pm$0.0013 & 0.5070$\\pm$0.0017 & 0.5755  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2978$\\pm$0.0481} & \\textbf{0.7267$\\pm$0.0688} & 0.9424$\\pm$0.0295 & \\textbf{0.4872$\\pm$0.0609} & \\textbf{0.5502$\\pm$0.0523} & \\textbf{0.6009 } \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:cross}\n\\end{table*}\n\\begin{figure*}[!t]\n\\begin{center}\n\\includegraphics[width=\\linewidth]{framework}\n\\caption{The HR@K results of the general cross-domain framework on 4 (datastes) $\\times$ 10 (models) = 40 tasks.}\n\\label{general_framework}\n\\end{center}\n\\end{figure*}\n\\subsubsection{Single Domain Recommendation Task}\nWe demonstrate the effectiveness of our \\modela~on four target domain datasets.\nThe experimental results evaluated by HR@K and NDCG@K on IQI, NF, MU and Bo are presented in Table \\ref{tab:single}. \nFrom these results, we have the following insightful observations.\n\\begin{itemize}\n\t\\item[-] Among these GNN baselines, the GCN has acceptable performances on multiple datasets. \n\tThe GraphSAGE-mean improves the results comparing with GCN via introducing the mean aggregator to aggregate messages from each node's local neighborhood. \n\tThe GraphSAGE-pooling achieves further improvement over GraphSAGE-mean by replacing the mean aggregator with the more complex pooling aggregator, which applies the element-wise max-pooling operation on the transformed neighbor messages through a fully-connected neural network. \n\tThe GAT obtains further performance improvement via assigning different learnable weights to neighbor messages.\n\t\\item[-] NCF also obtains competitive recommendation performance, which further validates why the simple collaborative filtering methods can be widely used in recommender systems. \n\tOn most tasks, our \\modela~outperforms the NCF, which demonstrates the graph-structured data are useful for recommender systems.\n    \\item[-] Our \\modela~almost obtains the best performance on multiple datasets. It outperforms the GNN baselines on multi-pair metrics.\n    Besides, although the improvement of the \\modela~ compared with the GAT is marginal on a few metrics and datasets, the \\textbf{Average} values of these metrics of the \\modela~are better on all four datasets, which indicates that the \\modela~has better generalization performance than the GAT.\n\\end{itemize}\nThe essence of recommender systems is to find similarity, and local neighbor nodes often contain such similarity.\nOur \\modela~aggregates local neighbor messages via high-order feature interactions.\nTherefore, the \\modela~ can achieve better performance and is more suitable on recommendation tasks.\nOverall, these improvements indicate the fact that \nour \\modela~can effectively integrate neighbor messages to generate more effective node representations and is more suitable when confronting the graph-structured data. \n\n\\subsubsection{Cross-Domain Recommendation Task}\nWe also demonstrate the effectiveness of our \\modelb~on four pairs cross-domain datasets.\nThe experimental results evaluated by HR@K and NDCG@K are presented in Table \\ref{tab:cross}. \nFrom these results, we have the following findings.\n\\begin{itemize}\n    \\item[-] The collaborative filtering based CD-NCF still obtains competitive recommendation performance via sharing the embedding of overlapped users or items, and it improves the recommendation performance of the CST on all datasets except the TC$\\rightarrow$IQI.\n    We conjecture that collaborative filtering methods need a lot of data to obtain good performance, while the TC$\\rightarrow$IQI have less feedback data.\n    It also demonstrates that collaborative filtering is indeed a simple and efficient method in recommender systems.\n\t\\item[-] EMCDR is not an end-to-end method, and the poor performance may result from the accumulation of errors at each step.\n\t\\item[-] EATNN is the state-of-the-art cross-domain recommendation baseline, and it achieves nearly the best results across multiple datasets among these baselines.\n\t\\item[-] By utilizing the graph topology, our \\modelb~ improves the recommendation performance compared with various methods.\n\tIt demonstrates that the proposed cross-domain framework combined with the proposed \\modela~ is more suitable for the graph-structured data in cross-domain recommendation.\n\\end{itemize}\n\n\\subsubsection{General Cross-Domain Recommendation Task}\nOur cross-domain framework is a general framework\nthat can be applied upon various existing GNN models. \nHere we apply the cross-domain framework\nto GCN, GAT, GraphSAGE-mean and GraphSAGE-pooling. \nIn order to prove that our cross domain framework is applicable to various GNN models. \nWe conduct experiments on 40 tasks ($4\\times10=40$, 4 pairs datasets, 10 models). \nThe results are shown in Figure \\ref{general_framework}. The red lines are the baselines which only use the target training set to train model, also shown in Table \\ref{tab:single}, and the blue lines are the cross-domain models which applied the general cross-domain framework.\nFrom the results, we have the following findings:\n\\begin{itemize}\n    \\item[-] On most tasks, our cross-domain framework is effective to improve the performance of the single domain models which also demonstrates the cross-domain framework can be applied upon various existing GNN models.\n    \\item[-] The improvement on GCN is larger than the other four GNN models. The main reason might be that the single domain GCN is significantly weaker than other improved GNN models as showed in Table \\ref{tab:single}, so the improvement of other GNN models brought by the cross-domain framework is relatively less than GCN.\n    \\item[-] The performance of the GraphSAGE-mean and GraphSAGE-pooling is unsatisfying on several datasets, the reason might be that the mean and pooling aggregators are too simple and fewer shared parameters make them difficult to coordinately train in two domains.\n\\end{itemize}\n\nOverall, we observe that the performance improvement of the cross-domain framework is significant and it is able to improve the performance of base GNN models on different datasets,\nwhich proves that the cross-domain framework is compatible with many GNN models.\n\n4.5.2 Cross-Domain Recommendation Task\n\\subsubsection{Cross-Domain Recommendation Task}\nWe also demonstrate the effectiveness of our \\modelb~on four pairs cross-domain datasets.\nThe experimental results evaluated by HR@K and NDCG@K are presented in Table \\ref{tab:cross}. \nFrom these results, we have the following findings.\n\\begin{itemize}\n    \\item[-] The collaborative filtering based CD-NCF still obtains competitive recommendation performance via sharing the embedding of overlapped users or items, and it improves the recommendation performance of the CST on all datasets except the TC$\\rightarrow$IQI.\n    We conjecture that collaborative filtering methods need a lot of data to obtain good performance, while the TC$\\rightarrow$IQI have less feedback data.\n    It also demonstrates that collaborative filtering is indeed a simple and efficient method in recommender systems.\n\t\\item[-] EMCDR is not an end-to-end method, and the poor performance may result from the accumulation of errors at each step.\n\t\\item[-] EATNN is the state-of-the-art cross-domain recommendation baseline, and it achieves nearly the best results across multiple datasets among these baselines.\n\t\\item[-] By utilizing the graph topology, our \\modelb~ improves the recommendation performance compared with various methods.\n\tIt demonstrates that the proposed cross-domain framework combined with the proposed \\modela~ is more suitable for the graph-structured data in cross-domain recommendation.\n\\end{itemize}\n\n",
        "Descriptive_question1": "What is the HR@1 value for the CD-GFM model in the TC→IQI dataset in table_3?",
        "Descriptive_question2": "Which model has the highest Average score for the MU→BO dataset in table_3?",
        "Reasoning_question1": "Why might the CD-GFM model consistently outperform other models across most metrics in table_3 for cross-domain recommendation tasks?",
        "Reasoning_question2": "How does the performance of the EATNN model compare to the CD-NCF model across different datasets in table_3, and what might explain any observed differences?",
        "Descriptive_answer1": "0.2105",
        "Descriptive_answer2": "CD-GFM",
        "Reasoning_answer1": "The CD-GFM model likely outperforms other models across most metrics in table_3 due to its effective use of graph topology and integration of domain-shared and domain-specific node representations. First, as a graph neural network-based approach, CD-GFM leverages the structural information from both source and target domains, allowing it to capture complex user-item interactions more effectively than traditional collaborative filtering methods like CST or CD-NCF. Second, the cross-domain framework it employs likely enhances its ability to transfer relevant knowledge between domains, as seen in its consistent top performance across datasets like TC→IQI, ML→NF, MO→MU, and MU→BO. Finally, the ablation study mentioned in the context suggests that the initialization and concatenation of shared node representations play a significant role, providing a statistical improvement (p-value < 0.05) over baseline variants, indicating that its architecture is well-suited for cross-domain tasks compared to models like EMCDR, which may suffer from error accumulation due to non-end-to-end training.",
        "Reasoning_answer2": "Comparing the performance of EATNN and CD-NCF across datasets in table_3 reveals distinct patterns. In TC→IQI, EATNN achieves a higher Average score (0.5351) than CD-NCF (0.4667), with better results in HR@10 (0.6473 vs. 0.5408) and NDCG@10 (0.4103 vs. 0.3392). For ML→NF, EATNN again outperforms CD-NCF with an Average of 0.5009 versus 0.4788, showing gains in HR@10 (0.5892 vs. 0.5540) and NDCG@10 (0.3835 vs. 0.3600). In MO→MU, EATNN slightly edges out CD-NCF in Average score (0.5911 vs. 0.5868) and achieves a higher NDCG@10 (0.4881 vs. 0.4747). Similarly, in MU→BO, EATNN's Average is lower (0.5755 vs. 0.5920) but it performs comparably in most metrics except HR@50 where CD-NCF is higher (0.9472 vs. 0.9277). The differences might be attributed to EATNN's use of attention mechanisms, which allow personalized transfer schemes for users, potentially making it more adaptive to varying user behaviors across domains compared to CD-NCF, which relies on sharing embeddings and may struggle with datasets having less feedback data like TC→IQI, as noted in the context. Additionally, EATNN being a state-of-the-art baseline for cross-domain tasks suggests it is better optimized for such scenarios than CD-NCF, which is adapted from a single-domain approach."
    },
    {
        "paper_id": "2007.05911.json",
        "table_id": "table_4",
        "table_content": "\\begin{table*}[!t]\n  \\centering\n  \\caption{Results of ablation study on cross-domain recommendation task based on \\modelb. ``*'' indicates that the  improvement is statistically significant with the p-value $<$ 0.05 on independent samples t-tests.}\n  \\resizebox{.7\\linewidth}{!}{\n    \\begin{tabular}{c|ccc|ccc}\n    \\toprule\n    Model & HR@1  & HR@10 & HR@50 & HR@1  & HR@10 & HR@50 \\\\\n    \\midrule\n          & \\multicolumn{3}{c|}{TC$\\rightarrow$IQI} & \\multicolumn{3}{c}{MO$\\rightarrow$MU} \\\\\n    CD-GFM-base & 0.1681 & 0.5914 & 0.9362 & 0.2445 & 0.6989 & 0.9054 \\\\\n    \\modelb & \\textbf{0.2105*} & \\textbf{0.6536*} & \\textbf{0.9758*} & \\textbf{0.2728*} & \\textbf{0.7314*} & \\textbf{0.9671*} \\\\\n    \\midrule\n          & \\multicolumn{3}{c|}{ML$\\rightarrow$NF} & \\multicolumn{3}{c}{MU$\\rightarrow$BO} \\\\\n    CD-GFM-base & 0.2178 & 0.6196 & 0.9182 & 0.2756 & 0.6963 & 0.9395 \\\\\n    \\modelb & \\textbf{0.2243*} & \\textbf{0.6247*} & \\textbf{0.9228} & \\textbf{0.2978*} & \\textbf{0.7267*} & \\textbf{0.9424} \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:ablation}\n\\end{table*}",
        "caption": "Results of ablation study on cross-domain recommendation task based on \\modelb. ``*'' indicates that the  improvement is statistically significant with the p-value $<$ 0.05 on independent samples t-tests.",
        "label": "tab:ablation",
        "section_info": "4 Experiments\n\\section{Experiments}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{Statistics of the datasets. ``$\\#$'' means the number of the corresponding items.}\n  \\resizebox{.75\\linewidth}{!}{\n    \\begin{tabular}{llllll}\n    \\toprule\n    \\multirow{2}[4]{*}{Dataset} & \\multirow{2}[4]{*}{Shared (\\#)} & \\multicolumn{2}{c}{Source Domain} & \\multicolumn{2}{c}{Target Domain} \\\\\n\\cmidrule{3-6}          &       & Unshared (\\#) & \\#Feedback & Unshared (\\#) & \\#Feedback \\\\\n    \\midrule\n    TC$\\rightarrow$IQI & Item (5,568) & User (35,398) & 314,621 & User (19,999) & 78,429 \\\\\n    ML$\\rightarrow$NF & Item (5,565) & user (30,279) & 11,555,621 & User (11,498) & 199,765 \\\\\n    MO$\\rightarrow$MU & User (27,898) & Item (15,465) & 7,366,992 & Item (14,521) & 3,784,331 \\\\\n    MU$\\rightarrow$BO & User (27,898) & Item (14,521) & 3,784,331 & Item (15,774) & 1,936,754 \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:dataset}\n\\end{table*}\nIn this section, we perform experiments to evaluate the proposed model and framework against various baselines on real-world datasets. \nWe first introduce the datasets, evaluation protocol, implementation details and baseline methods of our experiments. Finally, we present our experimental results and analysis.\n\n\\subsection{Datasets}\nWe utilize four pairs frequently used real-world datasets, which contain two pairs \\textbf{user-shared} datasets and two pairs \\textbf{item-shared} datasets. \nFor all datasets, we only use the user IDs, item IDs and their implicit feedback information.\nFor simplicity, we intentionally transform the rating data into binary (1/0, indicating whether a user has interacted with an item or not) to fit the problem setting of implicit feedback following \\cite{gao2019natr}.\nThe statistics of the four pairs datasets are listed in Table \\ref{tab:dataset}.\n\\begin{itemize}\n\\item \\textbf{TC$\\rightarrow$IQI} \\cite{yan2019tciqi} are from two mainstream video websites Tencent (TC)\\footnote{https://v.qq.com} and iQIYI (IQI)\\footnote{https://www.iqiyi.com} in China. \nThere are a lot of overlapped items (movies) in the two websites.  \nWe take TC and IQI as the source and target domains, respectively. \nWe got the processed dataset pair directly from \\cite{yan2019tciqi}.\n\n\\item \\textbf{ML$\\rightarrow$NF}\\footnote{https://grouplens.org/datasets/movielens}$^,$\\footnote{https://www.kaggle.com/laowingkin/netflix-movie-recommendation/data} are from two popular movie recommendation platforms MovieLens and Netflix, in which there are a lot of overlapped items (movies). \nWe take MovieLens (ML) as the source domain and the Netflix (NF) as the target domain. We identify the same movies with their names (case insensitive) and years to avoid wrong identifications as possible, which is similar data processing method with \\cite{gao2019natr}.\n\n\\item \\textbf{MO$\\rightarrow$MU} are from the famous social network platform Douban\\footnote{https://www.douban.com\\label{douban}} in China. Overlapped users have feedback on both Movie (MO) and Music (MU).\nWe take MO as the source domain and the MU as the target domain.\n\n\\item \\textbf{MU$\\rightarrow$BO} are also from the famous social network platform Douban\\textsuperscript{\\ref{douban}} in China. Overlapped users have feedback on both Music (MU) and Book (BO).\nWe take MU as the source domain and the BO as the target domain.\n\\end{itemize}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{The experimental results evaluated by HR@K and NDCG@K on single domain recommendation task with 95\\% confidence intervals.}\n  \\resizebox{1\\linewidth}{!}{\n\n    \\begin{tabular}{clcccccc}\n    \\toprule\n    Dataset & Model & HR(NDCG)@1 & HR@10 & HR@50 & NDCG@10 & NDCG@50 & \\textbf{Average} \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{IQI} & NCF   & 0.1545$\\pm$0.0029 & 0.5004$\\pm$0.0039 & 0.9153$\\pm$0.0015 & 0.2986$\\pm$0.0020 & 0.4185$\\pm$0.0088 & 0.4575  \\\\\n          & GCN   & 0.0877$\\pm$0.0040 & 0.4747$\\pm$0.0233 & 0.6620$\\pm$0.0323 & 0.2937$\\pm$0.0116 & 0.3361$\\pm$0.0137 & 0.3708  \\\\\n          & GAT   & 0.1497$\\pm$0.0545 & \\textbf{0.5878$\\pm$0.0765} & 0.9589$\\pm$0.0100 & 0.3359$\\pm$0.0797 & 0.4368$\\pm$0.0632 & 0.4938  \\\\\n          & GraphSAGE-mean & 0.0912$\\pm$0.0243 & 0.5671$\\pm$0.0388 & 0.9618$\\pm$0.0013 & 0.3145$\\pm$0.0298 & 0.3943$\\pm$0.0234 & 0.4658  \\\\\n          & GraphSAGE-pooling & 0.1122$\\pm$0.0217 & 0.5796$\\pm$0.0522 & 0.9508$\\pm$0.0041 & 0.3083$\\pm$0.0346 & 0.3956$\\pm$0.0231 & 0.4693  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.1591$\\pm$0.0278} & 0.5821$\\pm$0.0486 & \\textbf{0.9671$\\pm$0.0060} & \\textbf{0.3376$\\pm$0.0315} & \\textbf{0.4391$\\pm$0.0228} & \\textbf{0.4970 } \\\\\n\\midrule\n    \\multirow{6}[3]{*}{NF} & NCF   & 0.2102$\\pm$0.0038 & 0.5840$\\pm$0.004 & 0.8706$\\pm$0.0025 & 0.3804$\\pm$0.0036 & 0.4446$\\pm$0.0034 & 0.4980  \\\\\n          & GCN   & 0.1048$\\pm$0.0141 & 0.1688$\\pm$0.0141 & 0.4981$\\pm$0.0212 & 0.1328$\\pm$0.0144 & 0.2009$\\pm$0.0159 & 0.2211  \\\\\n          & GAT   & 0.1918$\\pm$0.0045 & 0.5564$\\pm$0.0027 & 0.9028$\\pm$0.0030 & 0.3554$\\pm$0.0021 & 0.4318$\\pm$0.0026 & 0.4876  \\\\\n          & GraphSAGE-mean & 0.1920$\\pm$0.0053 & 0.5525$\\pm$0.0008 & 0.8874$\\pm$0.0025 & 0.3542$\\pm$0.0025 & 0.4280$\\pm$0.0030 & 0.4828  \\\\\n          & GraphSAGE-pooling & 0.2059$\\pm$0.0027 & 0.6054$\\pm$0.0034 & \\textbf{0.9217$\\pm$0.0014} & 0.3906$\\pm$0.0027 & \\textbf{0.4696$\\pm$0.0023} & \\textbf{0.5186 } \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2140$\\pm$0.0042} & \\textbf{0.6077$\\pm$0.0131} & 0.9184$\\pm$0.0054 & \\textbf{0.3918$\\pm$0.0072} & 0.4613$\\pm$0.0055 & \\textbf{0.5186 } \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{MU} & NCF   & 0.2046$\\pm$0.0043 & 0.6078$\\pm$0.0026 & \\textbf{0.9590$\\pm$0.0007} & 0.3835$\\pm$0.0036 & \\textbf{0.5093$\\pm$0.0031} & 0.5328  \\\\\n          & GCN   & 0.1594$\\pm$0.0002 & 0.4984$\\pm$0.0019 & 0.7589$\\pm$0.0034 & 0.2946$\\pm$0.0006 & 0.3981$\\pm$0.0008 & 0.4219  \\\\\n          & GAT   & 0.2335$\\pm$0.0159 & 0.6833$\\pm$0.0072 & 0.9545$\\pm$0.0005 & 0.4463$\\pm$0.0128 & 0.5002$\\pm$0.0112 & 0.5636  \\\\\n          & GraphSAGE-mean & 0.1927$\\pm$0.0121 & 0.5923$\\pm$0.0196 & 0.8901$\\pm$0.0220 & 0.3742$\\pm$0.0161 & 0.4406$\\pm$0.0167 & 0.4980  \\\\\n          & GraphSAGE-pooling & 0.2215$\\pm$0.0193 & 0.6210$\\pm$0.0190 & 0.9484$\\pm$0.0026 & 0.4145$\\pm$0.0208 & 0.4965$\\pm$0.0171 & 0.5404  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2399$\\pm$0.0026} & \\textbf{0.6887$\\pm$0.0009} & 0.9507$\\pm$0.0028 & \\textbf{0.4470$\\pm$0.0011} & 0.5055$\\pm$0.0028 & \\textbf{0.5664 } \\\\\n    \\midrule\n    \\multirow{6}[3]{*}{BO} & NCF   & 0.2567$\\pm$0.0081 & 0.6733$\\pm$0.007 & 0.9422$\\pm$0.0024 & 0.4558$\\pm$0.0081 & 0.5164$\\pm$0.007 & 0.5689  \\\\\n          & GCN   & 0.1899$\\pm$0.0004 & 0.5007$\\pm$0.0017 & 0.6991$\\pm$0.001 & 0.3558$\\pm$0.0002 & 0.3900$\\pm$0.0002 & 0.4271  \\\\\n          & GAT   & 0.2805$\\pm$0.0258 & 0.7034$\\pm$0.0365 & 0.9369$\\pm$0.0202 & \\textbf{0.4776$\\pm$0.0321} & 0.5303$\\pm$0.0286 & 0.5857  \\\\\n          & GraphSAGE-mean & 0.2137$\\pm$0.0009 & 0.6036$\\pm$0.0007 & 0.8741$\\pm$0.0022 & 0.3920$\\pm$0.0007 & 0.4525$\\pm$0.001 & 0.5072  \\\\\n          & GraphSAGE-pooling & 0.2716$\\pm$0.0148 & 0.6987$\\pm$0.0143 & 0.9351$\\pm$0.0051 & 0.4653$\\pm$0.0155 & 0.5166$\\pm$0.0136 & 0.5775  \\\\\n\\cmidrule{2-8}          & \\textbf{GFM} & \\textbf{0.2867$\\pm$0.005} & \\textbf{0.7055$\\pm$0.0063} & \\textbf{0.9431$\\pm$0.0042} & 0.4757$\\pm$0.0061 & \\textbf{0.5392$\\pm$0.0058} & \\textbf{0.5900 } \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:single}\n\\end{table*}\n\n\\subsection{Evaluation Protocol}\nFollowing existing works \\cite{he2017neural,hu2019hybrid}, we adopt the Leave-One-Out (LOO) evaluation.\nWe randomly sample one interaction for each user as the validation and test sets, respectively.\nWe also follow the common strategy \\cite{hu2019hybrid,gao2019natr} to randomly sample 99 unobserved (negative) items for each user and then evaluate how well the model can rank the test item against these negative ones. \nThen, we adopt two standard metrics, \\textbf{HR@K} and \\textbf{NDCG@K}, which are widely used in recommendation \\cite{gao2019natr,hu2019hybrid,he2017neural,wang2018tem,ding2018improving}, to evaluate the ranking performance of each methods. The HR@K is computed as follows:\n\\begin{eqnarray}\nHR@K=\\frac{1}{|U|}\\sum_{u\\in\\setU} I(p_u\\leq K),\n\\end{eqnarray}\nwhere $p_u$ is the hit position for the user $u$'s test item, and $I(\\cdot)$ is the indicator function.\nThe NDCG@K is computed as follows:\n\\begin{eqnarray}\nNDCG@K=\\frac{1}{|U|}\\sum_{u\\in\\setU} \\frac{\\log 2}{\\log (p_u+1)}.\n\\end{eqnarray}\nWe report HR@K and NDCG@K with K = 1, 10 and 50.\nThe larger the value, the better the performance for all the evaluation metrics.\nFor all experiments, we report the metrics with \\textbf{95\\%} \\textbf{confidence intervals} on five runs.\n\n\\subsection{Implementation Details}\nIf a user has feedback on an item, there is an edge between the user node and the item node.\nThus, we construct the feedback graph $G$ utilized in our experiments.\n\nFor single domain recommendation task, we perform experiments on the four target domain datasets (i.e., IQI, NF, MU, BO).\nFor all datasets we use: embedding dimension $k=32$, neighbor sampling threshold $\\delta=30$ with two \\modela~layers, negative sampling ratio $\\gamma=8$, mini-batch size of 256 and learning rate of 0.001. \nWe also use dropout whose probability is 0.4.\n\nFor cross-domain recommendation task, we perform experiments on the four pairs cross-domain datasets.\nFor all datasets we use: embedding dimension $k=16$, neighbor sampling threshold $\\delta=10$ with one \\modela~layer, negative sampling ratio $\\gamma=8$,\ntunable hyper-parameter $\\alpha=0.7$ to control the different strength in Equation (\\ref{equ:lossst}),\nmini-batch size of 256 and learning rate of 0.001. \nWe also use dropout whose probability is 0.4.\n\nAll these values and hyper-parameters of all baselines are chosen via a grid search on the IQI validation set.\nWe do not perform any datasets-specific tuning except early stopping on validation sets.\nAll models are implemented using TensorFlow\\footnote{https://www.tensorflow.org} and trained on GTX 1080ti GPU.\nTraining is finished through stochastic gradient descent over shuffled mini-batches with the Adam \\cite{kingma2014adam} update rule.\n\n\\subsection{Baseline Methods}\nWe construct three groups of experiments to demonstrate the effectiveness of the proposed model and framework.\n\\subsubsection{Single Domain Recommendation}\nWe compare the proposed \\modela~model with the following baseline models.\n\\begin{itemize}\n\\item \\textbf{NCF}~\\cite{he2017neural}: Neural Collaborative Filtering (NCF) is the state-of-the-art solution for recommendation tasks with implicit feedback.\nWe use one of the variants of NCF, which is also called Generalized Matrix Factorization (GMF).\n\\item \\textbf{GCN}~\\cite{kipf2016gcn}: The vanilla GCN learns latent node representations based on the first-order approximation of spectral graph convolutions. \n\\item \\textbf{GAT}~\\cite{velivckovic2017gat}: It applies the attention mechanism to learn different weights for aggregating node features from neighbors. \n\\item \\textbf{GraphSAGE-mean}~\\cite{hamilton2017graphsage}: It learns to aggregate node messages from a node’s local neighborhood by the mean aggregator.\n\\item \\textbf{GraphSAGE-pooling}~\\cite{hamilton2017graphsage}: It learns to aggregate node messages from a node’s local neighborhood by the pooling aggregator.\n\\end{itemize}\nFor GCN, GAT, GraphSAGE-mean and GraphSAGE-pooling, We apply the inner product on the user and item node representations as the output.\n\n\\subsubsection{Cross-Domain Recommendation}\nWe compare the proposed \\modelb~model with the following baseline models.\n\\begin{itemize}\n\\item \\textbf{CST}~\\cite{pan2010cst}: Coordinate System Transfer (CST) assumes that both users and items are overlapped and adds\ntwo regularization terms in its objective function.  Here, we adapt the CST to our datasets by only reserving single-side (i.e., the user-side or item-side) regularization term.\n\\item \\textbf{CD-NCF}~\\cite{he2017neural}: Neural Collaborative Filtering (NCF) is the state-of-the-art solution for single domain recommendation tasks with implicit feedback. Here, we adapt it to our cross-domain recommendation task via sharing the overlapped user or item embeddings.\n\\item \\textbf{EMCDR}~\\cite{man2017emcdr}: This is an embedding and mapping framework for cross-domain recommendation.\nThe framework contains Latent Factor Model, Latent Space Mapping and Cross-domain Recommendation, and it is not an end-to-end method.\n\\item \\textbf{EATNN}~\\cite{chen2019eatnn}: This is the state-of-the-art solution for cross-domain recommendation tasks. By introducing attention mechanisms, the model automatically assigns a personalized transfer scheme for each user.\n\\end{itemize}\n\n\\subsubsection{General Cross-Domain Recommendation}\nWe apply the proposed cross-domain framework to other baseline GNN models.\n\\begin{itemize}\n\\item \\textbf{CD-GCN}~\\cite{kipf2016gcn}: It applies the proposed general framework to the GCN as described in Section \\ref{sec:general}. \n\\item \\textbf{CD-GAT}~\\cite{velivckovic2017gat}: It applies the proposed general framework to the GAT. \n\\item \\textbf{CD-GraphSAGE-mean}~\\cite{hamilton2017graphsage}: It applies the proposed general framework to the GraphSAGE-mean. \n\\item \\textbf{CD-GraphSAGE-pooling}~\\cite{hamilton2017graphsage}: It applies the proposed general framework to the GraphSAGE-pooling. \n\\end{itemize}\n\n\\subsection{Performance Comparison}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{The experimental results evaluated by HR@K and NDCG@K on cross-domain recommendation task with 95\\% confidence intervals.}\n  \\resizebox{1\\linewidth}{!}{\n    \\begin{tabular}{clcccccc}\n    \\toprule\n    Dataset & Model & HR(NDCG)@1 & HR@10 & HR@50 & NDCG@10 & NGDCG@50 & \\textbf{Average} \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{TC$\\rightarrow$IQI} & CST   & 0.1948$\\pm$0.0039 & \\textbf{0.6678$\\pm$0.0136} & 0.9455$\\pm$0.0028 & 0.4178$\\pm$0.0099 & 0.4858$\\pm$0.0030 & 0.5423  \\\\\n          & CD-NCF & 0.1701$\\pm$0.0314 & 0.5408$\\pm$0.0445 & 0.8702$\\pm$0.0402 & 0.3392$\\pm$0.0411 & 0.4131$\\pm$0.0396 & 0.4667  \\\\\n          & EMCDR & 0.2058$\\pm$0.0239 & 0.3962$\\pm$0.0628 & 0.7438$\\pm$0.0436 & 0.2897$\\pm$0.0394 & 0.3640$\\pm$0.0358 & 0.3999  \\\\\n          & EATNN & 0.1959$\\pm$0.0102 & 0.6473$\\pm$0.0089 & 0.9314$\\pm$0.0026 & 0.4103$\\pm$0.0100 & 0.4906$\\pm$0.0087 & 0.5351  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2105$\\pm$0.0089} & 0.6536$\\pm$0.0159 & \\textbf{0.9758$\\pm$0.0088} & \\textbf{0.4222$\\pm$0.0108} & \\textbf{0.4963$\\pm$0.0080} & \\textbf{0.5517 } \\\\\n\\midrule\n    \\multirow{5}[3]{*}{ML$\\rightarrow$NF} & CST   & 0.1878$\\pm$0.0058 & 0.5413$\\pm$0.0024 & 0.8551$\\pm$0.0007 & 0.3486$\\pm$0.0015 & 0.4178$\\pm$0.0023 & 0.4701  \\\\\n          & CD-NCF & 0.1997$\\pm$0.0260 & 0.5540$\\pm$0.0457 & 0.8539$\\pm$0.0246 & 0.3600$\\pm$0.0353 & 0.4266$\\pm$0.0310 & 0.4788  \\\\\n          & EMCDR & 0.0968$\\pm$0.0260 & 0.3406$\\pm$0.0240 & 0.6522$\\pm$0.0730 & 0.2027$\\pm$0.0170 & 0.2708$\\pm$0.0070 & 0.3126  \\\\\n          & EATNN & 0.2103$\\pm$0.0018 & 0.5892$\\pm$0.0038 & 0.8745$\\pm$0.0016 & 0.3835$\\pm$0.0015 & 0.4472$\\pm$0.0013 & 0.5009  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2243$\\pm$0.0047} & \\textbf{0.6247$\\pm$0.0069} & \\textbf{0.9228$\\pm$0.0033} & \\textbf{0.4062$\\pm$0.0055} & \\textbf{0.4732$\\pm$0.0043} & \\textbf{0.5302 } \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{MO$\\rightarrow$MU} & CST   & 0.2378$\\pm$0.0085 & 0.5934$\\pm$0.0024 & 0.9051$\\pm$0.0073 & 0.3986$\\pm$0.0115 & 0.4775$\\pm$0.0035 & 0.5225  \\\\\n          & CD-NCF & 0.2599$\\pm$0.0200 & 0.7232$\\pm$0.0430 & 0.9480$\\pm$0.0261 & 0.4747$\\pm$0.0315 & 0.5281$\\pm$0.0281 & 0.5868  \\\\\n          & EMCDR & 0.2290$\\pm$0.0290 & 0.5610$\\pm$0.0703 & 0.8430$\\pm$0.0560 & 0.3834$\\pm$0.0320 & 0.4234$\\pm$0.0410 & 0.4880  \\\\\n          & EATNN & 0.2680$\\pm$0.0021 & 0.7253$\\pm$0.0035 & 0.9457$\\pm$0.0026 & \\textbf{0.4881$\\pm$0.0013} & 0.5282$\\pm$0.0014 & 0.5911  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2728$\\pm$0.0054} & \\textbf{0.7314$\\pm$0.0072} & \\textbf{0.9671$\\pm$0.002} & 0.4851$\\pm$0.0060 & \\textbf{0.5389$\\pm$0.0049} & \\textbf{0.5991 } \\\\\n    \\midrule\n    \\multirow{5}[3]{*}{MU$\\rightarrow$BO} & CST   & 0.2524$\\pm$0.0089 & 0.6973$\\pm$0.0102 & 0.9355$\\pm$0.0098 & 0.4575$\\pm$0.0105 & 0.5143$\\pm$0.0068 & 0.5714  \\\\\n          & CD-NCF & 0.2770$\\pm$0.0158 & 0.7184$\\pm$0.0332 & \\textbf{0.9472$\\pm$0.0261} & 0.4841$\\pm$0.0215 & 0.5334$\\pm$0.0836 & 0.5920  \\\\\n          & EMCDR & 0.2004$\\pm$0.2972 & 0.4864$\\pm$0.5881 & 0.7612$\\pm$0.4115 & 0.3324$\\pm$0.4423 & 0.3920$\\pm$0.4082 & 0.4345  \\\\\n          & EATNN & 0.2731$\\pm$0.0015 & 0.7064$\\pm$0.0036 & 0.9277$\\pm$0.0026 & 0.4634$\\pm$0.0013 & 0.5070$\\pm$0.0017 & 0.5755  \\\\\n\\cmidrule{2-8}          & \\textbf{CD-GFM} & \\textbf{0.2978$\\pm$0.0481} & \\textbf{0.7267$\\pm$0.0688} & 0.9424$\\pm$0.0295 & \\textbf{0.4872$\\pm$0.0609} & \\textbf{0.5502$\\pm$0.0523} & \\textbf{0.6009 } \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:cross}\n\\end{table*}\n\\begin{figure*}[!t]\n\\begin{center}\n\\includegraphics[width=\\linewidth]{framework}\n\\caption{The HR@K results of the general cross-domain framework on 4 (datastes) $\\times$ 10 (models) = 40 tasks.}\n\\label{general_framework}\n\\end{center}\n\\end{figure*}\n\\subsubsection{Single Domain Recommendation Task}\nWe demonstrate the effectiveness of our \\modela~on four target domain datasets.\nThe experimental results evaluated by HR@K and NDCG@K on IQI, NF, MU and Bo are presented in Table \\ref{tab:single}. \nFrom these results, we have the following insightful observations.\n\\begin{itemize}\n\t\\item[-] Among these GNN baselines, the GCN has acceptable performances on multiple datasets. \n\tThe GraphSAGE-mean improves the results comparing with GCN via introducing the mean aggregator to aggregate messages from each node's local neighborhood. \n\tThe GraphSAGE-pooling achieves further improvement over GraphSAGE-mean by replacing the mean aggregator with the more complex pooling aggregator, which applies the element-wise max-pooling operation on the transformed neighbor messages through a fully-connected neural network. \n\tThe GAT obtains further performance improvement via assigning different learnable weights to neighbor messages.\n\t\\item[-] NCF also obtains competitive recommendation performance, which further validates why the simple collaborative filtering methods can be widely used in recommender systems. \n\tOn most tasks, our \\modela~outperforms the NCF, which demonstrates the graph-structured data are useful for recommender systems.\n    \\item[-] Our \\modela~almost obtains the best performance on multiple datasets. It outperforms the GNN baselines on multi-pair metrics.\n    Besides, although the improvement of the \\modela~ compared with the GAT is marginal on a few metrics and datasets, the \\textbf{Average} values of these metrics of the \\modela~are better on all four datasets, which indicates that the \\modela~has better generalization performance than the GAT.\n\\end{itemize}\nThe essence of recommender systems is to find similarity, and local neighbor nodes often contain such similarity.\nOur \\modela~aggregates local neighbor messages via high-order feature interactions.\nTherefore, the \\modela~ can achieve better performance and is more suitable on recommendation tasks.\nOverall, these improvements indicate the fact that \nour \\modela~can effectively integrate neighbor messages to generate more effective node representations and is more suitable when confronting the graph-structured data. \n\n\\subsubsection{Cross-Domain Recommendation Task}\nWe also demonstrate the effectiveness of our \\modelb~on four pairs cross-domain datasets.\nThe experimental results evaluated by HR@K and NDCG@K are presented in Table \\ref{tab:cross}. \nFrom these results, we have the following findings.\n\\begin{itemize}\n    \\item[-] The collaborative filtering based CD-NCF still obtains competitive recommendation performance via sharing the embedding of overlapped users or items, and it improves the recommendation performance of the CST on all datasets except the TC$\\rightarrow$IQI.\n    We conjecture that collaborative filtering methods need a lot of data to obtain good performance, while the TC$\\rightarrow$IQI have less feedback data.\n    It also demonstrates that collaborative filtering is indeed a simple and efficient method in recommender systems.\n\t\\item[-] EMCDR is not an end-to-end method, and the poor performance may result from the accumulation of errors at each step.\n\t\\item[-] EATNN is the state-of-the-art cross-domain recommendation baseline, and it achieves nearly the best results across multiple datasets among these baselines.\n\t\\item[-] By utilizing the graph topology, our \\modelb~ improves the recommendation performance compared with various methods.\n\tIt demonstrates that the proposed cross-domain framework combined with the proposed \\modela~ is more suitable for the graph-structured data in cross-domain recommendation.\n\\end{itemize}\n\n\\subsubsection{General Cross-Domain Recommendation Task}\nOur cross-domain framework is a general framework\nthat can be applied upon various existing GNN models. \nHere we apply the cross-domain framework\nto GCN, GAT, GraphSAGE-mean and GraphSAGE-pooling. \nIn order to prove that our cross domain framework is applicable to various GNN models. \nWe conduct experiments on 40 tasks ($4\\times10=40$, 4 pairs datasets, 10 models). \nThe results are shown in Figure \\ref{general_framework}. The red lines are the baselines which only use the target training set to train model, also shown in Table \\ref{tab:single}, and the blue lines are the cross-domain models which applied the general cross-domain framework.\nFrom the results, we have the following findings:\n\\begin{itemize}\n    \\item[-] On most tasks, our cross-domain framework is effective to improve the performance of the single domain models which also demonstrates the cross-domain framework can be applied upon various existing GNN models.\n    \\item[-] The improvement on GCN is larger than the other four GNN models. The main reason might be that the single domain GCN is significantly weaker than other improved GNN models as showed in Table \\ref{tab:single}, so the improvement of other GNN models brought by the cross-domain framework is relatively less than GCN.\n    \\item[-] The performance of the GraphSAGE-mean and GraphSAGE-pooling is unsatisfying on several datasets, the reason might be that the mean and pooling aggregators are too simple and fewer shared parameters make them difficult to coordinately train in two domains.\n\\end{itemize}\n\nOverall, we observe that the performance improvement of the cross-domain framework is significant and it is able to improve the performance of base GNN models on different datasets,\nwhich proves that the cross-domain framework is compatible with many GNN models.\n\n\\subsection{Ablation Study}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{Results of ablation study on cross-domain recommendation task based on \\modelb. ``*'' indicates that the  improvement is statistically significant with the p-value $<$ 0.05 on independent samples t-tests.}\n  \\resizebox{.7\\linewidth}{!}{\n    \\begin{tabular}{c|ccc|ccc}\n    \\toprule\n    Model & HR@1  & HR@10 & HR@50 & HR@1  & HR@10 & HR@50 \\\\\n    \\midrule\n          & \\multicolumn{3}{c|}{TC$\\rightarrow$IQI} & \\multicolumn{3}{c}{MO$\\rightarrow$MU} \\\\\n    CD-GFM-base & 0.1681 & 0.5914 & 0.9362 & 0.2445 & 0.6989 & 0.9054 \\\\\n    \\modelb & \\textbf{0.2105*} & \\textbf{0.6536*} & \\textbf{0.9758*} & \\textbf{0.2728*} & \\textbf{0.7314*} & \\textbf{0.9671*} \\\\\n    \\midrule\n          & \\multicolumn{3}{c|}{ML$\\rightarrow$NF} & \\multicolumn{3}{c}{MU$\\rightarrow$BO} \\\\\n    CD-GFM-base & 0.2178 & 0.6196 & 0.9182 & 0.2756 & 0.6963 & 0.9395 \\\\\n    \\modelb & \\textbf{0.2243*} & \\textbf{0.6247*} & \\textbf{0.9228} & \\textbf{0.2978*} & \\textbf{0.7267*} & \\textbf{0.9424} \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:ablation}\n\\end{table*}\nMoreover, for understanding the contribution of the shared node initialization in \\modelb.\nwe construct ablation experiments over \\textbf{CD-GFM-base} and \\modelb~on four pairs datastes.\n\\textbf{CD-GFM-base} only uses the domain-specific node representations $\\h_{n_s}$ and $\\h_{n_t}$ output directly from the \\modela~and not to concatenate the initialized input in Equation (\\ref{concat1}) and (\\ref{concat2}), i.e., \n$\\n_s=\\h_{n_s},\n\\n_t=\\h_{n_t}.\n$\nThe results are presented in Table \\ref{tab:ablation}.\nWe conduct independent samples t-tests and the p-value $<$ 0.05 indicates\nthat the improvement of \\modelb~over the \\textbf{CD-GFM-base} is statistically significant.\nThe improvement demonstrates that \\modelb~model can efficiently take advantage of the domain-shared and domain-specific node representations simultaneously, and obtain the best performance on all datasets, which indicates both two representations matter for the cross-domain recommendation performance.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n4.6 Ablation Study\n\\subsection{Ablation Study}\n\\begin{table*}[!t]\n  \\centering\n  \\caption{Results of ablation study on cross-domain recommendation task based on \\modelb. ``*'' indicates that the  improvement is statistically significant with the p-value $<$ 0.05 on independent samples t-tests.}\n  \\resizebox{.7\\linewidth}{!}{\n    \\begin{tabular}{c|ccc|ccc}\n    \\toprule\n    Model & HR@1  & HR@10 & HR@50 & HR@1  & HR@10 & HR@50 \\\\\n    \\midrule\n          & \\multicolumn{3}{c|}{TC$\\rightarrow$IQI} & \\multicolumn{3}{c}{MO$\\rightarrow$MU} \\\\\n    CD-GFM-base & 0.1681 & 0.5914 & 0.9362 & 0.2445 & 0.6989 & 0.9054 \\\\\n    \\modelb & \\textbf{0.2105*} & \\textbf{0.6536*} & \\textbf{0.9758*} & \\textbf{0.2728*} & \\textbf{0.7314*} & \\textbf{0.9671*} \\\\\n    \\midrule\n          & \\multicolumn{3}{c|}{ML$\\rightarrow$NF} & \\multicolumn{3}{c}{MU$\\rightarrow$BO} \\\\\n    CD-GFM-base & 0.2178 & 0.6196 & 0.9182 & 0.2756 & 0.6963 & 0.9395 \\\\\n    \\modelb & \\textbf{0.2243*} & \\textbf{0.6247*} & \\textbf{0.9228} & \\textbf{0.2978*} & \\textbf{0.7267*} & \\textbf{0.9424} \\\\\n    \\bottomrule\n    \\end{tabular}\n    }\n  \\label{tab:ablation}\n\\end{table*}\nMoreover, for understanding the contribution of the shared node initialization in \\modelb.\nwe construct ablation experiments over \\textbf{CD-GFM-base} and \\modelb~on four pairs datastes.\n\\textbf{CD-GFM-base} only uses the domain-specific node representations $\\h_{n_s}$ and $\\h_{n_t}$ output directly from the \\modela~and not to concatenate the initialized input in Equation (\\ref{concat1}) and (\\ref{concat2}), i.e., \n$\\n_s=\\h_{n_s},\n\\n_t=\\h_{n_t}.\n$\nThe results are presented in Table \\ref{tab:ablation}.\nWe conduct independent samples t-tests and the p-value $<$ 0.05 indicates\nthat the improvement of \\modelb~over the \\textbf{CD-GFM-base} is statistically significant.\nThe improvement demonstrates that \\modelb~model can efficiently take advantage of the domain-shared and domain-specific node representations simultaneously, and obtain the best performance on all datasets, which indicates both two representations matter for the cross-domain recommendation performance.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n",
        "Descriptive_question1": "What is the HR@1 value for CD-GFM-base in the TC→IQI dataset in table_4?",
        "Descriptive_question2": "Which model has the highest HR@50 value in the MO→MU dataset in table_4?",
        "Reasoning_question1": "Why does the 'modelb' show statistically significant improvement over CD-GFM-base across most metrics in table_4?",
        "Reasoning_question2": "What can be inferred about the effectiveness of shared node initialization in cross-domain recommendation tasks based on the performance differences between 'modelb' and CD-GFM-base in table_4?",
        "Descriptive_answer1": "0.1681",
        "Descriptive_answer2": "modelb",
        "Reasoning_answer1": "The statistically significant improvement of 'modelb' over CD-GFM-base across most metrics in the table is likely due to its ability to utilize both domain-shared and domain-specific node representations. Starting with the table data, 'modelb' consistently achieves higher HR@K values across datasets like TC→IQI (e.g., HR@1: 0.2105 vs. 0.1681) and MO→MU (e.g., HR@10: 0.7314 vs. 0.6989), with most improvements marked by '*' indicating statistical significance (p-value < 0.05). This suggests that the difference is not due to random variation. Considering the context of the ablation study, 'modelb' incorporates shared node initialization, which likely enhances its capability to capture cross-domain similarities while maintaining domain-specific nuances. In contrast, CD-GFM-base relies solely on domain-specific representations without this concatenation. Therefore, the integration of shared information appears to provide 'modelb' with a more robust framework for recommendation tasks, leading to better performance.",
        "Reasoning_answer2": "The effectiveness of shared node initialization in cross-domain recommendation tasks can be inferred by comparing the performance differences between 'modelb' and CD-GFM-base. First, observe that 'modelb' outperforms CD-GFM-base across all datasets and metrics in the table, such as HR@1 in MU→BO (0.2978 vs. 0.2756) and HR@50 in TC→IQI (0.9758 vs. 0.9362), with many results showing statistical significance (p-value < 0.05). This consistent improvement suggests that shared node initialization, as implemented in 'modelb', plays a critical role. According to the table caption and context, shared node initialization allows 'modelb' to combine domain-shared and domain-specific representations, unlike CD-GFM-base which only uses domain-specific ones. This likely enables 'modelb' to leverage common patterns across domains, enhancing recommendation accuracy. Hence, it can be inferred that shared node initialization is a key factor in improving cross-domain recommendation performance by facilitating better knowledge transfer between domains."
    },
    {
        "paper_id": "2302.00973.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}[ht] \n\t\\centering\n\t\\scriptsize \n\t\\caption{Description of the DraWritePD dataset}\n\t\\setlength{\\tabcolsep}{1.5mm}{\n\t\\begin{tabular}{| c || c| c | c| c | c | c|}\n\t\t\\hline \n\t\tParameter & azimuth &\taltitude & pressure & timestamp  & x-Axis & y-Axis\\\\\n\t\t\\hline\n\t\tNotation & $a$ & $l$ & $p$ & $t$ & $x$ & $y$\\\\\n\t\t\\hline\n\t\tQuantitation & rad & rad & psi & sec & mm & mm\\\\\n\t\t\\hline\n\t\\end{tabular}}\n\t\\label{tab:dataset description}\n\\end{table}",
        "caption": "Description of the DraWritePD dataset",
        "label": "tab:dataset description",
        "section_info": "2 Material\n\\section{Material} \\label{sec:dataset}\n\nIn this research, a data set is considered. The data set, here and later referred to as DraWritePD, was acquired from $49$ participants, with a mean age of $74.1$ years and a similar gender distribution. Within the group of patients with PD, the age deviation was approximately $3.35$ years, while within the group of subjects with HC, the age deviation was $4.55$ years, making both groups very similar. \n\nData acquisition was performed with an iPad Pro $9.7$ inch ($2016$) equipped with an Apple Pencil. As shown in Fig.\\ref{fig:drawing curves}, participants were asked to mimic the reference pattern to draw. During this process, the iPad Pro scanned the Apple Pencil signal at $240$ points per second. As shown in Table \\ref{tab:dataset description}, for each scan, the device captures six time sequence parameters: azimuth ($a$); altitude ($l$); pressure ($p$); timestamp ($t$); x-Axis ($x$); y-Axis ($y$).\n\n\\renewcommand{\\arraystretch}{1.5}\n\\begin{table}[ht] \n\t\\centering\n\t\\scriptsize \n\t\\caption{Description of the DraWritePD dataset}\n\t\\setlength{\\tabcolsep}{1.5mm}{\n\t\\begin{tabular}{| c || c| c | c| c | c | c|}\n\t\t\\hline \n\t\tParameter & azimuth &\taltitude & pressure & timestamp  & x-Axis & y-Axis\\\\\n\t\t\\hline\n\t\tNotation & $a$ & $l$ & $p$ & $t$ & $x$ & $y$\\\\\n\t\t\\hline\n\t\tQuantitation & rad & rad & psi & sec & mm & mm\\\\\n\t\t\\hline\n\t\\end{tabular}}\n\t\\label{tab:dataset description}\n\\end{table}\n\nOriginally, the dLAST consisted of $3$ tests: \\textit{$\\Pi$}\\textit{$\\Lambda$}, \\textit{$\\Pi$} and \\textit{sin wave}, each with $3$ exercises: \\textit{continue}, \\textit{copy} and \\textit{trace} applied to each test. The general task was to complete a set of tests consisting of $9$ different drawing exercises. To optimise the testing procedure within the frameworks of the present contribution, only the \\textit{$\\Pi$} test was considered. Specifically, the graphical plots of \\textit{$\\Pi$} test are demonstrated in Fig. \\ref{fig:drawing curves}.\n\nThe data acquisition process was carried out under strict privacy law guidance. The Research Ethics Committee approved the study of the University of Tartu (No.$1275T-9$).\n\n\n",
        "Descriptive_question1": "What is the notation for pressure in table_1?",
        "Descriptive_question2": "What unit is used for timestamp in table_1?",
        "Reasoning_question1": "Why might the DraWritePD dataset in table_1 include both azimuth and altitude parameters for each scan?",
        "Reasoning_question2": "How could the difference in units of measurement for parameters in table_1 impact the analysis of the drawing data?",
        "Descriptive_answer1": "p",
        "Descriptive_answer2": "sec",
        "Reasoning_answer1": "The inclusion of both azimuth and altitude parameters in the DraWritePD dataset likely serves to capture the orientation of the Apple Pencil during the drawing process. Azimuth refers to the horizontal angle or direction of the pencil, while altitude refers to the vertical angle or tilt. By measuring both, the dataset can provide a comprehensive three-dimensional representation of the pencil's position at each scan. This dual measurement is crucial for understanding how participants hold and maneuver the pencil, which could reveal differences in motor control or technique, especially between groups like PD patients and healthy controls. Therefore, these parameters together help in analyzing the spatial dynamics of drawing behavior beyond just the x and y coordinates on the screen.",
        "Reasoning_answer2": "The difference in units of measurement for parameters in table_1, such as radians for azimuth and altitude, psi for pressure, seconds for timestamp, and millimeters for x and y axes, could significantly impact the analysis of the drawing data. First, these varied units reflect different physical properties, which means they cannot be directly compared without normalization or conversion to a common scale. For instance, comparing pencil tilt (in radians) with tip pressure (in psi) requires a transformation to understand their relative contributions to drawing behavior. Second, the choice of units affects the granularity and precision of data interpretation—millimeters provide fine spatial detail for position, while seconds for timestamp allow for temporal analysis of movement speed. If not accounted for, these differences could skew statistical analyses or machine learning models by giving undue weight to parameters with larger numerical ranges. Thus, analysts must standardize or appropriately weight these parameters to ensure a balanced and meaningful interpretation of the data."
    },
    {
        "paper_id": "2302.00973.json",
        "table_id": "table_2",
        "table_content": "\\begin{table}[ht]\n\t\\centering\n\t\\scriptsize \n\t\\caption{The information of training set and testing set.}\n    \\setlength{\\tabcolsep}{4mm}{\n\t\\begin{tabular}{|c||c|c|c|c|}\n\t\t\\hline \n\t    \\multirow{2}*{}  &  \\multicolumn{2}{c|}{Training set} & \\multicolumn{2}{c|}{Testing set} \\\\\n\t    \\cline{2-5}\n\t\t                 &       HC       &         PD        &         HC       &       PD      \\\\\n\t\t\\hline\n\t\t\\hline\n\t\tParticipant     &       25       &         16        &         4        &        4      \\\\\n\t\t\\hline\n\t\tSequence set (S)      &       80       &         51        &         15       &       11      \\\\\n\t\t\\hline\n\t\tPatch set (P)      &     16166      &       16836       &       3670       &     3319      \\\\\n\t\t\\hline\n\t\\end{tabular}}\n\t\\label{tab:dataste}\n\\end{table}",
        "caption": "The information of training set and testing set.",
        "label": "tab:dataste",
        "section_info": "4 Experimental Results\n\\section{Experimental Results}\\label{sec:results}\n\nIn this section, we evaluate and analyse the performance of the proposed LSTM-CNN model using the DraWritePD data set.  The model runs on the desktop PC with an Intel(R) Core(TM) $3.60$ GHz($8$ CPU), $32$GB RAM, and an NVIDIA RTX$3070$Ti GPU with $8$ GB memory. \n\n\\subsection{Dataset}\n\nThe DraWritePD set contains $157$ pieces of sequence data from $29$ subjects with HC and $20$ patients with PD. The raw sequence data needs to be cropped into patches before being fed to the LSTM-CNN model. The class imbalance problem may occur during the segmentation due to the different lengths of the sequence data. As shown in Fig. \\ref{fig:segmentation}, a nonuniform sampling strategy with varying stride size is adopted to impose the number of generated patch data in each class to be the same. The statistics of the training and testing dataset is listed in Table \\ref{tab:dataste}.\n\n\\renewcommand{\\arraystretch}{1.5}\n\\begin{table}[ht]\n\t\\centering\n\t\\scriptsize \n\t\\caption{The information of training set and testing set.}\n    \\setlength{\\tabcolsep}{4mm}{\n\t\\begin{tabular}{|c||c|c|c|c|}\n\t\t\\hline \n\t    \\multirow{2}*{}  &  \\multicolumn{2}{c|}{Training set} & \\multicolumn{2}{c|}{Testing set} \\\\\n\t    \\cline{2-5}\n\t\t                 &       HC       &         PD        &         HC       &       PD      \\\\\n\t\t\\hline\n\t\t\\hline\n\t\tParticipant     &       25       &         16        &         4        &        4      \\\\\n\t\t\\hline\n\t\tSequence set (S)      &       80       &         51        &         15       &       11      \\\\\n\t\t\\hline\n\t\tPatch set (P)      &     16166      &       16836       &       3670       &     3319      \\\\\n\t\t\\hline\n\t\\end{tabular}}\n\t\\label{tab:dataste}\n\\end{table}\n\nDuring the training phase, the patch data set was randomly divided in the $8:2$ ratio into a training patch data set and a validation patch data set. During the testing phase, the proposed model was first evaluated based on the testing patch data set and then applied to the testing sequence data set, where the predicted result of each raw sequence data set was determined by the majority vote of the prediction result of the patch data. To clarify, we denote the patch data set testing as $P$ and the raw sequence data set as $S$, and independently evaluate the performance of the proposed model on the two cases. \n\n\n\\subsection{Experimental Setup}\nIn order to fully exploit the performance of the proposed LSTM-CNN model, we use a cross-validation strategy to optimally choose the parameters. Adam \\cite{kingma2014adam} optimiser is used to train the model, and the initial learning rate is set to $0.001$. Furthermore, the cross-entropy loss function is used for model fitting and the batch size is set to $64$. The proposed model is completed in $200$ epochs with the loss curve shown in Fig. \\ref{fig:train_val_curve}. We use the metrics: accuracy, precision, recall, specificity, $F_1$ score, and Matthews correlation coefficient (MCC) for evaluation, where the latter has been adopted by many existing methods to describe the different aspects of the performance of a classifier \\cite{baldi2000assessing}. Once the training phase is completed, the one with the best fitness value is chosen for testing. Moreover, the length of segmented patches and the choice of feature selection are also discussed to interpret their roles in determining the model performance. As shown in Fig.~\\ref{fig:parameter}, the model achieves the best classification result when the window size is $128$ by using the ($v_x,v_y$) velocity characteristics. The model performance is eventually tested on both the original sequences dataset ($S$) and the segmented patches dataset ($P$).\n\n\n\\subsection{Quantitative Evaluation and Comparison}\n\n\\renewcommand{\\arraystretch}{1.5}\n\\begin{table*}[ht]\n\t\\centering\n\t\\scriptsize \n\t\\caption{Quantitative comparison of different classification methods.}\n \\setlength{\\tabcolsep}{3mm}{\n\t\\begin{tabular}{ | c || c || c | c | c | c | c | c |}\n  \\hline\n  \\multirow{2}*{Model} & \\multirow{2}*{Inference time (s)} &  \\multicolumn{6}{c|}{Metric} \\\\\n\t\\cline{3-8}\n\t\t~ & ~ &  Accuracy (P/S) & Precision (P/S) & Recall (P/S) & Specificity (P/S) & $F_1$ score (P/S) & MCC (P/S) \\\\\n\t\t\\hline\n        \\hline\n        LR & 0.034 & 0.8061 / 0.9231  & 0.8559 / 0.8462 & 0.8565 / \\textbf{1.00} & 0.7018 / 0.8667 & 0.8562 / 0.9167 & 0.5585 / 0.8563  \\\\\n        \\hline\n        SVM & 6.060 &  0.8371 / 0.8846  & 0.8657 / 0.7857 & 0.8977 / \\textbf{1.00} & 0.7119 / 0.8000 & 0.8814 / 0.8800 & 0.6229 / 0.7928  \\\\\n        \\hline\n        RF & 9.526 &  0.8339 / 0.8462  & 0.9015 / 0.8889 & 0.8412 / 0.7273 & 0.8088 / 0.9333 & 0.8729 / 0.8000 & 0.6368 / 0.6860  \\\\\n        \\hline\n        LGB & 0.161 & 0.7889 / 0.8077   & 0.9183 / 0.8750 & 0.7538 / 0.6364 & 0.8613 / 0.9333 & 0.8280 / 0.7368 & 0.5800 / 0.6098  \\\\\n        \\hline\n        \n        MLP & 4.095 & 0.8274 / 0.8846  & 0.8598 / 0.8333 & 0.8921 / 0.9091 & 0.6891 / 0.8667 & 0.8756 / 0.8696 & 0.5950 / 0.7688  \\\\\n        \\hline\n        AlexNet & 4.143 &  0.7872 / 0.8846  & 0.9093 / \\textbf{1.00} & 0.7606 / 0.7273 & 0.8425 / \\textbf{1.00} & 0.8284 / 0.8421 & 0.5662 / 0.7785  \\\\\n        \\hline\n        \\hline\n        LSTM-CNN (Ours) & 4.212 &  0.7994 / \\textbf{0.9615}  & 0.8883 / \\textbf{1.00} & 0.8039 / 0.9091 & 0.7900 / \\textbf{1.00} & 0.8440 / \\textbf{0.9524} & 0.5720 / \\textbf{0.9232}  \\\\\n        \\hline     \n\t\\end{tabular}}\n\t\\label{tab:model structure}\n\\end{table*}\n\nWe provide a quantitative comparison to demonstrate the effectiveness and advantages of the proposed LSTM-CNN model. First, we compare it with some traditional machine learning (ML)-based classifiers, which include the Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), and LightGBM (LGB) \\cite{ke2017lightgbm}. For each classifier, we take $10$-fold cross-validations and the grid search algorithm to optimise the parameters and to ensure the robustness of the results. The training and validation of all these ML classifiers are run under Python's scikit-learn library \\cite{pedregosa2011scikit}. As shown in Table \\ref{tab:model structure}, our model has obvious advantages in most classification metrics,  in terms of accuracy increased by $3.8$\\%, $F_1$ score increased by $3.5$\\%, and MCC increased by $6.6$\\%. Regarding the efficiency of the model, our method outperforms $1.8$ seconds and $5.3$ seconds compared to the SVM model and the RF model, although it is slower than the optimised LR model and the LGB model.\n\nAdditionally, we also compare the performance of different neural network models on this task. The multilayer perceptron (MLP) is the basic model and its structure consists of two fully connected layers. AlexNet adopts a convolutional neural network structure similar to AlexNet\\cite{krizhevsky2012imagenet}, but to adapt to the size of the input data, the internal parameters are modified. In our work, to improve the efficiency of the LSTM-CNN, in addition to the $1$D convolution operation in the CNN block, an LSTM block is added, which contains a concatenation operation. Additionally, as shown in Fig.\\ref{fig:train_val_curve}, the average of multiple experimental results (N=$10$) is used as the model result. Finally, let us point out that our methods achieve optimal results in all metrics with the accuracy rate being $96.15\\%$, the $F_1$ score being $95.24\\%$, and the MCC being $92.32\\%$. Specifically, in the $S$ testing set, only one sequence data from the PD category is misclassified, and the remaining 25 sequence data are correctly classified.\n\nIn summary, the model proposed in this article could not only achieve high recognition accuracy, but also significantly simplify the structure of the model and improve the efficiency of deep learning models in the diagnosis of PD.\n\n\\begin{figure}[t]\n    \\centering\n    \\subfigure[]{\n        \\includegraphics[width=0.22\\textwidth]{Figures/train-acc-loss.png}\n        }\n    \\subfigure[]{\n        \\includegraphics[width=0.22\\textwidth]{Figures/val-acc-loss.png}\n    }\n    \\caption{The accuracy and loss curves of the LSTM-CNN model on the training set(a) and the validation set(b), respectively, where the solid curve represents the average of multiple experiments, and the shaded part represents the range of the results of multiple experiments (N=$10$).} \n\t\\label{fig:train_val_curve}\n\\end{figure}\n\n4.1 Dataset\n\\subsection{Dataset}\n\nThe DraWritePD set contains $157$ pieces of sequence data from $29$ subjects with HC and $20$ patients with PD. The raw sequence data needs to be cropped into patches before being fed to the LSTM-CNN model. The class imbalance problem may occur during the segmentation due to the different lengths of the sequence data. As shown in Fig. \\ref{fig:segmentation}, a nonuniform sampling strategy with varying stride size is adopted to impose the number of generated patch data in each class to be the same. The statistics of the training and testing dataset is listed in Table \\ref{tab:dataste}.\n\n\\renewcommand{\\arraystretch}{1.5}\n\\begin{table}[ht]\n\t\\centering\n\t\\scriptsize \n\t\\caption{The information of training set and testing set.}\n    \\setlength{\\tabcolsep}{4mm}{\n\t\\begin{tabular}{|c||c|c|c|c|}\n\t\t\\hline \n\t    \\multirow{2}*{}  &  \\multicolumn{2}{c|}{Training set} & \\multicolumn{2}{c|}{Testing set} \\\\\n\t    \\cline{2-5}\n\t\t                 &       HC       &         PD        &         HC       &       PD      \\\\\n\t\t\\hline\n\t\t\\hline\n\t\tParticipant     &       25       &         16        &         4        &        4      \\\\\n\t\t\\hline\n\t\tSequence set (S)      &       80       &         51        &         15       &       11      \\\\\n\t\t\\hline\n\t\tPatch set (P)      &     16166      &       16836       &       3670       &     3319      \\\\\n\t\t\\hline\n\t\\end{tabular}}\n\t\\label{tab:dataste}\n\\end{table}\n\nDuring the training phase, the patch data set was randomly divided in the $8:2$ ratio into a training patch data set and a validation patch data set. During the testing phase, the proposed model was first evaluated based on the testing patch data set and then applied to the testing sequence data set, where the predicted result of each raw sequence data set was determined by the majority vote of the prediction result of the patch data. To clarify, we denote the patch data set testing as $P$ and the raw sequence data set as $S$, and independently evaluate the performance of the proposed model on the two cases. \n\n\n",
        "Descriptive_question1": "How many participants are in the HC category for the Training set in table_2?",
        "Descriptive_question2": "What is the number of Patch set (P) for PD in the Testing set in table_2?",
        "Reasoning_question1": "Why might there be a significant difference in the number of Patch set (P) between HC and PD in the Training set of table_2 despite fewer participants in PD?",
        "Reasoning_question2": "What could be inferred about the data distribution strategy from the equal number of participants in HC and PD for the Testing set in table_2?",
        "Descriptive_answer1": "25",
        "Descriptive_answer2": "3319",
        "Reasoning_answer1": "The significant difference in the number of Patch set (P) between HC (16166) and PD (16836) in the Training set, despite fewer PD participants (16 compared to 25 HC), can be attributed to the data processing strategy. The context indicates that raw sequence data is cropped into patches, and a nonuniform sampling strategy with varying stride size is used to balance the number of patches per class. This suggests that for PD participants, either the sequence data was longer or the stride size was smaller, resulting in more patches being generated to address class imbalance. Therefore, even with fewer participants, the patch count for PD is slightly higher than for HC.",
        "Reasoning_answer2": "The equal number of participants in HC and PD (4 each) in the Testing set suggests a deliberate strategy to maintain balance during evaluation. This balance likely aims to ensure fair and unbiased performance assessment of the model by having an equal representation of both classes. By doing so, the testing phase avoids skewing results due to class imbalance, allowing for a more accurate comparison of the model's ability to distinguish between HC and PD. This intentional distribution reflects a focus on equitable evaluation as described in the experimental setup."
    },
    {
        "paper_id": "2302.00973.json",
        "table_id": "table_3",
        "table_content": "\\begin{table*}[ht]\n\t\\centering\n\t\\scriptsize \n\t\\caption{Quantitative comparison of different classification methods.}\n \\setlength{\\tabcolsep}{3mm}{\n\t\\begin{tabular}{ | c || c || c | c | c | c | c | c |}\n  \\hline\n  \\multirow{2}*{Model} & \\multirow{2}*{Inference time (s)} &  \\multicolumn{6}{c|}{Metric} \\\\\n\t\\cline{3-8}\n\t\t~ & ~ &  Accuracy (P/S) & Precision (P/S) & Recall (P/S) & Specificity (P/S) & $F_1$ score (P/S) & MCC (P/S) \\\\\n\t\t\\hline\n        \\hline\n        LR & 0.034 & 0.8061 / 0.9231  & 0.8559 / 0.8462 & 0.8565 / \\textbf{1.00} & 0.7018 / 0.8667 & 0.8562 / 0.9167 & 0.5585 / 0.8563  \\\\\n        \\hline\n        SVM & 6.060 &  0.8371 / 0.8846  & 0.8657 / 0.7857 & 0.8977 / \\textbf{1.00} & 0.7119 / 0.8000 & 0.8814 / 0.8800 & 0.6229 / 0.7928  \\\\\n        \\hline\n        RF & 9.526 &  0.8339 / 0.8462  & 0.9015 / 0.8889 & 0.8412 / 0.7273 & 0.8088 / 0.9333 & 0.8729 / 0.8000 & 0.6368 / 0.6860  \\\\\n        \\hline\n        LGB & 0.161 & 0.7889 / 0.8077   & 0.9183 / 0.8750 & 0.7538 / 0.6364 & 0.8613 / 0.9333 & 0.8280 / 0.7368 & 0.5800 / 0.6098  \\\\\n        \\hline\n        \n        MLP & 4.095 & 0.8274 / 0.8846  & 0.8598 / 0.8333 & 0.8921 / 0.9091 & 0.6891 / 0.8667 & 0.8756 / 0.8696 & 0.5950 / 0.7688  \\\\\n        \\hline\n        AlexNet & 4.143 &  0.7872 / 0.8846  & 0.9093 / \\textbf{1.00} & 0.7606 / 0.7273 & 0.8425 / \\textbf{1.00} & 0.8284 / 0.8421 & 0.5662 / 0.7785  \\\\\n        \\hline\n        \\hline\n        LSTM-CNN (Ours) & 4.212 &  0.7994 / \\textbf{0.9615}  & 0.8883 / \\textbf{1.00} & 0.8039 / 0.9091 & 0.7900 / \\textbf{1.00} & 0.8440 / \\textbf{0.9524} & 0.5720 / \\textbf{0.9232}  \\\\\n        \\hline     \n\t\\end{tabular}}\n\t\\label{tab:model structure}\n\\end{table*}",
        "caption": "Quantitative comparison of different classification methods.",
        "label": "tab:model structure",
        "section_info": "4 Experimental Results\n\\section{Experimental Results}\\label{sec:results}\n\nIn this section, we evaluate and analyse the performance of the proposed LSTM-CNN model using the DraWritePD data set.  The model runs on the desktop PC with an Intel(R) Core(TM) $3.60$ GHz($8$ CPU), $32$GB RAM, and an NVIDIA RTX$3070$Ti GPU with $8$ GB memory. \n\n\\subsection{Dataset}\n\nThe DraWritePD set contains $157$ pieces of sequence data from $29$ subjects with HC and $20$ patients with PD. The raw sequence data needs to be cropped into patches before being fed to the LSTM-CNN model. The class imbalance problem may occur during the segmentation due to the different lengths of the sequence data. As shown in Fig. \\ref{fig:segmentation}, a nonuniform sampling strategy with varying stride size is adopted to impose the number of generated patch data in each class to be the same. The statistics of the training and testing dataset is listed in Table \\ref{tab:dataste}.\n\n\\renewcommand{\\arraystretch}{1.5}\n\\begin{table}[ht]\n\t\\centering\n\t\\scriptsize \n\t\\caption{The information of training set and testing set.}\n    \\setlength{\\tabcolsep}{4mm}{\n\t\\begin{tabular}{|c||c|c|c|c|}\n\t\t\\hline \n\t    \\multirow{2}*{}  &  \\multicolumn{2}{c|}{Training set} & \\multicolumn{2}{c|}{Testing set} \\\\\n\t    \\cline{2-5}\n\t\t                 &       HC       &         PD        &         HC       &       PD      \\\\\n\t\t\\hline\n\t\t\\hline\n\t\tParticipant     &       25       &         16        &         4        &        4      \\\\\n\t\t\\hline\n\t\tSequence set (S)      &       80       &         51        &         15       &       11      \\\\\n\t\t\\hline\n\t\tPatch set (P)      &     16166      &       16836       &       3670       &     3319      \\\\\n\t\t\\hline\n\t\\end{tabular}}\n\t\\label{tab:dataste}\n\\end{table}\n\nDuring the training phase, the patch data set was randomly divided in the $8:2$ ratio into a training patch data set and a validation patch data set. During the testing phase, the proposed model was first evaluated based on the testing patch data set and then applied to the testing sequence data set, where the predicted result of each raw sequence data set was determined by the majority vote of the prediction result of the patch data. To clarify, we denote the patch data set testing as $P$ and the raw sequence data set as $S$, and independently evaluate the performance of the proposed model on the two cases. \n\n\n\\subsection{Experimental Setup}\nIn order to fully exploit the performance of the proposed LSTM-CNN model, we use a cross-validation strategy to optimally choose the parameters. Adam \\cite{kingma2014adam} optimiser is used to train the model, and the initial learning rate is set to $0.001$. Furthermore, the cross-entropy loss function is used for model fitting and the batch size is set to $64$. The proposed model is completed in $200$ epochs with the loss curve shown in Fig. \\ref{fig:train_val_curve}. We use the metrics: accuracy, precision, recall, specificity, $F_1$ score, and Matthews correlation coefficient (MCC) for evaluation, where the latter has been adopted by many existing methods to describe the different aspects of the performance of a classifier \\cite{baldi2000assessing}. Once the training phase is completed, the one with the best fitness value is chosen for testing. Moreover, the length of segmented patches and the choice of feature selection are also discussed to interpret their roles in determining the model performance. As shown in Fig.~\\ref{fig:parameter}, the model achieves the best classification result when the window size is $128$ by using the ($v_x,v_y$) velocity characteristics. The model performance is eventually tested on both the original sequences dataset ($S$) and the segmented patches dataset ($P$).\n\n\n\\subsection{Quantitative Evaluation and Comparison}\n\n\\renewcommand{\\arraystretch}{1.5}\n\\begin{table*}[ht]\n\t\\centering\n\t\\scriptsize \n\t\\caption{Quantitative comparison of different classification methods.}\n \\setlength{\\tabcolsep}{3mm}{\n\t\\begin{tabular}{ | c || c || c | c | c | c | c | c |}\n  \\hline\n  \\multirow{2}*{Model} & \\multirow{2}*{Inference time (s)} &  \\multicolumn{6}{c|}{Metric} \\\\\n\t\\cline{3-8}\n\t\t~ & ~ &  Accuracy (P/S) & Precision (P/S) & Recall (P/S) & Specificity (P/S) & $F_1$ score (P/S) & MCC (P/S) \\\\\n\t\t\\hline\n        \\hline\n        LR & 0.034 & 0.8061 / 0.9231  & 0.8559 / 0.8462 & 0.8565 / \\textbf{1.00} & 0.7018 / 0.8667 & 0.8562 / 0.9167 & 0.5585 / 0.8563  \\\\\n        \\hline\n        SVM & 6.060 &  0.8371 / 0.8846  & 0.8657 / 0.7857 & 0.8977 / \\textbf{1.00} & 0.7119 / 0.8000 & 0.8814 / 0.8800 & 0.6229 / 0.7928  \\\\\n        \\hline\n        RF & 9.526 &  0.8339 / 0.8462  & 0.9015 / 0.8889 & 0.8412 / 0.7273 & 0.8088 / 0.9333 & 0.8729 / 0.8000 & 0.6368 / 0.6860  \\\\\n        \\hline\n        LGB & 0.161 & 0.7889 / 0.8077   & 0.9183 / 0.8750 & 0.7538 / 0.6364 & 0.8613 / 0.9333 & 0.8280 / 0.7368 & 0.5800 / 0.6098  \\\\\n        \\hline\n        \n        MLP & 4.095 & 0.8274 / 0.8846  & 0.8598 / 0.8333 & 0.8921 / 0.9091 & 0.6891 / 0.8667 & 0.8756 / 0.8696 & 0.5950 / 0.7688  \\\\\n        \\hline\n        AlexNet & 4.143 &  0.7872 / 0.8846  & 0.9093 / \\textbf{1.00} & 0.7606 / 0.7273 & 0.8425 / \\textbf{1.00} & 0.8284 / 0.8421 & 0.5662 / 0.7785  \\\\\n        \\hline\n        \\hline\n        LSTM-CNN (Ours) & 4.212 &  0.7994 / \\textbf{0.9615}  & 0.8883 / \\textbf{1.00} & 0.8039 / 0.9091 & 0.7900 / \\textbf{1.00} & 0.8440 / \\textbf{0.9524} & 0.5720 / \\textbf{0.9232}  \\\\\n        \\hline     \n\t\\end{tabular}}\n\t\\label{tab:model structure}\n\\end{table*}\n\nWe provide a quantitative comparison to demonstrate the effectiveness and advantages of the proposed LSTM-CNN model. First, we compare it with some traditional machine learning (ML)-based classifiers, which include the Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), and LightGBM (LGB) \\cite{ke2017lightgbm}. For each classifier, we take $10$-fold cross-validations and the grid search algorithm to optimise the parameters and to ensure the robustness of the results. The training and validation of all these ML classifiers are run under Python's scikit-learn library \\cite{pedregosa2011scikit}. As shown in Table \\ref{tab:model structure}, our model has obvious advantages in most classification metrics,  in terms of accuracy increased by $3.8$\\%, $F_1$ score increased by $3.5$\\%, and MCC increased by $6.6$\\%. Regarding the efficiency of the model, our method outperforms $1.8$ seconds and $5.3$ seconds compared to the SVM model and the RF model, although it is slower than the optimised LR model and the LGB model.\n\nAdditionally, we also compare the performance of different neural network models on this task. The multilayer perceptron (MLP) is the basic model and its structure consists of two fully connected layers. AlexNet adopts a convolutional neural network structure similar to AlexNet\\cite{krizhevsky2012imagenet}, but to adapt to the size of the input data, the internal parameters are modified. In our work, to improve the efficiency of the LSTM-CNN, in addition to the $1$D convolution operation in the CNN block, an LSTM block is added, which contains a concatenation operation. Additionally, as shown in Fig.\\ref{fig:train_val_curve}, the average of multiple experimental results (N=$10$) is used as the model result. Finally, let us point out that our methods achieve optimal results in all metrics with the accuracy rate being $96.15\\%$, the $F_1$ score being $95.24\\%$, and the MCC being $92.32\\%$. Specifically, in the $S$ testing set, only one sequence data from the PD category is misclassified, and the remaining 25 sequence data are correctly classified.\n\nIn summary, the model proposed in this article could not only achieve high recognition accuracy, but also significantly simplify the structure of the model and improve the efficiency of deep learning models in the diagnosis of PD.\n\n\\begin{figure}[t]\n    \\centering\n    \\subfigure[]{\n        \\includegraphics[width=0.22\\textwidth]{Figures/train-acc-loss.png}\n        }\n    \\subfigure[]{\n        \\includegraphics[width=0.22\\textwidth]{Figures/val-acc-loss.png}\n    }\n    \\caption{The accuracy and loss curves of the LSTM-CNN model on the training set(a) and the validation set(b), respectively, where the solid curve represents the average of multiple experiments, and the shaded part represents the range of the results of multiple experiments (N=$10$).} \n\t\\label{fig:train_val_curve}\n\\end{figure}\n\n4.3 Quantitative Evaluation and Comparison\n\\subsection{Quantitative Evaluation and Comparison}\n\n\\renewcommand{\\arraystretch}{1.5}\n\\begin{table*}[ht]\n\t\\centering\n\t\\scriptsize \n\t\\caption{Quantitative comparison of different classification methods.}\n \\setlength{\\tabcolsep}{3mm}{\n\t\\begin{tabular}{ | c || c || c | c | c | c | c | c |}\n  \\hline\n  \\multirow{2}*{Model} & \\multirow{2}*{Inference time (s)} &  \\multicolumn{6}{c|}{Metric} \\\\\n\t\\cline{3-8}\n\t\t~ & ~ &  Accuracy (P/S) & Precision (P/S) & Recall (P/S) & Specificity (P/S) & $F_1$ score (P/S) & MCC (P/S) \\\\\n\t\t\\hline\n        \\hline\n        LR & 0.034 & 0.8061 / 0.9231  & 0.8559 / 0.8462 & 0.8565 / \\textbf{1.00} & 0.7018 / 0.8667 & 0.8562 / 0.9167 & 0.5585 / 0.8563  \\\\\n        \\hline\n        SVM & 6.060 &  0.8371 / 0.8846  & 0.8657 / 0.7857 & 0.8977 / \\textbf{1.00} & 0.7119 / 0.8000 & 0.8814 / 0.8800 & 0.6229 / 0.7928  \\\\\n        \\hline\n        RF & 9.526 &  0.8339 / 0.8462  & 0.9015 / 0.8889 & 0.8412 / 0.7273 & 0.8088 / 0.9333 & 0.8729 / 0.8000 & 0.6368 / 0.6860  \\\\\n        \\hline\n        LGB & 0.161 & 0.7889 / 0.8077   & 0.9183 / 0.8750 & 0.7538 / 0.6364 & 0.8613 / 0.9333 & 0.8280 / 0.7368 & 0.5800 / 0.6098  \\\\\n        \\hline\n        \n        MLP & 4.095 & 0.8274 / 0.8846  & 0.8598 / 0.8333 & 0.8921 / 0.9091 & 0.6891 / 0.8667 & 0.8756 / 0.8696 & 0.5950 / 0.7688  \\\\\n        \\hline\n        AlexNet & 4.143 &  0.7872 / 0.8846  & 0.9093 / \\textbf{1.00} & 0.7606 / 0.7273 & 0.8425 / \\textbf{1.00} & 0.8284 / 0.8421 & 0.5662 / 0.7785  \\\\\n        \\hline\n        \\hline\n        LSTM-CNN (Ours) & 4.212 &  0.7994 / \\textbf{0.9615}  & 0.8883 / \\textbf{1.00} & 0.8039 / 0.9091 & 0.7900 / \\textbf{1.00} & 0.8440 / \\textbf{0.9524} & 0.5720 / \\textbf{0.9232}  \\\\\n        \\hline     \n\t\\end{tabular}}\n\t\\label{tab:model structure}\n\\end{table*}\n\nWe provide a quantitative comparison to demonstrate the effectiveness and advantages of the proposed LSTM-CNN model. First, we compare it with some traditional machine learning (ML)-based classifiers, which include the Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), and LightGBM (LGB) \\cite{ke2017lightgbm}. For each classifier, we take $10$-fold cross-validations and the grid search algorithm to optimise the parameters and to ensure the robustness of the results. The training and validation of all these ML classifiers are run under Python's scikit-learn library \\cite{pedregosa2011scikit}. As shown in Table \\ref{tab:model structure}, our model has obvious advantages in most classification metrics,  in terms of accuracy increased by $3.8$\\%, $F_1$ score increased by $3.5$\\%, and MCC increased by $6.6$\\%. Regarding the efficiency of the model, our method outperforms $1.8$ seconds and $5.3$ seconds compared to the SVM model and the RF model, although it is slower than the optimised LR model and the LGB model.\n\nAdditionally, we also compare the performance of different neural network models on this task. The multilayer perceptron (MLP) is the basic model and its structure consists of two fully connected layers. AlexNet adopts a convolutional neural network structure similar to AlexNet\\cite{krizhevsky2012imagenet}, but to adapt to the size of the input data, the internal parameters are modified. In our work, to improve the efficiency of the LSTM-CNN, in addition to the $1$D convolution operation in the CNN block, an LSTM block is added, which contains a concatenation operation. Additionally, as shown in Fig.\\ref{fig:train_val_curve}, the average of multiple experimental results (N=$10$) is used as the model result. Finally, let us point out that our methods achieve optimal results in all metrics with the accuracy rate being $96.15\\%$, the $F_1$ score being $95.24\\%$, and the MCC being $92.32\\%$. Specifically, in the $S$ testing set, only one sequence data from the PD category is misclassified, and the remaining 25 sequence data are correctly classified.\n\nIn summary, the model proposed in this article could not only achieve high recognition accuracy, but also significantly simplify the structure of the model and improve the efficiency of deep learning models in the diagnosis of PD.\n\n\\begin{figure}[t]\n    \\centering\n    \\subfigure[]{\n        \\includegraphics[width=0.22\\textwidth]{Figures/train-acc-loss.png}\n        }\n    \\subfigure[]{\n        \\includegraphics[width=0.22\\textwidth]{Figures/val-acc-loss.png}\n    }\n    \\caption{The accuracy and loss curves of the LSTM-CNN model on the training set(a) and the validation set(b), respectively, where the solid curve represents the average of multiple experiments, and the shaded part represents the range of the results of multiple experiments (N=$10$).} \n\t\\label{fig:train_val_curve}\n\\end{figure}\n\n",
        "Descriptive_question1": "What is the accuracy of the LSTM-CNN model on the sequence dataset in table_3?",
        "Descriptive_question2": "Which model has the shortest inference time in table_3?",
        "Reasoning_question1": "How does the performance of the LSTM-CNN model compare to traditional machine learning models like SVM and RF in terms of accuracy on the sequence dataset in table_3?",
        "Reasoning_question2": "What can be inferred about the effectiveness of the LSTM-CNN model in diagnosing PD based on the metrics provided in table_3?",
        "Descriptive_answer1": "0.9615",
        "Descriptive_answer2": "LR",
        "Reasoning_answer1": "To compare the performance of the LSTM-CNN model to traditional machine learning models like SVM and RF in terms of accuracy on the sequence dataset (S) in table_3, I first look at the accuracy values provided. The LSTM-CNN model has an accuracy of 0.9615 on the sequence dataset, which is the highest among all models. In comparison, the SVM model has an accuracy of 0.8846, and the RF model has an accuracy of 0.8462. This indicates that LSTM-CNN outperforms SVM by 0.0769 (or 7.69 percentage points) and RF by 0.1153 (or 11.53 percentage points). The superior accuracy of LSTM-CNN suggests that it is better at correctly classifying the sequence data compared to these traditional models, likely due to its ability to capture temporal and spatial features through the combination of LSTM and CNN architectures.",
        "Reasoning_answer2": "To infer the effectiveness of the LSTM-CNN model in diagnosing Parkinson's Disease (PD) based on the metrics in table_3, I examine its performance across multiple metrics on the sequence dataset (S), as this reflects its application to real-world data. The LSTM-CNN model achieves the highest accuracy of 0.9615, meaning it correctly classifies 96.15% of the sequences. It also scores a perfect 1.00 in precision and specificity, indicating no false positives and perfect identification of negative cases (non-PD). The recall of 0.9091 shows it identifies most positive cases (PD) correctly, though not perfectly. The F1 score of 0.9524, which balances precision and recall, is the highest among all models, demonstrating robust overall performance. Lastly, the MCC of 0.9232, also the highest, confirms strong correlation between predicted and actual classifications, even in imbalanced datasets. Collectively, these metrics suggest that the LSTM-CNN model is highly effective for PD diagnosis, as it excels in distinguishing between PD and non-PD cases with minimal errors, particularly in avoiding false positives, which is critical in medical diagnostics."
    },
    {
        "paper_id": "2107.07634.json",
        "table_id": "table_1",
        "table_content": "\\begin{table*}[t]\n  \\caption{False reject ratios for structured evaluation set [$\\%$] at an operating point of 1 FA/100 hrs, and for take home evaluation set at an operating point of 100 FAs.}\n\n  \\label{tab:FRRs}\n   \\centering\n\n\\begin{tabular}{cccccc}\n  \\toprule\n &  MTL & Branch & Structured evaluation set&  Take home evaluation set & Avg.\\\\\n  \\midrule\n Phoneme classifier &  &Phonetic& 20.26 & 27.72 & 23.99\\\\ \\midrule\n Conventional MTL \\cite{9053577}& \\checkmark &\\begin{tabular}[c]{@{}c@{}}Phonetic\\\\ Phrase\\end{tabular}& \\begin{tabular}[c]{@{}c@{}}5.00 \\\\\\textbf{3.49}\\end{tabular}& \\begin{tabular}[c]{@{}c@{}}14.11 \\\\10.11\\end{tabular} &\\begin{tabular}[c]{@{}c@{}}9.56 \\\\6.80\\end{tabular}\\\\ \\midrule\n BLSTM decoder & \\checkmark &\\begin{tabular}[c]{@{}c@{}}Phonetic\\\\ Phrase\\end{tabular}& \\begin{tabular}[c]{@{}c@{}}5.02 \\\\4.76\\end{tabular} & \\begin{tabular}[c]{@{}c@{}}12.36 \\\\8.89\\end{tabular} &\\begin{tabular}[c]{@{}c@{}} 8.69\\\\6.83\\end{tabular}\\\\ \\midrule\n Cross attention decoder & \\checkmark &\\begin{tabular}[c]{@{}c@{}}Phonetic\\\\ Phrase\\end{tabular}& \\begin{tabular}[c]{@{}c@{}}4.64 \\\\3.82\\end{tabular} & \\begin{tabular}[c]{@{}c@{}}13.21 \\\\\\textbf{8.17}\\end{tabular} &\\begin{tabular}[c]{@{}c@{}} 8.93\\\\\\textbf{6.00}\\end{tabular}\\\\\n     \\bottomrule\n\\end{tabular}\n\n\\end{table*}",
        "caption": "False reject ratios for structured evaluation set [$\\%$] at an operating point of 1 FA/100 hrs, and for take home evaluation set at an operating point of 100 FAs.",
        "label": "tab:FRRs",
        "section_info": "4 Experimental evaluation\n\\section{Experimental evaluation}\n\\label{sec:exp}\n\n\n\nWe evaluated the effectiveness of the proposed approach on a KWS task, and compared its performance with a self-attention phonetic decoder with/without the conventional multi-task learning and a BLSTM decoder with the conventional multi-task learning. Although we used our internal datasets in experiments,  our proposed approach is easily applicable to any public ASR and KWS datasets.\n\n\\subsection{Data}\nOur ASR training data consisted of approximately 3 million utterances of transcribed near-field speech signals recorded with devices such as smart phones. Then data augmentation was performed by convolving room impulse responses (RIRs) with speech signals. The RIRs were recorded in various rooms with smart speakers with six microphones. Additionally, echo residuals were added to the augmented data. As a result, we obtained approximately 9 million augmented utterances consisting of the near-field signals, simulated far-field signals, and simulated far-field signals with the echo residuals. The KWS data consisted of approximately $65k$ false triggers and $300k$ true triggers spoken by anonymous speakers, which were triggered by a reference voice triggering system. The audio signals were recorded with smart speakers. The KWS data were combined with the augmented ASR dataset, and utterances were randomly sampled from the combined dataset for mini-batch training.\n\nFor evaluation, we used two different datasets. The first  is a \\emph{structured} dataset, where positive samples containing a keyword phrase were internally collected in controlled conditions from 100 participants, approximately evenly divided between males and females. Each subject spoke the keyword phrase followed by prompted voice commands to a smart speaker. The recordings were made in four different acoustic conditions: quiet, external noise from TV or kitchen appliances, music playing from the device at medium volume, and music playing at loud volume. 13000 such positive utterances were collected. For negative data, we used a set of 2000 hours of audio recordings which did not contain the keyword phrase by playing podcasts, audiobooks, TV play-back, etc. These negative audio samples were also recorded by the same smart speaker. The negative audio data allowed us to compute false accept (FA) per hour. The second dataset, called take home evaluation set, is a more realistic and challenging  dataset collected at home by employees. Each of the 80 participants volunteered to use the smart speaker daily for two weeks. By applying extra audio logging on device and personal review by the user, audio below the usual on-device trigger threshold was collected. This setup allowed us to collect challenging negative data, which was similar to the keyword phrase. We collected 7896 positive and 20919 negative audio samples\\footnote{The amount of the dataset has been increased by additional participants compared to the evaluation dataset used in \\cite{adya2020hybrid} and \\cite{9053577}, so the result reported in this paper is not directly comparable.} for evaluation. This dataset allowed us to compute the absolute number of false accepts (FAs).\n\n\\subsection{Two-stage approach for efficient KWS}\n\\begin{figure}[t]\n  \\centering\n\n\n  \\includegraphics[scale=0.45]{TwoPassKWS_v2.png}\n  \\caption{A two-stage approach for efficient KWS \\cite{gruenstein2017cascade,sigtia2018}. A 1st pass light-weight KWS system is always-on and takes streaming audio signals, where a DNN-HMM system is used to obtain a KWS score and an alignment for an audio segment containing a keyword. Once the 1st pass KWS score exceeds a threshold,  the audio segment is passed to a bigger KWS model (so-called checker) and a KWS score is re-computed.}\n\n  \\label{fig:TwoPass}\n\\end{figure}\n\nWe used a two-stage approach for efficient KWS \\cite{gruenstein2017cascade,sigtia2018} as shown in figure \\ref{fig:TwoPass}. A light-weight model was always-on and first detected candidate audio segments from streaming audio inputs. Once the segments were detected, a bigger model (so-called checker) was turned on and checked if the segments actually contained the keyword phrase or not. This two-stage approach greatly reduces compute cost and battery consumption on-device. For the 1st pass model, we used five layers of fully-connected neural networks with 64 hidden units as the acoustic model.  We used 20 target classes for the acoustic model; 18 phoneme classes for the keyword, one for silence and one for other speech.  We computed a 13-dimensional MFCC feature at a rate of 100 frames per second,  and supplied 19 consecutive frames to the acoustic model.  The confidence scores for KWS and alignments to extract audio segments were obtained using an HMM.  Given keyword start and end times from the HMM alignment,  we used $(start~time - 0.5)$ seconds  and $(end~time + 0.3)$ seconds for segmentation to ensure that the segment contained the detected keyword portion. The 1st pass threshold was set to obtain approximately 21 FA/hr on the structured evaluation dataset. We used the same 1st pass system for all the experiments and evaluated the effectiveness of our proposed model as the checker.\n\n\n\\subsection{Model training}\nFor a baseline phoneme classifier, we used a self-attention based acoustic model. The model consisted of 6 layers of Transformer blocks, each of which had a multi-head self-attention layer with 256 hidden dimension and 4 heads, followed by a feedforward neural network with 1024 hidden units. Finally, outputs from the Transformer blocks were projected to 54-dimensional logits for phonetic and blank labels by a linear layer. The baseline model was trained with the CTC loss\\footnote{In \\cite{adya2020hybrid}, the vanilla Transformer decoder was also trained along with the self-attention encoder using cross entropy loss, and used as a regularizer during training. We omitted the regularization just because of simplicity in our experiments. The regularization can be applied to all the approaches in our experiments including the proposed approach.}. The same architecture was also used for the conventional multi-task learning \\cite{9053577} by splitting the last layer into 54 outputs for the phonetic CTC loss and three discriminative outputs for a positive class, a negative class and a blank label for the phrase level CTC loss. Regarding the proposed approach, we used the same self-attention phoneme classifier for the phonetic encoder. The cross attention decoder consisted of a Transformer decoder block (i.e., $P=1$) which had the same configuration as the Transformer blocks of the encoder except the cross attention block. The dimension of the query vector and the length of the query sequence were set at 256 and 4, respectively. The last linear layer projected the reshaped $1024 (256\\times4)$-dimensional vector to two logits for positive and negative classes. The encoder and the decoder were jointly trained using the phonetic CTC loss and the phrase level cross entropy loss (see Section \\ref{sec:proposed}). We also explored a BLSTM decoder by replacing the cross attention decoder by a layer of BLSTMs with 256 hidden units followed by a linear layer which processed a concatenated BLSTM outputs at the first and last frame to predict logits. The scaling factor $\\alpha$ in Eq. (\\ref{eq:MTL}) for the multi-task learning was experimentally set at $10$. $40$-dimensional log mel-filter bank features $\\pm$ 3 context frames were used as inputs. In addition, we sub-sampled the features once per three frames to reduce computational complexity.\n\n All models were trained using the Adam optimizer \\cite{kingma2014adam}. The learning rate was first increased linearly to $0.0008$ until epoch $2$, then linearly decayed to $0.00056$ until epoch $16$. Finally the learning rate was exponentially decreased until the last epoch which was set at $28$. We used 16 GPUs for training and the batch size was 128 at each GPU. \n\n\n\n\n\n\\subsection{Results}\n\n\\begin{figure}[t]\n  \\centering\n\n\n  \\includegraphics[width=\\linewidth]{det_curve_checker_edc_v2.png}\n  \\caption{DET curves for structured evaluation set. The vertical dotted line indicates an operating point.}\n\n  \\label{fig:det_edc}\n\\end{figure}\n\n\\begin{figure}[t]\n  \\centering\n\n\n  \\includegraphics[width=\\linewidth]{det_curve_checker_thk_v2.png}\n  \\caption{DET curves for take home evaluation set. The vertical dotted line indicates an operating point.}\n\n  \\label{fig:det_thk}\n\\end{figure}\n\n\\begin{table*}[t]\n  \\caption{False reject ratios for structured evaluation set [$\\%$] at an operating point of 1 FA/100 hrs, and for take home evaluation set at an operating point of 100 FAs.}\n\n  \\label{tab:FRRs}\n   \\centering\n\n\\begin{tabular}{cccccc}\n  \\toprule\n &  MTL & Branch & Structured evaluation set&  Take home evaluation set & Avg.\\\\\n  \\midrule\n Phoneme classifier &  &Phonetic& 20.26 & 27.72 & 23.99\\\\ \\midrule\n Conventional MTL \\cite{9053577}& \\checkmark &\\begin{tabular}[c]{@{}c@{}}Phonetic\\\\ Phrase\\end{tabular}& \\begin{tabular}[c]{@{}c@{}}5.00 \\\\\\textbf{3.49}\\end{tabular}& \\begin{tabular}[c]{@{}c@{}}14.11 \\\\10.11\\end{tabular} &\\begin{tabular}[c]{@{}c@{}}9.56 \\\\6.80\\end{tabular}\\\\ \\midrule\n BLSTM decoder & \\checkmark &\\begin{tabular}[c]{@{}c@{}}Phonetic\\\\ Phrase\\end{tabular}& \\begin{tabular}[c]{@{}c@{}}5.02 \\\\4.76\\end{tabular} & \\begin{tabular}[c]{@{}c@{}}12.36 \\\\8.89\\end{tabular} &\\begin{tabular}[c]{@{}c@{}} 8.69\\\\6.83\\end{tabular}\\\\ \\midrule\n Cross attention decoder & \\checkmark &\\begin{tabular}[c]{@{}c@{}}Phonetic\\\\ Phrase\\end{tabular}& \\begin{tabular}[c]{@{}c@{}}4.64 \\\\3.82\\end{tabular} & \\begin{tabular}[c]{@{}c@{}}13.21 \\\\\\textbf{8.17}\\end{tabular} &\\begin{tabular}[c]{@{}c@{}} 8.93\\\\\\textbf{6.00}\\end{tabular}\\\\\n     \\bottomrule\n\\end{tabular}\n\n\\end{table*}\n\n\n\nFigures \\ref{fig:det_edc} and \\ref{fig:det_thk} show detection error tradeoff (DET) curves for all models evaluated on the structured evaluation dataset and take home evaluation dataset, respectively. The horizontal axis represents FA/hr for the structured dataset or the  absolute number of FAs for take home dataset. The vertical axis represents FRRs. Table \\ref{tab:FRRs} shows FRRs obtained with the baseline and proposed models at operating points. In the case of multi-task learning, results from both the phonetic and  phrase branches were reported. First, multi-task learning significantly improved the FRRs compared to the phoneme classifier which was trained only on the ASR data. This result shows the effectiveness of using both the ASR and the KWS data for KWS model training. Second, the phrase branch always yielded better results than the phonetic branch, presumably because the phrase branch was directly optimized for the target task. Note that although the performance of the phonetic branch was not as good as the phrase branch, the phonetic branch has an advantage of flexibility where the keyword phrase is configurable at test time.\n\n Lastly, the proposed cross attention decoder with the phrase branch yielded the best performance and achieved a $12\\%$ relative reduction in the FRRs  compared to the conventional multi-task learning and the BLSTM decoder. The cross attention decoder has another advantage over the BLSTM decoder, which is less training time and less runtime cost as reported in \\cite{adya2020hybrid}.\n\nEven though the proposed decoder can effectively learn from the KWS training data\\footnote{Cross validation loss with the conventional multi-task learning was $1.5\\times$ higher than the loss with the cross attention decoder.}, the proposed approach with the phrase branch did not outperform the conventional multi-task learning for the structured evaluation set. This performance degradation could be because of mismatched conditions/distributions between the KWS training data and the structured evaluation dataset that was recorded in the controlled conditions.\n\n4.4 Results\n\\subsection{Results}\n\n\\begin{figure}[t]\n  \\centering\n\n\n  \\includegraphics[width=\\linewidth]{det_curve_checker_edc_v2.png}\n  \\caption{DET curves for structured evaluation set. The vertical dotted line indicates an operating point.}\n\n  \\label{fig:det_edc}\n\\end{figure}\n\n\\begin{figure}[t]\n  \\centering\n\n\n  \\includegraphics[width=\\linewidth]{det_curve_checker_thk_v2.png}\n  \\caption{DET curves for take home evaluation set. The vertical dotted line indicates an operating point.}\n\n  \\label{fig:det_thk}\n\\end{figure}\n\n\\begin{table*}[t]\n  \\caption{False reject ratios for structured evaluation set [$\\%$] at an operating point of 1 FA/100 hrs, and for take home evaluation set at an operating point of 100 FAs.}\n\n  \\label{tab:FRRs}\n   \\centering\n\n\\begin{tabular}{cccccc}\n  \\toprule\n &  MTL & Branch & Structured evaluation set&  Take home evaluation set & Avg.\\\\\n  \\midrule\n Phoneme classifier &  &Phonetic& 20.26 & 27.72 & 23.99\\\\ \\midrule\n Conventional MTL \\cite{9053577}& \\checkmark &\\begin{tabular}[c]{@{}c@{}}Phonetic\\\\ Phrase\\end{tabular}& \\begin{tabular}[c]{@{}c@{}}5.00 \\\\\\textbf{3.49}\\end{tabular}& \\begin{tabular}[c]{@{}c@{}}14.11 \\\\10.11\\end{tabular} &\\begin{tabular}[c]{@{}c@{}}9.56 \\\\6.80\\end{tabular}\\\\ \\midrule\n BLSTM decoder & \\checkmark &\\begin{tabular}[c]{@{}c@{}}Phonetic\\\\ Phrase\\end{tabular}& \\begin{tabular}[c]{@{}c@{}}5.02 \\\\4.76\\end{tabular} & \\begin{tabular}[c]{@{}c@{}}12.36 \\\\8.89\\end{tabular} &\\begin{tabular}[c]{@{}c@{}} 8.69\\\\6.83\\end{tabular}\\\\ \\midrule\n Cross attention decoder & \\checkmark &\\begin{tabular}[c]{@{}c@{}}Phonetic\\\\ Phrase\\end{tabular}& \\begin{tabular}[c]{@{}c@{}}4.64 \\\\3.82\\end{tabular} & \\begin{tabular}[c]{@{}c@{}}13.21 \\\\\\textbf{8.17}\\end{tabular} &\\begin{tabular}[c]{@{}c@{}} 8.93\\\\\\textbf{6.00}\\end{tabular}\\\\\n     \\bottomrule\n\\end{tabular}\n\n\\end{table*}\n\n\n\nFigures \\ref{fig:det_edc} and \\ref{fig:det_thk} show detection error tradeoff (DET) curves for all models evaluated on the structured evaluation dataset and take home evaluation dataset, respectively. The horizontal axis represents FA/hr for the structured dataset or the  absolute number of FAs for take home dataset. The vertical axis represents FRRs. Table \\ref{tab:FRRs} shows FRRs obtained with the baseline and proposed models at operating points. In the case of multi-task learning, results from both the phonetic and  phrase branches were reported. First, multi-task learning significantly improved the FRRs compared to the phoneme classifier which was trained only on the ASR data. This result shows the effectiveness of using both the ASR and the KWS data for KWS model training. Second, the phrase branch always yielded better results than the phonetic branch, presumably because the phrase branch was directly optimized for the target task. Note that although the performance of the phonetic branch was not as good as the phrase branch, the phonetic branch has an advantage of flexibility where the keyword phrase is configurable at test time.\n\n Lastly, the proposed cross attention decoder with the phrase branch yielded the best performance and achieved a $12\\%$ relative reduction in the FRRs  compared to the conventional multi-task learning and the BLSTM decoder. The cross attention decoder has another advantage over the BLSTM decoder, which is less training time and less runtime cost as reported in \\cite{adya2020hybrid}.\n\nEven though the proposed decoder can effectively learn from the KWS training data\\footnote{Cross validation loss with the conventional multi-task learning was $1.5\\times$ higher than the loss with the cross attention decoder.}, the proposed approach with the phrase branch did not outperform the conventional multi-task learning for the structured evaluation set. This performance degradation could be because of mismatched conditions/distributions between the KWS training data and the structured evaluation dataset that was recorded in the controlled conditions.\n\n",
        "Descriptive_question1": "What is the false reject ratio for the Phoneme classifier on the structured evaluation set in table_1?",
        "Descriptive_question2": "Which model achieved the lowest false reject ratio on the take home evaluation set with the phrase branch in table_1?",
        "Reasoning_question1": "Why might the cross attention decoder with the phrase branch have outperformed other models on the take home evaluation set in table_1?",
        "Reasoning_question2": "How does the performance of the phrase branch compare to the phonetic branch across different models in table_1, and what could explain this difference?",
        "Descriptive_answer1": "20.26",
        "Descriptive_answer2": "Cross attention decoder",
        "Reasoning_answer1": "The cross attention decoder with the phrase branch likely outperformed other models on the take home evaluation set due to its ability to effectively learn from the KWS training data, as indicated by lower cross-validation loss compared to conventional multi-task learning. Observing the table, it achieved the lowest false reject ratio of 8.17% at the operating point of 100 FAs, compared to 10.11% for conventional MTL and 8.89% for BLSTM decoder. This suggests that the cross attention mechanism might be better at capturing relevant features for keyword spotting in more realistic, challenging home environments. Additionally, the context mentions that the cross attention decoder has advantages in training time and runtime cost, which could indirectly contribute to better optimization and performance on varied datasets like the take home set.",
        "Reasoning_answer2": "Across different models in the table, the phrase branch consistently shows lower false reject ratios compared to the phonetic branch. For instance, in conventional MTL, the phrase branch achieves 3.49% versus 5.00% for phonetic on the structured set, and 10.11% versus 14.11% on the take home set. Similarly, for the BLSTM decoder, it's 4.76% versus 5.02% on structured, and 8.89% versus 12.36% on take home; and for the cross attention decoder, it's 3.82% versus 4.64% on structured, and 8.17% versus 13.21% on take home. This pattern indicates that the phrase branch is generally more effective. The likely reason, as suggested in the context, is that the phrase branch is directly optimized for the target task of keyword spotting, whereas the phonetic branch focuses on broader phoneme classification which may not align as closely with the specific goal. Additionally, the phonetic branch's flexibility in configuring keyword phrases at test time might come at the cost of precision in performance compared to the targeted optimization of the phrase branch."
    },
    {
        "paper_id": "1812.04352.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}[h]\n  \\center\n  \\begin{tabular}{@ { } lrrrrr @ { }}\n    \\toprule\n    Test case      & $N$  & $\\#$Cores  & Layer-serial & Layer-parallel & Speedup \\\\\n\t \\midrule\n\t Peaks example  & 1024 & 256       &  4096 sec      &   683 sec     &  6.0   \\\\\n\t Indian Pines   & 512  & 128       &  2623 min      &   597 min     &   4.4  \\\\\n\t MNIST          & 512  & 128       &  619  min       &   71  min     &   8.5  \\\\\n\n    \\bottomrule\n  \\end{tabular}\n  \\caption{Runtime speedup of simultaneous layer-parallel training over layer-serial training.}\n  \\label{tab:OSspeedup}\n\\end{table}",
        "caption": "Runtime speedup of simultaneous layer-parallel training over layer-serial training.",
        "label": "tab:OSspeedup",
        "section_info": "5 Numerical Results\n\\section{Numerical Results}\n\\label{sec:numerics}\n\n\nWe investigate the computational benefits of the simultaneous layer-parallel training approach on three test cases. \nFor all test cases, our focus is on the ability to achieve speedup in training runtimes for very deep neural networks by introducing parallelism between the layers. It is likely, though not explored here, that greater combined speedups are possible by additionally using data-parallelism or parallelizing inside of each layer. Further studies are required to better understand the trade-off of distributing parallel work between layer-parallel and data-parallel.\n\n\n\\subsection{Test Cases}\n\n\\begin{enumerate}\n\\item Level set classification (\\textit{Peaks example}):\n\nAs a first step, we consider the test problem suggested in~\\cite{HaberRuthotto2017} for classifying grid points into five  (non-convex) level sets of a smooth nonlinear function $f\\colon[-3,3]^2\\to\\R$ (Figure \\ref{fig:testcases:peaks}).\nThe training data set consists of $s=5000$ randomly chosen points $\\bfy_k \\in[-3,3]^2, \\, k=1,\\ldots, s,$ and {standard basis vectors} $c_k\\in\\R^5$ which represent the probability that a point $\\bfy_k$ belongs to level set $i\\in\\{1,2,3,4,5\\}$.\nThe goal is to train a network that predicts the correct level sets for new points in $[-3,3]^2$ (validation points). \n\nWe choose a ResNet architecture with {smoothed ReLU activation defined as}\n\\begin{equation}{\\sigma(x) = \\begin{cases}\n\t\t\\max\\{x,0\\}, & |x|>0.1\\\\\n\t\t2\\frac{1}{2} x^2 + \\frac{1}{2} x + \\frac{1}{40} & |x| \\leq 0.1 \n\t\\end{cases}.}\n\\end{equation}\n Also,  we define the linear operations $\\bfK(\\cdot)$ at each layer to be a dense matrix representation of the weights $\\bftheta^n$. We choose a network depth of $T=5$ discretized with up to $N=2048$ layers and a network width of $8$ such that $\\bfu^n\\in\\R^8, \\forall \\,n=0, \\dots, N$. {In order to map the data set to the network width, we choose $L_{\\rm in}$ to be a dense $\\R^{8\\times 2}$ matrix whoses entries are learned alongside the network parameters, followed be an initial application of the activation function.}\n\n\\item Hyperspectral image segmentation (\\textit{Indian Pines}):\n\nIn this test case, we consider a soil segmentation problem based on a hyperspectral image data set. The input data consists of hypersectral bands over a single landscape in Indiana, US, ({Indian Pines data set}~\\cite{indianpinesdata}) with $145\\times 145$ pixels. For each pixel, the data set contains $220$ spectral reflectance bands which represent different portions of the electromagnetic spectrum in the wavelength range $0.4 - 2.5 \\cdot 10^{-6}$. \nThe goal is to train a network that assigns each pixel of the scene to one of $16$ class labels that represent the type of land-cover present at that pixel (such as alfalfa, corn, soybean, wheat, etc.), see Figure \\ref{fig:testcases:IP}.\n\nWe use the spectral bands of $s=1000$ randomly chosen pixel points, $\\bfy_k\\in\\R^{220}, \\, k=1,\\dots,s$, together with their corresponding class probability vectors $\\bfc_k\\in\\R^{16}$ (unit vectors) for training. The network architecture is a ResNet with smoothed ReLU activation (i.e. $\\sigma(x) = \\max\\{0,x\\}$, smoothed around zero) and define the linear operations $\\bfK(\\cdot)$ at each layer to be a dense matrix representation of the weights $\\bftheta^n$. We choose a network depth of $T=20$ discretized with up to $N=2048$ layers and a network width of $220$ channels, corresponding to the 220 reflectance bands. {The initial operator $L_{\\rm in}$ is chosen to be the identity.}\n\n\\item MNIST image classification (\\textit{MNIST}):\n\nAs a final example, we consider the now classic MNIST~\\cite{Lecun1998Gradient} test case for classification of handwritten digits encoded in a $28\\times 28$ grey scale image (Figure \\ref{fig:testcases:MNIST}). Our objective for this test case is to demonstrate the scalability of the layer-parallel approach over an increasing number of layers. While we obtain reasonable validation accuracy, the objective is not to develop an optimal ResNet to solve this problem. Further, we obtained the timings below with our own straightforward implementation of convolutions, to ensure compatible layer-to-layer propagators with XBraid for our initial tests.  Future work will use a fast convolution library, which will provide a substantial speedup to both the serial and layer-parallel codes.\n\n{For the weak scaling runs below,} we use a ResNet architecture with $\\tanh$ activation and define internal layers by the linear operator $\\bfK(\\cdot)$ using $8$ convolution kernels of width $3${; we used similar architectures in~\\cite{HaberHolthamRuthotto2017,HaberRuthotto2017}}. This yields a weight tensor at each layer of size $\\mathbb{R}^{3\\times3\\times 8\\times 8}$. The parameters to be trained are {in} $\\mathbb{R}^{28\\times 28}$ at each layer. {The strong scaling training tests below used $4$ convolutional kernels to reduce memory requirements.} \nThe network is defined to have a depth of $T=5$ and is discretized with up to $N = 2048$ layers. {The initial operator $L_{\\rm in}$ is chosen to be the identity copied over the $8$ (or $4$) convolutional kernels.}\n\n\\begin{figure}\n\t\\center\n\t\\begin{subfigure}{0.3\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[height=3.4cm]{figures/peaks/Mesh2D_all_scale.png}\n\t\t\\caption{Peaks}\n\t\t\\label{fig:testcases:peaks}\n\t\\end{subfigure}\n\t\\begin{subfigure}{0.3\\textwidth}\n\t\\center\n\n\n\t\t\\includegraphics[height=3.4cm]{figures/hyperspectral/IP_sampleband_classes.png} \n\t\t\\caption{Indian Pines}\n\t\t\\label{fig:testcases:IP}\n\t\\end{subfigure}\n\t\\begin{subfigure}{0.3\\textwidth}\n\t\\center\n\t\t\\includegraphics[height=3.4cm]{figures/mnist/mnist.png}\n\t\t\\caption{MNIST}\n\t\t\\label{fig:testcases:MNIST}\n\t\\end{subfigure}\n\t\\caption{Classes of the Peaks example (test case 1), sample band and true classes of the Indian Pines data set (test case 2), and examples from the MNIST data set (test case 3).}\n\\end{figure}\n\\end{enumerate} \n\nThe Peaks and Indian Pines computations were performed on the RHRK cluster Elwetritsch II at TU Kaiserslautern. Elwetritsch II has 485 nodes based on Haswell (2x8 cores, 64GB) and Skylake (2x12 cores, 96GB) architectures.  \nThe computations for the MNIST results were performed on the Skybridge capacity cluster at Sandia National Laboratories. Skybridge is a Cray containing 1848 nodes with two 8 core Intel 2.6 GHz Sandy Bridge processors, 64GB of RAM per node and an Infiniband interconnect. \n{The source code is available online at \\cite{dnn_pint}}.\n\n\\subsection{Layer-Parallel Scaling and Performance Validation}\n\\label{sec:numerics:MGRIT}\nFirst, we investigate {the performance of the layer-parallel MGRIT propagation for one single objective function and gradient evaluation.} Here, we keep the network weights fixed and propagate a batch of examples of sizes $s=5000, 1000, 500$ for the Peaks, Indian Pines and MNIST test case, respectively, through the network.\nWe choose a coarsening factor of $c=4$ to set up a hierarchy of ever coarser layer-grids to employ the multigrid scheme. {This coarsening strategy did not encounter any stability issues for forward Euler on the coarser layer-grids.} \n\nFigure \\ref{fig:MGRITconvergence} shows the convergence history of the MGRIT iterations for two different problem sizes using $N=256$ and $N=2048$ layers. We monitor the relative drop of the state and adjoint residual norms and observe fast convergence for all test cases that is independent of the number of layers.\n{Note that the performed multigrid iterations themselves are not dependent on the number of cores used for parallelisation, making Figure \\ref{fig:MGRITconvergence} independent of the parallel distribution. We report scaling results varying the number of cores next.}\n\\begin{figure}\n\t\\center \n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center \n\t\t\\includegraphics[width=\\textwidth]{figures/peaks/xbraid/convergence_standalone.pdf}\n\t\t\\caption{Peaks example}\n\t\\end{subfigure}\n\n\n\n\n\n\t\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center \n\t\t\\includegraphics[width=\\textwidth]{figures/mnist/convergence_standalone.pdf}\n\t\t\\caption{MNIST}\n\t\\end{subfigure}\n   \\caption{Convergence history of MGRIT solving the state and adjoint equations for $N=256$ and $N=2048$ layers. The MGRIT scheme achieves fast convergence independent of the number of layers.\\protect\\footnotemark}\n\t\\label{fig:MGRITconvergence}\n\\end{figure}\n\\footnotetext{The corresponding figure for the Indian Pines test case shows the same quantitative behavior, and has hence been omitted here.}\n\n\n{We investigate scaling results for the layer-parallel MGRIT scheme and compare runtimes to conventional serial-in-layer forward- and backpropagation.}\nFigure \\ref{fig:weakscaling} presents a weak-scaling study for the layer-parallel MGRIT scheme. Here, we double the number of layers as well as the number of compute cores while keeping the ratio $N/\\#\\text{cores} = 4$ fixed, such that each compute unit processes $4$ layers. Runtimes are measured for one objective function and gradient evaluation, using a relative stopping criterion of $5$ orders of magnitude for the MGRIT residual norms. Note, that the layer-serial data points have been added for comparison, even though they are executed on only one core. For the layer-serial propagation, doubling the number of layers leads to a doubling in runtime. \nThe layer-parallel MGRIT approach however yields nearly constant runtimes independent of the problem size. \nThe resulting speedups are reported in Table \\ref{tab:MGRITspeedup}. Since the layer-parallel MGRIT approach removes the linear runtime scale of the conventional serial-layer propagation, resulting speedups increase linearly with the problem size yielding up to a factor of $16$x for the MNIST case using $2048$ layers and $512$ cores. Further speedup can be expected when considering ever more layers (and computational resources). \n\n\n\\begin{figure}\n\t\\center\n\n\n\n\n\t\\begin{subfigure}{0.49\\textwidth}\n\t   \\includegraphics[width=\\textwidth]{figures/hyperspectral/weakscaling_standalone.pdf}\n\t   \\caption{Indian Pines}\n\t\\end{subfigure}\n \t\\begin{subfigure}{0.49\\textwidth}\n\t   \\includegraphics[width=\\textwidth]{figures/mnist/weakscaling_standalone.pdf}\n\t   \\caption{MNIST}\n\t\\end{subfigure}\n\t\\caption{Runtime comparison of a layer-parallel gradient evaluation with layer-serial forward- and backpropagation. The layer-parallel approach yields nearly constant runtimes for increasing problem sizes and computational resources.\\protect\\footnotemark}\n\t\\label{fig:weakscaling}\n\\end{figure}\n\\footnotetext{The corresponding figure for the Peaks test case shows the same quantitative behavior, and has hence been omitted here.}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n{\n\\begin{table}\n  \\center\n  \\begin{tabular}{@ { } lllrrr @ { }}\n    \\toprule\n     Test case  & $\\#$Layers    & $\\#$Cores  & Serial & Parallel & Speedup\\\\\n\n\t \\midrule\n      Peaks     & \t256  & 64  &  1.8sec  & 1.2sec   &   1.5   \\\\\n                & \t512  & 128 &  3.7sec  & 1.5sec   &   2.5   \\\\\n                & \t1024 & 256 &  7.1sec  & 1.6sec   &   4.3   \\\\\n                & \t2048 & 512 & 13.9sec  & 1.8sec   &   7.7   \\\\\n\t \\midrule\n\t  Indian Pines & \t256  & 64  &  157.1sec  & 77.6sec   &   2.0   \\\\\n                   & \t512  & 128 &  311.6sec  & 94.5sec   &   3.3   \\\\\n                   & \t1024 & 256 &  624.0sec  & 102.6sec  &   6.1   \\\\\n                   & \t2048 & 512 & 1248.0sec  & 120.6sec  &   10.3  \\\\\n\t \\midrule\n\t  MNIST        & \t256  & 64  &  272.3sec   &   79.5sec    &   3.4   \\\\\n                   & \t512  & 128 &  545.3sec   &  113.3sec    &   4.8   \\\\\n                   & \t1024 & 256 & 1095.2sec  &  104.0sec    &   10.5  \\\\\n                   & \t2048 & 512 & 2193.5sec  &  137.3sec    &   16.0  \\\\\t \n    \\bottomrule\n  \\end{tabular}\n  \\caption{Runtime and speedup of layer-parallel gradient evaluation over layer-serial propagation.}\n  \\label{tab:MGRITspeedup}\n\\end{table}\n}\n\nA strong scaling study is presented Figure \\ref{fig:strongscaling} for various numbers of layers. Here, we keep the problem sizes fixed and measure the time-to-solution for one gradient evaluation with MGRIT for increasing numbers of computational resources. It shows good strong scaling behavior for all test cases, independent of the numbers of layers. The cross over point where the layer-parallel MGRIT approach shows speedup over the layer-serial propagation is around $16$ cores for all cases.  \n\\begin{figure}\n\t\\center \n\t\\begin{subfigure}{.49\\textwidth}\n\t\t\\center \n\t\t\\includegraphics[width=\\textwidth]{figures/peaks/xbraid/strongscaling_standalone.pdf}\n\t\t\\caption{Peaks example}\n\t\\end{subfigure}\n\t\\begin{subfigure}{.49\\textwidth}\n\t\t\\center \n\t\t\\includegraphics[width=\\textwidth]{figures/hyperspectral/strongscaling_standalone.pdf}\n\t\t\\caption{Indian Pines}\n\t\\end{subfigure}\n\t\\caption{Strong scaling study for a layer-parallel gradient evaluation for various problem sizes from $N=256$ to $N=2048$ layers. {Corresponding serial runtimes are indicated by horizontal dashed lines.} The cross-over point where the layer-parallel approach yields speedup over the layer-serial propagation lies around $16$ cores.\\protect\\footnotemark}\n\t\\label{fig:strongscaling}\n\\end{figure}\n\n\\footnotetext{The corresponding figure for the MNIST test case shows the same quantitative behavior, and has hence been omitted here.}\n\n\n\n\\subsection{Simultaneous Layer-Parallel Training Validation}\n\\label{sec:numerics:oneshot}\nNext, we investigate the simultaneous layer-parallel training, using $m_1 = m_2 = 2$ layer-parallel MGRIT iterations in each outer training iteration (see Algorithm \\ref{alg:oneshot}). {The Hessian approximations $B_{\\bftheta}, B_{\\bfW}, B_{\\bfmu}$ are computed by successive limited-memory BFGS updates based on the current gradient $\\nabla_{(\\bftheta, \\bfW, \\bfmu)}J$.}\nWe compare runtimes of the simultaneous layer-parallel training with a conventional layer-serial training approach, while choosing the same Hessian, as well as the same initial network parameters for both approaches. However, we tune the optimization hyper-parameters (such as regularization parameters, stepsize selection, etc.)  separately for both schemes, in order to find the best setting for either approach that reaches a prescribed validation accuracy with the least iterations and minimum runtime. \n\nFor the Peaks example, we train a network with $N=1024$ layers distributed onto $256$ compute cores, and for the Indian Pines data set and the MNIST case we choose $N=512$ layers distributed onto $128$ compute cores, giving $4$ layers per processor in all cases. \n Figure \\ref{fig:oneshot} plots the training history over iteration counts (top) as well as runtime (bottom). We validate from the top figures, that both approaches reach comparable performance in terms of training result (optimization iteration counts, training loss and validation accuracy). Hence, reducing the accuracy of the inner multigrid iterations for solving the state and adjoint equations within a simultaneous training framework does not deteriorate the training behavior.\n But, each iteration of the simultaneous layer-parallel approach is much faster than for the layer-serial approach due to the layer-parallelization and the reduced state and adjoint accuracy. Therefore, the overall runtime for reaching that same final training result is reduced drastically (bottom figures). \n Runtime speedups are reported in Table \\ref{tab:OSspeedup}.\n\n\n  While these results have been computed for selected fixed $N$, it is expected that the speedup scales linearly with increasing numbers of layers, similar to the observation in Table~\\ref{tab:MGRITspeedup}.\n\n\n\\begin{figure}[ht]\n\t\\center\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/peaks/oneshot/OSvsRef_iter_standalone.pdf}\n\t\t\\caption{Peaks: Training over iteration counts}\n\t\\end{subfigure}\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/hyperspectral/OSvsRef_iter_standalone.pdf}\n\t\t\\caption{Indian Pines: Training over iteration counts}\n\t\\end{subfigure}\n\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/peaks/oneshot/OSvsRef_time_standalone.pdf}\n\t\t\\caption{Peaks: Training over time}\n\t\\end{subfigure}\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/hyperspectral/OSvsRef_time_standalone.pdf}\n\t\t\\caption{Indian Pines: Training over time}\n\t\\end{subfigure}\n\t\\caption{Training loss (solid lines) and validation accuracy (dashed lines) over training iterations (top) and compute time (bottom). For the layer-parallel training, each core processes $4$ layers. The simultaneous layer-parallel approach reaches training results comparable to a layer-serial approach within much less computational time.\\protect\\footnotemark}\n\t\\label{fig:oneshot}\n\\end{figure}\n\\footnotetext{The corresponding figures for the MNIST test case show the same quantitative behavior, and have hence been omitted here.}\n\n\n\\begin{table}[h]\n  \\center\n  \\begin{tabular}{@ { } lrrrrr @ { }}\n    \\toprule\n    Test case      & $N$  & $\\#$Cores  & Layer-serial & Layer-parallel & Speedup \\\\\n\t \\midrule\n\t Peaks example  & 1024 & 256       &  4096 sec      &   683 sec     &  6.0   \\\\\n\t Indian Pines   & 512  & 128       &  2623 min      &   597 min     &   4.4  \\\\\n\t MNIST          & 512  & 128       &  619  min       &   71  min     &   8.5  \\\\\n\n    \\bottomrule\n  \\end{tabular}\n  \\caption{Runtime speedup of simultaneous layer-parallel training over layer-serial training.}\n  \\label{tab:OSspeedup}\n\\end{table}\n\n\n\n\n5.3 Simultaneous Layer-Parallel Training Validation\n\\subsection{Simultaneous Layer-Parallel Training Validation}\n\\label{sec:numerics:oneshot}\nNext, we investigate the simultaneous layer-parallel training, using $m_1 = m_2 = 2$ layer-parallel MGRIT iterations in each outer training iteration (see Algorithm \\ref{alg:oneshot}). {The Hessian approximations $B_{\\bftheta}, B_{\\bfW}, B_{\\bfmu}$ are computed by successive limited-memory BFGS updates based on the current gradient $\\nabla_{(\\bftheta, \\bfW, \\bfmu)}J$.}\nWe compare runtimes of the simultaneous layer-parallel training with a conventional layer-serial training approach, while choosing the same Hessian, as well as the same initial network parameters for both approaches. However, we tune the optimization hyper-parameters (such as regularization parameters, stepsize selection, etc.)  separately for both schemes, in order to find the best setting for either approach that reaches a prescribed validation accuracy with the least iterations and minimum runtime. \n\nFor the Peaks example, we train a network with $N=1024$ layers distributed onto $256$ compute cores, and for the Indian Pines data set and the MNIST case we choose $N=512$ layers distributed onto $128$ compute cores, giving $4$ layers per processor in all cases. \n Figure \\ref{fig:oneshot} plots the training history over iteration counts (top) as well as runtime (bottom). We validate from the top figures, that both approaches reach comparable performance in terms of training result (optimization iteration counts, training loss and validation accuracy). Hence, reducing the accuracy of the inner multigrid iterations for solving the state and adjoint equations within a simultaneous training framework does not deteriorate the training behavior.\n But, each iteration of the simultaneous layer-parallel approach is much faster than for the layer-serial approach due to the layer-parallelization and the reduced state and adjoint accuracy. Therefore, the overall runtime for reaching that same final training result is reduced drastically (bottom figures). \n Runtime speedups are reported in Table \\ref{tab:OSspeedup}.\n\n\n  While these results have been computed for selected fixed $N$, it is expected that the speedup scales linearly with increasing numbers of layers, similar to the observation in Table~\\ref{tab:MGRITspeedup}.\n\n\n\\begin{figure}[ht]\n\t\\center\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/peaks/oneshot/OSvsRef_iter_standalone.pdf}\n\t\t\\caption{Peaks: Training over iteration counts}\n\t\\end{subfigure}\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/hyperspectral/OSvsRef_iter_standalone.pdf}\n\t\t\\caption{Indian Pines: Training over iteration counts}\n\t\\end{subfigure}\n\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/peaks/oneshot/OSvsRef_time_standalone.pdf}\n\t\t\\caption{Peaks: Training over time}\n\t\\end{subfigure}\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/hyperspectral/OSvsRef_time_standalone.pdf}\n\t\t\\caption{Indian Pines: Training over time}\n\t\\end{subfigure}\n\t\\caption{Training loss (solid lines) and validation accuracy (dashed lines) over training iterations (top) and compute time (bottom). For the layer-parallel training, each core processes $4$ layers. The simultaneous layer-parallel approach reaches training results comparable to a layer-serial approach within much less computational time.\\protect\\footnotemark}\n\t\\label{fig:oneshot}\n\\end{figure}\n\\footnotetext{The corresponding figures for the MNIST test case show the same quantitative behavior, and have hence been omitted here.}\n\n\n\\begin{table}[h]\n  \\center\n  \\begin{tabular}{@ { } lrrrrr @ { }}\n    \\toprule\n    Test case      & $N$  & $\\#$Cores  & Layer-serial & Layer-parallel & Speedup \\\\\n\t \\midrule\n\t Peaks example  & 1024 & 256       &  4096 sec      &   683 sec     &  6.0   \\\\\n\t Indian Pines   & 512  & 128       &  2623 min      &   597 min     &   4.4  \\\\\n\t MNIST          & 512  & 128       &  619  min       &   71  min     &   8.5  \\\\\n\n    \\bottomrule\n  \\end{tabular}\n  \\caption{Runtime speedup of simultaneous layer-parallel training over layer-serial training.}\n  \\label{tab:OSspeedup}\n\\end{table}\n\n\n\n\n",
        "Descriptive_question1": "What is the speedup value for the Peaks example in table_1?",
        "Descriptive_question2": "How many cores are used for the MNIST test case with 512 layers in table_1?",
        "Reasoning_question1": "Why might the speedup for MNIST be higher than for the Peaks example in table_1 despite using fewer layers?",
        "Reasoning_question2": "What trends can be observed in the speedup values across different test cases in table_1 as the number of cores increases?",
        "Descriptive_answer1": "6.0",
        "Descriptive_answer2": "128",
        "Reasoning_answer1": "To understand why MNIST has a higher speedup (8.5) compared to the Peaks example (6.0) despite using fewer layers (512 vs. 1024), we need to consider several factors beyond just the number of layers. First, the nature of the test cases differs significantly—MNIST involves image classification with convolutional operations, which may benefit more from layer-parallel training due to the structure of the data and network architecture. In contrast, the Peaks example deals with level set classification, potentially involving different computational complexities per layer. Second, although MNIST uses fewer layers, the runtime for layer-serial training in MNIST (619 min) is much shorter than for Peaks (4096 sec or approximately 68 min), suggesting that each layer in MNIST might be less computationally intensive, allowing parallelization to yield a greater relative improvement. Finally, the efficiency of parallelization on the hardware used (different clusters for each test case) and the specific implementation details could contribute to this discrepancy. Thus, the higher speedup in MNIST likely results from a combination of architectural suitability for parallelization and lower per-layer computational cost.",
        "Reasoning_answer2": "Analyzing the trend in speedup values across the test cases in table_1 as the number of cores increases, we can infer a general pattern. For Peaks example, with 256 cores and 1024 layers, the speedup is 6.0. For Indian Pines and MNIST, both using 128 cores with 512 layers, the speedups are 4.4 and 8.5, respectively. Although the number of cores varies across test cases, a trend emerges when considering the ratio of layers to cores (approximately 4 layers per core in all cases). The speedup generally increases with the number of cores and layers, as seen in related data from other tables like table_2 (tab:MGRITspeedup), where speedup consistently grows with more cores and layers (e.g., for MNIST, speedup rises from 3.4 at 64 cores to 16.0 at 512 cores). This suggests that layer-parallel training scales effectively with increased computational resources. The variation in speedup values among test cases indicates that the effectiveness of parallelization also depends on the specific problem characteristics and network architecture. Therefore, the trend shows that speedup improves with more cores, but the rate of improvement is influenced by the inherent properties of each test case."
    },
    {
        "paper_id": "1812.04352.json",
        "table_id": "table_2",
        "table_content": "\\begin{table}\n  \\center\n  \\begin{tabular}{@ { } lllrrr @ { }}\n    \\toprule\n     Test case  & $\\#$Layers    & $\\#$Cores  & Serial & Parallel & Speedup\\\\\n\n\t \\midrule\n      Peaks     & \t256  & 64  &  1.8sec  & 1.2sec   &   1.5   \\\\\n                & \t512  & 128 &  3.7sec  & 1.5sec   &   2.5   \\\\\n                & \t1024 & 256 &  7.1sec  & 1.6sec   &   4.3   \\\\\n                & \t2048 & 512 & 13.9sec  & 1.8sec   &   7.7   \\\\\n\t \\midrule\n\t  Indian Pines & \t256  & 64  &  157.1sec  & 77.6sec   &   2.0   \\\\\n                   & \t512  & 128 &  311.6sec  & 94.5sec   &   3.3   \\\\\n                   & \t1024 & 256 &  624.0sec  & 102.6sec  &   6.1   \\\\\n                   & \t2048 & 512 & 1248.0sec  & 120.6sec  &   10.3  \\\\\n\t \\midrule\n\t  MNIST        & \t256  & 64  &  272.3sec   &   79.5sec    &   3.4   \\\\\n                   & \t512  & 128 &  545.3sec   &  113.3sec    &   4.8   \\\\\n                   & \t1024 & 256 & 1095.2sec  &  104.0sec    &   10.5  \\\\\n                   & \t2048 & 512 & 2193.5sec  &  137.3sec    &   16.0  \\\\\t \n    \\bottomrule\n  \\end{tabular}\n  \\caption{Runtime and speedup of layer-parallel gradient evaluation over layer-serial propagation.}\n  \\label{tab:MGRITspeedup}\n\\end{table}",
        "caption": "Runtime and speedup of layer-parallel gradient evaluation over layer-serial propagation.",
        "label": "tab:MGRITspeedup",
        "section_info": "5 Numerical Results\n\\section{Numerical Results}\n\\label{sec:numerics}\n\n\nWe investigate the computational benefits of the simultaneous layer-parallel training approach on three test cases. \nFor all test cases, our focus is on the ability to achieve speedup in training runtimes for very deep neural networks by introducing parallelism between the layers. It is likely, though not explored here, that greater combined speedups are possible by additionally using data-parallelism or parallelizing inside of each layer. Further studies are required to better understand the trade-off of distributing parallel work between layer-parallel and data-parallel.\n\n\n\\subsection{Test Cases}\n\n\\begin{enumerate}\n\\item Level set classification (\\textit{Peaks example}):\n\nAs a first step, we consider the test problem suggested in~\\cite{HaberRuthotto2017} for classifying grid points into five  (non-convex) level sets of a smooth nonlinear function $f\\colon[-3,3]^2\\to\\R$ (Figure \\ref{fig:testcases:peaks}).\nThe training data set consists of $s=5000$ randomly chosen points $\\bfy_k \\in[-3,3]^2, \\, k=1,\\ldots, s,$ and {standard basis vectors} $c_k\\in\\R^5$ which represent the probability that a point $\\bfy_k$ belongs to level set $i\\in\\{1,2,3,4,5\\}$.\nThe goal is to train a network that predicts the correct level sets for new points in $[-3,3]^2$ (validation points). \n\nWe choose a ResNet architecture with {smoothed ReLU activation defined as}\n\\begin{equation}{\\sigma(x) = \\begin{cases}\n\t\t\\max\\{x,0\\}, & |x|>0.1\\\\\n\t\t2\\frac{1}{2} x^2 + \\frac{1}{2} x + \\frac{1}{40} & |x| \\leq 0.1 \n\t\\end{cases}.}\n\\end{equation}\n Also,  we define the linear operations $\\bfK(\\cdot)$ at each layer to be a dense matrix representation of the weights $\\bftheta^n$. We choose a network depth of $T=5$ discretized with up to $N=2048$ layers and a network width of $8$ such that $\\bfu^n\\in\\R^8, \\forall \\,n=0, \\dots, N$. {In order to map the data set to the network width, we choose $L_{\\rm in}$ to be a dense $\\R^{8\\times 2}$ matrix whoses entries are learned alongside the network parameters, followed be an initial application of the activation function.}\n\n\\item Hyperspectral image segmentation (\\textit{Indian Pines}):\n\nIn this test case, we consider a soil segmentation problem based on a hyperspectral image data set. The input data consists of hypersectral bands over a single landscape in Indiana, US, ({Indian Pines data set}~\\cite{indianpinesdata}) with $145\\times 145$ pixels. For each pixel, the data set contains $220$ spectral reflectance bands which represent different portions of the electromagnetic spectrum in the wavelength range $0.4 - 2.5 \\cdot 10^{-6}$. \nThe goal is to train a network that assigns each pixel of the scene to one of $16$ class labels that represent the type of land-cover present at that pixel (such as alfalfa, corn, soybean, wheat, etc.), see Figure \\ref{fig:testcases:IP}.\n\nWe use the spectral bands of $s=1000$ randomly chosen pixel points, $\\bfy_k\\in\\R^{220}, \\, k=1,\\dots,s$, together with their corresponding class probability vectors $\\bfc_k\\in\\R^{16}$ (unit vectors) for training. The network architecture is a ResNet with smoothed ReLU activation (i.e. $\\sigma(x) = \\max\\{0,x\\}$, smoothed around zero) and define the linear operations $\\bfK(\\cdot)$ at each layer to be a dense matrix representation of the weights $\\bftheta^n$. We choose a network depth of $T=20$ discretized with up to $N=2048$ layers and a network width of $220$ channels, corresponding to the 220 reflectance bands. {The initial operator $L_{\\rm in}$ is chosen to be the identity.}\n\n\\item MNIST image classification (\\textit{MNIST}):\n\nAs a final example, we consider the now classic MNIST~\\cite{Lecun1998Gradient} test case for classification of handwritten digits encoded in a $28\\times 28$ grey scale image (Figure \\ref{fig:testcases:MNIST}). Our objective for this test case is to demonstrate the scalability of the layer-parallel approach over an increasing number of layers. While we obtain reasonable validation accuracy, the objective is not to develop an optimal ResNet to solve this problem. Further, we obtained the timings below with our own straightforward implementation of convolutions, to ensure compatible layer-to-layer propagators with XBraid for our initial tests.  Future work will use a fast convolution library, which will provide a substantial speedup to both the serial and layer-parallel codes.\n\n{For the weak scaling runs below,} we use a ResNet architecture with $\\tanh$ activation and define internal layers by the linear operator $\\bfK(\\cdot)$ using $8$ convolution kernels of width $3${; we used similar architectures in~\\cite{HaberHolthamRuthotto2017,HaberRuthotto2017}}. This yields a weight tensor at each layer of size $\\mathbb{R}^{3\\times3\\times 8\\times 8}$. The parameters to be trained are {in} $\\mathbb{R}^{28\\times 28}$ at each layer. {The strong scaling training tests below used $4$ convolutional kernels to reduce memory requirements.} \nThe network is defined to have a depth of $T=5$ and is discretized with up to $N = 2048$ layers. {The initial operator $L_{\\rm in}$ is chosen to be the identity copied over the $8$ (or $4$) convolutional kernels.}\n\n\\begin{figure}\n\t\\center\n\t\\begin{subfigure}{0.3\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[height=3.4cm]{figures/peaks/Mesh2D_all_scale.png}\n\t\t\\caption{Peaks}\n\t\t\\label{fig:testcases:peaks}\n\t\\end{subfigure}\n\t\\begin{subfigure}{0.3\\textwidth}\n\t\\center\n\n\n\t\t\\includegraphics[height=3.4cm]{figures/hyperspectral/IP_sampleband_classes.png} \n\t\t\\caption{Indian Pines}\n\t\t\\label{fig:testcases:IP}\n\t\\end{subfigure}\n\t\\begin{subfigure}{0.3\\textwidth}\n\t\\center\n\t\t\\includegraphics[height=3.4cm]{figures/mnist/mnist.png}\n\t\t\\caption{MNIST}\n\t\t\\label{fig:testcases:MNIST}\n\t\\end{subfigure}\n\t\\caption{Classes of the Peaks example (test case 1), sample band and true classes of the Indian Pines data set (test case 2), and examples from the MNIST data set (test case 3).}\n\\end{figure}\n\\end{enumerate} \n\nThe Peaks and Indian Pines computations were performed on the RHRK cluster Elwetritsch II at TU Kaiserslautern. Elwetritsch II has 485 nodes based on Haswell (2x8 cores, 64GB) and Skylake (2x12 cores, 96GB) architectures.  \nThe computations for the MNIST results were performed on the Skybridge capacity cluster at Sandia National Laboratories. Skybridge is a Cray containing 1848 nodes with two 8 core Intel 2.6 GHz Sandy Bridge processors, 64GB of RAM per node and an Infiniband interconnect. \n{The source code is available online at \\cite{dnn_pint}}.\n\n\\subsection{Layer-Parallel Scaling and Performance Validation}\n\\label{sec:numerics:MGRIT}\nFirst, we investigate {the performance of the layer-parallel MGRIT propagation for one single objective function and gradient evaluation.} Here, we keep the network weights fixed and propagate a batch of examples of sizes $s=5000, 1000, 500$ for the Peaks, Indian Pines and MNIST test case, respectively, through the network.\nWe choose a coarsening factor of $c=4$ to set up a hierarchy of ever coarser layer-grids to employ the multigrid scheme. {This coarsening strategy did not encounter any stability issues for forward Euler on the coarser layer-grids.} \n\nFigure \\ref{fig:MGRITconvergence} shows the convergence history of the MGRIT iterations for two different problem sizes using $N=256$ and $N=2048$ layers. We monitor the relative drop of the state and adjoint residual norms and observe fast convergence for all test cases that is independent of the number of layers.\n{Note that the performed multigrid iterations themselves are not dependent on the number of cores used for parallelisation, making Figure \\ref{fig:MGRITconvergence} independent of the parallel distribution. We report scaling results varying the number of cores next.}\n\\begin{figure}\n\t\\center \n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center \n\t\t\\includegraphics[width=\\textwidth]{figures/peaks/xbraid/convergence_standalone.pdf}\n\t\t\\caption{Peaks example}\n\t\\end{subfigure}\n\n\n\n\n\n\t\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center \n\t\t\\includegraphics[width=\\textwidth]{figures/mnist/convergence_standalone.pdf}\n\t\t\\caption{MNIST}\n\t\\end{subfigure}\n   \\caption{Convergence history of MGRIT solving the state and adjoint equations for $N=256$ and $N=2048$ layers. The MGRIT scheme achieves fast convergence independent of the number of layers.\\protect\\footnotemark}\n\t\\label{fig:MGRITconvergence}\n\\end{figure}\n\\footnotetext{The corresponding figure for the Indian Pines test case shows the same quantitative behavior, and has hence been omitted here.}\n\n\n{We investigate scaling results for the layer-parallel MGRIT scheme and compare runtimes to conventional serial-in-layer forward- and backpropagation.}\nFigure \\ref{fig:weakscaling} presents a weak-scaling study for the layer-parallel MGRIT scheme. Here, we double the number of layers as well as the number of compute cores while keeping the ratio $N/\\#\\text{cores} = 4$ fixed, such that each compute unit processes $4$ layers. Runtimes are measured for one objective function and gradient evaluation, using a relative stopping criterion of $5$ orders of magnitude for the MGRIT residual norms. Note, that the layer-serial data points have been added for comparison, even though they are executed on only one core. For the layer-serial propagation, doubling the number of layers leads to a doubling in runtime. \nThe layer-parallel MGRIT approach however yields nearly constant runtimes independent of the problem size. \nThe resulting speedups are reported in Table \\ref{tab:MGRITspeedup}. Since the layer-parallel MGRIT approach removes the linear runtime scale of the conventional serial-layer propagation, resulting speedups increase linearly with the problem size yielding up to a factor of $16$x for the MNIST case using $2048$ layers and $512$ cores. Further speedup can be expected when considering ever more layers (and computational resources). \n\n\n\\begin{figure}\n\t\\center\n\n\n\n\n\t\\begin{subfigure}{0.49\\textwidth}\n\t   \\includegraphics[width=\\textwidth]{figures/hyperspectral/weakscaling_standalone.pdf}\n\t   \\caption{Indian Pines}\n\t\\end{subfigure}\n \t\\begin{subfigure}{0.49\\textwidth}\n\t   \\includegraphics[width=\\textwidth]{figures/mnist/weakscaling_standalone.pdf}\n\t   \\caption{MNIST}\n\t\\end{subfigure}\n\t\\caption{Runtime comparison of a layer-parallel gradient evaluation with layer-serial forward- and backpropagation. The layer-parallel approach yields nearly constant runtimes for increasing problem sizes and computational resources.\\protect\\footnotemark}\n\t\\label{fig:weakscaling}\n\\end{figure}\n\\footnotetext{The corresponding figure for the Peaks test case shows the same quantitative behavior, and has hence been omitted here.}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n{\n\\begin{table}\n  \\center\n  \\begin{tabular}{@ { } lllrrr @ { }}\n    \\toprule\n     Test case  & $\\#$Layers    & $\\#$Cores  & Serial & Parallel & Speedup\\\\\n\n\t \\midrule\n      Peaks     & \t256  & 64  &  1.8sec  & 1.2sec   &   1.5   \\\\\n                & \t512  & 128 &  3.7sec  & 1.5sec   &   2.5   \\\\\n                & \t1024 & 256 &  7.1sec  & 1.6sec   &   4.3   \\\\\n                & \t2048 & 512 & 13.9sec  & 1.8sec   &   7.7   \\\\\n\t \\midrule\n\t  Indian Pines & \t256  & 64  &  157.1sec  & 77.6sec   &   2.0   \\\\\n                   & \t512  & 128 &  311.6sec  & 94.5sec   &   3.3   \\\\\n                   & \t1024 & 256 &  624.0sec  & 102.6sec  &   6.1   \\\\\n                   & \t2048 & 512 & 1248.0sec  & 120.6sec  &   10.3  \\\\\n\t \\midrule\n\t  MNIST        & \t256  & 64  &  272.3sec   &   79.5sec    &   3.4   \\\\\n                   & \t512  & 128 &  545.3sec   &  113.3sec    &   4.8   \\\\\n                   & \t1024 & 256 & 1095.2sec  &  104.0sec    &   10.5  \\\\\n                   & \t2048 & 512 & 2193.5sec  &  137.3sec    &   16.0  \\\\\t \n    \\bottomrule\n  \\end{tabular}\n  \\caption{Runtime and speedup of layer-parallel gradient evaluation over layer-serial propagation.}\n  \\label{tab:MGRITspeedup}\n\\end{table}\n}\n\nA strong scaling study is presented Figure \\ref{fig:strongscaling} for various numbers of layers. Here, we keep the problem sizes fixed and measure the time-to-solution for one gradient evaluation with MGRIT for increasing numbers of computational resources. It shows good strong scaling behavior for all test cases, independent of the numbers of layers. The cross over point where the layer-parallel MGRIT approach shows speedup over the layer-serial propagation is around $16$ cores for all cases.  \n\\begin{figure}\n\t\\center \n\t\\begin{subfigure}{.49\\textwidth}\n\t\t\\center \n\t\t\\includegraphics[width=\\textwidth]{figures/peaks/xbraid/strongscaling_standalone.pdf}\n\t\t\\caption{Peaks example}\n\t\\end{subfigure}\n\t\\begin{subfigure}{.49\\textwidth}\n\t\t\\center \n\t\t\\includegraphics[width=\\textwidth]{figures/hyperspectral/strongscaling_standalone.pdf}\n\t\t\\caption{Indian Pines}\n\t\\end{subfigure}\n\t\\caption{Strong scaling study for a layer-parallel gradient evaluation for various problem sizes from $N=256$ to $N=2048$ layers. {Corresponding serial runtimes are indicated by horizontal dashed lines.} The cross-over point where the layer-parallel approach yields speedup over the layer-serial propagation lies around $16$ cores.\\protect\\footnotemark}\n\t\\label{fig:strongscaling}\n\\end{figure}\n\n\\footnotetext{The corresponding figure for the MNIST test case shows the same quantitative behavior, and has hence been omitted here.}\n\n\n\n\\subsection{Simultaneous Layer-Parallel Training Validation}\n\\label{sec:numerics:oneshot}\nNext, we investigate the simultaneous layer-parallel training, using $m_1 = m_2 = 2$ layer-parallel MGRIT iterations in each outer training iteration (see Algorithm \\ref{alg:oneshot}). {The Hessian approximations $B_{\\bftheta}, B_{\\bfW}, B_{\\bfmu}$ are computed by successive limited-memory BFGS updates based on the current gradient $\\nabla_{(\\bftheta, \\bfW, \\bfmu)}J$.}\nWe compare runtimes of the simultaneous layer-parallel training with a conventional layer-serial training approach, while choosing the same Hessian, as well as the same initial network parameters for both approaches. However, we tune the optimization hyper-parameters (such as regularization parameters, stepsize selection, etc.)  separately for both schemes, in order to find the best setting for either approach that reaches a prescribed validation accuracy with the least iterations and minimum runtime. \n\nFor the Peaks example, we train a network with $N=1024$ layers distributed onto $256$ compute cores, and for the Indian Pines data set and the MNIST case we choose $N=512$ layers distributed onto $128$ compute cores, giving $4$ layers per processor in all cases. \n Figure \\ref{fig:oneshot} plots the training history over iteration counts (top) as well as runtime (bottom). We validate from the top figures, that both approaches reach comparable performance in terms of training result (optimization iteration counts, training loss and validation accuracy). Hence, reducing the accuracy of the inner multigrid iterations for solving the state and adjoint equations within a simultaneous training framework does not deteriorate the training behavior.\n But, each iteration of the simultaneous layer-parallel approach is much faster than for the layer-serial approach due to the layer-parallelization and the reduced state and adjoint accuracy. Therefore, the overall runtime for reaching that same final training result is reduced drastically (bottom figures). \n Runtime speedups are reported in Table \\ref{tab:OSspeedup}.\n\n\n  While these results have been computed for selected fixed $N$, it is expected that the speedup scales linearly with increasing numbers of layers, similar to the observation in Table~\\ref{tab:MGRITspeedup}.\n\n\n\\begin{figure}[ht]\n\t\\center\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/peaks/oneshot/OSvsRef_iter_standalone.pdf}\n\t\t\\caption{Peaks: Training over iteration counts}\n\t\\end{subfigure}\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/hyperspectral/OSvsRef_iter_standalone.pdf}\n\t\t\\caption{Indian Pines: Training over iteration counts}\n\t\\end{subfigure}\n\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/peaks/oneshot/OSvsRef_time_standalone.pdf}\n\t\t\\caption{Peaks: Training over time}\n\t\\end{subfigure}\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/hyperspectral/OSvsRef_time_standalone.pdf}\n\t\t\\caption{Indian Pines: Training over time}\n\t\\end{subfigure}\n\t\\caption{Training loss (solid lines) and validation accuracy (dashed lines) over training iterations (top) and compute time (bottom). For the layer-parallel training, each core processes $4$ layers. The simultaneous layer-parallel approach reaches training results comparable to a layer-serial approach within much less computational time.\\protect\\footnotemark}\n\t\\label{fig:oneshot}\n\\end{figure}\n\\footnotetext{The corresponding figures for the MNIST test case show the same quantitative behavior, and have hence been omitted here.}\n\n\n\\begin{table}[h]\n  \\center\n  \\begin{tabular}{@ { } lrrrrr @ { }}\n    \\toprule\n    Test case      & $N$  & $\\#$Cores  & Layer-serial & Layer-parallel & Speedup \\\\\n\t \\midrule\n\t Peaks example  & 1024 & 256       &  4096 sec      &   683 sec     &  6.0   \\\\\n\t Indian Pines   & 512  & 128       &  2623 min      &   597 min     &   4.4  \\\\\n\t MNIST          & 512  & 128       &  619  min       &   71  min     &   8.5  \\\\\n\n    \\bottomrule\n  \\end{tabular}\n  \\caption{Runtime speedup of simultaneous layer-parallel training over layer-serial training.}\n  \\label{tab:OSspeedup}\n\\end{table}\n\n\n\n\n5.2 Layer-Parallel Scaling and Performance Validation\n\\subsection{Layer-Parallel Scaling and Performance Validation}\n\\label{sec:numerics:MGRIT}\nFirst, we investigate {the performance of the layer-parallel MGRIT propagation for one single objective function and gradient evaluation.} Here, we keep the network weights fixed and propagate a batch of examples of sizes $s=5000, 1000, 500$ for the Peaks, Indian Pines and MNIST test case, respectively, through the network.\nWe choose a coarsening factor of $c=4$ to set up a hierarchy of ever coarser layer-grids to employ the multigrid scheme. {This coarsening strategy did not encounter any stability issues for forward Euler on the coarser layer-grids.} \n\nFigure \\ref{fig:MGRITconvergence} shows the convergence history of the MGRIT iterations for two different problem sizes using $N=256$ and $N=2048$ layers. We monitor the relative drop of the state and adjoint residual norms and observe fast convergence for all test cases that is independent of the number of layers.\n{Note that the performed multigrid iterations themselves are not dependent on the number of cores used for parallelisation, making Figure \\ref{fig:MGRITconvergence} independent of the parallel distribution. We report scaling results varying the number of cores next.}\n\\begin{figure}\n\t\\center \n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center \n\t\t\\includegraphics[width=\\textwidth]{figures/peaks/xbraid/convergence_standalone.pdf}\n\t\t\\caption{Peaks example}\n\t\\end{subfigure}\n\n\n\n\n\n\t\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center \n\t\t\\includegraphics[width=\\textwidth]{figures/mnist/convergence_standalone.pdf}\n\t\t\\caption{MNIST}\n\t\\end{subfigure}\n   \\caption{Convergence history of MGRIT solving the state and adjoint equations for $N=256$ and $N=2048$ layers. The MGRIT scheme achieves fast convergence independent of the number of layers.\\protect\\footnotemark}\n\t\\label{fig:MGRITconvergence}\n\\end{figure}\n\\footnotetext{The corresponding figure for the Indian Pines test case shows the same quantitative behavior, and has hence been omitted here.}\n\n\n{We investigate scaling results for the layer-parallel MGRIT scheme and compare runtimes to conventional serial-in-layer forward- and backpropagation.}\nFigure \\ref{fig:weakscaling} presents a weak-scaling study for the layer-parallel MGRIT scheme. Here, we double the number of layers as well as the number of compute cores while keeping the ratio $N/\\#\\text{cores} = 4$ fixed, such that each compute unit processes $4$ layers. Runtimes are measured for one objective function and gradient evaluation, using a relative stopping criterion of $5$ orders of magnitude for the MGRIT residual norms. Note, that the layer-serial data points have been added for comparison, even though they are executed on only one core. For the layer-serial propagation, doubling the number of layers leads to a doubling in runtime. \nThe layer-parallel MGRIT approach however yields nearly constant runtimes independent of the problem size. \nThe resulting speedups are reported in Table \\ref{tab:MGRITspeedup}. Since the layer-parallel MGRIT approach removes the linear runtime scale of the conventional serial-layer propagation, resulting speedups increase linearly with the problem size yielding up to a factor of $16$x for the MNIST case using $2048$ layers and $512$ cores. Further speedup can be expected when considering ever more layers (and computational resources). \n\n\n\\begin{figure}\n\t\\center\n\n\n\n\n\t\\begin{subfigure}{0.49\\textwidth}\n\t   \\includegraphics[width=\\textwidth]{figures/hyperspectral/weakscaling_standalone.pdf}\n\t   \\caption{Indian Pines}\n\t\\end{subfigure}\n \t\\begin{subfigure}{0.49\\textwidth}\n\t   \\includegraphics[width=\\textwidth]{figures/mnist/weakscaling_standalone.pdf}\n\t   \\caption{MNIST}\n\t\\end{subfigure}\n\t\\caption{Runtime comparison of a layer-parallel gradient evaluation with layer-serial forward- and backpropagation. The layer-parallel approach yields nearly constant runtimes for increasing problem sizes and computational resources.\\protect\\footnotemark}\n\t\\label{fig:weakscaling}\n\\end{figure}\n\\footnotetext{The corresponding figure for the Peaks test case shows the same quantitative behavior, and has hence been omitted here.}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n{\n\\begin{table}\n  \\center\n  \\begin{tabular}{@ { } lllrrr @ { }}\n    \\toprule\n     Test case  & $\\#$Layers    & $\\#$Cores  & Serial & Parallel & Speedup\\\\\n\n\t \\midrule\n      Peaks     & \t256  & 64  &  1.8sec  & 1.2sec   &   1.5   \\\\\n                & \t512  & 128 &  3.7sec  & 1.5sec   &   2.5   \\\\\n                & \t1024 & 256 &  7.1sec  & 1.6sec   &   4.3   \\\\\n                & \t2048 & 512 & 13.9sec  & 1.8sec   &   7.7   \\\\\n\t \\midrule\n\t  Indian Pines & \t256  & 64  &  157.1sec  & 77.6sec   &   2.0   \\\\\n                   & \t512  & 128 &  311.6sec  & 94.5sec   &   3.3   \\\\\n                   & \t1024 & 256 &  624.0sec  & 102.6sec  &   6.1   \\\\\n                   & \t2048 & 512 & 1248.0sec  & 120.6sec  &   10.3  \\\\\n\t \\midrule\n\t  MNIST        & \t256  & 64  &  272.3sec   &   79.5sec    &   3.4   \\\\\n                   & \t512  & 128 &  545.3sec   &  113.3sec    &   4.8   \\\\\n                   & \t1024 & 256 & 1095.2sec  &  104.0sec    &   10.5  \\\\\n                   & \t2048 & 512 & 2193.5sec  &  137.3sec    &   16.0  \\\\\t \n    \\bottomrule\n  \\end{tabular}\n  \\caption{Runtime and speedup of layer-parallel gradient evaluation over layer-serial propagation.}\n  \\label{tab:MGRITspeedup}\n\\end{table}\n}\n\nA strong scaling study is presented Figure \\ref{fig:strongscaling} for various numbers of layers. Here, we keep the problem sizes fixed and measure the time-to-solution for one gradient evaluation with MGRIT for increasing numbers of computational resources. It shows good strong scaling behavior for all test cases, independent of the numbers of layers. The cross over point where the layer-parallel MGRIT approach shows speedup over the layer-serial propagation is around $16$ cores for all cases.  \n\\begin{figure}\n\t\\center \n\t\\begin{subfigure}{.49\\textwidth}\n\t\t\\center \n\t\t\\includegraphics[width=\\textwidth]{figures/peaks/xbraid/strongscaling_standalone.pdf}\n\t\t\\caption{Peaks example}\n\t\\end{subfigure}\n\t\\begin{subfigure}{.49\\textwidth}\n\t\t\\center \n\t\t\\includegraphics[width=\\textwidth]{figures/hyperspectral/strongscaling_standalone.pdf}\n\t\t\\caption{Indian Pines}\n\t\\end{subfigure}\n\t\\caption{Strong scaling study for a layer-parallel gradient evaluation for various problem sizes from $N=256$ to $N=2048$ layers. {Corresponding serial runtimes are indicated by horizontal dashed lines.} The cross-over point where the layer-parallel approach yields speedup over the layer-serial propagation lies around $16$ cores.\\protect\\footnotemark}\n\t\\label{fig:strongscaling}\n\\end{figure}\n\n\\footnotetext{The corresponding figure for the MNIST test case shows the same quantitative behavior, and has hence been omitted here.}\n\n\n\n5.3 Simultaneous Layer-Parallel Training Validation\n\\subsection{Simultaneous Layer-Parallel Training Validation}\n\\label{sec:numerics:oneshot}\nNext, we investigate the simultaneous layer-parallel training, using $m_1 = m_2 = 2$ layer-parallel MGRIT iterations in each outer training iteration (see Algorithm \\ref{alg:oneshot}). {The Hessian approximations $B_{\\bftheta}, B_{\\bfW}, B_{\\bfmu}$ are computed by successive limited-memory BFGS updates based on the current gradient $\\nabla_{(\\bftheta, \\bfW, \\bfmu)}J$.}\nWe compare runtimes of the simultaneous layer-parallel training with a conventional layer-serial training approach, while choosing the same Hessian, as well as the same initial network parameters for both approaches. However, we tune the optimization hyper-parameters (such as regularization parameters, stepsize selection, etc.)  separately for both schemes, in order to find the best setting for either approach that reaches a prescribed validation accuracy with the least iterations and minimum runtime. \n\nFor the Peaks example, we train a network with $N=1024$ layers distributed onto $256$ compute cores, and for the Indian Pines data set and the MNIST case we choose $N=512$ layers distributed onto $128$ compute cores, giving $4$ layers per processor in all cases. \n Figure \\ref{fig:oneshot} plots the training history over iteration counts (top) as well as runtime (bottom). We validate from the top figures, that both approaches reach comparable performance in terms of training result (optimization iteration counts, training loss and validation accuracy). Hence, reducing the accuracy of the inner multigrid iterations for solving the state and adjoint equations within a simultaneous training framework does not deteriorate the training behavior.\n But, each iteration of the simultaneous layer-parallel approach is much faster than for the layer-serial approach due to the layer-parallelization and the reduced state and adjoint accuracy. Therefore, the overall runtime for reaching that same final training result is reduced drastically (bottom figures). \n Runtime speedups are reported in Table \\ref{tab:OSspeedup}.\n\n\n  While these results have been computed for selected fixed $N$, it is expected that the speedup scales linearly with increasing numbers of layers, similar to the observation in Table~\\ref{tab:MGRITspeedup}.\n\n\n\\begin{figure}[ht]\n\t\\center\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/peaks/oneshot/OSvsRef_iter_standalone.pdf}\n\t\t\\caption{Peaks: Training over iteration counts}\n\t\\end{subfigure}\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/hyperspectral/OSvsRef_iter_standalone.pdf}\n\t\t\\caption{Indian Pines: Training over iteration counts}\n\t\\end{subfigure}\n\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/peaks/oneshot/OSvsRef_time_standalone.pdf}\n\t\t\\caption{Peaks: Training over time}\n\t\\end{subfigure}\n\t\\begin{subfigure}{0.49\\textwidth}\n\t\t\\center\n\t\t\\includegraphics[width=\\textwidth]{figures/hyperspectral/OSvsRef_time_standalone.pdf}\n\t\t\\caption{Indian Pines: Training over time}\n\t\\end{subfigure}\n\t\\caption{Training loss (solid lines) and validation accuracy (dashed lines) over training iterations (top) and compute time (bottom). For the layer-parallel training, each core processes $4$ layers. The simultaneous layer-parallel approach reaches training results comparable to a layer-serial approach within much less computational time.\\protect\\footnotemark}\n\t\\label{fig:oneshot}\n\\end{figure}\n\\footnotetext{The corresponding figures for the MNIST test case show the same quantitative behavior, and have hence been omitted here.}\n\n\n\\begin{table}[h]\n  \\center\n  \\begin{tabular}{@ { } lrrrrr @ { }}\n    \\toprule\n    Test case      & $N$  & $\\#$Cores  & Layer-serial & Layer-parallel & Speedup \\\\\n\t \\midrule\n\t Peaks example  & 1024 & 256       &  4096 sec      &   683 sec     &  6.0   \\\\\n\t Indian Pines   & 512  & 128       &  2623 min      &   597 min     &   4.4  \\\\\n\t MNIST          & 512  & 128       &  619  min       &   71  min     &   8.5  \\\\\n\n    \\bottomrule\n  \\end{tabular}\n  \\caption{Runtime speedup of simultaneous layer-parallel training over layer-serial training.}\n  \\label{tab:OSspeedup}\n\\end{table}\n\n\n\n\n",
        "Descriptive_question1": "What is the speedup value for the MNIST test case with 2048 layers in table_2?",
        "Descriptive_question2": "How many cores were used for the Peaks test case with 1024 layers in table_2?",
        "Reasoning_question1": "How does the speedup trend correlate with the number of layers for the Indian Pines test case in table_2?",
        "Reasoning_question2": "Why might the MNIST test case show a higher speedup compared to the Peaks test case at 2048 layers in table_2?",
        "Descriptive_answer1": "16.0",
        "Descriptive_answer2": "256",
        "Reasoning_answer1": "Looking at the data for the Indian Pines test case in the table, I observe that as the number of layers increases from 256 to 2048, the speedup also increases significantly. Specifically, with 256 layers, the speedup is 2.0, and it rises to 3.3 at 512 layers, then to 6.1 at 1024 layers, and finally reaches 10.3 at 2048 layers. This pattern indicates a clear positive correlation between the number of layers and the speedup. The reasoning behind this trend is that the layer-parallel approach benefits more as the network depth grows, because parallelism can be distributed across more layers, reducing the runtime compared to serial processing, which scales linearly with the number of layers. Therefore, the speedup increases almost proportionally with the number of layers, showcasing the efficiency of the parallel method for deeper networks.",
        "Reasoning_answer2": "To understand why the MNIST test case shows a higher speedup of 16.0 compared to the Peaks test case speedup of 7.7 at 2048 layers, I first compare the runtime data in the table. For MNIST, the serial runtime at 2048 layers is 2193.5 seconds, dropping to 137.3 seconds in parallel, which is a substantial reduction. In contrast, for Peaks, the serial runtime is 13.9 seconds, reducing to 1.8 seconds in parallel, a much smaller absolute difference. This suggests that the MNIST test case has a higher baseline serial runtime, likely due to greater computational complexity or data size, as hinted by the context of handling 28x28 grayscale images with convolutional layers. The parallel approach thus offers more room for improvement in runtime reduction for MNIST. Additionally, the parallel runtime for MNIST does not decrease as dramatically beyond a certain point (e.g., from 104.0 seconds at 1024 layers to 137.3 seconds at 2048 layers), but the serial runtime continues to double, amplifying the speedup ratio. For Peaks, with simpler data (level set classification), the serial runtime is already low, limiting the relative gain from parallelism. Hence, the higher speedup in MNIST likely stems from its higher computational demand and the efficiency of parallel processing in handling such workloads over many cores (512 cores at 2048 layers)."
    },
    {
        "paper_id": "2007.05827.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}[!h]\n\\centering\n\\caption{Plasma and field parameters for diamagnetic cavity event on 30 July 2015} \n\\begin{tabular}{p{6cm}p{2cm}p{2cm}}\n\\hline \n$r_{comet}$ (km) & \\multicolumn{2}{l}{179.5} \\\\ \n$D_{sun}$ (AU) & \\multicolumn{2}{l}{1.24} \\\\ \nNeutral density (cm\\textsuperscript{-3}) & \\multicolumn{2}{l}{$6.7\\times10^{7}$} \\\\ \nLatitude  & \\multicolumn{2}{l}{-48} \\\\\nCone angle  & \\multicolumn{2}{l}{149.3} \\\\\nB (nT) & \\multicolumn{2}{l}{38.8} \\\\\nDuration & \\multicolumn{2}{l}{00:10:55 (11:00:51 - 11:11:41 UTC)} \\\\\nEnergy range of reduced flux (eV) & \\multicolumn{2}{l}{56.1 - 358} \\\\\nEnergy of max. flux difference (eV) & \\multicolumn{2}{l}{74.4} \\\\\n  & Inside & Outside \\\\\n LAP bulk electrons density (cm\\textsuperscript{-3}) & 997.3 & 1164.8 \\\\\n\\hline \n\\end{tabular}\n\\label{table:1}\n\\end{table}",
        "caption": "Plasma and field parameters for diamagnetic cavity event on 30 July 2015",
        "label": "table:1",
        "section_info": "3 Observations\n\\section{Observations}\nThe reduced electron flux inside diamagnetic regions has been discussed in a couple of studies \\citep{madanian_plasma_2016,nemeth_charged_2016,timar_modelling_2017}. Figure \\ref{figexhibeves} shows two examples of IES electron spectra inside and outside diamagnetic regions, each exhibiting a decreased flux over different energies. The top-left panel shows an event on 30 July 2015 and the top-right panel shows an event few days later on 3 August 2015. The ordinate axis in these plots represents the differential electron flux integrated over the entire IES FOV ($2.8\\pi$ solid angle). The red (blue) lines on the top panels show the time averaged spectra inside (outside) the diamagnetic cavities. The period inside the diamagnetic cavity on 30 July is from 11:00:52 to 11:11:40 UTC and on 3 August from 17:20:42 to 17:28:03 UTC. The outside periods for 30 July and 3 August are selected between 10:52:26 -- 11:00:46 UTC and 17:12:16 -- 17:20:36 UTC, respectively. The horizontal green and purple lines on panel (a) are shown as references to highlight energy ranges of dominant cometary and solar wind electrons \\citep{madanian_plasma_2016}. The vertical dashed lines show the energy range in which a flux difference is observed (lower and upper energies). \n\nOn 30 July, flux of electrons in the $\\sim60 - 350$ eV range inside the diamagnetic region has decreased by variable amounts. This energy range extends to around 900 eV on 3 August. A characteristic energy indicated by \\textsc{\\char13}Max. drop\\textsc{\\char13} at around 175 eV on panel (b) is the energy at which the highest flux difference is observed. This energy for the event on 30 July is around 74 eV. Panel (c) in Figure \\ref{figexhibeves} shows the Rosetta spacecraft trajectory around the comet between 30 July and 6 August 2015. The colorbar represents the time. The reference frame in this plot is the dynamic body-Centered Solar Equatorial (CSEQ) frame in which the $+x$ axis is toward the Sun, the $+z$ axis is aligned with the projection of the solar rotation axis on a plane perpendicular to the $x$ axis, and the $y$ axis completes the right-hand coordinate system. The frame\\textsc{\\char13}s origin is the comet\\textsc{\\char13}s center of mass (shown with a black dot). Rosetta was at around 180 km from the comet on 30 July, and it gradually moved to a distance of 250 km north-east of the comet on 6 August. The spacecraft speed with respect to the comet was a few meters per second. We will discuss latitudinal dependence of variables. The latitude is measured in the ESA/RMOC shape frame (also known as the landmark coordinates) illustrated in the surface map of the comet in panel (d). Colors represent different longitudes, while latitudes are annotated on the map.\n\nThe flux difference across the diamagnetic boundaries creates an energy density difference between inside and outside plasmas. As seen in Figure \\ref{figexhibeves}, the energy range of flux difference varies for different diamagnetic events, and this variability has not been studied so far. In Section 3.1 we provide a statistical analysis of a subset of diamagnetic events and in Section 3.2 a detailed case study for one of these events is presented.\n\n\\subsection{Analysis of Suprathermal Electron Flux Difference across Diamagnetic Boundaries}\n\nWe use a subset of diamagnetic events reported in \\citet{goetz_first_2016} and limit our study to July and August of 2015, when comet activity was relatively high and the majority of diamagnetic events were observed. With the IES measurement cycle in mind, we down-selected events lasting longer than 256 s and with at least 512 s separation from another event on at least one side. These criteria ensure that at least one full IES measurement cycle exist inside the diamagnetic region and that the outside measurements are not contaminated by shorter events. This brought down the number of events from a total of 313 to 62 events. For the list of events see Section \\ref{sec:data}. We used an algorithm to search and compare the IES energy spectra inside and outside each event and record energy bins with reduced electron fluxes. For 31 events we had the option to choose the outside spectrum from the trailing or the leading side. For these cases measurements from the side with the higher magnetic field strength were selected. For events that showed multiple drops corresponding to multiple energy ranges (i.e., the inside spectrum would drop below the outside spectrum multiple times due to similar overlapping spectra,) the widest energy range was recorded. We present our observations in the context of total energy flux difference, $\\Delta\\psi$, across boundaries as seen in IES electron spectra and defined by:\n\\begin{equation}\n\\Delta\\psi = \\sum_{k=E_{lower}}^{E_{upper}} (\\psi(E_k)_{out} - \\psi(E_k)_{in})\\times E_k\n\\end{equation}\n\n \\begin{figure}[H]\n \\centering\n \\includegraphics[width=0.9\\linewidth]{Figure2_multipanel_statistics_rev3.pdf}\n \\caption{Distributions of plasma parameters for 62 diamagnetic events included in the study. The ordinate axis in all panels shows the parameter $\\Delta\\psi$. Panels (a - c) show the distributions of cometocentric distance, local neutral density, and bulk electron density, respectively. Data in these panels are also color coded by the cometary latitude. Panels (d - f) show distributions of cone angle, magnetic field strength, and event duration. Panels (g - i) show the distributions of upper energy limit of flux variations, energy of the highest flux difference, and bulk electron density difference. The red star marks the event on 30 July 2015 which is considered further in Section \\ref{sec:30julyeve}.}\n\\label{figstatmultipanel}\n\\end{figure}\n\n\\noindent where $\\psi(E_k)$ is the integrated differential electron flux over the IES FOV at energy $E_k$. $\\Delta\\psi$ distributions against several other parameters are presented in Figure \\ref{figstatmultipanel}.  The first row in this figure shows $\\Delta\\psi$ as a function of cometocentric distance, neutral density and bulk electron density $N_e$ measured by LAP instrument, respectively. These panels are also color coded based on the cometary latitude at each event (see Figure \\ref{figexhibeves}, panel (d)). As shown in panel (a), observations are mostly within 300 km from the comet and neutral densities varies between $5\\times10^6 - 10^{8}$ cm\\textsuperscript{-3}. The neutral densities also show a clear latitudinal dependence. Data in panel (b) shows that the comet is significantly more active in the southern hemisphere \\citep{hansen_evolution_2016,hassig_time_2015,lauter_surface_2018}. During the perihelion passage, the southern hemisphere of the comet receives higher insolation and this period is in the midst of the southern summer in cometary seasons \\citep{keller_insolation_2015}.\n\nPanel (c) shows LAP electron densities measured at the beginning of each diamagnetic crossing. The LAP densities are clustered around 200 cm\\textsuperscript{-3} and 1000 cm\\textsuperscript{-3}. The higher density cluster, corresponding to events over the southern latitudes where the comet activity is higher, shows more variations. The lower density events around 200 cm\\textsuperscript{-3} are more contained and show less variations. A few points that exhibit the highest $\\Delta\\psi$ are within this group. Since neutral densities show gradually increase with respect to decreasing latitude, one would expect to see a gradual increase in LAP electron densities at lower latitudes. However, there is a distinct separation in electron densities measured in the southern versus northern latitudes. This may reflect that the bulk radial plasma velocity is higher on the less active side. A reason for this could be that ion-neutral collisions, on the less active side, occur less frequently and thus are less efficient in hampering ion-acceleration along an ambipolar electric field (e.g., \\citet{vigren_1d_2017}). In addition, in a simplified view, an equally pronounced outward radial acceleration on the more active side would conflict with momentum conservation. Most events in the southern hemisphere occurred between 26 July and 3 August, when most of the long-lasting diamagnetic events have been observed. \n\nPanels (d - f) show, respectively, distributions of the magnetic field cone angle, magnetic field strength, and event duration. The cone angle defines the angle between the magnetic field vector and the comet-Sun line. The distribution in panel (d) shows events grouped around $30^{\\circ}$ and $150^{\\circ}$ cone angles which is expected for the observations near perihelion as significant magnetic field draping exists and the spacecraft resides mostly in the terminator plane at this time. Correlation between electron number flux and magnetic field magnitude slightly increases at higher energies when all measurements at perihelion are included \\citep{madanian_plasma_2016}, but the $\\Delta\\psi$ distribution in panel (e) exhibits no or a very weak correlation with the magnetic field strength. Event durations varied between 257 seconds and 32 minutes. The longest event that also shows the highest $\\Delta\\psi$ is on 7 July 2015 at 09:44:22 UTC. The outside spectrum is selected from the trailing side of that event the flux difference extends up to 733 eV. \n\nThe third row in Figure \\ref{figstatmultipanel} shows $\\Delta\\psi$, respectively, versus the upper energy limit of flux difference, energy of the highest flux difference, and relative difference in bulk electron density between inside and outside plasmas, $dN_e = (N_{e_{out}} - N_{e_{in}})/N_{e_{out}}$. The histogram in the background of panel (g) shows the occurrence rate of the upper energy limit. Although upper limits spread across many energies, the distribution suggests that the flux decrease stops at certain energies more often. The first peak in the histogram at 350--400 eV bin is the most dominant and includes 17 events. Flux difference for nine events extends up to 650--700 eV (the second highest peak). We will revisit this point in Section 3.2. $\\Delta\\psi$ decreases when the most affected electrons are at higher energies which can be observed in panel (h). In addition, panel(i) shows that for most events bulk electron density inside the diamagnetic region decreases, confirming previous findings using MIP data \\citep{henri_diamagnetic_2017}, though this decrease shows no apparent relation with IES flux differences. Suprathermal electrons at 100 or 200 eV travel through the plasma at speeds significantly faster than bulk electrons. Their flux variability occurs on time scales much different than the bulk plasma variation observed inside diamagnetic regions \\citep{hajra_dynamic_2018}.\n\n\\subsection{Suprathermal Electron PADs Case Study for the Event on 30 July 2015, 11:00:51 UTC} \\label{sec:30julyeve}\nThe diamagnetic cavity event that we consider in this section was shown in Figure \\ref{figexhibeves} panel (a). It is observed at negative latitudes and is one of the 17 events for which flux difference extends to $\\sim$350 eV (see panel (g) in Figure \\ref{figstatmultipanel}). Table \\ref{table:1} lists plasma and field parameters around this event.\n\n\\begin{table}[!h]\n\\centering\n\\caption{Plasma and field parameters for diamagnetic cavity event on 30 July 2015} \n\\begin{tabular}{p{6cm}p{2cm}p{2cm}}\n\\hline \n$r_{comet}$ (km) & \\multicolumn{2}{l}{179.5} \\\\ \n$D_{sun}$ (AU) & \\multicolumn{2}{l}{1.24} \\\\ \nNeutral density (cm\\textsuperscript{-3}) & \\multicolumn{2}{l}{$6.7\\times10^{7}$} \\\\ \nLatitude  & \\multicolumn{2}{l}{-48} \\\\\nCone angle  & \\multicolumn{2}{l}{149.3} \\\\\nB (nT) & \\multicolumn{2}{l}{38.8} \\\\\nDuration & \\multicolumn{2}{l}{00:10:55 (11:00:51 - 11:11:41 UTC)} \\\\\nEnergy range of reduced flux (eV) & \\multicolumn{2}{l}{56.1 - 358} \\\\\nEnergy of max. flux difference (eV) & \\multicolumn{2}{l}{74.4} \\\\\n  & Inside & Outside \\\\\n LAP bulk electrons density (cm\\textsuperscript{-3}) & 997.3 & 1164.8 \\\\\n\\hline \n\\end{tabular}\n\\label{table:1}\n\\end{table}\nTo better understand the nature of the reduced fluxes during the transition into the diamagnetic region we examine the 3D spatial distributions of high energy suprathermal electrons. Figure \\ref{figpolarplotsJim} shows 2D cuts of electron distribution variations in the IES FOV for four timestamps before the diamagnetic event on 30 July 2015. Panel (a) shows the differential electron flux for IES anodes (labeled 0-15) averaged around the central elevation plane at the first timestamp and is labeled as the \"reference\" distribution. The colors are in logarithmic scale and energies between 100 eV and 5 keV are shown. Panels (b - d) show the flux ratios in the next three timestamps (all still outside the cavity) as compared to the reference distribution. The disconnection at 3 o\\textsc{\\char13}clock on these panels is an artefact of the plotting software.\n\nRelative enhancements (red segments) are observed in anodes 0, 6, 8, and 12 of panels (b), (c), and (d); while decreases (blue segments) occur in anodes 2, 14, and 15 of panels (b) and (d) and in anodes 4, 6, 12 of panel (c). From this figure we notice directional changes for electrons at different energies close to the diamagnetic cavity. It is important to consider these changes in the electron trajectory with respect to the magnetic field. To better analyze these spatial changes, we analyze the electron pitch angle distributions. \n\n\\begin{figure}[H]\n \\centering\n \\includegraphics[width=0.95\\textwidth]{Figure3_Rev3_Polar_plots.pdf}\n \\caption{2D cuts of the IES FOV showing electron differential flux variations in four timestamps between 10:46 and 11:00 UTC before the diamagnetic cavity crossing on 30 July 2015. Panel (a) shows the electron differential flux at the first timestamp. Panels (b - d) show the corresponding flux ratios with respect to the distribution in panel (a).}\n \\label{figpolarplotsJim}\n  \\end{figure}\n\nWe should note that electron PAD is not an official data product of the IES instrument. Few factors that may complicate derivation of PADs and limit our ability to interpret them include, (1) low time resolution in IES data does not allow to resolve plasma effects such as wave-particle interactions in the distributions, (2) IES FOV does not cover the full sky and if the magnetic field points toward these gaps in the FOV (i.e. instrument symmetry axis,) part of the distribution will be lost, and (3) IES onboard averaging can reduce the resolution of the derived PADs. It is not our intention to study fine timescale effects on electrons, but rather we are looking at effects of changing magnetic field topology and our results prove that PADs at the current resolution can provide valuable information about those effects. We inspected the IES FOV for pitch angle coverage and ensured that the magnetic field direction during this event is favorable for PAD analysis.\n\nThe IES time resolution for a full cycle in the current mode is 256 s, resulting in a 2 s sampling time per energy bin. At each energy step, the deflector plates are biased in a see-saw fashion to conserve power and reduce sweep time. We track the time at which different energies and sectors were scanned within a cycle and update the magnetic field vector accordingly before calculating the pitch angles. An array consisting of 12 bins, each $15^{\\circ}$ wide, is used to sort fluxes into the pitch angle space. To account for straddling of sectors that covered more than one pitch angle bin, sector flux is distributed across all overlapping bins and the final PADs are normalized by the sampling rate at each bin. \n\nThe event on 30 July 2015 at 11:00:51 UTC meets our selection criteria. Specifically, we searched for periods of gradual changes in magnetic field strength over a few consecutive IES timestamps, where high amplitude magnetic field fluctuations were relatively low, as they can modulate the distribution faster than the IES can record and therefore cannot be studied. For the event studied in this section, although we do not observe the typical signatures of ultra-low frequency (ULF) waves, or circularly polarized whistler waves (see panel (a) of Figure \\ref{figPADtimeseries}), we have to assume that wave-particle interactions are negligible.\n\nFigure \\ref{figPADtimeseries} shows an overview of magnetic field data and electron PADs across this event. The top panel in this figure shows the magnetic field components and magnitude in the CSEQ coordinates. The diamagnetic cavity event is identified between 11:00:51 and 11:11:41 UTC. The cone angle ($\\theta_{cone}$) is shown in panel (b). The spectrogram in panel (c) shows the FOV integrated differential electron flux (cm\\textsuperscript{2} s eV)\\textsuperscript{-1} as a function of energy in the 200-1000 eV range. Flux reductions inside the cavity for this event were previously illustrated in panel (a) of Figure \\ref{figexhibeves}, and can also be identified in panel (c). Panels (d - h) show the electron PAD time series at different energies normalized by the maximum flux value in each panel. The distributions have been averaged over consecutive energy bins to improve the counting statistics. The energy ranges are specified in the parentheses. All colorbars are in logarithmic scale. The white lines overplotted on these panels are contours of constant magnetic moment, $\\mu_m = W_{\\perp} / |B|$, where $|B|$ is the magnetic field magnitude and $W_{\\perp} = 1/2 \\mbox{ } m_e V_{\\perp}^2$ is the perpendicular energy of electrons. The pitch angle distributions and contours inside the cavity have no physical meaning.\n\n\n\\begin{figure}[H]\n \\centering\n \\includegraphics[width=0.9\\linewidth]{Fig4_July30_ts_rev3.pdf}\n \\caption{Magnetic field and electron distribution time series around the diamagnetic cavity on 30 July 2015. The field free cavity is observed between 11:00:51 and 11:11:41 UTC and is marked with a grey box. Panel (a) shows magnetic field components and magnitude in CSEQ coordinates, panel (b) shows the magnetic field cone angle, panel (c) is the differential electron flux spectrogram in units of $\\log_{10}$(cm\\textsuperscript{2} s eV)\\textsuperscript{-1}, and panels (d - h) show electron pitch angle distributions in five different energy ranges. The fluxes are normalized by the maximum flux value in each panel. The white lines on these panels are the contours of the constant adiabatic invariant. The vertical dashed-dotted lines mark four IES timestamps before the onset of the diamagnetic cavity.}\n\\label{figPADtimeseries}\n\\end{figure}\n\nBetween 10:53:00 and 11:00:00 UTC, the magnetic field shows, on average, a gradual decrease in the field strength. There are perturbations due to the turbulent plasma environment. The $B_x$ component is shown with the blue color in panel (a) of Figure \\ref{figPADtimeseries}, and is highly negative throughout this period. In fact, most of the variations in the magnetic field strength originates from the $B_x$ component while the two other components are relatively quiet.  Close to the diamagnetic region the $y$ component of the field becomes dominant and shows a continuous decline. The magnetic field direction changes from anti-sunward (cone angle $\\sim180^{\\circ}$) to a direction perpendicular to the comet-Sun line (cone angle $\\sim90^{\\circ}$). This period corresponds to four IES timestamps identified by vertical dashed-dotted lines drawn across all panels and labeled by $t_1$, $t_2$, $t_3$, and $t_4$.\n\nAt 10:45 UTC electrons show a fairly scattered distribution occupying most of the pitch angle bins with similar intensities, except for the distributions in panels (f) and (g). In the next four timestamps, flux reductions around $90^{\\circ}$ pitch angles are observed and accompanied by increased fluxes in directions parallel ($0^{\\circ}$) and anti-parallel ($180^{\\circ}$) to the magnetic field. This is indicative of a changing distribution from isotropic to field-aligned. The effect is particularly evident for $151 - 293$ eV electrons, while $306-358$ eV electrons exhibit this change in the last two timestamps before the cavity. The redistributed electrons seem to follow along the white contours of the first adiabatic invariant. In contrast, the distributions of $375-440$ eV electrons in panel (h) show a different pattern. With the exception of timestamp ($t_3$) where an enhanced anti-parallel flux is observed, distributions are relatively disordered and chaotic with respect to the onset of the diamagnetic cavity, or the adiabatic invariant contours. This  implies that the first adiabatic invariant is only conserved up to a certain energy. \n\nIn Figure \\ref{figlineplots} we examine these spectra in a more quantitative way. In panels (a - d), differential electron flux at selected energies are plotted versus pitch angle. Each panel corresponds to an IES timestamp marked with vertical, dashed lines in Figure \\ref{figPADtimeseries}. The corresponding energies are annotated in panel (a), and error bars reflect the uncertainty due to the counting statistics. Error estimates for most data points are reasonably low and for a few points are larger. Larger error bars do not necessarily indicate that the observation must be discarded, but rather more measurements are needed to improve the confidence on the observation.\n\nThe 99 eV electrons have a maximum in the parallel direction until 10:52 UTC. In the next timestamp, the anti-parallel flux increases while the fluxes in $\\sim0-80^{\\circ}$ pitch angles decrease. Given that between 10:52 and 10:56 UTC the magnetic field vector rotates to mostly $-y$ direction, changes in 99 eV PADs indicate that a large flux of these electrons travel anti-parallel to the magnetic field and away from the comet. The 99 eV line also shows a sharp peak at around $50^{\\circ}$ at 11:00 UTC. The 185 eV electron distribution in panel (a) shows a rapid fall in the last pitch angle bin. This is most likely due to the low count rate in that bin, as is evident by the larger error bars. The 185 eV distribution changes into a bidirectional, field-aligned pattern in the next three timestamps. Similarly, the 202 eV (cyan) and 250 eV (green) electrons start roughly isotropic and evolve into double-peak bidirectional distributions, while $90^{\\circ}$ electrons become depleted. The net flux in these distributions remains almost the same from one timestamp to the next. In other words, the enhancements in pitch angles near $0^{\\circ}$ and $180^{\\circ}$ (as seen in panels (c) and (d)) are compensated by the depletions in $\\sim90^{\\circ}$ bins. The 358 and 396 eV lines, although moderately changing in time, do not show any depletion in perpendicular flux. The energies we discussed here are sensitive traces of the magnetic field topology. The depletion of $\\sim90^{\\circ}$ pitch angle electrons is consistent with adiabatic transport of electrons in sharply decreasing magnetic fields.\n\n\\begin{figure}[H]\n\\centering\n\\includegraphics[width=0.9\\linewidth]{Fig5_lineplot_newlabels.pdf}\n\\caption{Differential electron flux at selected energies (different colors) versus pitch angle at four timestamps prior to the cavity encounter on 30 July 2015. Energies are annotated in panel (a). Each panel correspond to a timestamp identified by vertical dashed-dotted lines in Figure \\ref{figPADtimeseries}.}\n\\label{figlineplots}\n\\end{figure}\n\nSpectra in timestamp $t_3$ show noticeably more depletion around $90^{\\circ}$ pitch angles than the neighboring timestamps, but the net fluxes are still higher than those inside the cavity. The origin of this behavior has not been clearly identified at this time.\n\n3.2 Suprathermal Electron PADs Case Study for the Event on 30 July 2015, 11:00:51 UTC\n\\subsection{Suprathermal Electron PADs Case Study for the Event on 30 July 2015, 11:00:51 UTC} \\label{sec:30julyeve}\nThe diamagnetic cavity event that we consider in this section was shown in Figure \\ref{figexhibeves} panel (a). It is observed at negative latitudes and is one of the 17 events for which flux difference extends to $\\sim$350 eV (see panel (g) in Figure \\ref{figstatmultipanel}). Table \\ref{table:1} lists plasma and field parameters around this event.\n\n\\begin{table}[!h]\n\\centering\n\\caption{Plasma and field parameters for diamagnetic cavity event on 30 July 2015} \n\\begin{tabular}{p{6cm}p{2cm}p{2cm}}\n\\hline \n$r_{comet}$ (km) & \\multicolumn{2}{l}{179.5} \\\\ \n$D_{sun}$ (AU) & \\multicolumn{2}{l}{1.24} \\\\ \nNeutral density (cm\\textsuperscript{-3}) & \\multicolumn{2}{l}{$6.7\\times10^{7}$} \\\\ \nLatitude  & \\multicolumn{2}{l}{-48} \\\\\nCone angle  & \\multicolumn{2}{l}{149.3} \\\\\nB (nT) & \\multicolumn{2}{l}{38.8} \\\\\nDuration & \\multicolumn{2}{l}{00:10:55 (11:00:51 - 11:11:41 UTC)} \\\\\nEnergy range of reduced flux (eV) & \\multicolumn{2}{l}{56.1 - 358} \\\\\nEnergy of max. flux difference (eV) & \\multicolumn{2}{l}{74.4} \\\\\n  & Inside & Outside \\\\\n LAP bulk electrons density (cm\\textsuperscript{-3}) & 997.3 & 1164.8 \\\\\n\\hline \n\\end{tabular}\n\\label{table:1}\n\\end{table}\nTo better understand the nature of the reduced fluxes during the transition into the diamagnetic region we examine the 3D spatial distributions of high energy suprathermal electrons. Figure \\ref{figpolarplotsJim} shows 2D cuts of electron distribution variations in the IES FOV for four timestamps before the diamagnetic event on 30 July 2015. Panel (a) shows the differential electron flux for IES anodes (labeled 0-15) averaged around the central elevation plane at the first timestamp and is labeled as the \"reference\" distribution. The colors are in logarithmic scale and energies between 100 eV and 5 keV are shown. Panels (b - d) show the flux ratios in the next three timestamps (all still outside the cavity) as compared to the reference distribution. The disconnection at 3 o\\textsc{\\char13}clock on these panels is an artefact of the plotting software.\n\nRelative enhancements (red segments) are observed in anodes 0, 6, 8, and 12 of panels (b), (c), and (d); while decreases (blue segments) occur in anodes 2, 14, and 15 of panels (b) and (d) and in anodes 4, 6, 12 of panel (c). From this figure we notice directional changes for electrons at different energies close to the diamagnetic cavity. It is important to consider these changes in the electron trajectory with respect to the magnetic field. To better analyze these spatial changes, we analyze the electron pitch angle distributions. \n\n\\begin{figure}[H]\n \\centering\n \\includegraphics[width=0.95\\textwidth]{Figure3_Rev3_Polar_plots.pdf}\n \\caption{2D cuts of the IES FOV showing electron differential flux variations in four timestamps between 10:46 and 11:00 UTC before the diamagnetic cavity crossing on 30 July 2015. Panel (a) shows the electron differential flux at the first timestamp. Panels (b - d) show the corresponding flux ratios with respect to the distribution in panel (a).}\n \\label{figpolarplotsJim}\n  \\end{figure}\n\nWe should note that electron PAD is not an official data product of the IES instrument. Few factors that may complicate derivation of PADs and limit our ability to interpret them include, (1) low time resolution in IES data does not allow to resolve plasma effects such as wave-particle interactions in the distributions, (2) IES FOV does not cover the full sky and if the magnetic field points toward these gaps in the FOV (i.e. instrument symmetry axis,) part of the distribution will be lost, and (3) IES onboard averaging can reduce the resolution of the derived PADs. It is not our intention to study fine timescale effects on electrons, but rather we are looking at effects of changing magnetic field topology and our results prove that PADs at the current resolution can provide valuable information about those effects. We inspected the IES FOV for pitch angle coverage and ensured that the magnetic field direction during this event is favorable for PAD analysis.\n\nThe IES time resolution for a full cycle in the current mode is 256 s, resulting in a 2 s sampling time per energy bin. At each energy step, the deflector plates are biased in a see-saw fashion to conserve power and reduce sweep time. We track the time at which different energies and sectors were scanned within a cycle and update the magnetic field vector accordingly before calculating the pitch angles. An array consisting of 12 bins, each $15^{\\circ}$ wide, is used to sort fluxes into the pitch angle space. To account for straddling of sectors that covered more than one pitch angle bin, sector flux is distributed across all overlapping bins and the final PADs are normalized by the sampling rate at each bin. \n\nThe event on 30 July 2015 at 11:00:51 UTC meets our selection criteria. Specifically, we searched for periods of gradual changes in magnetic field strength over a few consecutive IES timestamps, where high amplitude magnetic field fluctuations were relatively low, as they can modulate the distribution faster than the IES can record and therefore cannot be studied. For the event studied in this section, although we do not observe the typical signatures of ultra-low frequency (ULF) waves, or circularly polarized whistler waves (see panel (a) of Figure \\ref{figPADtimeseries}), we have to assume that wave-particle interactions are negligible.\n\nFigure \\ref{figPADtimeseries} shows an overview of magnetic field data and electron PADs across this event. The top panel in this figure shows the magnetic field components and magnitude in the CSEQ coordinates. The diamagnetic cavity event is identified between 11:00:51 and 11:11:41 UTC. The cone angle ($\\theta_{cone}$) is shown in panel (b). The spectrogram in panel (c) shows the FOV integrated differential electron flux (cm\\textsuperscript{2} s eV)\\textsuperscript{-1} as a function of energy in the 200-1000 eV range. Flux reductions inside the cavity for this event were previously illustrated in panel (a) of Figure \\ref{figexhibeves}, and can also be identified in panel (c). Panels (d - h) show the electron PAD time series at different energies normalized by the maximum flux value in each panel. The distributions have been averaged over consecutive energy bins to improve the counting statistics. The energy ranges are specified in the parentheses. All colorbars are in logarithmic scale. The white lines overplotted on these panels are contours of constant magnetic moment, $\\mu_m = W_{\\perp} / |B|$, where $|B|$ is the magnetic field magnitude and $W_{\\perp} = 1/2 \\mbox{ } m_e V_{\\perp}^2$ is the perpendicular energy of electrons. The pitch angle distributions and contours inside the cavity have no physical meaning.\n\n\n\\begin{figure}[H]\n \\centering\n \\includegraphics[width=0.9\\linewidth]{Fig4_July30_ts_rev3.pdf}\n \\caption{Magnetic field and electron distribution time series around the diamagnetic cavity on 30 July 2015. The field free cavity is observed between 11:00:51 and 11:11:41 UTC and is marked with a grey box. Panel (a) shows magnetic field components and magnitude in CSEQ coordinates, panel (b) shows the magnetic field cone angle, panel (c) is the differential electron flux spectrogram in units of $\\log_{10}$(cm\\textsuperscript{2} s eV)\\textsuperscript{-1}, and panels (d - h) show electron pitch angle distributions in five different energy ranges. The fluxes are normalized by the maximum flux value in each panel. The white lines on these panels are the contours of the constant adiabatic invariant. The vertical dashed-dotted lines mark four IES timestamps before the onset of the diamagnetic cavity.}\n\\label{figPADtimeseries}\n\\end{figure}\n\nBetween 10:53:00 and 11:00:00 UTC, the magnetic field shows, on average, a gradual decrease in the field strength. There are perturbations due to the turbulent plasma environment. The $B_x$ component is shown with the blue color in panel (a) of Figure \\ref{figPADtimeseries}, and is highly negative throughout this period. In fact, most of the variations in the magnetic field strength originates from the $B_x$ component while the two other components are relatively quiet.  Close to the diamagnetic region the $y$ component of the field becomes dominant and shows a continuous decline. The magnetic field direction changes from anti-sunward (cone angle $\\sim180^{\\circ}$) to a direction perpendicular to the comet-Sun line (cone angle $\\sim90^{\\circ}$). This period corresponds to four IES timestamps identified by vertical dashed-dotted lines drawn across all panels and labeled by $t_1$, $t_2$, $t_3$, and $t_4$.\n\nAt 10:45 UTC electrons show a fairly scattered distribution occupying most of the pitch angle bins with similar intensities, except for the distributions in panels (f) and (g). In the next four timestamps, flux reductions around $90^{\\circ}$ pitch angles are observed and accompanied by increased fluxes in directions parallel ($0^{\\circ}$) and anti-parallel ($180^{\\circ}$) to the magnetic field. This is indicative of a changing distribution from isotropic to field-aligned. The effect is particularly evident for $151 - 293$ eV electrons, while $306-358$ eV electrons exhibit this change in the last two timestamps before the cavity. The redistributed electrons seem to follow along the white contours of the first adiabatic invariant. In contrast, the distributions of $375-440$ eV electrons in panel (h) show a different pattern. With the exception of timestamp ($t_3$) where an enhanced anti-parallel flux is observed, distributions are relatively disordered and chaotic with respect to the onset of the diamagnetic cavity, or the adiabatic invariant contours. This  implies that the first adiabatic invariant is only conserved up to a certain energy. \n\nIn Figure \\ref{figlineplots} we examine these spectra in a more quantitative way. In panels (a - d), differential electron flux at selected energies are plotted versus pitch angle. Each panel corresponds to an IES timestamp marked with vertical, dashed lines in Figure \\ref{figPADtimeseries}. The corresponding energies are annotated in panel (a), and error bars reflect the uncertainty due to the counting statistics. Error estimates for most data points are reasonably low and for a few points are larger. Larger error bars do not necessarily indicate that the observation must be discarded, but rather more measurements are needed to improve the confidence on the observation.\n\nThe 99 eV electrons have a maximum in the parallel direction until 10:52 UTC. In the next timestamp, the anti-parallel flux increases while the fluxes in $\\sim0-80^{\\circ}$ pitch angles decrease. Given that between 10:52 and 10:56 UTC the magnetic field vector rotates to mostly $-y$ direction, changes in 99 eV PADs indicate that a large flux of these electrons travel anti-parallel to the magnetic field and away from the comet. The 99 eV line also shows a sharp peak at around $50^{\\circ}$ at 11:00 UTC. The 185 eV electron distribution in panel (a) shows a rapid fall in the last pitch angle bin. This is most likely due to the low count rate in that bin, as is evident by the larger error bars. The 185 eV distribution changes into a bidirectional, field-aligned pattern in the next three timestamps. Similarly, the 202 eV (cyan) and 250 eV (green) electrons start roughly isotropic and evolve into double-peak bidirectional distributions, while $90^{\\circ}$ electrons become depleted. The net flux in these distributions remains almost the same from one timestamp to the next. In other words, the enhancements in pitch angles near $0^{\\circ}$ and $180^{\\circ}$ (as seen in panels (c) and (d)) are compensated by the depletions in $\\sim90^{\\circ}$ bins. The 358 and 396 eV lines, although moderately changing in time, do not show any depletion in perpendicular flux. The energies we discussed here are sensitive traces of the magnetic field topology. The depletion of $\\sim90^{\\circ}$ pitch angle electrons is consistent with adiabatic transport of electrons in sharply decreasing magnetic fields.\n\n\\begin{figure}[H]\n\\centering\n\\includegraphics[width=0.9\\linewidth]{Fig5_lineplot_newlabels.pdf}\n\\caption{Differential electron flux at selected energies (different colors) versus pitch angle at four timestamps prior to the cavity encounter on 30 July 2015. Energies are annotated in panel (a). Each panel correspond to a timestamp identified by vertical dashed-dotted lines in Figure \\ref{figPADtimeseries}.}\n\\label{figlineplots}\n\\end{figure}\n\nSpectra in timestamp $t_3$ show noticeably more depletion around $90^{\\circ}$ pitch angles than the neighboring timestamps, but the net fluxes are still higher than those inside the cavity. The origin of this behavior has not been clearly identified at this time.\n\n",
        "Descriptive_question1": "What is the cometocentric distance reported for the diamagnetic cavity event on 30 July 2015 in table_1?",
        "Descriptive_question2": "What is the energy of the maximum flux difference for the event on 30 July 2015 as shown in table_1?",
        "Reasoning_question1": "How might the difference in LAP bulk electron density inside versus outside the diamagnetic cavity on 30 July 2015, as presented in table_1, reflect on the plasma dynamics in the region?",
        "Reasoning_question2": "Considering the energy range of reduced flux provided in table_1 for the 30 July 2015 event, what could be the potential reasons for the observed reduction in electron flux within this specific range?",
        "Descriptive_answer1": "179.5 km",
        "Descriptive_answer2": "74.4 eV",
        "Reasoning_answer1": "The difference in LAP bulk electron density inside (997.3 cm⁻³) versus outside (1164.8 cm⁻³) the diamagnetic cavity on 30 July 2015 suggests a notable reduction in electron density within the cavity. This observation indicates that plasma dynamics are altered inside the diamagnetic region, likely due to the exclusion of the magnetic field, which affects the confinement and movement of charged particles. Outside the cavity, the higher density could be attributed to the presence of magnetic field lines that guide and trap electrons more effectively. Inside, the absence of a magnetic field may lead to a more diffuse plasma environment where electrons are less constrained, potentially resulting in a lower density as particles disperse more freely. Additionally, this difference might reflect variations in ionization rates or interactions with neutral particles, which could be less effective inside the cavity due to altered electric fields or reduced suprathermal electron fluxes contributing to ionization.",
        "Reasoning_answer2": "The energy range of reduced flux (56.1 - 358 eV) for the 30 July 2015 event, as shown in table_1, suggests specific processes impacting electrons within this energy bracket. First, this range likely corresponds to suprathermal electrons that are particularly sensitive to changes in the magnetic field topology as the spacecraft transitions into the diamagnetic cavity, where the magnetic field is significantly reduced or absent. The reduction in flux could be due to adiabatic effects, where electrons with energies in this range lose perpendicular energy as the magnetic field decreases, leading to a field-aligned distribution and reduced omnidirectional flux. Additionally, interactions with the local plasma environment, such as wave-particle interactions or scattering by neutral particles (with a neutral density of 6.7×10⁷ cm⁻³), might preferentially affect electrons in this energy range, depleting their flux. Finally, the upper limit of 358 eV might indicate a threshold beyond which electrons are less influenced by these mechanisms, possibly due to higher energies allowing them to traverse the cavity boundaries with less perturbation."
    },
    {
        "paper_id": "2007.05808.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}\n\\small\n\\caption{Direct comparison of lower bounds for connected graphs with 5 vertices }\n\\label{jedge2}\n\\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|l|l|l|}\n\\hline\n\n&\\multicolumn{3}{|c|}{}&\\multicolumn{4}{|c|}{LB from lit.}&\\multicolumn{3}{|c|}{ New LB }&\\\\\n \\hline\n  Num& $|E|$ & $\\beta(G)$ & $\\beta_E(G) $& L1 &L2 & L3&L4&N1&N2&N3& $\\beta_{M}(G)$ \\\\\n\\hline\n\\hline\n1. &  4&3&3& 2 & 1 &$\\underline{\\textbf{4}}$  & $\\underline{\\textbf{4}}$ & 2 & $\\underline{\\textbf{4}}$  & 2 & 4\\\\\n \\hline\n 2.&  4&2&2 &2 &1 &$\\underline{\\textbf{3}}$  &$\\underline{\\textbf{3}}$  & 2&$\\underline{\\textbf{3}}$  &2&3 \\\\\n \\hline\n3.&  5 & 2 & 3  & 2 & 1 & $\\underline{\\textbf{4}}$  & $\\underline{\\textbf{4}}$ & 2 &$\\underline{\\textbf{4}}$ &2&  4\\\\\n \\hline\n  4. &  5 & 2 & 2 & 2 & 1 & $\\underline{\\textbf{3}}$  &$\\underline{\\textbf{3}}$ & 2 & $\\underline{\\textbf{3}}$  & 2 &3  \\\\\n \\hline\n 5.&5&2& 2&2 & 1 & 2& $\\underline{\\textbf{3}}$ & 2 &2& 2&3\\\\\n \\hline\n6. &6& 2& 3& 2&1& 3& $\\underline{\\textbf{4}}$ &  2&$\\underline{\\textbf{4}}$ & 2&4\\\\\n \\hline\n 7. &6&3&3 & 2& 2&2 &3 & 3 &2&2 &4 \\\\\n \\hline\n 8.  &7&3&4 & 2& 2&$\\underline{\\textbf{5}}$  & $\\underline{\\textbf{5}}$ &3  &$\\underline{\\textbf{5}}$ & 2&5\\\\\n \\hline\n9.  &  4 &1&1 & 1&1 & $\\underline{\\textbf{2}}$ &$\\underline{\\textbf{2}}$  & $\\underline{\\textbf{2}}$  &$\\underline{\\textbf{2}}$ &$\\underline{\\textbf{2}}$  &2\\\\\n \\hline\n 10.&  5&2&2 & 2 & 1 & $\\underline{\\textbf{3}}$ &$\\underline{\\textbf{3}}$ & 2 &$\\underline{\\textbf{3}}$ &2& 3 \\\\\n \\hline\n11.  &6&2&3 &2 &2 & $\\underline{\\textbf{4}}$ &$\\underline{\\textbf{4}}$  & 3 &$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n  12.   &6&2&3 &2 &1 & $\\underline{\\textbf{4}}$ &$\\underline{\\textbf{4}}$ &  2&$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n13. &7&3&3 & 2&1 &$\\underline{\\textbf{4}}$  &$\\underline{\\textbf{4}}$  & 2 &$\\underline{\\textbf{4}}$ &2&4 \\\\\n \\hline\n14. &5&2&2 & 1 & 2& 0& $\\underline{\\textbf{3}}$ &$\\underline{\\textbf{3}}$   &$\\underline{\\textbf{3}}$ & 2&3\\\\\n \\hline\n15. &6&2&2 &2 & 2& 1& $\\underline{\\textbf{3}}$ & $\\underline{\\textbf{3}}$  &$\\underline{\\textbf{3}}$ & 2&3\\\\\n \\hline\n16.&7&2&3 & 2& 2& 2& $\\underline{\\textbf{4}}$ & 3 &$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n\n  17.&8&3& 4&2 &2 & $\\underline{\\textbf{5}}$ & $\\underline{\\textbf{5}}$ &3  &$\\underline{\\textbf{5}}$ &2&5 \\\\\n \\hline\n18.&7&2&3 & 2&2 &3 &$\\underline{\\textbf{4}}$  &3  &3&2&4 \\\\\n \\hline\n19. &8&2&$\\underline{\\textbf{4}}$  & 2&3 & 2&$\\underline{\\textbf{4}}$ & 3 &$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n  20. &9&3& 4& 2&3 &$\\underline{\\textbf{5}}$  &$\\underline{\\textbf{5}}$  & 3 &$\\underline{\\textbf{5}}$ & 2&5\\\\\n \\hline\n  21.  &10&4& 4& 2& 3&$\\underline{\\textbf{5}}$  &$\\underline{\\textbf{5}}$  & 4& $\\underline{\\textbf{5}}$ &3& 5\\\\\n \\hline\n\\end{tabular}\n\\end{table}",
        "caption": "Direct comparison of lower bounds for connected graphs with 5 vertices ",
        "label": "jedge2",
        "section_info": "4 Direct comparison of lower bounds\n\\section{Direct comparison of lower bounds}\n\nIn this section, it will be given direct comparison between lower bounds known in the literature (\\cite{yer17},\\cite{fil19}) with the new lower bounds obtained in this paper.\n\nFirst, comparison will be performed on all connected graphs with 5 vertices. There are 21 such graphs and their graphic representations could be find at \\url{https://mathworld.wolfram.com/ConnectedGraph.html}. The results in the Table $\\ref{jedge2}$ are given in the same order as graphic representation on link and in this table are shown the comparisons of the various lower bounds for these graphs. In Table $\\ref{jedge2}$, $|E|$ is  number of edges, $\\beta(G)$ and $\\beta_E(G)$ are metric dimension and edge metric dimension, respectively. In the following columns, L1 and L2\nare the notation for lower bounds from Proposition 4 and Theorem 1. Each\nof Proposition 1, Proposition 2, Proposition 3 and Corollary 1 determines\none lower bound. For the purpose of transparency of the Table 7, we have\ndecided to give a unified lower bound that encompasses all of them. It will be\ndenoted as L3. This lower bound cannot be obtained generally, while for each\nspecific graph, all three lower bounds from propositions and Corollary 1 can be\ncalculated separately and unified together. Lower bound L4 is a LP relaxation\nof the mixed metric dimension problem. In the following columns new lower\nbounds N1, N2 and N3 from Corollary 3, Theorem 3 and Theorem 4 are\ngiven respectively.\n\nIt should be noted that a total enumeration is able to quickly compute metric dimension, edge metric dimension and mixed metric dimension\nfor graphs up to 36 vertices, so it is used to obtain data for $\\beta(G)$, $\\beta_E(G)$ and $\\beta_M(G)$\nin Table $\\ref{jedge2}$ and Table $\\ref{jedge}$. Data of column L4 in these tables, that represents a LP relaxation\nof the mixed metric dimension problem, can be quickly obtained by any linear programming software:\nCPLEX, Gurobi, GLPK, LP\\_solve, etc. Data of column N2 is also computed by a total enumeration.\n\n\\begin{table}\n\\small\n\\caption{Direct comparison of lower bounds for connected graphs with 5 vertices }\n\\label{jedge2}\n\\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|l|l|l|}\n\\hline\n\n&\\multicolumn{3}{|c|}{}&\\multicolumn{4}{|c|}{LB from lit.}&\\multicolumn{3}{|c|}{ New LB }&\\\\\n \\hline\n  Num& $|E|$ & $\\beta(G)$ & $\\beta_E(G) $& L1 &L2 & L3&L4&N1&N2&N3& $\\beta_{M}(G)$ \\\\\n\\hline\n\\hline\n1. &  4&3&3& 2 & 1 &$\\underline{\\textbf{4}}$  & $\\underline{\\textbf{4}}$ & 2 & $\\underline{\\textbf{4}}$  & 2 & 4\\\\\n \\hline\n 2.&  4&2&2 &2 &1 &$\\underline{\\textbf{3}}$  &$\\underline{\\textbf{3}}$  & 2&$\\underline{\\textbf{3}}$  &2&3 \\\\\n \\hline\n3.&  5 & 2 & 3  & 2 & 1 & $\\underline{\\textbf{4}}$  & $\\underline{\\textbf{4}}$ & 2 &$\\underline{\\textbf{4}}$ &2&  4\\\\\n \\hline\n  4. &  5 & 2 & 2 & 2 & 1 & $\\underline{\\textbf{3}}$  &$\\underline{\\textbf{3}}$ & 2 & $\\underline{\\textbf{3}}$  & 2 &3  \\\\\n \\hline\n 5.&5&2& 2&2 & 1 & 2& $\\underline{\\textbf{3}}$ & 2 &2& 2&3\\\\\n \\hline\n6. &6& 2& 3& 2&1& 3& $\\underline{\\textbf{4}}$ &  2&$\\underline{\\textbf{4}}$ & 2&4\\\\\n \\hline\n 7. &6&3&3 & 2& 2&2 &3 & 3 &2&2 &4 \\\\\n \\hline\n 8.  &7&3&4 & 2& 2&$\\underline{\\textbf{5}}$  & $\\underline{\\textbf{5}}$ &3  &$\\underline{\\textbf{5}}$ & 2&5\\\\\n \\hline\n9.  &  4 &1&1 & 1&1 & $\\underline{\\textbf{2}}$ &$\\underline{\\textbf{2}}$  & $\\underline{\\textbf{2}}$  &$\\underline{\\textbf{2}}$ &$\\underline{\\textbf{2}}$  &2\\\\\n \\hline\n 10.&  5&2&2 & 2 & 1 & $\\underline{\\textbf{3}}$ &$\\underline{\\textbf{3}}$ & 2 &$\\underline{\\textbf{3}}$ &2& 3 \\\\\n \\hline\n11.  &6&2&3 &2 &2 & $\\underline{\\textbf{4}}$ &$\\underline{\\textbf{4}}$  & 3 &$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n  12.   &6&2&3 &2 &1 & $\\underline{\\textbf{4}}$ &$\\underline{\\textbf{4}}$ &  2&$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n13. &7&3&3 & 2&1 &$\\underline{\\textbf{4}}$  &$\\underline{\\textbf{4}}$  & 2 &$\\underline{\\textbf{4}}$ &2&4 \\\\\n \\hline\n14. &5&2&2 & 1 & 2& 0& $\\underline{\\textbf{3}}$ &$\\underline{\\textbf{3}}$   &$\\underline{\\textbf{3}}$ & 2&3\\\\\n \\hline\n15. &6&2&2 &2 & 2& 1& $\\underline{\\textbf{3}}$ & $\\underline{\\textbf{3}}$  &$\\underline{\\textbf{3}}$ & 2&3\\\\\n \\hline\n16.&7&2&3 & 2& 2& 2& $\\underline{\\textbf{4}}$ & 3 &$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n\n  17.&8&3& 4&2 &2 & $\\underline{\\textbf{5}}$ & $\\underline{\\textbf{5}}$ &3  &$\\underline{\\textbf{5}}$ &2&5 \\\\\n \\hline\n18.&7&2&3 & 2&2 &3 &$\\underline{\\textbf{4}}$  &3  &3&2&4 \\\\\n \\hline\n19. &8&2&$\\underline{\\textbf{4}}$  & 2&3 & 2&$\\underline{\\textbf{4}}$ & 3 &$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n  20. &9&3& 4& 2&3 &$\\underline{\\textbf{5}}$  &$\\underline{\\textbf{5}}$  & 3 &$\\underline{\\textbf{5}}$ & 2&5\\\\\n \\hline\n  21.  &10&4& 4& 2& 3&$\\underline{\\textbf{5}}$  &$\\underline{\\textbf{5}}$  & 4& $\\underline{\\textbf{5}}$ &3& 5\\\\\n \\hline\n\\end{tabular}\n\\end{table}\n\n\nAs it can be seen from Table \\ref{jedge2} new lower bounds obtain better results than bounds L1, L2, L3 and L4 from the literature. In three of 21 cases mixed metric dimension was not reached by new lower bounds.\n\nThese results are not quite representative since graphs in question are with small number of vertices ($|V| = 5$). In order to improve comparison we conducted it on some well-known graphs.\n\n\n\nAdditional comparison will be conducted on 12 well-known graphs. In the Table $\\ref{jedge}$ are shown graph characteristics for each graph, while the comparisons of the various lower bounds are shown in the Table  $\\ref{jedge1}$. Columns in Table $\\ref{jedge1}$, nominated as L1, L2, L3, L4, N1, N2 and N3, have the same meaning as in Table \\ref{jedge2}. From the Table \\ref{jedge1} it can be seen that the new lower bounds are often better than the ones from the literature.\n\nHowever, only in two cases mixed metric dimension equals lower bound (1 from literature and 1 from the new ones). In 4 cases lower bounds have difference 1 from exact values.\nIt should be noted that all 7 lower bounds should be used in union since different lower bounds are applicable for different graphs and no one is uniquely dominant over the others. The important feature of presented lower bounds is that their calculation complexity is much smaller in comparison with standard/edge/mixed metric dimension problem complexity.\n\n\n\\begin{table}\n\\small\n\\caption{Graph characteristic}\n\\label{jedge}\n\\begin{tabular}{|l|l|l|l|l|l|l|}\n\n\\hline\nNum&  Name & $|V|$& $|E|$& $\\beta(G)$&$\\beta_E(G)$& Another  notions \\\\\n\\hline\n1.& Rook's graph & 36& 180& 7 & 8 & srg(36,10,4,2) \\\\\n\\hline\n\n2.& 9-triangular graph & 36& 252& 6 & 32 & Johnson graph; srg(36,14,7,4) \\\\\n\n\n\\hline\n3.& Clebsch graph  &16 &40 & 4 &9& srg(16,5,0,2)\\\\\n\\hline\n\n4.& Generalized quadrangle  &27 &135& 5 & 18 & srg(27,10,1,5)\\\\\n\\hline\n\n5.& Hypercube  $Q_5$ &32 &80& 4 & 4 & $5-$cube graph \\\\\n\n\\hline\n\n6.& Kneser  (7,2)&21 &105&  5 & 12 & srg(21,10,3,6)\\\\\n\n \\hline\n7.& Mobius Kantor  &16 &24& 4 & 4 & Generalized Petersen $GP(8,3)$ \\\\\n\n \\hline\n8.& Paley graph  &13 &39& 4 & 6 & srg(13,6,2,3)\\\\\n \\hline\n9.& Petersen graph  &10 &15& 3 & 4 & Generalized Petersen $GP(5,2)$\\\\\n \\hline\n10.& Small graph 6 vert. &6 &11& 3 & 4 & \\\\\n \\hline\n11.& Hamming H(2,6)& 36 & 180 &  7 & 8 & $K_6 \\Box K_6$\\\\\n \\hline\n12.& Hamming H(3,3) & 27 & 81 &  4 & 5 & $K_3 \\Box K_3  \\Box K_3$\\\\\n \\hline\n\\end{tabular}\n\\end{table}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\begin{table}\n\\small\n\\caption{Direct comparison of lower bounds for some  graphs}\n\\label{jedge1}\n\\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|}\n\\hline\n\n&\\multicolumn{8}{|c|}{LB from lit.}\\\\\n \\hline\nNum& L1 &L2 & L3&L4&N1&N2&N3& $\\beta_{M}(G)$\\\\\n\\hline\n\\hline\n\n\n\n1.   & 4& 5 &0 & 6 & 5& 6 &8&9 \\\\\n \\hline\n 2.&4& 5 & 0& 18 & 5 & 9 & 8 & 32\\\\\n \\hline\n3. &   3 & 4 & 0& 4 & 4 & 5 & 5& 9\\\\\n \\hline\n  4.  & 4 & 5 &0 & 4 & 5 & 6 & 8& 18\\\\\n \\hline\n 5.&   3& \\bf{\\underline{4}}&0 & 2& $\\underline{\\textbf{4}}$   & 2 & 3& 4 \\\\\n \\hline\n 6.&   4& 5 &0& 4& 5 & 6 & 6& 12\\\\\n \\hline\n 7.&   2& 3 &0& 2& 3 & 3 & 3& 4\\\\\n \\hline\n8. &  3& 4 &0& 4 & 4 & 5 & 5& 6\\\\\n \\hline\n  9.   & 2 & 3 &0&  4& 3 & 4 & 4&6 \\\\\n \\hline\n 10.&    2 & 2 &$\\underline{\\textbf{5}}$ &$\\underline{\\textbf{5}}$  & 3 & 4 & 3& 5\\\\\n \\hline\n 11.  & 4 & 5 &0&6 &5 & 6  & 8 & 9\\\\\n \\hline\n 12.  &  3& 4 &0& 3& 4& 3  & 4 &6 \\\\\n \\hline\n\\end{tabular}\n\\end{table}\n\n\nUsing the results shown in the Table $\\ref{jedge1}$, it can be concluded that these bounds do not give some perfect results, as expected, but we can see that there are graphs on which lower bounds L1 and L2 have been reached, as well as new lower bounds N1 and N2.\n\nFinally, it must be noted that even with additional comparison, the representative sample is not statistically quite correct, but is given to illustrate usage and efficiency of presented lower bounds in some capacity. Graphs that we took in consideration varied from order 10 up to order 36. Second number (order 36) is chosen because exact value of their mixed metric dimensions could be quickly determined. Since problem of finding exact mixed metric dimension is NP-hard, our selection do not include large graphs.\n\n\n",
        "Descriptive_question1": "What is the value of β(G) for graph number 1 in table_1?",
        "Descriptive_question2": "How many edges does graph number 8 have in table_1?",
        "Reasoning_question1": "Which lower bound, among L1, L2, L3, L4, N1, N2, and N3, most frequently matches the mixed metric dimension β_M(G) for the graphs listed in table_1, and why might this be significant?",
        "Reasoning_question2": "Based on the data in table_1, what trend can be observed in the relationship between the number of edges (|E|) and the mixed metric dimension β_M(G) across the 21 graphs, and what might explain this trend?",
        "Descriptive_answer1": "3",
        "Descriptive_answer2": "7",
        "Reasoning_answer1": "Among the lower bounds L1, L2, L3, L4, N1, N2, and N3 in table_1, L4 and N2 most frequently match the mixed metric dimension β_M(G). Observing the data, L4 matches β_M(G) in 18 out of 21 graphs (e.g., graphs 1, 2, 3, 4, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21), and N2 also matches β_M(G) in 18 out of 21 graphs (e.g., graphs 1, 2, 3, 4, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21). This frequent matching indicates that both L4 (a linear programming relaxation of the mixed metric dimension problem) and N2 (computed by total enumeration) are more effective at approximating the true mixed metric dimension compared to other bounds. This is significant because it suggests that these methods, particularly L4 and N2, may capture critical structural properties of small graphs (like those with 5 vertices) more accurately. Additionally, their ability to often reach the exact value highlights their potential reliability for theoretical analysis and practical applications in graph theory, especially for problems related to metric dimensions.",
        "Reasoning_answer2": "Analyzing the data in table_1, a clear trend emerges: as the number of edges (|E|) increases across the 21 graphs, the mixed metric dimension β_M(G) also tends to increase. For instance, graphs with |E|=4 (e.g., graphs 1, 2, 9) have β_M(G) values of 4, 3, and 2 respectively, averaging around 3. In contrast, graphs with |E|=10 (graph 21) have β_M(G)=5, and graphs with |E|=9 (graph 20) also show β_M(G)=5. This positive correlation suggests that denser graphs (with more edges) generally require a larger mixed metric dimension to uniquely identify vertices and edges. A possible explanation for this trend is that more edges create a more complex graph structure, increasing the number of unique 'signatures' needed to distinguish between elements. In graph theory, the mixed metric dimension combines aspects of vertex and edge metric dimensions, so a higher number of edges likely contributes to greater structural complexity, necessitating a larger set of landmarks or resolving sets to cover all distinctions."
    },
    {
        "paper_id": "2007.05808.json",
        "table_id": "table_2",
        "table_content": "\\begin{table}\n\\small\n\\caption{Graph characteristic}\n\\label{jedge}\n\\begin{tabular}{|l|l|l|l|l|l|l|}\n\n\\hline\nNum&  Name & $|V|$& $|E|$& $\\beta(G)$&$\\beta_E(G)$& Another  notions \\\\\n\\hline\n1.& Rook's graph & 36& 180& 7 & 8 & srg(36,10,4,2) \\\\\n\\hline\n\n2.& 9-triangular graph & 36& 252& 6 & 32 & Johnson graph; srg(36,14,7,4) \\\\\n\n\n\\hline\n3.& Clebsch graph  &16 &40 & 4 &9& srg(16,5,0,2)\\\\\n\\hline\n\n4.& Generalized quadrangle  &27 &135& 5 & 18 & srg(27,10,1,5)\\\\\n\\hline\n\n5.& Hypercube  $Q_5$ &32 &80& 4 & 4 & $5-$cube graph \\\\\n\n\\hline\n\n6.& Kneser  (7,2)&21 &105&  5 & 12 & srg(21,10,3,6)\\\\\n\n \\hline\n7.& Mobius Kantor  &16 &24& 4 & 4 & Generalized Petersen $GP(8,3)$ \\\\\n\n \\hline\n8.& Paley graph  &13 &39& 4 & 6 & srg(13,6,2,3)\\\\\n \\hline\n9.& Petersen graph  &10 &15& 3 & 4 & Generalized Petersen $GP(5,2)$\\\\\n \\hline\n10.& Small graph 6 vert. &6 &11& 3 & 4 & \\\\\n \\hline\n11.& Hamming H(2,6)& 36 & 180 &  7 & 8 & $K_6 \\Box K_6$\\\\\n \\hline\n12.& Hamming H(3,3) & 27 & 81 &  4 & 5 & $K_3 \\Box K_3  \\Box K_3$\\\\\n \\hline\n\\end{tabular}\n\\end{table}",
        "caption": "Graph characteristic",
        "label": "jedge",
        "section_info": "4 Direct comparison of lower bounds\n\\section{Direct comparison of lower bounds}\n\nIn this section, it will be given direct comparison between lower bounds known in the literature (\\cite{yer17},\\cite{fil19}) with the new lower bounds obtained in this paper.\n\nFirst, comparison will be performed on all connected graphs with 5 vertices. There are 21 such graphs and their graphic representations could be find at \\url{https://mathworld.wolfram.com/ConnectedGraph.html}. The results in the Table $\\ref{jedge2}$ are given in the same order as graphic representation on link and in this table are shown the comparisons of the various lower bounds for these graphs. In Table $\\ref{jedge2}$, $|E|$ is  number of edges, $\\beta(G)$ and $\\beta_E(G)$ are metric dimension and edge metric dimension, respectively. In the following columns, L1 and L2\nare the notation for lower bounds from Proposition 4 and Theorem 1. Each\nof Proposition 1, Proposition 2, Proposition 3 and Corollary 1 determines\none lower bound. For the purpose of transparency of the Table 7, we have\ndecided to give a unified lower bound that encompasses all of them. It will be\ndenoted as L3. This lower bound cannot be obtained generally, while for each\nspecific graph, all three lower bounds from propositions and Corollary 1 can be\ncalculated separately and unified together. Lower bound L4 is a LP relaxation\nof the mixed metric dimension problem. In the following columns new lower\nbounds N1, N2 and N3 from Corollary 3, Theorem 3 and Theorem 4 are\ngiven respectively.\n\nIt should be noted that a total enumeration is able to quickly compute metric dimension, edge metric dimension and mixed metric dimension\nfor graphs up to 36 vertices, so it is used to obtain data for $\\beta(G)$, $\\beta_E(G)$ and $\\beta_M(G)$\nin Table $\\ref{jedge2}$ and Table $\\ref{jedge}$. Data of column L4 in these tables, that represents a LP relaxation\nof the mixed metric dimension problem, can be quickly obtained by any linear programming software:\nCPLEX, Gurobi, GLPK, LP\\_solve, etc. Data of column N2 is also computed by a total enumeration.\n\n\\begin{table}\n\\small\n\\caption{Direct comparison of lower bounds for connected graphs with 5 vertices }\n\\label{jedge2}\n\\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|l|l|l|}\n\\hline\n\n&\\multicolumn{3}{|c|}{}&\\multicolumn{4}{|c|}{LB from lit.}&\\multicolumn{3}{|c|}{ New LB }&\\\\\n \\hline\n  Num& $|E|$ & $\\beta(G)$ & $\\beta_E(G) $& L1 &L2 & L3&L4&N1&N2&N3& $\\beta_{M}(G)$ \\\\\n\\hline\n\\hline\n1. &  4&3&3& 2 & 1 &$\\underline{\\textbf{4}}$  & $\\underline{\\textbf{4}}$ & 2 & $\\underline{\\textbf{4}}$  & 2 & 4\\\\\n \\hline\n 2.&  4&2&2 &2 &1 &$\\underline{\\textbf{3}}$  &$\\underline{\\textbf{3}}$  & 2&$\\underline{\\textbf{3}}$  &2&3 \\\\\n \\hline\n3.&  5 & 2 & 3  & 2 & 1 & $\\underline{\\textbf{4}}$  & $\\underline{\\textbf{4}}$ & 2 &$\\underline{\\textbf{4}}$ &2&  4\\\\\n \\hline\n  4. &  5 & 2 & 2 & 2 & 1 & $\\underline{\\textbf{3}}$  &$\\underline{\\textbf{3}}$ & 2 & $\\underline{\\textbf{3}}$  & 2 &3  \\\\\n \\hline\n 5.&5&2& 2&2 & 1 & 2& $\\underline{\\textbf{3}}$ & 2 &2& 2&3\\\\\n \\hline\n6. &6& 2& 3& 2&1& 3& $\\underline{\\textbf{4}}$ &  2&$\\underline{\\textbf{4}}$ & 2&4\\\\\n \\hline\n 7. &6&3&3 & 2& 2&2 &3 & 3 &2&2 &4 \\\\\n \\hline\n 8.  &7&3&4 & 2& 2&$\\underline{\\textbf{5}}$  & $\\underline{\\textbf{5}}$ &3  &$\\underline{\\textbf{5}}$ & 2&5\\\\\n \\hline\n9.  &  4 &1&1 & 1&1 & $\\underline{\\textbf{2}}$ &$\\underline{\\textbf{2}}$  & $\\underline{\\textbf{2}}$  &$\\underline{\\textbf{2}}$ &$\\underline{\\textbf{2}}$  &2\\\\\n \\hline\n 10.&  5&2&2 & 2 & 1 & $\\underline{\\textbf{3}}$ &$\\underline{\\textbf{3}}$ & 2 &$\\underline{\\textbf{3}}$ &2& 3 \\\\\n \\hline\n11.  &6&2&3 &2 &2 & $\\underline{\\textbf{4}}$ &$\\underline{\\textbf{4}}$  & 3 &$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n  12.   &6&2&3 &2 &1 & $\\underline{\\textbf{4}}$ &$\\underline{\\textbf{4}}$ &  2&$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n13. &7&3&3 & 2&1 &$\\underline{\\textbf{4}}$  &$\\underline{\\textbf{4}}$  & 2 &$\\underline{\\textbf{4}}$ &2&4 \\\\\n \\hline\n14. &5&2&2 & 1 & 2& 0& $\\underline{\\textbf{3}}$ &$\\underline{\\textbf{3}}$   &$\\underline{\\textbf{3}}$ & 2&3\\\\\n \\hline\n15. &6&2&2 &2 & 2& 1& $\\underline{\\textbf{3}}$ & $\\underline{\\textbf{3}}$  &$\\underline{\\textbf{3}}$ & 2&3\\\\\n \\hline\n16.&7&2&3 & 2& 2& 2& $\\underline{\\textbf{4}}$ & 3 &$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n\n  17.&8&3& 4&2 &2 & $\\underline{\\textbf{5}}$ & $\\underline{\\textbf{5}}$ &3  &$\\underline{\\textbf{5}}$ &2&5 \\\\\n \\hline\n18.&7&2&3 & 2&2 &3 &$\\underline{\\textbf{4}}$  &3  &3&2&4 \\\\\n \\hline\n19. &8&2&$\\underline{\\textbf{4}}$  & 2&3 & 2&$\\underline{\\textbf{4}}$ & 3 &$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n  20. &9&3& 4& 2&3 &$\\underline{\\textbf{5}}$  &$\\underline{\\textbf{5}}$  & 3 &$\\underline{\\textbf{5}}$ & 2&5\\\\\n \\hline\n  21.  &10&4& 4& 2& 3&$\\underline{\\textbf{5}}$  &$\\underline{\\textbf{5}}$  & 4& $\\underline{\\textbf{5}}$ &3& 5\\\\\n \\hline\n\\end{tabular}\n\\end{table}\n\n\nAs it can be seen from Table \\ref{jedge2} new lower bounds obtain better results than bounds L1, L2, L3 and L4 from the literature. In three of 21 cases mixed metric dimension was not reached by new lower bounds.\n\nThese results are not quite representative since graphs in question are with small number of vertices ($|V| = 5$). In order to improve comparison we conducted it on some well-known graphs.\n\n\n\nAdditional comparison will be conducted on 12 well-known graphs. In the Table $\\ref{jedge}$ are shown graph characteristics for each graph, while the comparisons of the various lower bounds are shown in the Table  $\\ref{jedge1}$. Columns in Table $\\ref{jedge1}$, nominated as L1, L2, L3, L4, N1, N2 and N3, have the same meaning as in Table \\ref{jedge2}. From the Table \\ref{jedge1} it can be seen that the new lower bounds are often better than the ones from the literature.\n\nHowever, only in two cases mixed metric dimension equals lower bound (1 from literature and 1 from the new ones). In 4 cases lower bounds have difference 1 from exact values.\nIt should be noted that all 7 lower bounds should be used in union since different lower bounds are applicable for different graphs and no one is uniquely dominant over the others. The important feature of presented lower bounds is that their calculation complexity is much smaller in comparison with standard/edge/mixed metric dimension problem complexity.\n\n\n\\begin{table}\n\\small\n\\caption{Graph characteristic}\n\\label{jedge}\n\\begin{tabular}{|l|l|l|l|l|l|l|}\n\n\\hline\nNum&  Name & $|V|$& $|E|$& $\\beta(G)$&$\\beta_E(G)$& Another  notions \\\\\n\\hline\n1.& Rook's graph & 36& 180& 7 & 8 & srg(36,10,4,2) \\\\\n\\hline\n\n2.& 9-triangular graph & 36& 252& 6 & 32 & Johnson graph; srg(36,14,7,4) \\\\\n\n\n\\hline\n3.& Clebsch graph  &16 &40 & 4 &9& srg(16,5,0,2)\\\\\n\\hline\n\n4.& Generalized quadrangle  &27 &135& 5 & 18 & srg(27,10,1,5)\\\\\n\\hline\n\n5.& Hypercube  $Q_5$ &32 &80& 4 & 4 & $5-$cube graph \\\\\n\n\\hline\n\n6.& Kneser  (7,2)&21 &105&  5 & 12 & srg(21,10,3,6)\\\\\n\n \\hline\n7.& Mobius Kantor  &16 &24& 4 & 4 & Generalized Petersen $GP(8,3)$ \\\\\n\n \\hline\n8.& Paley graph  &13 &39& 4 & 6 & srg(13,6,2,3)\\\\\n \\hline\n9.& Petersen graph  &10 &15& 3 & 4 & Generalized Petersen $GP(5,2)$\\\\\n \\hline\n10.& Small graph 6 vert. &6 &11& 3 & 4 & \\\\\n \\hline\n11.& Hamming H(2,6)& 36 & 180 &  7 & 8 & $K_6 \\Box K_6$\\\\\n \\hline\n12.& Hamming H(3,3) & 27 & 81 &  4 & 5 & $K_3 \\Box K_3  \\Box K_3$\\\\\n \\hline\n\\end{tabular}\n\\end{table}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\begin{table}\n\\small\n\\caption{Direct comparison of lower bounds for some  graphs}\n\\label{jedge1}\n\\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|}\n\\hline\n\n&\\multicolumn{8}{|c|}{LB from lit.}\\\\\n \\hline\nNum& L1 &L2 & L3&L4&N1&N2&N3& $\\beta_{M}(G)$\\\\\n\\hline\n\\hline\n\n\n\n1.   & 4& 5 &0 & 6 & 5& 6 &8&9 \\\\\n \\hline\n 2.&4& 5 & 0& 18 & 5 & 9 & 8 & 32\\\\\n \\hline\n3. &   3 & 4 & 0& 4 & 4 & 5 & 5& 9\\\\\n \\hline\n  4.  & 4 & 5 &0 & 4 & 5 & 6 & 8& 18\\\\\n \\hline\n 5.&   3& \\bf{\\underline{4}}&0 & 2& $\\underline{\\textbf{4}}$   & 2 & 3& 4 \\\\\n \\hline\n 6.&   4& 5 &0& 4& 5 & 6 & 6& 12\\\\\n \\hline\n 7.&   2& 3 &0& 2& 3 & 3 & 3& 4\\\\\n \\hline\n8. &  3& 4 &0& 4 & 4 & 5 & 5& 6\\\\\n \\hline\n  9.   & 2 & 3 &0&  4& 3 & 4 & 4&6 \\\\\n \\hline\n 10.&    2 & 2 &$\\underline{\\textbf{5}}$ &$\\underline{\\textbf{5}}$  & 3 & 4 & 3& 5\\\\\n \\hline\n 11.  & 4 & 5 &0&6 &5 & 6  & 8 & 9\\\\\n \\hline\n 12.  &  3& 4 &0& 3& 4& 3  & 4 &6 \\\\\n \\hline\n\\end{tabular}\n\\end{table}\n\n\nUsing the results shown in the Table $\\ref{jedge1}$, it can be concluded that these bounds do not give some perfect results, as expected, but we can see that there are graphs on which lower bounds L1 and L2 have been reached, as well as new lower bounds N1 and N2.\n\nFinally, it must be noted that even with additional comparison, the representative sample is not statistically quite correct, but is given to illustrate usage and efficiency of presented lower bounds in some capacity. Graphs that we took in consideration varied from order 10 up to order 36. Second number (order 36) is chosen because exact value of their mixed metric dimensions could be quickly determined. Since problem of finding exact mixed metric dimension is NP-hard, our selection do not include large graphs.\n\n\n",
        "Descriptive_question1": "What is the number of vertices for the Rook's graph in table_2?",
        "Descriptive_question2": "What is the edge metric dimension of the Petersen graph in table_2?",
        "Reasoning_question1": "How does the metric dimension compare to the edge metric dimension across the graphs listed in table_2, and what might this indicate about the structural properties of these graphs?",
        "Reasoning_question2": "In table_2, why might the mixed metric dimension often be higher than both the metric dimension and edge metric dimension for most of the listed graphs?",
        "Descriptive_answer1": "36",
        "Descriptive_answer2": "4",
        "Reasoning_answer1": "Upon examining table_2, it is evident that for most graphs, the edge metric dimension (β_E(G)) is greater than or equal to the metric dimension (β(G)). For instance, in Rook's graph, β(G) is 7 while β_E(G) is 8; in 9-triangular graph, β(G) is 6 while β_E(G) is 32, showing a significant difference; and in Petersen graph, β(G) is 3 while β_E(G) is 4. This trend holds for many graphs like Clebsch graph (4 vs 9) and Generalized quadrangle (5 vs 18). However, there are exceptions like Hypercube Q_5 and Mobius Kantor, where both dimensions are equal (4 vs 4). This comparison suggests that resolving edges (identifying unique edge signatures) often requires more landmarks or distinguishing points than resolving vertices. Structurally, this may indicate that these graphs have higher edge complexity or symmetry, making it harder to uniquely identify edges compared to vertices. Graphs with equal metric and edge metric dimensions might have balanced structural properties, where vertex and edge identification complexities are similar, possibly due to regular or symmetric configurations.",
        "Reasoning_answer2": "Analyzing table_2 alongside the context of mixed metric dimension from other tables like table_4 (jedge1), it becomes clear that the mixed metric dimension (β_M(G)) is often higher than both the metric dimension (β(G)) and edge metric dimension (β_E(G)) for most graphs. For instance, for Rook's graph, β(G) is 7, β_E(G) is 8, but β_M(G) is 9; for 9-triangular graph, β(G) is 6, β_E(G) is 32, and β_M(G) is 32 (equal to β_E(G) but much higher than β(G)); for Petersen graph, β(G) is 3, β_E(G) is 4, and β_M(G) is 6. The mixed metric dimension represents the smallest set of vertices needed to uniquely identify both vertices and edges simultaneously. This dual requirement inherently increases the complexity, as the set must cover distinguishing conditions for two different structural elements. Logically, since it encompasses the constraints of both metric and edge metric dimensions, the mixed metric dimension is at least as large as the maximum of the two individual dimensions and often higher when additional vertices are needed to resolve overlapping or ambiguous cases in vertex-edge interactions. This reflects the combined intricacy of the graph's vertex and edge structures."
    },
    {
        "paper_id": "2007.05808.json",
        "table_id": "table_3",
        "table_content": "\\begin{table}\n\\small\n\\caption{Direct comparison of lower bounds for some  graphs}\n\\label{jedge1}\n\\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|}\n\\hline\n\n&\\multicolumn{8}{|c|}{LB from lit.}\\\\\n \\hline\nNum& L1 &L2 & L3&L4&N1&N2&N3& $\\beta_{M}(G)$\\\\\n\\hline\n\\hline\n\n\n\n1.   & 4& 5 &0 & 6 & 5& 6 &8&9 \\\\\n \\hline\n 2.&4& 5 & 0& 18 & 5 & 9 & 8 & 32\\\\\n \\hline\n3. &   3 & 4 & 0& 4 & 4 & 5 & 5& 9\\\\\n \\hline\n  4.  & 4 & 5 &0 & 4 & 5 & 6 & 8& 18\\\\\n \\hline\n 5.&   3& \\bf{\\underline{4}}&0 & 2& $\\underline{\\textbf{4}}$   & 2 & 3& 4 \\\\\n \\hline\n 6.&   4& 5 &0& 4& 5 & 6 & 6& 12\\\\\n \\hline\n 7.&   2& 3 &0& 2& 3 & 3 & 3& 4\\\\\n \\hline\n8. &  3& 4 &0& 4 & 4 & 5 & 5& 6\\\\\n \\hline\n  9.   & 2 & 3 &0&  4& 3 & 4 & 4&6 \\\\\n \\hline\n 10.&    2 & 2 &$\\underline{\\textbf{5}}$ &$\\underline{\\textbf{5}}$  & 3 & 4 & 3& 5\\\\\n \\hline\n 11.  & 4 & 5 &0&6 &5 & 6  & 8 & 9\\\\\n \\hline\n 12.  &  3& 4 &0& 3& 4& 3  & 4 &6 \\\\\n \\hline\n\\end{tabular}\n\\end{table}",
        "caption": "Direct comparison of lower bounds for some  graphs",
        "label": "jedge1",
        "section_info": "4 Direct comparison of lower bounds\n\\section{Direct comparison of lower bounds}\n\nIn this section, it will be given direct comparison between lower bounds known in the literature (\\cite{yer17},\\cite{fil19}) with the new lower bounds obtained in this paper.\n\nFirst, comparison will be performed on all connected graphs with 5 vertices. There are 21 such graphs and their graphic representations could be find at \\url{https://mathworld.wolfram.com/ConnectedGraph.html}. The results in the Table $\\ref{jedge2}$ are given in the same order as graphic representation on link and in this table are shown the comparisons of the various lower bounds for these graphs. In Table $\\ref{jedge2}$, $|E|$ is  number of edges, $\\beta(G)$ and $\\beta_E(G)$ are metric dimension and edge metric dimension, respectively. In the following columns, L1 and L2\nare the notation for lower bounds from Proposition 4 and Theorem 1. Each\nof Proposition 1, Proposition 2, Proposition 3 and Corollary 1 determines\none lower bound. For the purpose of transparency of the Table 7, we have\ndecided to give a unified lower bound that encompasses all of them. It will be\ndenoted as L3. This lower bound cannot be obtained generally, while for each\nspecific graph, all three lower bounds from propositions and Corollary 1 can be\ncalculated separately and unified together. Lower bound L4 is a LP relaxation\nof the mixed metric dimension problem. In the following columns new lower\nbounds N1, N2 and N3 from Corollary 3, Theorem 3 and Theorem 4 are\ngiven respectively.\n\nIt should be noted that a total enumeration is able to quickly compute metric dimension, edge metric dimension and mixed metric dimension\nfor graphs up to 36 vertices, so it is used to obtain data for $\\beta(G)$, $\\beta_E(G)$ and $\\beta_M(G)$\nin Table $\\ref{jedge2}$ and Table $\\ref{jedge}$. Data of column L4 in these tables, that represents a LP relaxation\nof the mixed metric dimension problem, can be quickly obtained by any linear programming software:\nCPLEX, Gurobi, GLPK, LP\\_solve, etc. Data of column N2 is also computed by a total enumeration.\n\n\\begin{table}\n\\small\n\\caption{Direct comparison of lower bounds for connected graphs with 5 vertices }\n\\label{jedge2}\n\\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|l|l|l|}\n\\hline\n\n&\\multicolumn{3}{|c|}{}&\\multicolumn{4}{|c|}{LB from lit.}&\\multicolumn{3}{|c|}{ New LB }&\\\\\n \\hline\n  Num& $|E|$ & $\\beta(G)$ & $\\beta_E(G) $& L1 &L2 & L3&L4&N1&N2&N3& $\\beta_{M}(G)$ \\\\\n\\hline\n\\hline\n1. &  4&3&3& 2 & 1 &$\\underline{\\textbf{4}}$  & $\\underline{\\textbf{4}}$ & 2 & $\\underline{\\textbf{4}}$  & 2 & 4\\\\\n \\hline\n 2.&  4&2&2 &2 &1 &$\\underline{\\textbf{3}}$  &$\\underline{\\textbf{3}}$  & 2&$\\underline{\\textbf{3}}$  &2&3 \\\\\n \\hline\n3.&  5 & 2 & 3  & 2 & 1 & $\\underline{\\textbf{4}}$  & $\\underline{\\textbf{4}}$ & 2 &$\\underline{\\textbf{4}}$ &2&  4\\\\\n \\hline\n  4. &  5 & 2 & 2 & 2 & 1 & $\\underline{\\textbf{3}}$  &$\\underline{\\textbf{3}}$ & 2 & $\\underline{\\textbf{3}}$  & 2 &3  \\\\\n \\hline\n 5.&5&2& 2&2 & 1 & 2& $\\underline{\\textbf{3}}$ & 2 &2& 2&3\\\\\n \\hline\n6. &6& 2& 3& 2&1& 3& $\\underline{\\textbf{4}}$ &  2&$\\underline{\\textbf{4}}$ & 2&4\\\\\n \\hline\n 7. &6&3&3 & 2& 2&2 &3 & 3 &2&2 &4 \\\\\n \\hline\n 8.  &7&3&4 & 2& 2&$\\underline{\\textbf{5}}$  & $\\underline{\\textbf{5}}$ &3  &$\\underline{\\textbf{5}}$ & 2&5\\\\\n \\hline\n9.  &  4 &1&1 & 1&1 & $\\underline{\\textbf{2}}$ &$\\underline{\\textbf{2}}$  & $\\underline{\\textbf{2}}$  &$\\underline{\\textbf{2}}$ &$\\underline{\\textbf{2}}$  &2\\\\\n \\hline\n 10.&  5&2&2 & 2 & 1 & $\\underline{\\textbf{3}}$ &$\\underline{\\textbf{3}}$ & 2 &$\\underline{\\textbf{3}}$ &2& 3 \\\\\n \\hline\n11.  &6&2&3 &2 &2 & $\\underline{\\textbf{4}}$ &$\\underline{\\textbf{4}}$  & 3 &$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n  12.   &6&2&3 &2 &1 & $\\underline{\\textbf{4}}$ &$\\underline{\\textbf{4}}$ &  2&$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n13. &7&3&3 & 2&1 &$\\underline{\\textbf{4}}$  &$\\underline{\\textbf{4}}$  & 2 &$\\underline{\\textbf{4}}$ &2&4 \\\\\n \\hline\n14. &5&2&2 & 1 & 2& 0& $\\underline{\\textbf{3}}$ &$\\underline{\\textbf{3}}$   &$\\underline{\\textbf{3}}$ & 2&3\\\\\n \\hline\n15. &6&2&2 &2 & 2& 1& $\\underline{\\textbf{3}}$ & $\\underline{\\textbf{3}}$  &$\\underline{\\textbf{3}}$ & 2&3\\\\\n \\hline\n16.&7&2&3 & 2& 2& 2& $\\underline{\\textbf{4}}$ & 3 &$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n\n  17.&8&3& 4&2 &2 & $\\underline{\\textbf{5}}$ & $\\underline{\\textbf{5}}$ &3  &$\\underline{\\textbf{5}}$ &2&5 \\\\\n \\hline\n18.&7&2&3 & 2&2 &3 &$\\underline{\\textbf{4}}$  &3  &3&2&4 \\\\\n \\hline\n19. &8&2&$\\underline{\\textbf{4}}$  & 2&3 & 2&$\\underline{\\textbf{4}}$ & 3 &$\\underline{\\textbf{4}}$ &2 &4\\\\\n \\hline\n  20. &9&3& 4& 2&3 &$\\underline{\\textbf{5}}$  &$\\underline{\\textbf{5}}$  & 3 &$\\underline{\\textbf{5}}$ & 2&5\\\\\n \\hline\n  21.  &10&4& 4& 2& 3&$\\underline{\\textbf{5}}$  &$\\underline{\\textbf{5}}$  & 4& $\\underline{\\textbf{5}}$ &3& 5\\\\\n \\hline\n\\end{tabular}\n\\end{table}\n\n\nAs it can be seen from Table \\ref{jedge2} new lower bounds obtain better results than bounds L1, L2, L3 and L4 from the literature. In three of 21 cases mixed metric dimension was not reached by new lower bounds.\n\nThese results are not quite representative since graphs in question are with small number of vertices ($|V| = 5$). In order to improve comparison we conducted it on some well-known graphs.\n\n\n\nAdditional comparison will be conducted on 12 well-known graphs. In the Table $\\ref{jedge}$ are shown graph characteristics for each graph, while the comparisons of the various lower bounds are shown in the Table  $\\ref{jedge1}$. Columns in Table $\\ref{jedge1}$, nominated as L1, L2, L3, L4, N1, N2 and N3, have the same meaning as in Table \\ref{jedge2}. From the Table \\ref{jedge1} it can be seen that the new lower bounds are often better than the ones from the literature.\n\nHowever, only in two cases mixed metric dimension equals lower bound (1 from literature and 1 from the new ones). In 4 cases lower bounds have difference 1 from exact values.\nIt should be noted that all 7 lower bounds should be used in union since different lower bounds are applicable for different graphs and no one is uniquely dominant over the others. The important feature of presented lower bounds is that their calculation complexity is much smaller in comparison with standard/edge/mixed metric dimension problem complexity.\n\n\n\\begin{table}\n\\small\n\\caption{Graph characteristic}\n\\label{jedge}\n\\begin{tabular}{|l|l|l|l|l|l|l|}\n\n\\hline\nNum&  Name & $|V|$& $|E|$& $\\beta(G)$&$\\beta_E(G)$& Another  notions \\\\\n\\hline\n1.& Rook's graph & 36& 180& 7 & 8 & srg(36,10,4,2) \\\\\n\\hline\n\n2.& 9-triangular graph & 36& 252& 6 & 32 & Johnson graph; srg(36,14,7,4) \\\\\n\n\n\\hline\n3.& Clebsch graph  &16 &40 & 4 &9& srg(16,5,0,2)\\\\\n\\hline\n\n4.& Generalized quadrangle  &27 &135& 5 & 18 & srg(27,10,1,5)\\\\\n\\hline\n\n5.& Hypercube  $Q_5$ &32 &80& 4 & 4 & $5-$cube graph \\\\\n\n\\hline\n\n6.& Kneser  (7,2)&21 &105&  5 & 12 & srg(21,10,3,6)\\\\\n\n \\hline\n7.& Mobius Kantor  &16 &24& 4 & 4 & Generalized Petersen $GP(8,3)$ \\\\\n\n \\hline\n8.& Paley graph  &13 &39& 4 & 6 & srg(13,6,2,3)\\\\\n \\hline\n9.& Petersen graph  &10 &15& 3 & 4 & Generalized Petersen $GP(5,2)$\\\\\n \\hline\n10.& Small graph 6 vert. &6 &11& 3 & 4 & \\\\\n \\hline\n11.& Hamming H(2,6)& 36 & 180 &  7 & 8 & $K_6 \\Box K_6$\\\\\n \\hline\n12.& Hamming H(3,3) & 27 & 81 &  4 & 5 & $K_3 \\Box K_3  \\Box K_3$\\\\\n \\hline\n\\end{tabular}\n\\end{table}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\begin{table}\n\\small\n\\caption{Direct comparison of lower bounds for some  graphs}\n\\label{jedge1}\n\\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|}\n\\hline\n\n&\\multicolumn{8}{|c|}{LB from lit.}\\\\\n \\hline\nNum& L1 &L2 & L3&L4&N1&N2&N3& $\\beta_{M}(G)$\\\\\n\\hline\n\\hline\n\n\n\n1.   & 4& 5 &0 & 6 & 5& 6 &8&9 \\\\\n \\hline\n 2.&4& 5 & 0& 18 & 5 & 9 & 8 & 32\\\\\n \\hline\n3. &   3 & 4 & 0& 4 & 4 & 5 & 5& 9\\\\\n \\hline\n  4.  & 4 & 5 &0 & 4 & 5 & 6 & 8& 18\\\\\n \\hline\n 5.&   3& \\bf{\\underline{4}}&0 & 2& $\\underline{\\textbf{4}}$   & 2 & 3& 4 \\\\\n \\hline\n 6.&   4& 5 &0& 4& 5 & 6 & 6& 12\\\\\n \\hline\n 7.&   2& 3 &0& 2& 3 & 3 & 3& 4\\\\\n \\hline\n8. &  3& 4 &0& 4 & 4 & 5 & 5& 6\\\\\n \\hline\n  9.   & 2 & 3 &0&  4& 3 & 4 & 4&6 \\\\\n \\hline\n 10.&    2 & 2 &$\\underline{\\textbf{5}}$ &$\\underline{\\textbf{5}}$  & 3 & 4 & 3& 5\\\\\n \\hline\n 11.  & 4 & 5 &0&6 &5 & 6  & 8 & 9\\\\\n \\hline\n 12.  &  3& 4 &0& 3& 4& 3  & 4 &6 \\\\\n \\hline\n\\end{tabular}\n\\end{table}\n\n\nUsing the results shown in the Table $\\ref{jedge1}$, it can be concluded that these bounds do not give some perfect results, as expected, but we can see that there are graphs on which lower bounds L1 and L2 have been reached, as well as new lower bounds N1 and N2.\n\nFinally, it must be noted that even with additional comparison, the representative sample is not statistically quite correct, but is given to illustrate usage and efficiency of presented lower bounds in some capacity. Graphs that we took in consideration varied from order 10 up to order 36. Second number (order 36) is chosen because exact value of their mixed metric dimensions could be quickly determined. Since problem of finding exact mixed metric dimension is NP-hard, our selection do not include large graphs.\n\n\n",
        "Descriptive_question1": "What is the value of L1 for the first graph listed in table_3?",
        "Descriptive_question2": "What is the mixed metric dimension value for the second graph in table_3?",
        "Reasoning_question1": "Which lower bound, L1 or N2, consistently provides a closer estimate to the mixed metric dimension across most graphs in table_3, and why might this be the case?",
        "Reasoning_question2": "Based on the data in table_3, can you infer a trend or pattern in the performance of new lower bounds (N1, N2, N3) compared to literature bounds (L1, L2, L3, L4) as the number of vertices or edges increases?",
        "Descriptive_answer1": "4",
        "Descriptive_answer2": "32",
        "Reasoning_answer1": "To determine which lower bound, L1 or N2, consistently provides a closer estimate to the mixed metric dimension (β_M(G)) across most graphs in Table 3 (referring to Table \\ref{jedge1}), I compared the values of L1 and N2 to β_M(G) for each graph. For L1, the differences from β_M(G) are as follows: Graph 1 (5), Graph 2 (28), Graph 3 (6), Graph 4 (14), Graph 5 (1), Graph 6 (8), Graph 7 (2), Graph 8 (3), Graph 9 (4), Graph 10 (3), Graph 11 (5), Graph 12 (3). For N2, the differences are: Graph 1 (3), Graph 2 (23), Graph 3 (4), Graph 4 (12), Graph 5 (2), Graph 6 (6), Graph 7 (1), Graph 8 (1), Graph 9 (2), Graph 10 (1), Graph 11 (3), Graph 12 (3). Analyzing these differences, N2 is closer to β_M(G) in 10 out of 12 cases (Graphs 1, 2, 3, 4, 6, 7, 8, 9, 10, 11), while L1 is closer or equal in only 2 cases (Graphs 5 and 12). This suggests that N2 consistently provides a closer estimate. The reason might be that N2, as a new lower bound (possibly from Theorem 3 as described in the context), incorporates improved methodologies or considers additional graph properties compared to L1 (from Proposition 4), which is an older bound from the literature and may not account for certain structural complexities of the graphs as effectively.",
        "Reasoning_answer2": "To infer a trend or pattern in the performance of new lower bounds (N1, N2, N3) compared to literature bounds (L1, L2, L3, L4) as the number of vertices or edges increases, I first correlated the graph characteristics from Table \\ref{jedge} with the lower bound data in Table \\ref{jedge1}. The number of vertices (|V|) for the graphs are: Graph 1 (36), Graph 2 (36), Graph 3 (16), Graph 4 (27), Graph 5 (32), Graph 6 (21), Graph 7 (16), Graph 8 (13), Graph 9 (10), Graph 10 (6), Graph 11 (36), Graph 12 (27). The number of edges (|E|) follows a similar trend. Comparing the lower bounds to β_M(G), I calculated how often the new bounds (N1, N2, N3) are closer to β_M(G) than the literature bounds (L1, L2, L3, L4) for each graph. For graphs with higher vertices (e.g., Graphs 1, 2, 11 with |V|=36), new bounds like N2 and N3 often show smaller differences (e.g., N2 differs by 3, 23, 3 respectively) compared to L1 or L2 (differences of 5, 28, 5). For Graph 2 (9-triangular graph, |V|=36, |E|=252), N2 (difference of 23) is much closer than L1 (difference of 28). However, for smaller graphs like Graph 9 (|V|=10) and Graph 10 (|V|=6), the differences are smaller for both sets, and literature bounds sometimes match or are close (e.g., L4 matches β_M(G) in Graph 10). Notably, in Graph 5 (|V|=32), N1 matches β_M(G) exactly, unlike any literature bound. The general trend suggests that new lower bounds tend to perform better (i.e., are closer to β_M(G)) for graphs with a larger number of vertices or edges, possibly because the new bounds (N1, N2, N3) are derived from more recent methodologies or theorems (like Corollary 3, Theorem 3, Theorem 4) that better capture the complexity of larger graphs. Literature bounds like L3 often show a value of 0, indicating they might not be as applicable or refined for diverse graph structures. However, the sample size is small, and exceptions exist (e.g., Graph 12 with |V|=27 shows similar performance). Overall, as graph size increases, new bounds seem to provide incrementally better approximations, though no single bound dominates universally."
    },
    {
        "paper_id": "2007.05808.json",
        "table_id": "table_4",
        "table_content": "\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k+1,2l+1}$}\n\\label{vtor2}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, l, l+1, l+k$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,l+i-1,l+k-i+1$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,l-j, l-j+2, l-j+k+1$) \\\\\n$(i,j)$ & $1\\leq i \\leq k-1$  &($j+i,i+l-j,l-j+i,l-j+k-i+2$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,j-l,j-l,j-l+k-1$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,j-l+i,j-l+i-1,k-i+j-l$)\\\\\n&$1\\leq i \\leq k-1$  &  \\\\\n\n\n\n$(i,0)$ &  $k+2 \\leq i \\leq m-1$ & ($m-i, m-i+l, m-i+l+1, l-k+i-1$) \\\\\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+j,m-i+l-j,m-i+l-j+2,i-k+l-j$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m+n-i-j,j-l+m-i,,$\\\\\n&$l+1\\leq j \\leq n-1$  & $j-l-i+m,j-l+i-k-2)$ \\\\\n\n$(k+1,0)$ &  & ($m-i, m-i+l, k+l,l$) \\\\\n$(k,j)$ &  $1 \\leq j \\leq l$ & ($k+j,k+l-j,k+l-j,l-j+2$) \\\\\n$(k+1,j)$ &  $1 \\leq j \\leq l$ & ($k+j, k+l-j, k+l-j+1, l-j+1$) \\\\\n$(k,j)$ &  $l+1 \\leq j \\leq n-1$ & ($k+n-j, j-l+k, k+j-l-2, j-l$) \\\\\n$(k+1,j)$ &  $l+1 \\leq j \\leq n-1$ & ($k+n-j,k+j-l,k+j-l-1,j-l-1$) \\\\\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}",
        "caption": "  Metric coordinates of vertices of $T_{2k+1,2l+1}$",
        "label": "vtor2",
        "section_info": "3 Exact results on torus graph\n\\section{Exact results on torus graph}\n\nIn this section we will use previously introduced general lower bounds to obtain the exact values of mixed metric dimension of torus graph.\n\n\\begin{thm} For $m,n \\geq 3$ it holds $\\beta_M(T_{m,n}) = 4$.\\end{thm}\n\\begin{proof}  \\textbf{\\underline{Step 1}:}  {\\em Upper bound is 4}.  \\\\\n\nThere are four cases:\\\\\n\\textbf{Case 1.} $m=2k+1, n=2l+1$\\\\\nLet $S = \\{(0,0), (0,l), (1,l+1), (k+1,l+1)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge,with respect to $S$, is shown in the Table \\ref{vtor2} and Table \\ref{etor2}.\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k+1,2l+1}$}\n\\label{vtor2}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, l, l+1, l+k$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,l+i-1,l+k-i+1$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,l-j, l-j+2, l-j+k+1$) \\\\\n$(i,j)$ & $1\\leq i \\leq k-1$  &($j+i,i+l-j,l-j+i,l-j+k-i+2$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,j-l,j-l,j-l+k-1$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,j-l+i,j-l+i-1,k-i+j-l$)\\\\\n&$1\\leq i \\leq k-1$  &  \\\\\n\n\n\n$(i,0)$ &  $k+2 \\leq i \\leq m-1$ & ($m-i, m-i+l, m-i+l+1, l-k+i-1$) \\\\\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+j,m-i+l-j,m-i+l-j+2,i-k+l-j$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m+n-i-j,j-l+m-i,,$\\\\\n&$l+1\\leq j \\leq n-1$  & $j-l-i+m,j-l+i-k-2)$ \\\\\n\n$(k+1,0)$ &  & ($m-i, m-i+l, k+l,l$) \\\\\n$(k,j)$ &  $1 \\leq j \\leq l$ & ($k+j,k+l-j,k+l-j,l-j+2$) \\\\\n$(k+1,j)$ &  $1 \\leq j \\leq l$ & ($k+j, k+l-j, k+l-j+1, l-j+1$) \\\\\n$(k,j)$ &  $l+1 \\leq j \\leq n-1$ & ($k+n-j, j-l+k, k+j-l-2, j-l$) \\\\\n$(k+1,j)$ &  $l+1 \\leq j \\leq n-1$ & ($k+n-j,k+j-l,k+j-l-1,j-l-1$) \\\\\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k+1,2l+1}$}\n\\label{etor2}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n\n $(0,0)(1,0)$ &  & ($0,l,l,k+l$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,l,l+1,k+l-1$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,l,l+1,l+k-1$) \\\\\n$(0,j)(0,j+1)$ & $0\\leq j \\leq l-1$   & ($j, l-j-1, l-j+1, k+l-j$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k-1$ & ($i,l+i,l+i-1,k-i+l$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,l-j-1+i,l+i-j-1,l-j+k-i+1$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i,i,i-1,k-i+1$) \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,l+i-1,l+i-1,l+k-i+1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,l-j+1,k+l-j+1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,j-l,j-l,k+j-l-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-2,k+j-l-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($i+j,l-j+i,l-j+i,l-j+k-i+1$) \\\\\n\n\n&  $1\\leq j \\leq l$ &  \\\\\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,l+i,l+i-1,l+k-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $j-l+i-2,k-i+j-l-1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,j-l-1,k+j-l-1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($j+m-i-1,l-j+m-i-1,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+l-j+1,l+1-j+i-k-1$)\\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i+j,l-j-1+m-i,$ \\\\\n&  $1\\leq j \\leq l-1$ &$l-j+m-i,l-j+i-k-1$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i+l-1,m-i+l,i-k-1+l$) \\\\\n$(i,0)(i,1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l-1,m-i+l+1,l+i-j-k-1$) \\\\\n$(k,j)(k+1,j)$ &  $1\\leq j \\leq l$ & ($k+j,k+l-j,k+l-j,l-j+1$)\\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,l-j+2,k+l-j$) \\\\\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($k+j,k+l-j-1,k+l-j,l-j$)\\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,m-i+j-l-1,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $m-i+j-l-1,i-k+j-l-2)$ \\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($n-j-1+m-i,j-l+m-i,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+m-i,j-l+i-k-2)$ \\\\\n$(i,l)(i,l+1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i+l,m-i,m-i+1,i-k-1)$\\\\\n$(k,j)(k+1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j+k,j-l+k,k+j-l-2,j-l-1)$\\\\\n$(i,0)(i,n-1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l,m-i+l,l+i-k-2)$\\\\\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,j-l,k+j-l-2)$\\\\\n$(0,l)(0,l+1)$ &  & ($l,0,1,k)$\\\\\n$(0,l)(m-1,l)$ &  & ($l,0,2,k)$\\\\\n$(k,0)(k+1,0)$ &  & ($k,k+l,k+l-1,l)$\\\\\n$(k+1,0)(k+1,1)$ &  & ($k,k+l-1,k+l,l)$\\\\\t\n$(k+1,0)(k+1,n-1)$ &  & ($k,k+l,k+l-1,l-1)$\\\\\n$(k+1,j)(k+1,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j+k-1,j-l+k,j-l+k-1,j-l-1)$\\\\\n$(k+1,l)(k+1,l+1)$ &  & ($k+l,k,k,0)$\\\\\n\n\\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\\newpage\n\n\nSince metric coordinates of all items are mutually different, $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k+1,2l+1})\\leq 4.$\n\n\\textbf{Case 2.} $m=2k+1, n=2l$\\\\\nLet $S = \\{(0,0), (0,l), (1,0), (k+1,1)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge, with respect to $S$, is shown in the Table \\ref{vtor7} and Table \\ref{vtor6}.\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k+1,2l}$}\n\\label{vtor7}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, l, 1, k+1$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,i-1,k-i+2$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,l-j, j+1, k+j-1$) \\\\\n$(i,j)$ & $1\\leq i \\leq k$  &($j+i,l-j+i,j+i-1,j+k-i$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,j-l,n-j+1,n-j+k+1$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,j-l+i,n-j+i-1,n-j+k-i+2$)\\\\\n&$1\\leq i \\leq k$  &  \\\\\n\n\n\n$(i,0)$ &  $k+2 \\leq i \\leq m-1$ & ($m-i, m-i+l, m-i+1, i-k$) \\\\\n\n$(k+1,0)$ &   & ($k, k+l, k, 1$) \\\\\n\n\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+j,m-i-j+l,m-i+j+1,i-k+j-2$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(k+1,j)$ & $1\\leq j \\leq l$   & ($m-i+j, m-i+l-j,k+j ,j-1$) \\\\\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+n-j,m-i+j-l,$\\\\\n&$l+1\\leq j \\leq n-1$  & $m-i+1+n-j,n+i-k-j)$ \\\\\n$(k+1,j)$ & $l+1\\leq j \\leq n-1$   & ($n-j+k,k+j-l,n-j+k ,n-j+1$) \\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k+1,2l}$}\n\\label{vtor6}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n $(0,0)(0,1)$ &  & ($0,l-1,1,k$) \\\\\n $(0,0)(1,0)$ &  & ($0,l,0,k+1$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,l-1,1,k+1$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,l,1,k$) \\\\\n$(0,j)(0,j+1)$ & $1\\leq j \\leq l-1$   & ($j, l-j-1, j+1, k+j-1$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,i-1,k-i+1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,i+l-j-1,j+i-1,k+j-i$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i-1,i,l+i-2,k-i+l$) \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,l-1+i,i-1,k-i+1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j,k+j-1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j,k+j-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,n-j+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-1,n-j+k-i-1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,j-l,n-j,k+n-j$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j-1+i,j-l+i,n-j-2+i,k+n-j-i+1$) \\\\\n&  $l+1\\leq j \\leq n-2$ &  \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k$ & ($i+j,i+l-j,j+i-1,k-i+j-1$) \\\\\n&  $1\\leq j \\leq l$ &  \\\\\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,i+l-1,i-1,k-i+2$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $n-j+i-1,n+k-i-j+1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,n-j,n-j+k+1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($m-i+j-1,m-i+l-j-1,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+j,i-k+j-2$)\\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i+j,m-i+l-j-1,$ \\\\\n&  $1\\leq j \\leq l-1$ &$m-i+1+j,i-k+j-2$) \\\\\n\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($m-i+j,m-i+l-j-1,k+j,i-k+j-2$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i+l-1,m-i,i-k$) \\\\\n$(i,0)(i,1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i-1+l,m-i+1,i-k-1$) \\\\\n $(k+1,0)(k+1,1)$ &  & ($k,k+l-1,k,0$) \\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j+1,k+j-2$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,j-l+m-i-1,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $m+n-j-i,n-j+i-k)$ \\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($n-j+m-i-1,m+j-i-l,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $n-j+m-i,n-j+i-k-1)$ \\\\\n$(k+1,j)(k+1,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j+m-i-1,j-l+m-i,$\\\\\n&   & $k+n-j-1,n-j+i-k-1$) \\\\\n$(i,l)(i,l+1)$ &  $k+2\\leq i \\leq m-1$ & ($l+m-i-1,m-i,l+m-i,l+i-k-2$\\\\\n$(k+1,l)(k+1,l+1)$ &   & ($l+k-1,k,l+k-1,l-1)$\\\\\n$(i,0)(i,n-1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l,m-i+1,i-k)$\\\\\n$(k+1,0)(k+1,n-1)$ &   & ($k,l+k-1,k,1)$\\\\\n\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,n-j+1,k+n-j)$\\\\\n\n\n\n\n$(0,l)(0,l+1)$ &  & ($l-1,0,l,k+l-1)$\\\\\n$(k+1,0)(k+2,0)$ &  & ($k,k+l-1,k,0)$\\\\\n\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\n\nSince metric coordinates of all items   are mutually different, so $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k+1,2l})\\leq 4.$\n\n\n\n\n\\newpage\n\n\n\n\n\n\\textbf{Case 3.} $m=2k, n=2l+1$\\\\\nLet $S =\\{(0,0),(k,0),(0,1),(1,l+1)\\}$. Since $C_m \\Box C_n$ is the same as $C_n \\Box C_m$, the proof of this case is similar to the proof of Case 2.\\\\\n\n\\textbf{Case 4.} $m=2k, n=2l$\\\\\nLet $S = \\{(0,0), (0,1), (1,l), (k,0)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge, with respect to $S$, is shown in the Table \\ref{etor} and Table \\ref{vtor}.\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k,2l}$}\n\\label{etor}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, 1, l+1, k$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+1,l+i-1,k-i$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,j-1, l-j+1, k+j$) \\\\\n$(i,j)$ & $1\\leq i \\leq k$  &($j+i,j-1+i,l-j+i-1,j+k-i$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,n-j+1,j-l+1,n-j+k$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,n-j+i+1,j-l+i-1,n-j+k-i$)\\\\\n&$1\\leq i \\leq k$  &  \\\\\n\n\n\n$(i,0)$ &  $k+1 \\leq i \\leq m-1$ & ($m-i, m-i+1, m-i+l+1, i-k$) \\\\\n\n$(i,j)$ & $k+1\\leq i \\leq m-1$  &($m-i+j,m-i+j-1,m-i+l-j+1,i-k+j$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n\n\n$(i,j)$ & $k+1\\leq i \\leq m-1$  &($m+n-i-j,m+n-i-j+1,$\\\\\n&$l+1\\leq j \\leq n-1$  & $m+j-l-i+1,n+i-k-j)$ \\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k,2l}$}\n\\label{vtor}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n $(0,0)(0,1)$ &  & ($0,0,l,k$) \\\\\n $(0,0)(1,0)$ &  & ($0,1,l,k-1$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,1,l,k$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,1,l+1,k-1$) \\\\\n$(0,j)(0,j+1)$ & $1\\leq j \\leq l-1$   & ($j, j-1, l-j, k+j$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k-1$ & ($i,i+1,l+i-1,k-i-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,i+j-1,l-j+i-2,k+j-i$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,i,l+i-2,k-i$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,j-1,l-j,k+j-1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,n-j,j-l+1,n-j-1+k$) \\\\\n\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,n-j+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-1,n-j+k-i-1$) \\\\\n\n\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($i+j,i+j-1,l-j+i-1,j+k-i-1$) \\\\\n&  $1\\leq j \\leq l$ &  \\\\\n\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i-1,l+i-1,i-1,l+k-i-1$) \\\\\n\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,i+1,l+i-2,k-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,n-j+i+1,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $j-l+i-1,n-l+k-i-1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,n-j+1,j-l,n-j+k-1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($m-i+j-1,m-i+j-2,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+l-j,i-k+j$)\\\\\n$(i,j)(i,j+1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i+j,m-i+j-1,$ \\\\\n&  $1\\leq j \\leq l-1$ &$,m-i+l-j,j+i-k$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i,m-i+l,i-k$) \\\\\n\n$(i,0)(i,1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i,m-i,m-i+l,i-k$) \\\\\n$(k,j)(k+1,j)$ &  $1\\leq j \\leq l$ & ($m-k+j-1,m-k+j-2,$\\\\\n&   & $,m-k+l-j-1,j$)  \\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,j-1,l-j+1,m-k+j-1$) \\\\\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($m-k+j-1,m-k+j-2,$\\\\\n&   & $,m-k+l-j-1,j+1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,n+m-j-i,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $j-l+m-i,n-j+i-k)$ \\\\\n$(i,j)(i,j+1)$ &  $k+1\\leq i \\leq m-1$ & ($n-j+m-i-1,n+m-j-i,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+m-i+1,n-j+i-k-1)$ \\\\\n$(i,l)(i,l+1)$ &  $k+1\\leq i \\leq m-1$ & ($l+m-i-1,m+l-i-1,$\\\\\n&   & $,m-i+1,l+i-k-1)$  \\\\\n$(k,j)(k+1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j+m-i-1,m+n-j-i+1,$\\\\\n&   & $,j-l+k-1,n-j)$  \\\\\n$(i,0)(i,n-1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i,m-i+1,m-i+l,i-k)$\\\\\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,n-j+1,j-l+1,n-j+k-1)$\\\\\n$(0,l)(0,l+1)$ &  & ($l-1,l-1,1,l-1+k)$\\\\\n$(0,l)(m-1,l)$ &  & ($l,l-1,1,k+l-1)$\\\\\n$(k,0)(k+1,0)$ &  & ($k-1,k,m-k-1+l,0$\\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\nSince metric coordinates of all items  are mutually different, $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k,2l})\\leq 4.$\\\\\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\textbf{\\underline{Step 2}:}  {\\em Lower bound is 4}.  \\\\\nTorus graph is $4$-regular graph, so by Corollary \\ref{lb1reg} follows\n$\\beta_M(T_{m,n})  \\geq 1 + \\lceil log_2 (r+1) \\rceil = 1 + \\lceil log_2 5 \\rceil = 4$.\n\nTherefore, from the previous two steps, it follows that $\\beta_M(T_{m,n})=4.$\n\\end{proof}\n\n\n\n\n\n\n\n\n\n\n\n",
        "Descriptive_question1": "What is the metric coordinate value for vertex (0,0) in table_4?",
        "Descriptive_question2": "What condition applies to vertex (i,0) for 1 ≤ i ≤ k in table_4?",
        "Reasoning_question1": "How do the metric coordinates of vertices in table_4 vary with different ranges of i and j values, and what does this suggest about the structure of the torus graph T_{2k+1,2l+1}?",
        "Reasoning_question2": "By comparing the metric coordinates across different vertices in table_4, what can be inferred about the symmetry or patterns in the torus graph T_{2k+1,2l+1}?",
        "Descriptive_answer1": "0, l, l+1, l+k",
        "Descriptive_answer2": "1 ≤ i ≤ k",
        "Reasoning_answer1": "Looking at table_4 for the torus graph T_{2k+1,2l+1}, I notice that the metric coordinates r(v,S) for vertices vary systematically based on the ranges of i and j. For instance, at vertex (0,0), the coordinates are (0, l, l+1, l+k), which seem to represent distances to the set S. As i increases for vertices (i,0) where 1 ≤ i ≤ k, the coordinates change to (i, i+l, l+i-1, l+k-i+1), indicating a pattern where some values increase with i while others decrease, reflecting a shift along one axis of the torus. Similarly, for vertices (0,j) where 1 ≤ j ≤ l, the coordinates (j, l-j, l-j+2, l-j+k+1) show a dependency on j, with terms decreasing as j increases, suggesting movement along the other axis. For combined variations in (i,j), such as 1 ≤ i ≤ k-1 and 1 ≤ j ≤ l, coordinates like (j+i, i+l-j, l-j+i, l-j+k-i+2) incorporate both i and j, blending the effects seen in individual axes. In other ranges, like l+1 ≤ j ≤ n-1 or k+2 ≤ i ≤ m-1, coordinates adjust with terms like n-j or m-i, indicating wrap-around effects typical of a torus structure. This systematic variation suggests that the torus graph T_{2k+1,2l+1} is a grid-like structure with periodic boundaries, where distances are measured in a wrapped manner, reflecting the toroidal topology connecting opposite edges.",
        "Reasoning_answer2": "By comparing metric coordinates across different vertices in table_4 for T_{2k+1,2l+1}, I can infer patterns and symmetry in the graph. First, I observe that coordinates for vertices like (0,0) and (k+1,0) involve terms like l, k, and m-i, which repeat or mirror in certain positions, hinting at a central symmetry. For instance, (0,0) has (0, l, l+1, l+k), while (k+1,0) involves (m-i, m-i+l, k+l, l), showing a transformation involving m and k that might correspond to half the grid size or a symmetric point. Looking at vertices along axes, such as (i,0) and (0,j), the coordinates change incrementally (e.g., for (i,0): i increases in the first term, decreases in the last term as i approaches k), suggesting a balanced progression around the torus. For vertices in the middle ranges like (i,j), the coordinates blend i and j contributions (e.g., j+i in first term), and in higher ranges like l+1 ≤ j ≤ n-1, terms like n-j appear, mirroring the lower j ranges with j terms. This indicates a cyclic nature where crossing the boundary (from j=l to j=l+1) reflects a symmetric distance calculation. Additionally, comparing opposite corners or edges, such as (0,j) for j from 1 to l versus l+1 to n-1, shows coordinates flipping from l-j to j-l, reinforcing a mirror symmetry across the grid’s midpoint. These observations suggest that the torus graph T_{2k+1,2l+1} exhibits rotational and reflectional symmetry, consistent with a toroidal structure where paths wrap around, creating symmetric distance patterns from any point relative to the resolving set S."
    },
    {
        "paper_id": "2007.05808.json",
        "table_id": "table_5",
        "table_content": "\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k+1,2l+1}$}\n\\label{etor2}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n\n $(0,0)(1,0)$ &  & ($0,l,l,k+l$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,l,l+1,k+l-1$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,l,l+1,l+k-1$) \\\\\n$(0,j)(0,j+1)$ & $0\\leq j \\leq l-1$   & ($j, l-j-1, l-j+1, k+l-j$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k-1$ & ($i,l+i,l+i-1,k-i+l$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,l-j-1+i,l+i-j-1,l-j+k-i+1$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i,i,i-1,k-i+1$) \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,l+i-1,l+i-1,l+k-i+1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,l-j+1,k+l-j+1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,j-l,j-l,k+j-l-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-2,k+j-l-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($i+j,l-j+i,l-j+i,l-j+k-i+1$) \\\\\n\n\n&  $1\\leq j \\leq l$ &  \\\\\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,l+i,l+i-1,l+k-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $j-l+i-2,k-i+j-l-1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,j-l-1,k+j-l-1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($j+m-i-1,l-j+m-i-1,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+l-j+1,l+1-j+i-k-1$)\\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i+j,l-j-1+m-i,$ \\\\\n&  $1\\leq j \\leq l-1$ &$l-j+m-i,l-j+i-k-1$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i+l-1,m-i+l,i-k-1+l$) \\\\\n$(i,0)(i,1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l-1,m-i+l+1,l+i-j-k-1$) \\\\\n$(k,j)(k+1,j)$ &  $1\\leq j \\leq l$ & ($k+j,k+l-j,k+l-j,l-j+1$)\\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,l-j+2,k+l-j$) \\\\\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($k+j,k+l-j-1,k+l-j,l-j$)\\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,m-i+j-l-1,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $m-i+j-l-1,i-k+j-l-2)$ \\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($n-j-1+m-i,j-l+m-i,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+m-i,j-l+i-k-2)$ \\\\\n$(i,l)(i,l+1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i+l,m-i,m-i+1,i-k-1)$\\\\\n$(k,j)(k+1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j+k,j-l+k,k+j-l-2,j-l-1)$\\\\\n$(i,0)(i,n-1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l,m-i+l,l+i-k-2)$\\\\\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,j-l,k+j-l-2)$\\\\\n$(0,l)(0,l+1)$ &  & ($l,0,1,k)$\\\\\n$(0,l)(m-1,l)$ &  & ($l,0,2,k)$\\\\\n$(k,0)(k+1,0)$ &  & ($k,k+l,k+l-1,l)$\\\\\n$(k+1,0)(k+1,1)$ &  & ($k,k+l-1,k+l,l)$\\\\\t\n$(k+1,0)(k+1,n-1)$ &  & ($k,k+l,k+l-1,l-1)$\\\\\n$(k+1,j)(k+1,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j+k-1,j-l+k,j-l+k-1,j-l-1)$\\\\\n$(k+1,l)(k+1,l+1)$ &  & ($k+l,k,k,0)$\\\\\n\n\\hline\n\\end{tabular}\n\\end{center}\n\\end{table}",
        "caption": "  Metric coordinates of edges of $T_{2k+1,2l+1}$",
        "label": "etor2",
        "section_info": "3 Exact results on torus graph\n\\section{Exact results on torus graph}\n\nIn this section we will use previously introduced general lower bounds to obtain the exact values of mixed metric dimension of torus graph.\n\n\\begin{thm} For $m,n \\geq 3$ it holds $\\beta_M(T_{m,n}) = 4$.\\end{thm}\n\\begin{proof}  \\textbf{\\underline{Step 1}:}  {\\em Upper bound is 4}.  \\\\\n\nThere are four cases:\\\\\n\\textbf{Case 1.} $m=2k+1, n=2l+1$\\\\\nLet $S = \\{(0,0), (0,l), (1,l+1), (k+1,l+1)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge,with respect to $S$, is shown in the Table \\ref{vtor2} and Table \\ref{etor2}.\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k+1,2l+1}$}\n\\label{vtor2}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, l, l+1, l+k$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,l+i-1,l+k-i+1$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,l-j, l-j+2, l-j+k+1$) \\\\\n$(i,j)$ & $1\\leq i \\leq k-1$  &($j+i,i+l-j,l-j+i,l-j+k-i+2$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,j-l,j-l,j-l+k-1$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,j-l+i,j-l+i-1,k-i+j-l$)\\\\\n&$1\\leq i \\leq k-1$  &  \\\\\n\n\n\n$(i,0)$ &  $k+2 \\leq i \\leq m-1$ & ($m-i, m-i+l, m-i+l+1, l-k+i-1$) \\\\\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+j,m-i+l-j,m-i+l-j+2,i-k+l-j$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m+n-i-j,j-l+m-i,,$\\\\\n&$l+1\\leq j \\leq n-1$  & $j-l-i+m,j-l+i-k-2)$ \\\\\n\n$(k+1,0)$ &  & ($m-i, m-i+l, k+l,l$) \\\\\n$(k,j)$ &  $1 \\leq j \\leq l$ & ($k+j,k+l-j,k+l-j,l-j+2$) \\\\\n$(k+1,j)$ &  $1 \\leq j \\leq l$ & ($k+j, k+l-j, k+l-j+1, l-j+1$) \\\\\n$(k,j)$ &  $l+1 \\leq j \\leq n-1$ & ($k+n-j, j-l+k, k+j-l-2, j-l$) \\\\\n$(k+1,j)$ &  $l+1 \\leq j \\leq n-1$ & ($k+n-j,k+j-l,k+j-l-1,j-l-1$) \\\\\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k+1,2l+1}$}\n\\label{etor2}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n\n $(0,0)(1,0)$ &  & ($0,l,l,k+l$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,l,l+1,k+l-1$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,l,l+1,l+k-1$) \\\\\n$(0,j)(0,j+1)$ & $0\\leq j \\leq l-1$   & ($j, l-j-1, l-j+1, k+l-j$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k-1$ & ($i,l+i,l+i-1,k-i+l$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,l-j-1+i,l+i-j-1,l-j+k-i+1$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i,i,i-1,k-i+1$) \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,l+i-1,l+i-1,l+k-i+1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,l-j+1,k+l-j+1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,j-l,j-l,k+j-l-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-2,k+j-l-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($i+j,l-j+i,l-j+i,l-j+k-i+1$) \\\\\n\n\n&  $1\\leq j \\leq l$ &  \\\\\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,l+i,l+i-1,l+k-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $j-l+i-2,k-i+j-l-1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,j-l-1,k+j-l-1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($j+m-i-1,l-j+m-i-1,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+l-j+1,l+1-j+i-k-1$)\\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i+j,l-j-1+m-i,$ \\\\\n&  $1\\leq j \\leq l-1$ &$l-j+m-i,l-j+i-k-1$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i+l-1,m-i+l,i-k-1+l$) \\\\\n$(i,0)(i,1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l-1,m-i+l+1,l+i-j-k-1$) \\\\\n$(k,j)(k+1,j)$ &  $1\\leq j \\leq l$ & ($k+j,k+l-j,k+l-j,l-j+1$)\\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,l-j+2,k+l-j$) \\\\\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($k+j,k+l-j-1,k+l-j,l-j$)\\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,m-i+j-l-1,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $m-i+j-l-1,i-k+j-l-2)$ \\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($n-j-1+m-i,j-l+m-i,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+m-i,j-l+i-k-2)$ \\\\\n$(i,l)(i,l+1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i+l,m-i,m-i+1,i-k-1)$\\\\\n$(k,j)(k+1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j+k,j-l+k,k+j-l-2,j-l-1)$\\\\\n$(i,0)(i,n-1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l,m-i+l,l+i-k-2)$\\\\\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,j-l,k+j-l-2)$\\\\\n$(0,l)(0,l+1)$ &  & ($l,0,1,k)$\\\\\n$(0,l)(m-1,l)$ &  & ($l,0,2,k)$\\\\\n$(k,0)(k+1,0)$ &  & ($k,k+l,k+l-1,l)$\\\\\n$(k+1,0)(k+1,1)$ &  & ($k,k+l-1,k+l,l)$\\\\\t\n$(k+1,0)(k+1,n-1)$ &  & ($k,k+l,k+l-1,l-1)$\\\\\n$(k+1,j)(k+1,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j+k-1,j-l+k,j-l+k-1,j-l-1)$\\\\\n$(k+1,l)(k+1,l+1)$ &  & ($k+l,k,k,0)$\\\\\n\n\\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\\newpage\n\n\nSince metric coordinates of all items are mutually different, $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k+1,2l+1})\\leq 4.$\n\n\\textbf{Case 2.} $m=2k+1, n=2l$\\\\\nLet $S = \\{(0,0), (0,l), (1,0), (k+1,1)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge, with respect to $S$, is shown in the Table \\ref{vtor7} and Table \\ref{vtor6}.\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k+1,2l}$}\n\\label{vtor7}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, l, 1, k+1$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,i-1,k-i+2$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,l-j, j+1, k+j-1$) \\\\\n$(i,j)$ & $1\\leq i \\leq k$  &($j+i,l-j+i,j+i-1,j+k-i$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,j-l,n-j+1,n-j+k+1$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,j-l+i,n-j+i-1,n-j+k-i+2$)\\\\\n&$1\\leq i \\leq k$  &  \\\\\n\n\n\n$(i,0)$ &  $k+2 \\leq i \\leq m-1$ & ($m-i, m-i+l, m-i+1, i-k$) \\\\\n\n$(k+1,0)$ &   & ($k, k+l, k, 1$) \\\\\n\n\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+j,m-i-j+l,m-i+j+1,i-k+j-2$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(k+1,j)$ & $1\\leq j \\leq l$   & ($m-i+j, m-i+l-j,k+j ,j-1$) \\\\\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+n-j,m-i+j-l,$\\\\\n&$l+1\\leq j \\leq n-1$  & $m-i+1+n-j,n+i-k-j)$ \\\\\n$(k+1,j)$ & $l+1\\leq j \\leq n-1$   & ($n-j+k,k+j-l,n-j+k ,n-j+1$) \\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k+1,2l}$}\n\\label{vtor6}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n $(0,0)(0,1)$ &  & ($0,l-1,1,k$) \\\\\n $(0,0)(1,0)$ &  & ($0,l,0,k+1$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,l-1,1,k+1$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,l,1,k$) \\\\\n$(0,j)(0,j+1)$ & $1\\leq j \\leq l-1$   & ($j, l-j-1, j+1, k+j-1$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,i-1,k-i+1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,i+l-j-1,j+i-1,k+j-i$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i-1,i,l+i-2,k-i+l$) \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,l-1+i,i-1,k-i+1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j,k+j-1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j,k+j-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,n-j+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-1,n-j+k-i-1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,j-l,n-j,k+n-j$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j-1+i,j-l+i,n-j-2+i,k+n-j-i+1$) \\\\\n&  $l+1\\leq j \\leq n-2$ &  \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k$ & ($i+j,i+l-j,j+i-1,k-i+j-1$) \\\\\n&  $1\\leq j \\leq l$ &  \\\\\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,i+l-1,i-1,k-i+2$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $n-j+i-1,n+k-i-j+1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,n-j,n-j+k+1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($m-i+j-1,m-i+l-j-1,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+j,i-k+j-2$)\\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i+j,m-i+l-j-1,$ \\\\\n&  $1\\leq j \\leq l-1$ &$m-i+1+j,i-k+j-2$) \\\\\n\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($m-i+j,m-i+l-j-1,k+j,i-k+j-2$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i+l-1,m-i,i-k$) \\\\\n$(i,0)(i,1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i-1+l,m-i+1,i-k-1$) \\\\\n $(k+1,0)(k+1,1)$ &  & ($k,k+l-1,k,0$) \\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j+1,k+j-2$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,j-l+m-i-1,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $m+n-j-i,n-j+i-k)$ \\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($n-j+m-i-1,m+j-i-l,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $n-j+m-i,n-j+i-k-1)$ \\\\\n$(k+1,j)(k+1,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j+m-i-1,j-l+m-i,$\\\\\n&   & $k+n-j-1,n-j+i-k-1$) \\\\\n$(i,l)(i,l+1)$ &  $k+2\\leq i \\leq m-1$ & ($l+m-i-1,m-i,l+m-i,l+i-k-2$\\\\\n$(k+1,l)(k+1,l+1)$ &   & ($l+k-1,k,l+k-1,l-1)$\\\\\n$(i,0)(i,n-1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l,m-i+1,i-k)$\\\\\n$(k+1,0)(k+1,n-1)$ &   & ($k,l+k-1,k,1)$\\\\\n\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,n-j+1,k+n-j)$\\\\\n\n\n\n\n$(0,l)(0,l+1)$ &  & ($l-1,0,l,k+l-1)$\\\\\n$(k+1,0)(k+2,0)$ &  & ($k,k+l-1,k,0)$\\\\\n\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\n\nSince metric coordinates of all items   are mutually different, so $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k+1,2l})\\leq 4.$\n\n\n\n\n\\newpage\n\n\n\n\n\n\\textbf{Case 3.} $m=2k, n=2l+1$\\\\\nLet $S =\\{(0,0),(k,0),(0,1),(1,l+1)\\}$. Since $C_m \\Box C_n$ is the same as $C_n \\Box C_m$, the proof of this case is similar to the proof of Case 2.\\\\\n\n\\textbf{Case 4.} $m=2k, n=2l$\\\\\nLet $S = \\{(0,0), (0,1), (1,l), (k,0)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge, with respect to $S$, is shown in the Table \\ref{etor} and Table \\ref{vtor}.\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k,2l}$}\n\\label{etor}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, 1, l+1, k$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+1,l+i-1,k-i$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,j-1, l-j+1, k+j$) \\\\\n$(i,j)$ & $1\\leq i \\leq k$  &($j+i,j-1+i,l-j+i-1,j+k-i$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,n-j+1,j-l+1,n-j+k$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,n-j+i+1,j-l+i-1,n-j+k-i$)\\\\\n&$1\\leq i \\leq k$  &  \\\\\n\n\n\n$(i,0)$ &  $k+1 \\leq i \\leq m-1$ & ($m-i, m-i+1, m-i+l+1, i-k$) \\\\\n\n$(i,j)$ & $k+1\\leq i \\leq m-1$  &($m-i+j,m-i+j-1,m-i+l-j+1,i-k+j$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n\n\n$(i,j)$ & $k+1\\leq i \\leq m-1$  &($m+n-i-j,m+n-i-j+1,$\\\\\n&$l+1\\leq j \\leq n-1$  & $m+j-l-i+1,n+i-k-j)$ \\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k,2l}$}\n\\label{vtor}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n $(0,0)(0,1)$ &  & ($0,0,l,k$) \\\\\n $(0,0)(1,0)$ &  & ($0,1,l,k-1$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,1,l,k$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,1,l+1,k-1$) \\\\\n$(0,j)(0,j+1)$ & $1\\leq j \\leq l-1$   & ($j, j-1, l-j, k+j$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k-1$ & ($i,i+1,l+i-1,k-i-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,i+j-1,l-j+i-2,k+j-i$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,i,l+i-2,k-i$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,j-1,l-j,k+j-1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,n-j,j-l+1,n-j-1+k$) \\\\\n\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,n-j+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-1,n-j+k-i-1$) \\\\\n\n\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($i+j,i+j-1,l-j+i-1,j+k-i-1$) \\\\\n&  $1\\leq j \\leq l$ &  \\\\\n\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i-1,l+i-1,i-1,l+k-i-1$) \\\\\n\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,i+1,l+i-2,k-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,n-j+i+1,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $j-l+i-1,n-l+k-i-1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,n-j+1,j-l,n-j+k-1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($m-i+j-1,m-i+j-2,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+l-j,i-k+j$)\\\\\n$(i,j)(i,j+1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i+j,m-i+j-1,$ \\\\\n&  $1\\leq j \\leq l-1$ &$,m-i+l-j,j+i-k$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i,m-i+l,i-k$) \\\\\n\n$(i,0)(i,1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i,m-i,m-i+l,i-k$) \\\\\n$(k,j)(k+1,j)$ &  $1\\leq j \\leq l$ & ($m-k+j-1,m-k+j-2,$\\\\\n&   & $,m-k+l-j-1,j$)  \\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,j-1,l-j+1,m-k+j-1$) \\\\\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($m-k+j-1,m-k+j-2,$\\\\\n&   & $,m-k+l-j-1,j+1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,n+m-j-i,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $j-l+m-i,n-j+i-k)$ \\\\\n$(i,j)(i,j+1)$ &  $k+1\\leq i \\leq m-1$ & ($n-j+m-i-1,n+m-j-i,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+m-i+1,n-j+i-k-1)$ \\\\\n$(i,l)(i,l+1)$ &  $k+1\\leq i \\leq m-1$ & ($l+m-i-1,m+l-i-1,$\\\\\n&   & $,m-i+1,l+i-k-1)$  \\\\\n$(k,j)(k+1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j+m-i-1,m+n-j-i+1,$\\\\\n&   & $,j-l+k-1,n-j)$  \\\\\n$(i,0)(i,n-1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i,m-i+1,m-i+l,i-k)$\\\\\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,n-j+1,j-l+1,n-j+k-1)$\\\\\n$(0,l)(0,l+1)$ &  & ($l-1,l-1,1,l-1+k)$\\\\\n$(0,l)(m-1,l)$ &  & ($l,l-1,1,k+l-1)$\\\\\n$(k,0)(k+1,0)$ &  & ($k-1,k,m-k-1+l,0$\\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\nSince metric coordinates of all items  are mutually different, $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k,2l})\\leq 4.$\\\\\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\textbf{\\underline{Step 2}:}  {\\em Lower bound is 4}.  \\\\\nTorus graph is $4$-regular graph, so by Corollary \\ref{lb1reg} follows\n$\\beta_M(T_{m,n})  \\geq 1 + \\lceil log_2 (r+1) \\rceil = 1 + \\lceil log_2 5 \\rceil = 4$.\n\nTherefore, from the previous two steps, it follows that $\\beta_M(T_{m,n})=4.$\n\\end{proof}\n\n\n\n\n\n\n\n\n\n\n\n",
        "Descriptive_question1": "What is the metric coordinate for the edge (0,0)(1,0) in table_5?",
        "Descriptive_question2": "Which edge in table_5 has the condition '0≤ j ≤ l-1' for the coordinates of (0,j)(0,j+1)?",
        "Reasoning_question1": "How do the metric coordinates of edges in table_5 vary based on the conditions specified for different ranges of i and j?",
        "Reasoning_question2": "What patterns or trends can be observed in the metric coordinates of edges in table_5 when comparing edges with similar vertex conditions?",
        "Descriptive_answer1": "(0,l,l,k+l)",
        "Descriptive_answer2": "(0,j)(0,j+1)",
        "Reasoning_answer1": "The metric coordinates of edges in table_5, which details the edges of the torus graph T_{2k+1,2l+1}, vary significantly based on the specified conditions for indices i and j. First, for edges with lower index ranges like 1≤i≤k or 1≤j≤l, the coordinates often involve linear combinations of i, j, k, and l, reflecting positions closer to the origin or central axes of the grid, as seen with edges like (i,0)(i+1,0) having coordinates (i,l+i,l+i-1,k-i+l). As indices increase to ranges like l+1≤j≤n-2 or k+1≤i≤m-2, the coordinates incorporate terms like n-j or m-i, indicating a shift towards the boundaries of the grid and adjusting for wrap-around effects inherent in a torus structure, evident in edges like (0,j)(0,j+1) with coordinates (n-j-1,j-l,j-l,k+j-l-1). Additionally, specific conditions for fixed points, such as edges involving (0,0) or (k+1,l+1), yield coordinates with simpler forms or constants, reflecting their role in the resolving set S. This structured variation ensures unique identification of each edge based on its position and the torus's cyclic nature.",
        "Reasoning_answer2": "Analyzing table_5 for edges of T_{2k+1,2l+1}, several patterns emerge when comparing metric coordinates of edges with similar vertex conditions. For edges sharing the same vertex type, such as horizontal connections like (i,j)(i,j+1), coordinates consistently include terms like i+j or l-j, adjusted by the range of j, indicating a systematic shift based on horizontal progression; for instance, under 1≤j≤l-1, coordinates are (i+j,l-j-1+i,l+i-j-1,l-j+k-i+1), while for l+1≤j≤n-2, they shift to (n-j+i-1,j-l+i,j-l+i-2,k+j-l-i), reflecting boundary effects. Similarly, vertical edges like (i,j)(i+1,j) show coordinates with repeating terms adjusted by i or j, such as (i+j,l-j+i,l-j+i,l-j+k-i+1) for 1≤j≤l, suggesting a mirrored or cyclic adjustment. Edges involving specific vertices like (0,0) or (k+1,0) often have coordinates with fixed values or simple increments, like (0,l,l,k+l) for (0,0)(1,0), highlighting their foundational role in the resolving set S. These patterns underscore a structured design in the torus graph, ensuring distinct metric representations through incremental and cyclic adjustments based on vertex conditions."
    },
    {
        "paper_id": "2007.05808.json",
        "table_id": "table_6",
        "table_content": "\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k+1,2l}$}\n\\label{vtor7}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, l, 1, k+1$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,i-1,k-i+2$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,l-j, j+1, k+j-1$) \\\\\n$(i,j)$ & $1\\leq i \\leq k$  &($j+i,l-j+i,j+i-1,j+k-i$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,j-l,n-j+1,n-j+k+1$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,j-l+i,n-j+i-1,n-j+k-i+2$)\\\\\n&$1\\leq i \\leq k$  &  \\\\\n\n\n\n$(i,0)$ &  $k+2 \\leq i \\leq m-1$ & ($m-i, m-i+l, m-i+1, i-k$) \\\\\n\n$(k+1,0)$ &   & ($k, k+l, k, 1$) \\\\\n\n\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+j,m-i-j+l,m-i+j+1,i-k+j-2$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(k+1,j)$ & $1\\leq j \\leq l$   & ($m-i+j, m-i+l-j,k+j ,j-1$) \\\\\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+n-j,m-i+j-l,$\\\\\n&$l+1\\leq j \\leq n-1$  & $m-i+1+n-j,n+i-k-j)$ \\\\\n$(k+1,j)$ & $l+1\\leq j \\leq n-1$   & ($n-j+k,k+j-l,n-j+k ,n-j+1$) \\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}",
        "caption": "  Metric coordinates of vertices of $T_{2k+1,2l}$",
        "label": "vtor7",
        "section_info": "3 Exact results on torus graph\n\\section{Exact results on torus graph}\n\nIn this section we will use previously introduced general lower bounds to obtain the exact values of mixed metric dimension of torus graph.\n\n\\begin{thm} For $m,n \\geq 3$ it holds $\\beta_M(T_{m,n}) = 4$.\\end{thm}\n\\begin{proof}  \\textbf{\\underline{Step 1}:}  {\\em Upper bound is 4}.  \\\\\n\nThere are four cases:\\\\\n\\textbf{Case 1.} $m=2k+1, n=2l+1$\\\\\nLet $S = \\{(0,0), (0,l), (1,l+1), (k+1,l+1)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge,with respect to $S$, is shown in the Table \\ref{vtor2} and Table \\ref{etor2}.\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k+1,2l+1}$}\n\\label{vtor2}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, l, l+1, l+k$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,l+i-1,l+k-i+1$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,l-j, l-j+2, l-j+k+1$) \\\\\n$(i,j)$ & $1\\leq i \\leq k-1$  &($j+i,i+l-j,l-j+i,l-j+k-i+2$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,j-l,j-l,j-l+k-1$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,j-l+i,j-l+i-1,k-i+j-l$)\\\\\n&$1\\leq i \\leq k-1$  &  \\\\\n\n\n\n$(i,0)$ &  $k+2 \\leq i \\leq m-1$ & ($m-i, m-i+l, m-i+l+1, l-k+i-1$) \\\\\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+j,m-i+l-j,m-i+l-j+2,i-k+l-j$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m+n-i-j,j-l+m-i,,$\\\\\n&$l+1\\leq j \\leq n-1$  & $j-l-i+m,j-l+i-k-2)$ \\\\\n\n$(k+1,0)$ &  & ($m-i, m-i+l, k+l,l$) \\\\\n$(k,j)$ &  $1 \\leq j \\leq l$ & ($k+j,k+l-j,k+l-j,l-j+2$) \\\\\n$(k+1,j)$ &  $1 \\leq j \\leq l$ & ($k+j, k+l-j, k+l-j+1, l-j+1$) \\\\\n$(k,j)$ &  $l+1 \\leq j \\leq n-1$ & ($k+n-j, j-l+k, k+j-l-2, j-l$) \\\\\n$(k+1,j)$ &  $l+1 \\leq j \\leq n-1$ & ($k+n-j,k+j-l,k+j-l-1,j-l-1$) \\\\\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k+1,2l+1}$}\n\\label{etor2}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n\n $(0,0)(1,0)$ &  & ($0,l,l,k+l$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,l,l+1,k+l-1$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,l,l+1,l+k-1$) \\\\\n$(0,j)(0,j+1)$ & $0\\leq j \\leq l-1$   & ($j, l-j-1, l-j+1, k+l-j$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k-1$ & ($i,l+i,l+i-1,k-i+l$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,l-j-1+i,l+i-j-1,l-j+k-i+1$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i,i,i-1,k-i+1$) \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,l+i-1,l+i-1,l+k-i+1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,l-j+1,k+l-j+1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,j-l,j-l,k+j-l-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-2,k+j-l-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($i+j,l-j+i,l-j+i,l-j+k-i+1$) \\\\\n\n\n&  $1\\leq j \\leq l$ &  \\\\\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,l+i,l+i-1,l+k-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $j-l+i-2,k-i+j-l-1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,j-l-1,k+j-l-1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($j+m-i-1,l-j+m-i-1,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+l-j+1,l+1-j+i-k-1$)\\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i+j,l-j-1+m-i,$ \\\\\n&  $1\\leq j \\leq l-1$ &$l-j+m-i,l-j+i-k-1$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i+l-1,m-i+l,i-k-1+l$) \\\\\n$(i,0)(i,1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l-1,m-i+l+1,l+i-j-k-1$) \\\\\n$(k,j)(k+1,j)$ &  $1\\leq j \\leq l$ & ($k+j,k+l-j,k+l-j,l-j+1$)\\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,l-j+2,k+l-j$) \\\\\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($k+j,k+l-j-1,k+l-j,l-j$)\\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,m-i+j-l-1,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $m-i+j-l-1,i-k+j-l-2)$ \\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($n-j-1+m-i,j-l+m-i,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+m-i,j-l+i-k-2)$ \\\\\n$(i,l)(i,l+1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i+l,m-i,m-i+1,i-k-1)$\\\\\n$(k,j)(k+1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j+k,j-l+k,k+j-l-2,j-l-1)$\\\\\n$(i,0)(i,n-1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l,m-i+l,l+i-k-2)$\\\\\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,j-l,k+j-l-2)$\\\\\n$(0,l)(0,l+1)$ &  & ($l,0,1,k)$\\\\\n$(0,l)(m-1,l)$ &  & ($l,0,2,k)$\\\\\n$(k,0)(k+1,0)$ &  & ($k,k+l,k+l-1,l)$\\\\\n$(k+1,0)(k+1,1)$ &  & ($k,k+l-1,k+l,l)$\\\\\t\n$(k+1,0)(k+1,n-1)$ &  & ($k,k+l,k+l-1,l-1)$\\\\\n$(k+1,j)(k+1,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j+k-1,j-l+k,j-l+k-1,j-l-1)$\\\\\n$(k+1,l)(k+1,l+1)$ &  & ($k+l,k,k,0)$\\\\\n\n\\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\\newpage\n\n\nSince metric coordinates of all items are mutually different, $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k+1,2l+1})\\leq 4.$\n\n\\textbf{Case 2.} $m=2k+1, n=2l$\\\\\nLet $S = \\{(0,0), (0,l), (1,0), (k+1,1)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge, with respect to $S$, is shown in the Table \\ref{vtor7} and Table \\ref{vtor6}.\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k+1,2l}$}\n\\label{vtor7}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, l, 1, k+1$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,i-1,k-i+2$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,l-j, j+1, k+j-1$) \\\\\n$(i,j)$ & $1\\leq i \\leq k$  &($j+i,l-j+i,j+i-1,j+k-i$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,j-l,n-j+1,n-j+k+1$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,j-l+i,n-j+i-1,n-j+k-i+2$)\\\\\n&$1\\leq i \\leq k$  &  \\\\\n\n\n\n$(i,0)$ &  $k+2 \\leq i \\leq m-1$ & ($m-i, m-i+l, m-i+1, i-k$) \\\\\n\n$(k+1,0)$ &   & ($k, k+l, k, 1$) \\\\\n\n\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+j,m-i-j+l,m-i+j+1,i-k+j-2$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(k+1,j)$ & $1\\leq j \\leq l$   & ($m-i+j, m-i+l-j,k+j ,j-1$) \\\\\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+n-j,m-i+j-l,$\\\\\n&$l+1\\leq j \\leq n-1$  & $m-i+1+n-j,n+i-k-j)$ \\\\\n$(k+1,j)$ & $l+1\\leq j \\leq n-1$   & ($n-j+k,k+j-l,n-j+k ,n-j+1$) \\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k+1,2l}$}\n\\label{vtor6}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n $(0,0)(0,1)$ &  & ($0,l-1,1,k$) \\\\\n $(0,0)(1,0)$ &  & ($0,l,0,k+1$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,l-1,1,k+1$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,l,1,k$) \\\\\n$(0,j)(0,j+1)$ & $1\\leq j \\leq l-1$   & ($j, l-j-1, j+1, k+j-1$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,i-1,k-i+1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,i+l-j-1,j+i-1,k+j-i$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i-1,i,l+i-2,k-i+l$) \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,l-1+i,i-1,k-i+1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j,k+j-1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j,k+j-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,n-j+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-1,n-j+k-i-1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,j-l,n-j,k+n-j$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j-1+i,j-l+i,n-j-2+i,k+n-j-i+1$) \\\\\n&  $l+1\\leq j \\leq n-2$ &  \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k$ & ($i+j,i+l-j,j+i-1,k-i+j-1$) \\\\\n&  $1\\leq j \\leq l$ &  \\\\\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,i+l-1,i-1,k-i+2$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $n-j+i-1,n+k-i-j+1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,n-j,n-j+k+1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($m-i+j-1,m-i+l-j-1,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+j,i-k+j-2$)\\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i+j,m-i+l-j-1,$ \\\\\n&  $1\\leq j \\leq l-1$ &$m-i+1+j,i-k+j-2$) \\\\\n\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($m-i+j,m-i+l-j-1,k+j,i-k+j-2$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i+l-1,m-i,i-k$) \\\\\n$(i,0)(i,1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i-1+l,m-i+1,i-k-1$) \\\\\n $(k+1,0)(k+1,1)$ &  & ($k,k+l-1,k,0$) \\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j+1,k+j-2$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,j-l+m-i-1,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $m+n-j-i,n-j+i-k)$ \\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($n-j+m-i-1,m+j-i-l,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $n-j+m-i,n-j+i-k-1)$ \\\\\n$(k+1,j)(k+1,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j+m-i-1,j-l+m-i,$\\\\\n&   & $k+n-j-1,n-j+i-k-1$) \\\\\n$(i,l)(i,l+1)$ &  $k+2\\leq i \\leq m-1$ & ($l+m-i-1,m-i,l+m-i,l+i-k-2$\\\\\n$(k+1,l)(k+1,l+1)$ &   & ($l+k-1,k,l+k-1,l-1)$\\\\\n$(i,0)(i,n-1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l,m-i+1,i-k)$\\\\\n$(k+1,0)(k+1,n-1)$ &   & ($k,l+k-1,k,1)$\\\\\n\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,n-j+1,k+n-j)$\\\\\n\n\n\n\n$(0,l)(0,l+1)$ &  & ($l-1,0,l,k+l-1)$\\\\\n$(k+1,0)(k+2,0)$ &  & ($k,k+l-1,k,0)$\\\\\n\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\n\nSince metric coordinates of all items   are mutually different, so $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k+1,2l})\\leq 4.$\n\n\n\n\n\\newpage\n\n\n\n\n\n\\textbf{Case 3.} $m=2k, n=2l+1$\\\\\nLet $S =\\{(0,0),(k,0),(0,1),(1,l+1)\\}$. Since $C_m \\Box C_n$ is the same as $C_n \\Box C_m$, the proof of this case is similar to the proof of Case 2.\\\\\n\n\\textbf{Case 4.} $m=2k, n=2l$\\\\\nLet $S = \\{(0,0), (0,1), (1,l), (k,0)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge, with respect to $S$, is shown in the Table \\ref{etor} and Table \\ref{vtor}.\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k,2l}$}\n\\label{etor}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, 1, l+1, k$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+1,l+i-1,k-i$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,j-1, l-j+1, k+j$) \\\\\n$(i,j)$ & $1\\leq i \\leq k$  &($j+i,j-1+i,l-j+i-1,j+k-i$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,n-j+1,j-l+1,n-j+k$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,n-j+i+1,j-l+i-1,n-j+k-i$)\\\\\n&$1\\leq i \\leq k$  &  \\\\\n\n\n\n$(i,0)$ &  $k+1 \\leq i \\leq m-1$ & ($m-i, m-i+1, m-i+l+1, i-k$) \\\\\n\n$(i,j)$ & $k+1\\leq i \\leq m-1$  &($m-i+j,m-i+j-1,m-i+l-j+1,i-k+j$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n\n\n$(i,j)$ & $k+1\\leq i \\leq m-1$  &($m+n-i-j,m+n-i-j+1,$\\\\\n&$l+1\\leq j \\leq n-1$  & $m+j-l-i+1,n+i-k-j)$ \\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k,2l}$}\n\\label{vtor}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n $(0,0)(0,1)$ &  & ($0,0,l,k$) \\\\\n $(0,0)(1,0)$ &  & ($0,1,l,k-1$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,1,l,k$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,1,l+1,k-1$) \\\\\n$(0,j)(0,j+1)$ & $1\\leq j \\leq l-1$   & ($j, j-1, l-j, k+j$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k-1$ & ($i,i+1,l+i-1,k-i-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,i+j-1,l-j+i-2,k+j-i$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,i,l+i-2,k-i$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,j-1,l-j,k+j-1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,n-j,j-l+1,n-j-1+k$) \\\\\n\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,n-j+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-1,n-j+k-i-1$) \\\\\n\n\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($i+j,i+j-1,l-j+i-1,j+k-i-1$) \\\\\n&  $1\\leq j \\leq l$ &  \\\\\n\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i-1,l+i-1,i-1,l+k-i-1$) \\\\\n\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,i+1,l+i-2,k-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,n-j+i+1,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $j-l+i-1,n-l+k-i-1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,n-j+1,j-l,n-j+k-1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($m-i+j-1,m-i+j-2,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+l-j,i-k+j$)\\\\\n$(i,j)(i,j+1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i+j,m-i+j-1,$ \\\\\n&  $1\\leq j \\leq l-1$ &$,m-i+l-j,j+i-k$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i,m-i+l,i-k$) \\\\\n\n$(i,0)(i,1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i,m-i,m-i+l,i-k$) \\\\\n$(k,j)(k+1,j)$ &  $1\\leq j \\leq l$ & ($m-k+j-1,m-k+j-2,$\\\\\n&   & $,m-k+l-j-1,j$)  \\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,j-1,l-j+1,m-k+j-1$) \\\\\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($m-k+j-1,m-k+j-2,$\\\\\n&   & $,m-k+l-j-1,j+1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,n+m-j-i,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $j-l+m-i,n-j+i-k)$ \\\\\n$(i,j)(i,j+1)$ &  $k+1\\leq i \\leq m-1$ & ($n-j+m-i-1,n+m-j-i,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+m-i+1,n-j+i-k-1)$ \\\\\n$(i,l)(i,l+1)$ &  $k+1\\leq i \\leq m-1$ & ($l+m-i-1,m+l-i-1,$\\\\\n&   & $,m-i+1,l+i-k-1)$  \\\\\n$(k,j)(k+1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j+m-i-1,m+n-j-i+1,$\\\\\n&   & $,j-l+k-1,n-j)$  \\\\\n$(i,0)(i,n-1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i,m-i+1,m-i+l,i-k)$\\\\\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,n-j+1,j-l+1,n-j+k-1)$\\\\\n$(0,l)(0,l+1)$ &  & ($l-1,l-1,1,l-1+k)$\\\\\n$(0,l)(m-1,l)$ &  & ($l,l-1,1,k+l-1)$\\\\\n$(k,0)(k+1,0)$ &  & ($k-1,k,m-k-1+l,0$\\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\nSince metric coordinates of all items  are mutually different, $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k,2l})\\leq 4.$\\\\\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\textbf{\\underline{Step 2}:}  {\\em Lower bound is 4}.  \\\\\nTorus graph is $4$-regular graph, so by Corollary \\ref{lb1reg} follows\n$\\beta_M(T_{m,n})  \\geq 1 + \\lceil log_2 (r+1) \\rceil = 1 + \\lceil log_2 5 \\rceil = 4$.\n\nTherefore, from the previous two steps, it follows that $\\beta_M(T_{m,n})=4.$\n\\end{proof}\n\n\n\n\n\n\n\n\n\n\n\n",
        "Descriptive_question1": "What is the metric coordinate value for vertex (0,0) in table_6?",
        "Descriptive_question2": "What is the condition for vertex (i,0) in table_6 for the range 1 ≤ i ≤ k?",
        "Reasoning_question1": "How do the metric coordinates for vertices in table_6 change as the value of j increases from 1 to l for a fixed i?",
        "Reasoning_question2": "What can be inferred about the structure of the torus graph T_{2k+1,2l} from the metric coordinates provided in table_6 for different vertex conditions?",
        "Descriptive_answer1": "(0, l, 1, k+1)",
        "Descriptive_answer2": "1≤ i ≤ k",
        "Reasoning_answer1": "For a fixed i, as j increases from 1 to l, I observe the metric coordinates for vertex (i,j) in table_6, under the condition 1≤ i ≤ k and 1≤ j ≤ l, are given as (j+i, l-j+i, j+i-1, j+k-i). Let's analyze each component: the first component (j+i) increases linearly with j, reflecting a direct increase in distance or position as j grows. The second component (l-j+i) decreases with j, suggesting a reduction in distance from a reference point related to l as j increases. The third component (j+i-1) also increases with j, similar to the first, indicating a consistent progression. The fourth component (j+k-i) increases with j, combining the growth in j with constants k and i. Overall, the trend shows a systematic shift in coordinates, balancing increases and decreases, likely representing movement along a dimension of the torus graph.",
        "Reasoning_answer2": "Analyzing the metric coordinates in table_6 for T_{2k+1,2l}, I notice distinct patterns across different vertex conditions, which provide insight into the graph's structure. For instance, the coordinates for (0,0) are (0, l, 1, k+1), suggesting a baseline or origin-like point with specific distances to reference points. As I look at vertices along axes, such as (i,0) for 1≤ i ≤ k with coordinates (i, i+l, i-1, k-i+2), I see incremental changes in components, indicating a grid-like progression with wrap-around considerations, typical of a torus due to its cyclic nature. Similarly, for (0,j) with 1≤ j ≤ l, coordinates (j, l-j, j+1, k+j-1) show a balanced shift, reflecting symmetry or periodicity. For combined indices (i,j), multiple conditions reveal complex interactions in coordinates, often involving terms like m-i or n-j, suggesting wrap-around effects at boundaries, a hallmark of toroidal topology. Overall, these patterns infer that T_{2k+1,2l} is a structured, periodic graph, likely a 2D grid wrapped into a torus shape, where distances are computed considering cyclic connections both horizontally and vertically."
    },
    {
        "paper_id": "2007.05808.json",
        "table_id": "table_7",
        "table_content": "\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k+1,2l}$}\n\\label{vtor6}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n $(0,0)(0,1)$ &  & ($0,l-1,1,k$) \\\\\n $(0,0)(1,0)$ &  & ($0,l,0,k+1$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,l-1,1,k+1$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,l,1,k$) \\\\\n$(0,j)(0,j+1)$ & $1\\leq j \\leq l-1$   & ($j, l-j-1, j+1, k+j-1$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,i-1,k-i+1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,i+l-j-1,j+i-1,k+j-i$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i-1,i,l+i-2,k-i+l$) \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,l-1+i,i-1,k-i+1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j,k+j-1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j,k+j-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,n-j+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-1,n-j+k-i-1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,j-l,n-j,k+n-j$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j-1+i,j-l+i,n-j-2+i,k+n-j-i+1$) \\\\\n&  $l+1\\leq j \\leq n-2$ &  \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k$ & ($i+j,i+l-j,j+i-1,k-i+j-1$) \\\\\n&  $1\\leq j \\leq l$ &  \\\\\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,i+l-1,i-1,k-i+2$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $n-j+i-1,n+k-i-j+1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,n-j,n-j+k+1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($m-i+j-1,m-i+l-j-1,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+j,i-k+j-2$)\\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i+j,m-i+l-j-1,$ \\\\\n&  $1\\leq j \\leq l-1$ &$m-i+1+j,i-k+j-2$) \\\\\n\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($m-i+j,m-i+l-j-1,k+j,i-k+j-2$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i+l-1,m-i,i-k$) \\\\\n$(i,0)(i,1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i-1+l,m-i+1,i-k-1$) \\\\\n $(k+1,0)(k+1,1)$ &  & ($k,k+l-1,k,0$) \\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j+1,k+j-2$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,j-l+m-i-1,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $m+n-j-i,n-j+i-k)$ \\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($n-j+m-i-1,m+j-i-l,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $n-j+m-i,n-j+i-k-1)$ \\\\\n$(k+1,j)(k+1,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j+m-i-1,j-l+m-i,$\\\\\n&   & $k+n-j-1,n-j+i-k-1$) \\\\\n$(i,l)(i,l+1)$ &  $k+2\\leq i \\leq m-1$ & ($l+m-i-1,m-i,l+m-i,l+i-k-2$\\\\\n$(k+1,l)(k+1,l+1)$ &   & ($l+k-1,k,l+k-1,l-1)$\\\\\n$(i,0)(i,n-1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l,m-i+1,i-k)$\\\\\n$(k+1,0)(k+1,n-1)$ &   & ($k,l+k-1,k,1)$\\\\\n\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,n-j+1,k+n-j)$\\\\\n\n\n\n\n$(0,l)(0,l+1)$ &  & ($l-1,0,l,k+l-1)$\\\\\n$(k+1,0)(k+2,0)$ &  & ($k,k+l-1,k,0)$\\\\\n\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}",
        "caption": "  Metric coordinates of edges of $T_{2k+1,2l}$",
        "label": "vtor6",
        "section_info": "3 Exact results on torus graph\n\\section{Exact results on torus graph}\n\nIn this section we will use previously introduced general lower bounds to obtain the exact values of mixed metric dimension of torus graph.\n\n\\begin{thm} For $m,n \\geq 3$ it holds $\\beta_M(T_{m,n}) = 4$.\\end{thm}\n\\begin{proof}  \\textbf{\\underline{Step 1}:}  {\\em Upper bound is 4}.  \\\\\n\nThere are four cases:\\\\\n\\textbf{Case 1.} $m=2k+1, n=2l+1$\\\\\nLet $S = \\{(0,0), (0,l), (1,l+1), (k+1,l+1)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge,with respect to $S$, is shown in the Table \\ref{vtor2} and Table \\ref{etor2}.\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k+1,2l+1}$}\n\\label{vtor2}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, l, l+1, l+k$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,l+i-1,l+k-i+1$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,l-j, l-j+2, l-j+k+1$) \\\\\n$(i,j)$ & $1\\leq i \\leq k-1$  &($j+i,i+l-j,l-j+i,l-j+k-i+2$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,j-l,j-l,j-l+k-1$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,j-l+i,j-l+i-1,k-i+j-l$)\\\\\n&$1\\leq i \\leq k-1$  &  \\\\\n\n\n\n$(i,0)$ &  $k+2 \\leq i \\leq m-1$ & ($m-i, m-i+l, m-i+l+1, l-k+i-1$) \\\\\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+j,m-i+l-j,m-i+l-j+2,i-k+l-j$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m+n-i-j,j-l+m-i,,$\\\\\n&$l+1\\leq j \\leq n-1$  & $j-l-i+m,j-l+i-k-2)$ \\\\\n\n$(k+1,0)$ &  & ($m-i, m-i+l, k+l,l$) \\\\\n$(k,j)$ &  $1 \\leq j \\leq l$ & ($k+j,k+l-j,k+l-j,l-j+2$) \\\\\n$(k+1,j)$ &  $1 \\leq j \\leq l$ & ($k+j, k+l-j, k+l-j+1, l-j+1$) \\\\\n$(k,j)$ &  $l+1 \\leq j \\leq n-1$ & ($k+n-j, j-l+k, k+j-l-2, j-l$) \\\\\n$(k+1,j)$ &  $l+1 \\leq j \\leq n-1$ & ($k+n-j,k+j-l,k+j-l-1,j-l-1$) \\\\\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k+1,2l+1}$}\n\\label{etor2}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n\n $(0,0)(1,0)$ &  & ($0,l,l,k+l$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,l,l+1,k+l-1$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,l,l+1,l+k-1$) \\\\\n$(0,j)(0,j+1)$ & $0\\leq j \\leq l-1$   & ($j, l-j-1, l-j+1, k+l-j$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k-1$ & ($i,l+i,l+i-1,k-i+l$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,l-j-1+i,l+i-j-1,l-j+k-i+1$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i,i,i-1,k-i+1$) \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,l+i-1,l+i-1,l+k-i+1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,l-j+1,k+l-j+1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,j-l,j-l,k+j-l-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-2,k+j-l-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($i+j,l-j+i,l-j+i,l-j+k-i+1$) \\\\\n\n\n&  $1\\leq j \\leq l$ &  \\\\\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,l+i,l+i-1,l+k-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $j-l+i-2,k-i+j-l-1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,j-l-1,k+j-l-1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($j+m-i-1,l-j+m-i-1,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+l-j+1,l+1-j+i-k-1$)\\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i+j,l-j-1+m-i,$ \\\\\n&  $1\\leq j \\leq l-1$ &$l-j+m-i,l-j+i-k-1$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i+l-1,m-i+l,i-k-1+l$) \\\\\n$(i,0)(i,1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l-1,m-i+l+1,l+i-j-k-1$) \\\\\n$(k,j)(k+1,j)$ &  $1\\leq j \\leq l$ & ($k+j,k+l-j,k+l-j,l-j+1$)\\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,l-j+2,k+l-j$) \\\\\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($k+j,k+l-j-1,k+l-j,l-j$)\\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,m-i+j-l-1,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $m-i+j-l-1,i-k+j-l-2)$ \\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($n-j-1+m-i,j-l+m-i,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+m-i,j-l+i-k-2)$ \\\\\n$(i,l)(i,l+1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i+l,m-i,m-i+1,i-k-1)$\\\\\n$(k,j)(k+1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j+k,j-l+k,k+j-l-2,j-l-1)$\\\\\n$(i,0)(i,n-1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l,m-i+l,l+i-k-2)$\\\\\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,j-l,k+j-l-2)$\\\\\n$(0,l)(0,l+1)$ &  & ($l,0,1,k)$\\\\\n$(0,l)(m-1,l)$ &  & ($l,0,2,k)$\\\\\n$(k,0)(k+1,0)$ &  & ($k,k+l,k+l-1,l)$\\\\\n$(k+1,0)(k+1,1)$ &  & ($k,k+l-1,k+l,l)$\\\\\t\n$(k+1,0)(k+1,n-1)$ &  & ($k,k+l,k+l-1,l-1)$\\\\\n$(k+1,j)(k+1,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j+k-1,j-l+k,j-l+k-1,j-l-1)$\\\\\n$(k+1,l)(k+1,l+1)$ &  & ($k+l,k,k,0)$\\\\\n\n\\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\\newpage\n\n\nSince metric coordinates of all items are mutually different, $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k+1,2l+1})\\leq 4.$\n\n\\textbf{Case 2.} $m=2k+1, n=2l$\\\\\nLet $S = \\{(0,0), (0,l), (1,0), (k+1,1)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge, with respect to $S$, is shown in the Table \\ref{vtor7} and Table \\ref{vtor6}.\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k+1,2l}$}\n\\label{vtor7}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, l, 1, k+1$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,i-1,k-i+2$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,l-j, j+1, k+j-1$) \\\\\n$(i,j)$ & $1\\leq i \\leq k$  &($j+i,l-j+i,j+i-1,j+k-i$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,j-l,n-j+1,n-j+k+1$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,j-l+i,n-j+i-1,n-j+k-i+2$)\\\\\n&$1\\leq i \\leq k$  &  \\\\\n\n\n\n$(i,0)$ &  $k+2 \\leq i \\leq m-1$ & ($m-i, m-i+l, m-i+1, i-k$) \\\\\n\n$(k+1,0)$ &   & ($k, k+l, k, 1$) \\\\\n\n\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+j,m-i-j+l,m-i+j+1,i-k+j-2$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(k+1,j)$ & $1\\leq j \\leq l$   & ($m-i+j, m-i+l-j,k+j ,j-1$) \\\\\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+n-j,m-i+j-l,$\\\\\n&$l+1\\leq j \\leq n-1$  & $m-i+1+n-j,n+i-k-j)$ \\\\\n$(k+1,j)$ & $l+1\\leq j \\leq n-1$   & ($n-j+k,k+j-l,n-j+k ,n-j+1$) \\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k+1,2l}$}\n\\label{vtor6}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n $(0,0)(0,1)$ &  & ($0,l-1,1,k$) \\\\\n $(0,0)(1,0)$ &  & ($0,l,0,k+1$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,l-1,1,k+1$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,l,1,k$) \\\\\n$(0,j)(0,j+1)$ & $1\\leq j \\leq l-1$   & ($j, l-j-1, j+1, k+j-1$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,i-1,k-i+1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,i+l-j-1,j+i-1,k+j-i$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i-1,i,l+i-2,k-i+l$) \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,l-1+i,i-1,k-i+1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j,k+j-1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j,k+j-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,n-j+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-1,n-j+k-i-1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,j-l,n-j,k+n-j$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j-1+i,j-l+i,n-j-2+i,k+n-j-i+1$) \\\\\n&  $l+1\\leq j \\leq n-2$ &  \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k$ & ($i+j,i+l-j,j+i-1,k-i+j-1$) \\\\\n&  $1\\leq j \\leq l$ &  \\\\\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,i+l-1,i-1,k-i+2$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $n-j+i-1,n+k-i-j+1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,n-j,n-j+k+1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($m-i+j-1,m-i+l-j-1,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+j,i-k+j-2$)\\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i+j,m-i+l-j-1,$ \\\\\n&  $1\\leq j \\leq l-1$ &$m-i+1+j,i-k+j-2$) \\\\\n\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($m-i+j,m-i+l-j-1,k+j,i-k+j-2$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i+l-1,m-i,i-k$) \\\\\n$(i,0)(i,1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i-1+l,m-i+1,i-k-1$) \\\\\n $(k+1,0)(k+1,1)$ &  & ($k,k+l-1,k,0$) \\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j+1,k+j-2$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,j-l+m-i-1,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $m+n-j-i,n-j+i-k)$ \\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($n-j+m-i-1,m+j-i-l,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $n-j+m-i,n-j+i-k-1)$ \\\\\n$(k+1,j)(k+1,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j+m-i-1,j-l+m-i,$\\\\\n&   & $k+n-j-1,n-j+i-k-1$) \\\\\n$(i,l)(i,l+1)$ &  $k+2\\leq i \\leq m-1$ & ($l+m-i-1,m-i,l+m-i,l+i-k-2$\\\\\n$(k+1,l)(k+1,l+1)$ &   & ($l+k-1,k,l+k-1,l-1)$\\\\\n$(i,0)(i,n-1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l,m-i+1,i-k)$\\\\\n$(k+1,0)(k+1,n-1)$ &   & ($k,l+k-1,k,1)$\\\\\n\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,n-j+1,k+n-j)$\\\\\n\n\n\n\n$(0,l)(0,l+1)$ &  & ($l-1,0,l,k+l-1)$\\\\\n$(k+1,0)(k+2,0)$ &  & ($k,k+l-1,k,0)$\\\\\n\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\n\nSince metric coordinates of all items   are mutually different, so $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k+1,2l})\\leq 4.$\n\n\n\n\n\\newpage\n\n\n\n\n\n\\textbf{Case 3.} $m=2k, n=2l+1$\\\\\nLet $S =\\{(0,0),(k,0),(0,1),(1,l+1)\\}$. Since $C_m \\Box C_n$ is the same as $C_n \\Box C_m$, the proof of this case is similar to the proof of Case 2.\\\\\n\n\\textbf{Case 4.} $m=2k, n=2l$\\\\\nLet $S = \\{(0,0), (0,1), (1,l), (k,0)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge, with respect to $S$, is shown in the Table \\ref{etor} and Table \\ref{vtor}.\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k,2l}$}\n\\label{etor}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, 1, l+1, k$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+1,l+i-1,k-i$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,j-1, l-j+1, k+j$) \\\\\n$(i,j)$ & $1\\leq i \\leq k$  &($j+i,j-1+i,l-j+i-1,j+k-i$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,n-j+1,j-l+1,n-j+k$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,n-j+i+1,j-l+i-1,n-j+k-i$)\\\\\n&$1\\leq i \\leq k$  &  \\\\\n\n\n\n$(i,0)$ &  $k+1 \\leq i \\leq m-1$ & ($m-i, m-i+1, m-i+l+1, i-k$) \\\\\n\n$(i,j)$ & $k+1\\leq i \\leq m-1$  &($m-i+j,m-i+j-1,m-i+l-j+1,i-k+j$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n\n\n$(i,j)$ & $k+1\\leq i \\leq m-1$  &($m+n-i-j,m+n-i-j+1,$\\\\\n&$l+1\\leq j \\leq n-1$  & $m+j-l-i+1,n+i-k-j)$ \\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k,2l}$}\n\\label{vtor}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n $(0,0)(0,1)$ &  & ($0,0,l,k$) \\\\\n $(0,0)(1,0)$ &  & ($0,1,l,k-1$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,1,l,k$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,1,l+1,k-1$) \\\\\n$(0,j)(0,j+1)$ & $1\\leq j \\leq l-1$   & ($j, j-1, l-j, k+j$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k-1$ & ($i,i+1,l+i-1,k-i-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,i+j-1,l-j+i-2,k+j-i$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,i,l+i-2,k-i$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,j-1,l-j,k+j-1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,n-j,j-l+1,n-j-1+k$) \\\\\n\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,n-j+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-1,n-j+k-i-1$) \\\\\n\n\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($i+j,i+j-1,l-j+i-1,j+k-i-1$) \\\\\n&  $1\\leq j \\leq l$ &  \\\\\n\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i-1,l+i-1,i-1,l+k-i-1$) \\\\\n\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,i+1,l+i-2,k-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,n-j+i+1,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $j-l+i-1,n-l+k-i-1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,n-j+1,j-l,n-j+k-1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($m-i+j-1,m-i+j-2,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+l-j,i-k+j$)\\\\\n$(i,j)(i,j+1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i+j,m-i+j-1,$ \\\\\n&  $1\\leq j \\leq l-1$ &$,m-i+l-j,j+i-k$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i,m-i+l,i-k$) \\\\\n\n$(i,0)(i,1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i,m-i,m-i+l,i-k$) \\\\\n$(k,j)(k+1,j)$ &  $1\\leq j \\leq l$ & ($m-k+j-1,m-k+j-2,$\\\\\n&   & $,m-k+l-j-1,j$)  \\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,j-1,l-j+1,m-k+j-1$) \\\\\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($m-k+j-1,m-k+j-2,$\\\\\n&   & $,m-k+l-j-1,j+1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,n+m-j-i,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $j-l+m-i,n-j+i-k)$ \\\\\n$(i,j)(i,j+1)$ &  $k+1\\leq i \\leq m-1$ & ($n-j+m-i-1,n+m-j-i,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+m-i+1,n-j+i-k-1)$ \\\\\n$(i,l)(i,l+1)$ &  $k+1\\leq i \\leq m-1$ & ($l+m-i-1,m+l-i-1,$\\\\\n&   & $,m-i+1,l+i-k-1)$  \\\\\n$(k,j)(k+1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j+m-i-1,m+n-j-i+1,$\\\\\n&   & $,j-l+k-1,n-j)$  \\\\\n$(i,0)(i,n-1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i,m-i+1,m-i+l,i-k)$\\\\\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,n-j+1,j-l+1,n-j+k-1)$\\\\\n$(0,l)(0,l+1)$ &  & ($l-1,l-1,1,l-1+k)$\\\\\n$(0,l)(m-1,l)$ &  & ($l,l-1,1,k+l-1)$\\\\\n$(k,0)(k+1,0)$ &  & ($k-1,k,m-k-1+l,0$\\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\nSince metric coordinates of all items  are mutually different, $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k,2l})\\leq 4.$\\\\\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\textbf{\\underline{Step 2}:}  {\\em Lower bound is 4}.  \\\\\nTorus graph is $4$-regular graph, so by Corollary \\ref{lb1reg} follows\n$\\beta_M(T_{m,n})  \\geq 1 + \\lceil log_2 (r+1) \\rceil = 1 + \\lceil log_2 5 \\rceil = 4$.\n\nTherefore, from the previous two steps, it follows that $\\beta_M(T_{m,n})=4.$\n\\end{proof}\n\n\n\n\n\n\n\n\n\n\n\n",
        "Descriptive_question1": "What is the metric coordinate of the edge (0,0)(0,1) in table_7?",
        "Descriptive_question2": "Which edge in table_7 has the condition '1≤j≤l-1' for the coordinates (j, l-j-1, j+1, k+j-1)?",
        "Reasoning_question1": "What can be inferred about the pattern of metric coordinates for edges of the form (i,j)(i,j+1) across different ranges of j in table_7?",
        "Reasoning_question2": "How do the metric coordinates of edges in table_7 vary between horizontal and vertical connections in the torus graph T_{2k+1,2l}?",
        "Descriptive_answer1": "0,l-1,1,k",
        "Descriptive_answer2": "(0,j)(0,j+1)",
        "Reasoning_answer1": "Looking at table_7, for edges of the form (i,j)(i,j+1), the metric coordinates show distinct patterns based on the range of j. For 1≤j≤l-1, the coordinates are (i+j, i+l-j-1, j+i-1, k+j-i), indicating a combination of i and j terms with adjustments based on l and k. When j ranges from l+1≤j≤n-2, the coordinates shift to (n-j-1+i, j-l+i, n-j-2+i, k+n-j-i+1), reflecting a dependency on n and a reversal in the trend of j, suggesting a wrap-around effect typical of a torus structure. This variation highlights how the torus grid's periodicity influences coordinate calculations across different segments of the graph.",
        "Reasoning_answer2": "In table_7, metric coordinates for edges in the torus graph T_{2k+1,2l} differ significantly between horizontal and vertical connections due to the graph's structure. Horizontal edges, such as (i,j)(i,j+1), often have coordinates that incorporate terms like i+j or n-j, reflecting movement along the j-axis with periodic adjustments (e.g., for 1≤j≤l-1, coordinates are i+j, i+l-j-1, etc.). Vertical edges, like (i,j)(i+1,j), show coordinates with terms like i+l or m-i, indicating movement along the i-axis (e.g., for 1≤i≤k and 1≤j≤l, coordinates are i+j, i+l-j, etc.). This distinction arises from the torus's cyclic nature in both dimensions, where horizontal connections wrap around the j-direction and vertical ones wrap around the i-direction, resulting in different coordinate formulations to capture unique positions in the grid."
    },
    {
        "paper_id": "2007.05808.json",
        "table_id": "table_8",
        "table_content": "\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k,2l}$}\n\\label{etor}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, 1, l+1, k$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+1,l+i-1,k-i$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,j-1, l-j+1, k+j$) \\\\\n$(i,j)$ & $1\\leq i \\leq k$  &($j+i,j-1+i,l-j+i-1,j+k-i$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,n-j+1,j-l+1,n-j+k$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,n-j+i+1,j-l+i-1,n-j+k-i$)\\\\\n&$1\\leq i \\leq k$  &  \\\\\n\n\n\n$(i,0)$ &  $k+1 \\leq i \\leq m-1$ & ($m-i, m-i+1, m-i+l+1, i-k$) \\\\\n\n$(i,j)$ & $k+1\\leq i \\leq m-1$  &($m-i+j,m-i+j-1,m-i+l-j+1,i-k+j$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n\n\n$(i,j)$ & $k+1\\leq i \\leq m-1$  &($m+n-i-j,m+n-i-j+1,$\\\\\n&$l+1\\leq j \\leq n-1$  & $m+j-l-i+1,n+i-k-j)$ \\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}",
        "caption": "  Metric coordinates of vertices of $T_{2k,2l}$",
        "label": "etor",
        "section_info": "3 Exact results on torus graph\n\\section{Exact results on torus graph}\n\nIn this section we will use previously introduced general lower bounds to obtain the exact values of mixed metric dimension of torus graph.\n\n\\begin{thm} For $m,n \\geq 3$ it holds $\\beta_M(T_{m,n}) = 4$.\\end{thm}\n\\begin{proof}  \\textbf{\\underline{Step 1}:}  {\\em Upper bound is 4}.  \\\\\n\nThere are four cases:\\\\\n\\textbf{Case 1.} $m=2k+1, n=2l+1$\\\\\nLet $S = \\{(0,0), (0,l), (1,l+1), (k+1,l+1)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge,with respect to $S$, is shown in the Table \\ref{vtor2} and Table \\ref{etor2}.\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k+1,2l+1}$}\n\\label{vtor2}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, l, l+1, l+k$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,l+i-1,l+k-i+1$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,l-j, l-j+2, l-j+k+1$) \\\\\n$(i,j)$ & $1\\leq i \\leq k-1$  &($j+i,i+l-j,l-j+i,l-j+k-i+2$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,j-l,j-l,j-l+k-1$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,j-l+i,j-l+i-1,k-i+j-l$)\\\\\n&$1\\leq i \\leq k-1$  &  \\\\\n\n\n\n$(i,0)$ &  $k+2 \\leq i \\leq m-1$ & ($m-i, m-i+l, m-i+l+1, l-k+i-1$) \\\\\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+j,m-i+l-j,m-i+l-j+2,i-k+l-j$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m+n-i-j,j-l+m-i,,$\\\\\n&$l+1\\leq j \\leq n-1$  & $j-l-i+m,j-l+i-k-2)$ \\\\\n\n$(k+1,0)$ &  & ($m-i, m-i+l, k+l,l$) \\\\\n$(k,j)$ &  $1 \\leq j \\leq l$ & ($k+j,k+l-j,k+l-j,l-j+2$) \\\\\n$(k+1,j)$ &  $1 \\leq j \\leq l$ & ($k+j, k+l-j, k+l-j+1, l-j+1$) \\\\\n$(k,j)$ &  $l+1 \\leq j \\leq n-1$ & ($k+n-j, j-l+k, k+j-l-2, j-l$) \\\\\n$(k+1,j)$ &  $l+1 \\leq j \\leq n-1$ & ($k+n-j,k+j-l,k+j-l-1,j-l-1$) \\\\\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k+1,2l+1}$}\n\\label{etor2}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n\n $(0,0)(1,0)$ &  & ($0,l,l,k+l$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,l,l+1,k+l-1$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,l,l+1,l+k-1$) \\\\\n$(0,j)(0,j+1)$ & $0\\leq j \\leq l-1$   & ($j, l-j-1, l-j+1, k+l-j$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k-1$ & ($i,l+i,l+i-1,k-i+l$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,l-j-1+i,l+i-j-1,l-j+k-i+1$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i,i,i-1,k-i+1$) \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,l+i-1,l+i-1,l+k-i+1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,l-j+1,k+l-j+1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,j-l,j-l,k+j-l-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-2,k+j-l-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($i+j,l-j+i,l-j+i,l-j+k-i+1$) \\\\\n\n\n&  $1\\leq j \\leq l$ &  \\\\\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,l+i,l+i-1,l+k-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $j-l+i-2,k-i+j-l-1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,j-l-1,k+j-l-1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($j+m-i-1,l-j+m-i-1,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+l-j+1,l+1-j+i-k-1$)\\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i+j,l-j-1+m-i,$ \\\\\n&  $1\\leq j \\leq l-1$ &$l-j+m-i,l-j+i-k-1$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i+l-1,m-i+l,i-k-1+l$) \\\\\n$(i,0)(i,1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l-1,m-i+l+1,l+i-j-k-1$) \\\\\n$(k,j)(k+1,j)$ &  $1\\leq j \\leq l$ & ($k+j,k+l-j,k+l-j,l-j+1$)\\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,l-j+2,k+l-j$) \\\\\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($k+j,k+l-j-1,k+l-j,l-j$)\\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,m-i+j-l-1,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $m-i+j-l-1,i-k+j-l-2)$ \\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($n-j-1+m-i,j-l+m-i,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+m-i,j-l+i-k-2)$ \\\\\n$(i,l)(i,l+1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i+l,m-i,m-i+1,i-k-1)$\\\\\n$(k,j)(k+1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j+k,j-l+k,k+j-l-2,j-l-1)$\\\\\n$(i,0)(i,n-1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l,m-i+l,l+i-k-2)$\\\\\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,j-l,k+j-l-2)$\\\\\n$(0,l)(0,l+1)$ &  & ($l,0,1,k)$\\\\\n$(0,l)(m-1,l)$ &  & ($l,0,2,k)$\\\\\n$(k,0)(k+1,0)$ &  & ($k,k+l,k+l-1,l)$\\\\\n$(k+1,0)(k+1,1)$ &  & ($k,k+l-1,k+l,l)$\\\\\t\n$(k+1,0)(k+1,n-1)$ &  & ($k,k+l,k+l-1,l-1)$\\\\\n$(k+1,j)(k+1,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j+k-1,j-l+k,j-l+k-1,j-l-1)$\\\\\n$(k+1,l)(k+1,l+1)$ &  & ($k+l,k,k,0)$\\\\\n\n\\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\\newpage\n\n\nSince metric coordinates of all items are mutually different, $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k+1,2l+1})\\leq 4.$\n\n\\textbf{Case 2.} $m=2k+1, n=2l$\\\\\nLet $S = \\{(0,0), (0,l), (1,0), (k+1,1)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge, with respect to $S$, is shown in the Table \\ref{vtor7} and Table \\ref{vtor6}.\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k+1,2l}$}\n\\label{vtor7}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, l, 1, k+1$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,i-1,k-i+2$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,l-j, j+1, k+j-1$) \\\\\n$(i,j)$ & $1\\leq i \\leq k$  &($j+i,l-j+i,j+i-1,j+k-i$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,j-l,n-j+1,n-j+k+1$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,j-l+i,n-j+i-1,n-j+k-i+2$)\\\\\n&$1\\leq i \\leq k$  &  \\\\\n\n\n\n$(i,0)$ &  $k+2 \\leq i \\leq m-1$ & ($m-i, m-i+l, m-i+1, i-k$) \\\\\n\n$(k+1,0)$ &   & ($k, k+l, k, 1$) \\\\\n\n\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+j,m-i-j+l,m-i+j+1,i-k+j-2$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(k+1,j)$ & $1\\leq j \\leq l$   & ($m-i+j, m-i+l-j,k+j ,j-1$) \\\\\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+n-j,m-i+j-l,$\\\\\n&$l+1\\leq j \\leq n-1$  & $m-i+1+n-j,n+i-k-j)$ \\\\\n$(k+1,j)$ & $l+1\\leq j \\leq n-1$   & ($n-j+k,k+j-l,n-j+k ,n-j+1$) \\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k+1,2l}$}\n\\label{vtor6}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n $(0,0)(0,1)$ &  & ($0,l-1,1,k$) \\\\\n $(0,0)(1,0)$ &  & ($0,l,0,k+1$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,l-1,1,k+1$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,l,1,k$) \\\\\n$(0,j)(0,j+1)$ & $1\\leq j \\leq l-1$   & ($j, l-j-1, j+1, k+j-1$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,i-1,k-i+1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,i+l-j-1,j+i-1,k+j-i$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i-1,i,l+i-2,k-i+l$) \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,l-1+i,i-1,k-i+1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j,k+j-1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j,k+j-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,n-j+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-1,n-j+k-i-1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,j-l,n-j,k+n-j$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j-1+i,j-l+i,n-j-2+i,k+n-j-i+1$) \\\\\n&  $l+1\\leq j \\leq n-2$ &  \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k$ & ($i+j,i+l-j,j+i-1,k-i+j-1$) \\\\\n&  $1\\leq j \\leq l$ &  \\\\\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,i+l-1,i-1,k-i+2$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $n-j+i-1,n+k-i-j+1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,n-j,n-j+k+1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($m-i+j-1,m-i+l-j-1,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+j,i-k+j-2$)\\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i+j,m-i+l-j-1,$ \\\\\n&  $1\\leq j \\leq l-1$ &$m-i+1+j,i-k+j-2$) \\\\\n\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($m-i+j,m-i+l-j-1,k+j,i-k+j-2$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i+l-1,m-i,i-k$) \\\\\n$(i,0)(i,1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i-1+l,m-i+1,i-k-1$) \\\\\n $(k+1,0)(k+1,1)$ &  & ($k,k+l-1,k,0$) \\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j+1,k+j-2$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,j-l+m-i-1,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $m+n-j-i,n-j+i-k)$ \\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($n-j+m-i-1,m+j-i-l,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $n-j+m-i,n-j+i-k-1)$ \\\\\n$(k+1,j)(k+1,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j+m-i-1,j-l+m-i,$\\\\\n&   & $k+n-j-1,n-j+i-k-1$) \\\\\n$(i,l)(i,l+1)$ &  $k+2\\leq i \\leq m-1$ & ($l+m-i-1,m-i,l+m-i,l+i-k-2$\\\\\n$(k+1,l)(k+1,l+1)$ &   & ($l+k-1,k,l+k-1,l-1)$\\\\\n$(i,0)(i,n-1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l,m-i+1,i-k)$\\\\\n$(k+1,0)(k+1,n-1)$ &   & ($k,l+k-1,k,1)$\\\\\n\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,n-j+1,k+n-j)$\\\\\n\n\n\n\n$(0,l)(0,l+1)$ &  & ($l-1,0,l,k+l-1)$\\\\\n$(k+1,0)(k+2,0)$ &  & ($k,k+l-1,k,0)$\\\\\n\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\n\nSince metric coordinates of all items   are mutually different, so $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k+1,2l})\\leq 4.$\n\n\n\n\n\\newpage\n\n\n\n\n\n\\textbf{Case 3.} $m=2k, n=2l+1$\\\\\nLet $S =\\{(0,0),(k,0),(0,1),(1,l+1)\\}$. Since $C_m \\Box C_n$ is the same as $C_n \\Box C_m$, the proof of this case is similar to the proof of Case 2.\\\\\n\n\\textbf{Case 4.} $m=2k, n=2l$\\\\\nLet $S = \\{(0,0), (0,1), (1,l), (k,0)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge, with respect to $S$, is shown in the Table \\ref{etor} and Table \\ref{vtor}.\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k,2l}$}\n\\label{etor}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, 1, l+1, k$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+1,l+i-1,k-i$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,j-1, l-j+1, k+j$) \\\\\n$(i,j)$ & $1\\leq i \\leq k$  &($j+i,j-1+i,l-j+i-1,j+k-i$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,n-j+1,j-l+1,n-j+k$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,n-j+i+1,j-l+i-1,n-j+k-i$)\\\\\n&$1\\leq i \\leq k$  &  \\\\\n\n\n\n$(i,0)$ &  $k+1 \\leq i \\leq m-1$ & ($m-i, m-i+1, m-i+l+1, i-k$) \\\\\n\n$(i,j)$ & $k+1\\leq i \\leq m-1$  &($m-i+j,m-i+j-1,m-i+l-j+1,i-k+j$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n\n\n$(i,j)$ & $k+1\\leq i \\leq m-1$  &($m+n-i-j,m+n-i-j+1,$\\\\\n&$l+1\\leq j \\leq n-1$  & $m+j-l-i+1,n+i-k-j)$ \\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k,2l}$}\n\\label{vtor}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n $(0,0)(0,1)$ &  & ($0,0,l,k$) \\\\\n $(0,0)(1,0)$ &  & ($0,1,l,k-1$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,1,l,k$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,1,l+1,k-1$) \\\\\n$(0,j)(0,j+1)$ & $1\\leq j \\leq l-1$   & ($j, j-1, l-j, k+j$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k-1$ & ($i,i+1,l+i-1,k-i-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,i+j-1,l-j+i-2,k+j-i$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,i,l+i-2,k-i$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,j-1,l-j,k+j-1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,n-j,j-l+1,n-j-1+k$) \\\\\n\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,n-j+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-1,n-j+k-i-1$) \\\\\n\n\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($i+j,i+j-1,l-j+i-1,j+k-i-1$) \\\\\n&  $1\\leq j \\leq l$ &  \\\\\n\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i-1,l+i-1,i-1,l+k-i-1$) \\\\\n\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,i+1,l+i-2,k-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,n-j+i+1,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $j-l+i-1,n-l+k-i-1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,n-j+1,j-l,n-j+k-1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($m-i+j-1,m-i+j-2,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+l-j,i-k+j$)\\\\\n$(i,j)(i,j+1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i+j,m-i+j-1,$ \\\\\n&  $1\\leq j \\leq l-1$ &$,m-i+l-j,j+i-k$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i,m-i+l,i-k$) \\\\\n\n$(i,0)(i,1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i,m-i,m-i+l,i-k$) \\\\\n$(k,j)(k+1,j)$ &  $1\\leq j \\leq l$ & ($m-k+j-1,m-k+j-2,$\\\\\n&   & $,m-k+l-j-1,j$)  \\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,j-1,l-j+1,m-k+j-1$) \\\\\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($m-k+j-1,m-k+j-2,$\\\\\n&   & $,m-k+l-j-1,j+1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,n+m-j-i,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $j-l+m-i,n-j+i-k)$ \\\\\n$(i,j)(i,j+1)$ &  $k+1\\leq i \\leq m-1$ & ($n-j+m-i-1,n+m-j-i,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+m-i+1,n-j+i-k-1)$ \\\\\n$(i,l)(i,l+1)$ &  $k+1\\leq i \\leq m-1$ & ($l+m-i-1,m+l-i-1,$\\\\\n&   & $,m-i+1,l+i-k-1)$  \\\\\n$(k,j)(k+1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j+m-i-1,m+n-j-i+1,$\\\\\n&   & $,j-l+k-1,n-j)$  \\\\\n$(i,0)(i,n-1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i,m-i+1,m-i+l,i-k)$\\\\\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,n-j+1,j-l+1,n-j+k-1)$\\\\\n$(0,l)(0,l+1)$ &  & ($l-1,l-1,1,l-1+k)$\\\\\n$(0,l)(m-1,l)$ &  & ($l,l-1,1,k+l-1)$\\\\\n$(k,0)(k+1,0)$ &  & ($k-1,k,m-k-1+l,0$\\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\nSince metric coordinates of all items  are mutually different, $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k,2l})\\leq 4.$\\\\\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\textbf{\\underline{Step 2}:}  {\\em Lower bound is 4}.  \\\\\nTorus graph is $4$-regular graph, so by Corollary \\ref{lb1reg} follows\n$\\beta_M(T_{m,n})  \\geq 1 + \\lceil log_2 (r+1) \\rceil = 1 + \\lceil log_2 5 \\rceil = 4$.\n\nTherefore, from the previous two steps, it follows that $\\beta_M(T_{m,n})=4.$\n\\end{proof}\n\n\n\n\n\n\n\n\n\n\n\n",
        "Descriptive_question1": "What is the metric coordinate for vertex (0,0) in table_8?",
        "Descriptive_question2": "What is the condition for vertex (i,0) in table_8?",
        "Reasoning_question1": "How do the metric coordinates of vertices change with increasing i for (i,0) in table_8, and what pattern can be observed?",
        "Reasoning_question2": "Why might the metric coordinates differ between vertices in different ranges of j values in table_8?",
        "Descriptive_answer1": "(0, 1, l+1, k)",
        "Descriptive_answer2": "1≤i≤k",
        "Reasoning_answer1": "Looking at the metric coordinates for vertices (i,0) in table_8, as i increases from 1 to k, the coordinates are given by (i, i+1, l+i-1, k-i). Let's break this down: the first coordinate increases linearly with i, indicating a direct correspondence to the position along the i-axis. The second coordinate, i+1, also increases with i, suggesting a step ahead in a related dimension. The third coordinate, l+i-1, increases with i, reflecting a combined effect of a constant l and the incremental i. The fourth coordinate, k-i, decreases as i increases, showing an inverse relationship with i relative to a constant k. The pattern observed here is that the coordinates generally increase or decrease systematically with i, capturing positional shifts in a grid-like structure of the torus graph, with two coordinates increasing and one decreasing to maintain unique identification of each vertex position.",
        "Reasoning_answer2": "The metric coordinates differ between vertices in different ranges of j values in table_8 due to the structural properties of the torus graph, which is a grid wrapped around at the edges to form a toroidal surface. For j from 1 to l, the coordinates for (0,j) are (j, j-1, l-j+1, k+j), reflecting a specific positional encoding relative to one side of the torus. In contrast, for j from l+1 to n-1, the coordinates for (0,j) become (n-j, n-j+1, j-l+1, n-j+k), indicating a shift in reference, likely due to the wrapping effect of the torus where positions are measured relative to the opposite side or a different boundary. This difference in coordinate calculation is necessary to ensure that each vertex has a unique set of metric coordinates with respect to the chosen resolving set S, accounting for the periodic nature of the torus grid. The change in j ranges effectively splits the torus into different sectors, requiring distinct formulas to maintain the uniqueness of representations across the entire graph."
    },
    {
        "paper_id": "2007.05808.json",
        "table_id": "table_9",
        "table_content": "\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k,2l}$}\n\\label{vtor}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n $(0,0)(0,1)$ &  & ($0,0,l,k$) \\\\\n $(0,0)(1,0)$ &  & ($0,1,l,k-1$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,1,l,k$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,1,l+1,k-1$) \\\\\n$(0,j)(0,j+1)$ & $1\\leq j \\leq l-1$   & ($j, j-1, l-j, k+j$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k-1$ & ($i,i+1,l+i-1,k-i-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,i+j-1,l-j+i-2,k+j-i$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,i,l+i-2,k-i$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,j-1,l-j,k+j-1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,n-j,j-l+1,n-j-1+k$) \\\\\n\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,n-j+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-1,n-j+k-i-1$) \\\\\n\n\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($i+j,i+j-1,l-j+i-1,j+k-i-1$) \\\\\n&  $1\\leq j \\leq l$ &  \\\\\n\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i-1,l+i-1,i-1,l+k-i-1$) \\\\\n\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,i+1,l+i-2,k-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,n-j+i+1,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $j-l+i-1,n-l+k-i-1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,n-j+1,j-l,n-j+k-1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($m-i+j-1,m-i+j-2,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+l-j,i-k+j$)\\\\\n$(i,j)(i,j+1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i+j,m-i+j-1,$ \\\\\n&  $1\\leq j \\leq l-1$ &$,m-i+l-j,j+i-k$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i,m-i+l,i-k$) \\\\\n\n$(i,0)(i,1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i,m-i,m-i+l,i-k$) \\\\\n$(k,j)(k+1,j)$ &  $1\\leq j \\leq l$ & ($m-k+j-1,m-k+j-2,$\\\\\n&   & $,m-k+l-j-1,j$)  \\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,j-1,l-j+1,m-k+j-1$) \\\\\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($m-k+j-1,m-k+j-2,$\\\\\n&   & $,m-k+l-j-1,j+1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,n+m-j-i,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $j-l+m-i,n-j+i-k)$ \\\\\n$(i,j)(i,j+1)$ &  $k+1\\leq i \\leq m-1$ & ($n-j+m-i-1,n+m-j-i,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+m-i+1,n-j+i-k-1)$ \\\\\n$(i,l)(i,l+1)$ &  $k+1\\leq i \\leq m-1$ & ($l+m-i-1,m+l-i-1,$\\\\\n&   & $,m-i+1,l+i-k-1)$  \\\\\n$(k,j)(k+1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j+m-i-1,m+n-j-i+1,$\\\\\n&   & $,j-l+k-1,n-j)$  \\\\\n$(i,0)(i,n-1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i,m-i+1,m-i+l,i-k)$\\\\\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,n-j+1,j-l+1,n-j+k-1)$\\\\\n$(0,l)(0,l+1)$ &  & ($l-1,l-1,1,l-1+k)$\\\\\n$(0,l)(m-1,l)$ &  & ($l,l-1,1,k+l-1)$\\\\\n$(k,0)(k+1,0)$ &  & ($k-1,k,m-k-1+l,0$\\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}",
        "caption": "  Metric coordinates of edges of $T_{2k,2l}$",
        "label": "vtor",
        "section_info": "3 Exact results on torus graph\n\\section{Exact results on torus graph}\n\nIn this section we will use previously introduced general lower bounds to obtain the exact values of mixed metric dimension of torus graph.\n\n\\begin{thm} For $m,n \\geq 3$ it holds $\\beta_M(T_{m,n}) = 4$.\\end{thm}\n\\begin{proof}  \\textbf{\\underline{Step 1}:}  {\\em Upper bound is 4}.  \\\\\n\nThere are four cases:\\\\\n\\textbf{Case 1.} $m=2k+1, n=2l+1$\\\\\nLet $S = \\{(0,0), (0,l), (1,l+1), (k+1,l+1)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge,with respect to $S$, is shown in the Table \\ref{vtor2} and Table \\ref{etor2}.\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k+1,2l+1}$}\n\\label{vtor2}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, l, l+1, l+k$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,l+i-1,l+k-i+1$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,l-j, l-j+2, l-j+k+1$) \\\\\n$(i,j)$ & $1\\leq i \\leq k-1$  &($j+i,i+l-j,l-j+i,l-j+k-i+2$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,j-l,j-l,j-l+k-1$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,j-l+i,j-l+i-1,k-i+j-l$)\\\\\n&$1\\leq i \\leq k-1$  &  \\\\\n\n\n\n$(i,0)$ &  $k+2 \\leq i \\leq m-1$ & ($m-i, m-i+l, m-i+l+1, l-k+i-1$) \\\\\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+j,m-i+l-j,m-i+l-j+2,i-k+l-j$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m+n-i-j,j-l+m-i,,$\\\\\n&$l+1\\leq j \\leq n-1$  & $j-l-i+m,j-l+i-k-2)$ \\\\\n\n$(k+1,0)$ &  & ($m-i, m-i+l, k+l,l$) \\\\\n$(k,j)$ &  $1 \\leq j \\leq l$ & ($k+j,k+l-j,k+l-j,l-j+2$) \\\\\n$(k+1,j)$ &  $1 \\leq j \\leq l$ & ($k+j, k+l-j, k+l-j+1, l-j+1$) \\\\\n$(k,j)$ &  $l+1 \\leq j \\leq n-1$ & ($k+n-j, j-l+k, k+j-l-2, j-l$) \\\\\n$(k+1,j)$ &  $l+1 \\leq j \\leq n-1$ & ($k+n-j,k+j-l,k+j-l-1,j-l-1$) \\\\\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k+1,2l+1}$}\n\\label{etor2}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n\n $(0,0)(1,0)$ &  & ($0,l,l,k+l$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,l,l+1,k+l-1$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,l,l+1,l+k-1$) \\\\\n$(0,j)(0,j+1)$ & $0\\leq j \\leq l-1$   & ($j, l-j-1, l-j+1, k+l-j$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k-1$ & ($i,l+i,l+i-1,k-i+l$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,l-j-1+i,l+i-j-1,l-j+k-i+1$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i,i,i-1,k-i+1$) \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,l+i-1,l+i-1,l+k-i+1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,l-j+1,k+l-j+1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,j-l,j-l,k+j-l-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-2,k+j-l-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($i+j,l-j+i,l-j+i,l-j+k-i+1$) \\\\\n\n\n&  $1\\leq j \\leq l$ &  \\\\\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,l+i,l+i-1,l+k-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $j-l+i-2,k-i+j-l-1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,j-l-1,k+j-l-1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($j+m-i-1,l-j+m-i-1,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+l-j+1,l+1-j+i-k-1$)\\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i+j,l-j-1+m-i,$ \\\\\n&  $1\\leq j \\leq l-1$ &$l-j+m-i,l-j+i-k-1$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i+l-1,m-i+l,i-k-1+l$) \\\\\n$(i,0)(i,1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l-1,m-i+l+1,l+i-j-k-1$) \\\\\n$(k,j)(k+1,j)$ &  $1\\leq j \\leq l$ & ($k+j,k+l-j,k+l-j,l-j+1$)\\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,l-j+2,k+l-j$) \\\\\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($k+j,k+l-j-1,k+l-j,l-j$)\\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,m-i+j-l-1,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $m-i+j-l-1,i-k+j-l-2)$ \\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($n-j-1+m-i,j-l+m-i,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+m-i,j-l+i-k-2)$ \\\\\n$(i,l)(i,l+1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i+l,m-i,m-i+1,i-k-1)$\\\\\n$(k,j)(k+1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j+k,j-l+k,k+j-l-2,j-l-1)$\\\\\n$(i,0)(i,n-1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l,m-i+l,l+i-k-2)$\\\\\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,j-l,k+j-l-2)$\\\\\n$(0,l)(0,l+1)$ &  & ($l,0,1,k)$\\\\\n$(0,l)(m-1,l)$ &  & ($l,0,2,k)$\\\\\n$(k,0)(k+1,0)$ &  & ($k,k+l,k+l-1,l)$\\\\\n$(k+1,0)(k+1,1)$ &  & ($k,k+l-1,k+l,l)$\\\\\t\n$(k+1,0)(k+1,n-1)$ &  & ($k,k+l,k+l-1,l-1)$\\\\\n$(k+1,j)(k+1,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j+k-1,j-l+k,j-l+k-1,j-l-1)$\\\\\n$(k+1,l)(k+1,l+1)$ &  & ($k+l,k,k,0)$\\\\\n\n\\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\\newpage\n\n\nSince metric coordinates of all items are mutually different, $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k+1,2l+1})\\leq 4.$\n\n\\textbf{Case 2.} $m=2k+1, n=2l$\\\\\nLet $S = \\{(0,0), (0,l), (1,0), (k+1,1)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge, with respect to $S$, is shown in the Table \\ref{vtor7} and Table \\ref{vtor6}.\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k+1,2l}$}\n\\label{vtor7}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, l, 1, k+1$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,i-1,k-i+2$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,l-j, j+1, k+j-1$) \\\\\n$(i,j)$ & $1\\leq i \\leq k$  &($j+i,l-j+i,j+i-1,j+k-i$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,j-l,n-j+1,n-j+k+1$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,j-l+i,n-j+i-1,n-j+k-i+2$)\\\\\n&$1\\leq i \\leq k$  &  \\\\\n\n\n\n$(i,0)$ &  $k+2 \\leq i \\leq m-1$ & ($m-i, m-i+l, m-i+1, i-k$) \\\\\n\n$(k+1,0)$ &   & ($k, k+l, k, 1$) \\\\\n\n\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+j,m-i-j+l,m-i+j+1,i-k+j-2$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(k+1,j)$ & $1\\leq j \\leq l$   & ($m-i+j, m-i+l-j,k+j ,j-1$) \\\\\n\n$(i,j)$ & $k+2\\leq i \\leq m-1$  &($m-i+n-j,m-i+j-l,$\\\\\n&$l+1\\leq j \\leq n-1$  & $m-i+1+n-j,n+i-k-j)$ \\\\\n$(k+1,j)$ & $l+1\\leq j \\leq n-1$   & ($n-j+k,k+j-l,n-j+k ,n-j+1$) \\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k+1,2l}$}\n\\label{vtor6}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n $(0,0)(0,1)$ &  & ($0,l-1,1,k$) \\\\\n $(0,0)(1,0)$ &  & ($0,l,0,k+1$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,l-1,1,k+1$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,l,1,k$) \\\\\n$(0,j)(0,j+1)$ & $1\\leq j \\leq l-1$   & ($j, l-j-1, j+1, k+j-1$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k$ & ($i,i+l,i-1,k-i+1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,i+l-j-1,j+i-1,k+j-i$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i-1,i,l+i-2,k-i+l$) \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,l-1+i,i-1,k-i+1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j,k+j-1$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j,k+j-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,n-j+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-1,n-j+k-i-1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,j-l,n-j,k+n-j$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j-1+i,j-l+i,n-j-2+i,k+n-j-i+1$) \\\\\n&  $l+1\\leq j \\leq n-2$ &  \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k$ & ($i+j,i+l-j,j+i-1,k-i+j-1$) \\\\\n&  $1\\leq j \\leq l$ &  \\\\\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,i+l-1,i-1,k-i+2$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,j-l+i,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $n-j+i-1,n+k-i-j+1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,n-j,n-j+k+1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($m-i+j-1,m-i+l-j-1,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+j,i-k+j-2$)\\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i+j,m-i+l-j-1,$ \\\\\n&  $1\\leq j \\leq l-1$ &$m-i+1+j,i-k+j-2$) \\\\\n\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($m-i+j,m-i+l-j-1,k+j,i-k+j-2$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i+l-1,m-i,i-k$) \\\\\n$(i,0)(i,1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i-1+l,m-i+1,i-k-1$) \\\\\n $(k+1,0)(k+1,1)$ &  & ($k,k+l-1,k,0$) \\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,l-j,j+1,k+j-2$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,j-l+m-i-1,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $m+n-j-i,n-j+i-k)$ \\\\\n$(i,j)(i,j+1)$ &  $k+2\\leq i \\leq m-1$ & ($n-j+m-i-1,m+j-i-l,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $n-j+m-i,n-j+i-k-1)$ \\\\\n$(k+1,j)(k+1,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j+m-i-1,j-l+m-i,$\\\\\n&   & $k+n-j-1,n-j+i-k-1$) \\\\\n$(i,l)(i,l+1)$ &  $k+2\\leq i \\leq m-1$ & ($l+m-i-1,m-i,l+m-i,l+i-k-2$\\\\\n$(k+1,l)(k+1,l+1)$ &   & ($l+k-1,k,l+k-1,l-1)$\\\\\n$(i,0)(i,n-1)$ &  $k+2\\leq i \\leq m-1$ & ($m-i,m-i+l,m-i+1,i-k)$\\\\\n$(k+1,0)(k+1,n-1)$ &   & ($k,l+k-1,k,1)$\\\\\n\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,j-l,n-j+1,k+n-j)$\\\\\n\n\n\n\n$(0,l)(0,l+1)$ &  & ($l-1,0,l,k+l-1)$\\\\\n$(k+1,0)(k+2,0)$ &  & ($k,k+l-1,k,0)$\\\\\n\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\n\nSince metric coordinates of all items   are mutually different, so $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k+1,2l})\\leq 4.$\n\n\n\n\n\\newpage\n\n\n\n\n\n\\textbf{Case 3.} $m=2k, n=2l+1$\\\\\nLet $S =\\{(0,0),(k,0),(0,1),(1,l+1)\\}$. Since $C_m \\Box C_n$ is the same as $C_n \\Box C_m$, the proof of this case is similar to the proof of Case 2.\\\\\n\n\\textbf{Case 4.} $m=2k, n=2l$\\\\\nLet $S = \\{(0,0), (0,1), (1,l), (k,0)\\}$. Let us prove that $S$ is mixed metric resolving set. The representation of coordinates of each vertex and each edge, with respect to $S$, is shown in the Table \\ref{etor} and Table \\ref{vtor}.\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of vertices of $T_{2k,2l}$}\n\\label{etor}\n\\begin{tabular}{|c|c|c|}\n \\hline\n  vetex & cond. & $r(v,S)$\\\\\n \\hline\n$(0,0)$ &   & ($0, 1, l+1, k$) \\\\\n$(i,0)$ &  $1\\leq i \\leq k$ & ($i,i+1,l+i-1,k-i$) \\\\\n$(0,j)$ &  $1 \\leq j \\leq l$ & ($j,j-1, l-j+1, k+j$) \\\\\n$(i,j)$ & $1\\leq i \\leq k$  &($j+i,j-1+i,l-j+i-1,j+k-i$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n$(0,j)$  &  $l+1 \\leq j \\leq n-1$ & ($n-j,n-j+1,j-l+1,n-j+k$) \\\\\n\n\n$(i,j)$ & $l+1\\leq j \\leq n-1$  &($n-j+i,n-j+i+1,j-l+i-1,n-j+k-i$)\\\\\n&$1\\leq i \\leq k$  &  \\\\\n\n\n\n$(i,0)$ &  $k+1 \\leq i \\leq m-1$ & ($m-i, m-i+1, m-i+l+1, i-k$) \\\\\n\n$(i,j)$ & $k+1\\leq i \\leq m-1$  &($m-i+j,m-i+j-1,m-i+l-j+1,i-k+j$)\\\\\n&$1\\leq j \\leq l$  &  \\\\\n\n\n$(i,j)$ & $k+1\\leq i \\leq m-1$  &($m+n-i-j,m+n-i-j+1,$\\\\\n&$l+1\\leq j \\leq n-1$  & $m+j-l-i+1,n+i-k-j)$ \\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\n\n\\begin{table}\n\\tiny\n\\begin{center}\n\\caption{  Metric coordinates of edges of $T_{2k,2l}$}\n\\label{vtor}\n\\begin{tabular}{|c|c|c|}\n\n \\hline\n  edge & cond. & $r(e,S)$\\\\\n \\hline\n $(0,0)(0,1)$ &  & ($0,0,l,k$) \\\\\n $(0,0)(1,0)$ &  & ($0,1,l,k-1$) \\\\\n $(0,0)(0,n-1)$ &  & ($0,1,l,k$) \\\\\n $(0,0)(m-1,0)$ &  & ($0,1,l+1,k-1$) \\\\\n$(0,j)(0,j+1)$ & $1\\leq j \\leq l-1$   & ($j, j-1, l-j, k+j$) \\\\\n$(i,0)(i+1,0)$ &  $1\\leq i \\leq k-1$ & ($i,i+1,l+i-1,k-i-1$) \\\\\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($i+j,i+j-1,l-j+i-2,k+j-i$) \\\\\n&  $1\\leq j \\leq l-1$ &  \\\\\n$(i,0)(i,1)$ &  $1\\leq i \\leq k$ & ($i,i,l+i-2,k-i$) \\\\\n$(0,j)(1,j)$ &  $1\\leq j \\leq l$ & ($j,j-1,l-j,k+j-1$) \\\\\n$(0,j)(0,j+1)$ &  $l+1\\leq j \\leq n-2$ & ($n-j-1,n-j,j-l+1,n-j-1+k$) \\\\\n\n$(i,j)(i,j+1)$ &  $1\\leq i \\leq k$ & ($n-j+i-1,n-j+i,$ \\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+i-1,n-j+k-i-1$) \\\\\n\n\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($i+j,i+j-1,l-j+i-1,j+k-i-1$) \\\\\n&  $1\\leq j \\leq l$ &  \\\\\n\n$(i,l)(i,l+1)$ &  $1\\leq i \\leq k$ & ($l+i-1,l+i-1,i-1,l+k-i-1$) \\\\\n\n$(i,0)(i,n-1)$ &  $1\\leq i \\leq k$ & ($i,i+1,l+i-2,k-i$) \\\\\n$(i,j)(i+1,j)$ &  $1\\leq i \\leq k-1$ & ($n-j+i,n-j+i+1,$ \\\\\n&  $l+1\\leq j \\leq n-1$ &  $j-l+i-1,n-l+k-i-1$)\\\\\n$(0,j)(1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,n-j+1,j-l,n-j+k-1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($m-i+j-1,m-i+j-2,$ \\\\\n&  $1\\leq j \\leq l$ & $m-i+l-j,i-k+j$)\\\\\n$(i,j)(i,j+1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i+j,m-i+j-1,$ \\\\\n&  $1\\leq j \\leq l-1$ &$,m-i+l-j,j+i-k$) \\\\\n$(i,0)(i+1,0)$ &  $k+1\\leq i \\leq m-2$ & ($m-i-1,m-i,m-i+l,i-k$) \\\\\n\n$(i,0)(i,1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i,m-i,m-i+l,i-k$) \\\\\n$(k,j)(k+1,j)$ &  $1\\leq j \\leq l$ & ($m-k+j-1,m-k+j-2,$\\\\\n&   & $,m-k+l-j-1,j$)  \\\\\n$(0,j)(m-1,j)$ &  $1\\leq j \\leq l$ & ($j,j-1,l-j+1,m-k+j-1$) \\\\\n$(k+1,j)(k+1,j+1)$ &  $1\\leq j \\leq l-1$ & ($m-k+j-1,m-k+j-2,$\\\\\n&   & $,m-k+l-j-1,j+1$) \\\\\n$(i,j)(i+1,j)$ &  $k+1\\leq i \\leq m-2$ & ($n-j+m-i-1,n+m-j-i,$\\\\\n&  $l+1\\leq j \\leq n-1$ & $j-l+m-i,n-j+i-k)$ \\\\\n$(i,j)(i,j+1)$ &  $k+1\\leq i \\leq m-1$ & ($n-j+m-i-1,n+m-j-i,$\\\\\n&  $l+1\\leq j \\leq n-2$ & $j-l+m-i+1,n-j+i-k-1)$ \\\\\n$(i,l)(i,l+1)$ &  $k+1\\leq i \\leq m-1$ & ($l+m-i-1,m+l-i-1,$\\\\\n&   & $,m-i+1,l+i-k-1)$  \\\\\n$(k,j)(k+1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j+m-i-1,m+n-j-i+1,$\\\\\n&   & $,j-l+k-1,n-j)$  \\\\\n$(i,0)(i,n-1)$ &  $k+1\\leq i \\leq m-1$ & ($m-i,m-i+1,m-i+l,i-k)$\\\\\n$(0,j)(m-1,j)$ &  $l+1\\leq j \\leq n-1$ & ($n-j,n-j+1,j-l+1,n-j+k-1)$\\\\\n$(0,l)(0,l+1)$ &  & ($l-1,l-1,1,l-1+k)$\\\\\n$(0,l)(m-1,l)$ &  & ($l,l-1,1,k+l-1)$\\\\\n$(k,0)(k+1,0)$ &  & ($k-1,k,m-k-1+l,0$\\\\\n\n\n\t  \\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\nSince metric coordinates of all items  are mutually different, $S$ is a mixed resolving set. Therefore, $\\beta_{M}(T_{2k,2l})\\leq 4.$\\\\\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\textbf{\\underline{Step 2}:}  {\\em Lower bound is 4}.  \\\\\nTorus graph is $4$-regular graph, so by Corollary \\ref{lb1reg} follows\n$\\beta_M(T_{m,n})  \\geq 1 + \\lceil log_2 (r+1) \\rceil = 1 + \\lceil log_2 5 \\rceil = 4$.\n\nTherefore, from the previous two steps, it follows that $\\beta_M(T_{m,n})=4.$\n\\end{proof}\n\n\n\n\n\n\n\n\n\n\n\n",
        "Descriptive_question1": "What is the metric coordinate for the edge (0,0)(0,1) in table_9?",
        "Descriptive_question2": "What is the condition for the edge (0,j)(0,j+1) in table_9 for the range of j between 1 and l-1?",
        "Reasoning_question1": "How do the metric coordinates of edges in table_9 reflect the structure of the torus graph T_{2k,2l} compared to other torus graph configurations?",
        "Reasoning_question2": "Why might the metric coordinates for edges in table_9 differ based on the specific conditions of i and j, and what does this indicate about the graph's symmetry?",
        "Descriptive_answer1": "(0,0,l,k)",
        "Descriptive_answer2": "1≤j≤l-1",
        "Reasoning_answer1": "To understand how the metric coordinates of edges in table_9 reflect the structure of the torus graph T_{2k,2l} compared to other torus graph configurations, we start by noting that table_9 specifically pertains to T_{2k,2l}, which represents a torus graph with even dimensions in both directions (m=2k and n=2l). The metric coordinates in this table are defined with respect to the resolving set S = {(0,0), (0,1), (1,l), (k,0)}, as mentioned in Case 4 of the proof. Examining the coordinates, such as (0,0,l,k) for edge (0,0)(0,1), we see that they encapsulate distances from the edge to each vertex in S, reflecting the toroidal structure through wrap-around effects (e.g., coordinates involving n-j or m-i for edges near boundaries). Comparing this to other configurations like T_{2k+1,2l+1} (Case 1, table_2), which uses a different resolving set S = {(0,0), (0,l), (1,l+1), (k+1,l+1)}, the coordinates differ due to the odd dimensions, adjusting for different central points and symmetry. For instance, in T_{2k+1,2l+1}, coordinates often include terms like l-j+2, indicating a shift in distance calculations. This comparison shows that in T_{2k,2l}, the even dimensions lead to a more symmetric and balanced resolving set placement, reflected in simpler coordinate expressions without additional offset terms seen in odd-dimension cases. Thus, the metric coordinates in table_9 uniquely capture the even-even torus structure, highlighting a uniform distance distribution compared to mixed or odd-odd configurations where resolving sets must account for dimensional asymmetry.",
        "Reasoning_answer2": "To explore why the metric coordinates for edges in table_9 differ based on the specific conditions of i and j, and what this indicates about the graph's symmetry, let’s first consider that table_9 lists metric coordinates for edges of T_{2k,2l}, a torus graph with even dimensions. The coordinates are calculated relative to the resolving set S = {(0,0), (0,1), (1,l), (k,0)}, and vary with conditions on i and j because these indices define the position of edges within the grid-like structure of the torus. For example, for edge (0,j)(0,j+1) with 1≤j≤l-1, the coordinate is (j, j-1, l-j, k+j), while for l+1≤j≤n-2, it becomes (n-j-1, n-j, j-l+1, n-j-1+k), reflecting a shift due to the toroidal wrap-around effect at the boundary. Similarly, edges like (i,j)(i+1,j) have different coordinate formulas depending on whether j is in 1≤j≤l or l+1≤j≤n-1, showing how distances adjust near middle or edge regions of the graph. This variation indicates that the torus graph T_{2k,2l} does not have complete translational symmetry in terms of metric coordinates; instead, its symmetry is periodic with respect to m and n, requiring different expressions as edges approach or cross the 'seams' of the torus. The resolving set placement also influences this, as it is not centrally symmetric for all dimensions, leading to asymmetric distance calculations depending on proximity to S vertices. Thus, the differing coordinates highlight that while the torus graph has a regular, cyclic structure, the metric perspective imposed by a specific resolving set breaks full symmetry, revealing directional and positional dependencies in how distances are measured across the graph."
    },
    {
        "paper_id": "1908.06383.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}\n\t\\centering\n\t\\begin{tabular}{c|cccc}\n\t\t$n$ & 0 & 1& 2&\\\\\\hline\n\t\t$\\gamma_\\star^{(n)}$ & 2.071&13.307&27.783 &\\\\[2mm]\n\t\t$k_\\star^{(n)}$ & 1.065&4.318 &7.529 &\n\t\\end{tabular}\n\t\\caption{Approximate values of gain-and-loss amplitudes $\\gamma=\\gamma_\\star^{(n)}$  and wavenumbers   $k=k_\\star^{(n)}$, $n=0, 1, \\ldots$,  corresponding to spectral singularities with lowest $\\gamma$ in the limit $\\ell=0$, see Table~I in  \\cite{Mostafazadeh2009}. \\label{tbl:1}}\n\\end{table}",
        "caption": "Approximate values of gain-and-loss amplitudes $\\gamma=\\gamma_\\star^{(n)}$  and wavenumbers   $k=k_\\star^{(n)}$, $n=0, 1, \\ldots$,  corresponding to spectral singularities with lowest $\\gamma$ in the limit $\\ell=0$, see Table~I in  \\cite{Mostafazadeh2009}. \\label{tbl:1}",
        "label": "tbl:1",
        "section_info": "5 Spectral singularities\n\\section{Spectral singularities}\n\\label{sec:ss}\n\\subsection{General analytical expressions}\n\nIn this section we study real zeroes of the function $F$ corresponding to spectral singularities, i.e. to  zero-width resonances.\nFor such zeroes, equation (\\ref{2.10}) is a pair of two real equations for one real variable $k$ and two parameters $\\ell$ and $\\g$.\nThanks to the symmetry of the zeroes with respect to the imaginary axis, it is sufficient to find only positive real resonances since the negative ones are located symmetrically with respect to the origin. Similar to the proof of Lemma~\\ref{lm3.6}, for real positive $k$  we make  change (\\ref{4.1})\nand rewrite equation (\\ref{2.10}) in the following form:\n\\begin{align*}\n(1-2u^{-4})\\cos\\b u^{-1}&+(2u^4-1)\\cosh\\b u + 2i \\sqrt{1-u^4}\\left(u^2\\sinh\\b u - u^{-4}\\sin\\b u^{-1}\\right)\n\\\\\n&-e^{-2i\\b\\ell\\sqrt{u^{-2}-u^2}} \\left(\\cosh(\\b u)-\\cos\\b u^{-1}\\right)=0,\\qquad \\b=\\sqrt{2\\g}.\n\\end{align*}\nTaking the real and imaginary part of this equation and multiplying the equation by $u^4$, we obtain:\n\\begin{equation}\\label{4.3}\n\\begin{aligned}\n&(u^4-2)\\cos\\b u^{-1}+u^4(2u^4-1)\\cosh\\b u\n=u^4\\cos\\left(2\\b\\ell\\sqrt{u^{-2}-u^2}\\right)\\left(\\cosh \\b u-\\cos\\b u^{-1}\\right),\n\\\\\n&2\\sqrt{1-u^4}\\Big(u^6\\sinh\\b u-\\sin\\b u^{-1}\\big)\n=-u^4\\sin\\left(2\\b\\ell\\sqrt{u^{-2}-u^2}\\right)\\left(\\cosh \\b u-\\cos\\b u^{-1}\\right).\n\\end{aligned}\n\\end{equation}\nThis is a system of two real equations with three real variables. If we are given $(\\b,\\ell)$ and we try to find $u$, the system is overdetermined and does not necessary have a root. In other words, it is solvable with respect to $u$ only if $(\\b,\\ell)$ are located on some (solvability) curves. In order to avoid working with an overdetermined system, in what follows we regard (\\ref{4.3}) as a system for two unknown variable with one parameter.\n\nTo find the curves in $(\\b,\\ell)$ plane, on which equations (\\ref{4.3}) are solvable with respect to $u$, we shall regard $u$ as a parameter and $(\\b,\\ell)$ as unknown variables. We take the sum of squares of equations (\\ref{4.3}) and divide the result by $(1-u^4)$. We also divide equations (\\ref{4.3}). This leads us to a   pair  of equations:\n\\begin{align}\\label{4.4}\n&u^4(1-u^4) \\cosh\\b u\\cos\\b u^{-1}-2u^6 \\sinh\\b u\\sin\\b u^{-1} +1-u^{12}=0,\n\\\\\n&\\frac{2\\sqrt{1-u^4}(u^6\\sinh\\b u-\\sin\\b u^{-1})}{(2-u^4)\\cos\\b u^{-1}+u^4(1-2u^4)\\cosh\\b u}=\\tan \\Big( 2\\b\\ell\\sqrt{u^{-2}-u^2}\\Big).\\nonumber\n\\end{align}\nThe second equation can be solved explicitly with respect to $\\ell$:\n\\begin{equation}\\label{4.10}\n\\ell=\\frac{1}{2\\b\\sqrt{u^{-2}-u^2}} \\left(\\arctan \\frac{2\\sqrt{1-u^4}(u^6\\sinh\\b u-\\sin\\b u^{-1})}{(2-u^4)\\cos\\b u^{-1}+u^4(1-2u^4)\\cosh\\b u}+\\pi n\\right),\n\\end{equation}\nwhere $n\\in\\mathds{N}$ is an arbitrary natural number. As the next lemma states, to make equations (\\ref{4.4}), (\\ref{4.10}) equivalent to (\\ref{4.3}), we should also assume that\n\\begin{equation}\\label{4.5}\n(-1)^n=\\sign \\big((u^4-2)\\cos\\b u^{-1}+u^4(2u^4-1)\\cosh\\b u\\big).\n\\end{equation}\n\n\\begin{lemma}\\label{lm4.1}\nEquations (\\ref{4.3}) are equivalent to (\\ref{4.4}), (\\ref{4.10}), (\\ref{4.5}).\n\\end{lemma}\n\n\\begin{proof}\nWe rewrite shortly equations (\\ref{4.3}) as\n$A_1=B\\cos\\a$, $ A_2=-B\\sin\\a$,\nwhere $A_1$, $A_2$ are the left hand sides in (\\ref{4.3}), $\\a=2\\b\\ell\\sqrt{u^{-2}-u^2}$ and $B=u^4(\\cosh\\b u-\\cos\\b u^{-1})$. Then equations (\\ref{4.4}), (\\ref{4.10}) become\n$A_1^2+A_2^2=B^2$, $\\a=-\\arctan\\frac{A_2}{A_1}+\\pi n$.\nWe have:\n\\begin{equation*}\nB\\cos \\a=(-1)^n\\cos\\arctan\\frac{A_2}{A_1}=\\frac{(-1)^n}{\\sqrt{1+\\frac{A_2^2}{A_1^2}}} = \\frac{(-1)^n|A_1|}{\\sqrt{A_1^2+A_2^2}}\n\\end{equation*}\nand we get the first equation $A_1=B\\cos\\a$  provided condition (\\ref{4.5}) is satisfied. In the same way we check that the latter condition also ensures the second equation $A_2=-B\\sin\\a$.\n\\end{proof}\n\n\nEquation (\\ref{4.4}) is transcendental, and we can not solve it analytically.\nNevertheless, for each $u\\in(0,1)$, this is an equation only for a single variable $\\b$, not for two as equations (\\ref{4.3}). So, we propose the following algorithm of recovering the aforementioned solvability curves: choose $u\\in(0,1)$, then solve equation (\\ref{4.4}) and recover the sequence of  distances  $\\ell$ by formula (\\ref{4.10}) with different integer $n$. Then the gain-and-loss amplitude $\\gamma$ and the corresponding wavenumber $k$ can be readily recovered from $\\beta$ and  $u$.  In a similar way, one can first fix some value of $\\beta$ (i.e., fix the gain-and-loss strength) and then   solve equation (\\ref{4.4}) with respect to $u$ and   recover   $\\ell$ by (\\ref{4.10}). Equation (\\ref{4.4})  is well-behaved, and for each $\\beta$  all its zeros  $u$ can be easily found numerically.\n\nAlternatively, as explained below  in Section~\\ref{sec:general}, the values  corresponding to  spectral singularities can be found systematically by means of the numerical continuation from the limit $\\ell=0$. However, in this case   equation (\\ref{4.4}) is still useful because it allows one to check that all spectral singularities have been found for the given value of the gain-and-loss $\\gamma$.\n\n\n\\subsection{\nAbsence of   spectral singularities}\n\\label{sec:gap}\n\nFor $u=0$ and $u=1$, the left-hand-side of equation (\\ref{4.4}) is equal respectively to $1$ and $-2\\sinh\\beta\\sin\\beta$. Then a sufficient condition for the existence of a spectral singularity at the given gain-and-loss amplitude $\\gamma$ is $\\sin\\beta=\\sin\\sqrt{2\\gamma}>0$. At the same time, it is also possible to establish sufficient conditions that forbid the existence of   spectral singularities in a certain interval of parameters. In this subsection we prove the existence of two ``forbidden gaps'' for the roots of equation (\\ref{4.4}). The first one exists for all $\\b\\geqslant 0$ and it states that there is no roots in certain interval. The second gap is a certain interval of values of $\\b$, for which equation (\\ref{4.4}) has no zeroes at all.\n\nFor the convenience, by $g(u,\\b)$ we denote the left hand side of equation (\\ref{4.4}). The first ``forbidden gap'' is described in the following lemma.\n\n\\begin{lemma}\\label{lm5.1}\nFor all $\\b\\geqslant 0$, equation (\\ref{4.4}) has no roots in the interval $\\big[0,(1+\\tfrac{\\b}{4})^{-1}\\big)$.\n\\end{lemma}\n\\begin{proof}\nEmploying a standard inequality $a\\cos\\a+b\\sin\\a\\leqslant \\sqrt{a^2+b^2}$, we estimate the first two term in equation (\\ref{4.4}) as\n\\begin{equation*}\nu^4(1-u^4) \\cosh\\b u\\cos\\b u^{-1}-2u^6 \\sinh\\b u\\sin\\b u^{-1} \\leqslant \\sqrt{u^8(1-u^4)^2\\cosh^2 \\b u+4u^{12}\\sinh^2\\b u}.\n\\end{equation*}\nHence, equation (\\ref{4.4}) surely has no roots for values of $u$ satisfying\n\\begin{equation*}\n \\sqrt{u^8(1-u^4)^2\\cosh^2 \\b u+4u^{12}\\sinh^2\\b u} <1-u^{12}.\n\\end{equation*}\nExpressing $\\cosh^2 \\b u$ via $\\sinh^2 \\b u$ and simplifying this inequality, we obtain $ u^4(1+u^4)\\cosh \\b u <1+u^{12}$ and hence,\n\\begin{equation}\\label{3.37}\n\\cosh \\b u-1 <\\frac{1-2u^4+u^8}{u^4},\\qquad \\sqrt{2}\\sinh \\b u\n<u^{-2}-u^2.\n\\end{equation}\nThanks to the estimate $\\sqrt{2}\\sinh 2\\ln u^{-1}\\leqslant u^{-2}-u^2$, $u\\in(0,1]$, the last inequality in (\\ref{3.37}) holds once $\\frac{\\b}{4}<u^{-1}\\ln u^{-1}$. It is easy to confirm that this inequality is true once $u^{-1}>1+\\tfrac{\\b}{4}$.\nThe proof is complete.\n\\end{proof}\n\n\nThe next lemma is   auxiliary  and will be employed in studying the second forbidden zone.\n\n\\begin{lemma}\\label{lm5.2}\nThe function $g(u,\\pi)$ is positive on $[0,1)$.\n\\end{lemma}\n\n\\begin{proof}\nWe have $g(0,\\pi)=1$ and by Lemma~\\ref{lm5.1}, it is positive for $u<(1+\\tfrac{\\pi}{4})^{-1}$. This is why in what follows we consider only the values $u\\geqslant (1+\\tfrac{\\pi}{4})^{-1}$. For such values of $u$ we have\n$\\pi\\leqslant \\pi u^{-1}\\leqslant \\pi (1+\\tfrac{\\pi}{4})<1.79\\pi$.\nAs\n$\\tfrac{3\\pi}{2} \\leqslant \\pi u^{-1}\\leqslant \\pi (1+\\tfrac{\\pi}{4})$,\nthe function $\\sin\\b u^{-1}$ is negative, while $\\cos \\b u^{-1}$ is positive. Hence, for such values of $u$, the function $g(u,\\pi)$ is positive. It remains to consider the values $\\frac{2}{3}<u\\leqslant 1$.\n\nFor such values of $u$ we first observe the following simple estimates:\n\\begin{equation*}\n\\frac{1-u^4}{1-u}\\cos\\pi u^{-1}\\geqslant -4,\\qquad -\\frac{\\sin \\pi u^{-1}}{1-u}=\\frac{\\sin \\pi (u^{-1}-1)}{1-u}\\geqslant \\pi u.\n\\end{equation*}\nThen the function $g$ satisfies the estimate:\n\\begin{equation}\\label{3.12}\n\\frac{g(u,\\pi)}{u^4(1-u)}\\geqslant g_1(u)+g_2(u),\\qquad g_1(u):=-4\\cosh\\pi u +2\\pi u^3 \\sinh \\pi u,\\qquad g_2(u):=\\frac{1-u^{12}}{u^4(1-u)}.\n\\end{equation}\nThe function $g_1(u)$ is monotonically increasing in $u\\in[\\tfrac{2}{3},1]$ since\n\\begin{equation*}\ng_1'(u)=2\\pi^2 u^3\\cosh\\pi u+(6\\pi u^2-4\\pi)\\sinh\\pi u>2\\pi(\\pi u^3+6u^2-2)\\sinh\\pi u>2\\pi\\sinh\\pi u>0.\n\\end{equation*}\nHence,\n$g_1(u)\\geqslant g_1\\left(\\frac{2}{3}\\right)>-9.05$.\nFor the function $g_2$ we have the following representation and estimate:\n\\begin{equation*}\ng_2(u)=1+\\sum\\limits_{j=1}^{4}(u^j+u^{-j})+\\sum\\limits_{j=5}^{8} u^j\\geqslant 9 +\\sum\\limits_{j=5}^{8}\\left(\\frac{2}{3}\\right)^j>9.31.\n\\end{equation*}\nTwo last estimates and (\\ref{3.12}) imply the positivity of the function $g$ for $u\\in[\\tfrac{2}{3},1)$.\n\\end{proof}\n\n\nDenote\n\\begin{align*}\ng_*(u,\\b):=&\\b u^3(1-3u^4)\\cosh\\b u \\sin \\b u^{-1}+\\b u^5 (3-u^4)\\sinh\\b u \\cos \\b u^{-1}\n\\\\\n&-2 u^4(1+u^4) \\cosh\\b u \\cos \\b u^{-1} -6(1+u^{12}).\n\\end{align*}\n\n\nThe next lemma states the existence of the second forbidden zone.\n\n\\begin{lemma}\\label{lm5.3}\nEquation (\\ref{4.4}) has no roots as $\\pi<\\b<\\b_*<5$, where $(u_*, \\b_*)$ is  the root of the system of the equations\n\\begin{equation}\\label{3.13}\ng(u,\\b)=0,\\qquad g_*(u,\\b)=0,\\qquad u\\in[0,1],\\qquad \\pi<\\b<5,\n\\end{equation}\nwith minimal possible $\\b$.\nTheir approximate values are\n\\begin{equation}\\label{3.14}\n\\b_*=4.808438,\\qquad u_*=0.611772.\n\\end{equation}\n\\end{lemma}\n\n\\begin{proof}\nThe function $g(u,\\pi)$ is positive on $[0,1)$ and $g(1,\\pi)=0$, see Figure~\\ref{fig:forbidden}a. As $\\pi<\\b<2\\pi$, we have $g(0,\\b)=1>0$ and $g(1,\\b)=-2\\sin\\b\\sinh\\b>0$. Hence, for $\\b$ close enough to $\\pi$, the function $g(u,\\b)$ is positive for all $u\\in[0,1]$. At the same time, we have $g(0.65,5)<-0.617<0$ and therefore, for $\\b=5$, equation (\\ref{4.4}) possesses at least two roots, one in $(0,0.65)$ and another in $(0.65,1)$. cf. Figure~\\ref{fig:forbidden}c. We also observe that the function $g$ is jointly continuous in $(u,\\b)$. The above facts means that as $\\b$ grows from $\\pi$ to $5$, at some value $\\b=\\b_*$, the graph of the function $g$ is still located in the upper half-plane but touches the $u$-axis at some point $u=u_*$,\nsee Figure~\\ref{fig:forbidden}b. The function $g(u,\\b)$ is positive as $\\pi<\\b<\\b_*$ and $u\\in[0,1]$. Then the point $u=u_*$ is obviously the global minimum of $g$ and hence, $(u_*,\\b_*)$ is a solution to the system of equations $g(u,\\b)=0$, $\\frac{\\p g}{\\p u}(u,\\b)=0$. It is easy to check that $g_*=\\frac{\\p g}{\\p u}-6 g$ and hence, $(u_*,\\b_*)$ solves system (\\ref{3.13}). These roots can be found numerically and this gives (\\ref{3.14}). The proof is complete.\n\\end{proof}\n\n\n\\begin{figure}\n\t\\centering\n\\includegraphics[width=0.99\\columnwidth]{fig03.eps}\n\\caption{Illustration for proof of Lemma~\\ref{lm5.3}. Graphs of the function $g(u, \\beta)$ for $\\b=\\pi$ (a), $\\b=\\b_*\\approx  4.808438$ (b), and $\\b=5$ (c). Notice broken vertical axes in (b) and (c). }\n\\label{fig:forbidden}\n\\end{figure}\n\nReturning from the auxiliary variable $\\beta$ to the gain-and-loss amplitude $\\gamma=\\beta^2/2$, from   Lemma~\\ref{lm5.3}  we deduce  the following important result:\n\\begin{equation}\n\\label{eq:gap}\n\\textrm{there is no spectral singularities for\\quad }  \\frac{\\pi^2}{2} < \\gamma< \\gamma_*\\approx  11.561.\n\\end{equation}\n\n\n\n\\subsection{Creating a  spectral singularity  at a given wavenumber}\n\n\\begin{table}\n\t\\centering\n\t\\begin{tabular}{c|cccc}\n\t\t$n$ & 0 & 1& 2&\\\\\\hline\n\t\t$\\gamma_\\star^{(n)}$ & 2.071&13.307&27.783 &\\\\[2mm]\n\t\t$k_\\star^{(n)}$ & 1.065&4.318 &7.529 &\n\t\\end{tabular}\n\t\\caption{Approximate values of gain-and-loss amplitudes $\\gamma=\\gamma_\\star^{(n)}$  and wavenumbers   $k=k_\\star^{(n)}$, $n=0, 1, \\ldots$,  corresponding to spectral singularities with lowest $\\gamma$ in the limit $\\ell=0$, see Table~I in  \\cite{Mostafazadeh2009}. \\label{tbl:1}}\n\\end{table}\n\n\nFor $\\ell=0$ spectral singularity can only be obtained for some isolated values of the wavenumber $k$ and the gain-and-loss amplitude $\\gamma$ \\cite{Mostafazadeh2009}. Several lowest values of $\\gamma$ corresponding to the spectral singularities  and the associated wavenumbers  $k$ are listed in Table~\\ref{tbl:1}.  An  important advantage of the more general system with nonzero gain-to-loss disctance    $\\ell>0$ consists  in the possibility to create a spectral   singularity   at any wavenumber $k$ given beforehand. Indeed, let us  return back to equations (\\ref{4.4}), (\\ref{4.10}) and    discuss the following   issue: given a point $k$ on the real axis, how to choose $\\b$ and $\\ell$ to have a  resonance at this point? Equations (\\ref{4.4})--(\\ref{4.10}) allow us to answer  easily this question.\n\nWe fix $k>0$ and we find the associated value of $u$ by resolving (\\ref{4.1}):\n\\begin{equation}\\label{4.12}\nu^{-2}-u^2=4k^2\\b^{-2},\\qquad\n\nu=\\b R^{-1}, \\quad \\mbox{where\\ } R = \\sqrt{2k^2+\\sqrt{4k^4+\\b^4}}.\n\\end{equation}\nWe divide equation (\\ref{4.4}) by $u^6$ and substitute then the above formulae and\n\\begin{equation*}\n\\frac{1-u^{12}}{u^6}=\\frac{1-u^4}{u^2}\\frac{1+u^4+u^8}{u^4}=(u^{-2}-u^2)\\big((u^{-2}-u^2)^2+3\n\\big).\n\\end{equation*}\nThis gives the equation:\n\\begin{equation}\\label{4.11}\n\\begin{aligned}\n2\\b^4 k^2&\\cosh(\\b^2R^{-1})  \\cos R\n\n\n - \\b^6 \\sinh(\\b^2R^{-1})  \\sin R\n  +2k^2(16k^4+3\\b^4)=0.\n\\end{aligned}\n\\end{equation}\nAn algorithm for creating a resonance at a prescribed point  $k$ is as follows. Given $k>0$, we first solve equation (\\ref{4.11}) with respect to $\\b$ and we also find $u$ by (\\ref{4.12}). Then needed values of  $\\ell$ are determined by (\\ref{4.10}), (\\ref{4.5}).\n\n\nIn order to illustrate this  algorithm,  we us consider a finite interval of wavenumbers $k\\in (0, k_1]$, where  we set $k_1 = 10$ for the numerics reported on in what follows. We scan   the chosen interval  with a sufficiently small step ($\\Delta k=0.01$) and for each value of $k$ solve equation (\\ref{4.11}) numerically  using the simple dichotomy method. While for each $k$  equation (\\ref{4.11}) might have several roots $\\beta$, in our numerical procedure we always choose the  minimal positive root, i.e., the one which allows to achieve the spectral singularity with given $k$ at the smallest possible value  of the gain-and-loss amplitude $\\gamma=\\gamma_\\textrm{min}$. Next, we choose the minimal positive distance $\\ell_\\textrm{min}$ which satisfies the conditions   (\\ref{4.10}), (\\ref{4.5}) and then we use the periodicity in $\\ell$ to generate the sequence of larger gain-to-loss distances $\\ell_n = \\ell_\\textrm{min} + n\\pi/(2k)$, $n=1,2\\ldots$ [see   (\\ref{eq:periodic})]. The resulting dependencies $\\gamma_\\textrm{min}(k)$ and $\\ell_\\textrm{min}(k)$, $\\ell_n(k)$   are shown in figure~\\ref{fig:min}. The minimal gain-and-loss amplitude $\\g_{min}(k)$  and the minimal gain-to-loss distance $\\ell_{min}$ are   discontinuous,  which means that the small variation  in  the wavenumuber $k$  might require  a significant  change either in   $\\gamma$ or in $\\ell$.  It is especially important that the values of the  distance  $\\ell$ are generically different from zero, which points out explicitly    that the new degree of freedom offered by the nonzero gain-to-loss distance    is   important for the achieving a spectral singularity  at   the given wavenumber $k$.\n\n\n\\begin{figure}\n\t\\centering\n\t\\includegraphics[width=0.8\\columnwidth]{fig04.eps}\n\t\\caption{(a) Minimal value of the gain-and-loss amplitude $\\gamma_\\textrm{min}$  which corresponds to   a spectral singularity with the given value of the wavenumber $k$. (b)  Minimal gain-to-loss distance  $\\ell_\\textrm{min}$  which corresponds to   a spectral singularity with the  $k$ and $\\gamma$ from the  left panel (bold curves) and larger distances $\\ell_n$ obtained using the periodicity in $\\ell$ (thin \\textcolor{black}{dotted} curves).}\n\t\\label{fig:min}\n\\end{figure}\n\n\n\\subsection{$\\PT$-symmetry breaking laser-antilaser threshold}\n\nA particularly important characteristics of any $\\PT$-symmetric structure is the $\\PT$ symmetry breaking threshold, i.e., the amplitude of the gain-and-loss\ncorresponding to the  ``phase transition'' from the purely real to complex spectrum.  The best studied scenario of the phase transition  \nis the collision of two real discrete  eigenvalues at an exceptional point with the subsequent splitting in a complex-conjugate pair. However, in systems with nonempty continuous spectrum,  the phase transition can also occur through   the splitting of a self-dual spectral singularity, which results in a bifurcation of a complex-conjugate pair from an interior point of the continuum \\cite{Yang17,KZ17,Konotop2019}. At the moment corresponding to the formation of the spectral singularity, the system operates in the CPA-laser regime \\cite{Longhi10}. Thus, in such a system, the $\\PT$-symmetry breaking threshold at the same time corresponds to the CPA-laser threshold.\n\nLemma~\\ref{lm3.3} guarantees that the spectrum of our system is real for sufficiently small gain-and-loss amplitudes $\\gamma$. Additionally, according to Lemma~\\ref{lm3.5}, the spectrum  does not have any real discrete eigenvalue. Hence, the $\\PT$-symmetry breaking is expected to occur through the emergence of a self-dual spectral singularity. In order to identify the $\\PT$-symmetry breaking threshold in our system, we start from the limit $\\ell=0$, there the phase transition takes place at  $\\gamma_\\star^{(0)} \\approx 2.072$, see \\cite{Mostafazadeh2009,KZ17} and Table~\\ref{tbl:1}. Thus, the spectrum with $\\ell=0$ is purely real and continuous for $\\gamma \\in [0, \\gamma_\\star]$, while the  increase of the gain-and-loss just above $\\gamma_\\star^{(0)}$ leads to the bifurcation of a complex-conjugate pair from an interior point  of the continuum. The spectral singularity forming at  $\\gamma_\\star^{(0)}$ takes place at wavenumber $k=k_\\star^{(0)} \\approx  1.065$. Respectively, the complex-conjugate pair of eigenvalues bifurcates from $\\lambda_0 =  [k_\\star^{(0)}]^2$. (Notice that the further increase of $\\gamma$ above the next threshold values listed in Table~\\ref{tbl:1} leads to the formation of new spectral singularities and, respectively, to bifurcations of new complex-conjugate pairs in the spectrum.)\n\nNext, we use the numerical continuation in $\\ell$ in order to continue the known solution at $\\ell=0$ to the domain $\\ell>0$.  The obtained dependence of the threshold value  of the gain-and-loss amplitude on distance $\\ell$ is shown in figure~\\ref{fig:threshold}(a), where we observe that the phase transition threshold decreases monotonically with the growth of $\\ell$. This means that introducing an additional space   between the gain and loss, one can decrease the $\\PT$-symmetry breaking  threshold, i.e. achieve the laser-antilaser operation at \\textcolor{black}{lower} gain-and-loss amplitudes than in a waveguide with the adjacent gain and loss.\n\n\n\\begin{figure}\n\t\\centering\n\t\\includegraphics[width=0.85\\columnwidth]{fig05.eps}\n\t\\caption{(a) $\\PT$-symmetry breaking threshold $\\gamma_\\star^{(0)}$ \\textit{vs} the distance between the gain and loss $\\ell$. The spectrum is purely real and continuous for $\\gamma\\leq \\gamma_\\star$, but acquires a pair of complex conjugate eigenvalues  as the gain-and-loss amplitude exceeds the threshold $\\gamma_\\star^{(0)}$. (b) Values of the wavevector $k_\\star^{(0)}$ corresponding to the dependence in (a).}\n\t\\label{fig:threshold}\n\\end{figure}\n\n\\subsection{General picture of spectral singularities}\n\\label{sec:general}\n\n\\begin{figure}\n\t\\centering\n\t\\includegraphics[width=0.99\\columnwidth]{fig06.eps}\n\t\\caption{(a) Values of the gain-and-loss amplitude $\\gamma$ and distance  $\\ell$, for which   spectral singularities occur. (b) Corresponding  wavenumbers   $k$. \\textcolor{black}{Panel (c) magnifies   the region  $(\\ell, \\gamma)\\in [2,4]\\times[10,50]$ from   (a), and panel  (d) is the  magnification of corresponding curves from panel (b).}}\n\t\\label{fig:general}\n\\end{figure}\n\nLet us now turn to description of the general picture of spectral singularities. In order to construct systematically different solutions, we again start from the limit $\\ell=0$, where the values of $\\gamma$ and $k$ corresponding to spectral singularities are known, see Table~\\ref{tbl:1}. Then we use the periodicity of function $e^{-4ik\\ell}$ in $\\ell$ in order to construct new branches of solutions having no counterparts in the limit $\\ell=0$, cf. equation (\\ref{eq:periodic}). This procedure results in a fairly complicated picture containing a multitude of spectral singularities, some part of which (corresponding to  relatively  small   values of the gain-and-loss) is shown in figure~\\ref{fig:general}(a,b) as the curves on the plane $\\gamma$ {\\it vs} $\\ell$ and $k$ {\\it vs} $\\ell$.\n\n\nLet us describe the structure of the found solutions using the  diagram $\\gamma$ {\\it vs} $\\ell$ in figure~\\ref{fig:general}(a). The multitude of curves shown in this plot can be divided into three groups (plotted with red, blue and green curves) which have been obtained by means of the continuation from three different solutions in the limit $\\ell=0$. There is a well-visible vertical gap between red and blue curves, which corresponds to the ``forbidden'' values of the gain-and-loss amplitudes $\\gamma$ found above in (\\ref{eq:gap}). At the same time,  there is no gap between blue and green curves, which  results in a multitude of intersections between these curves.\nThe first group of spectral singularities,  corresponding to red curves in Figure~\\ref{fig:general}(a),   is obtained through  the continuation from the  spectral singularity  in the limit $\\ell=0$ with the smallest gain-and-loss amplitude, i.e.  from values $\\gamma_\\star^{(0)}$ and $k_\\star^{(0)}$ in Table~\\ref{tbl:1}.  The leftmost (bold)  curve in this group which originates in the limit $\\ell=0$  is the $\\PT$-symmetry breaking threshold which was already shown in figure~\\ref{fig:threshold}(a). For values of $\\gamma$ above this curve, there is always one or more (but finitely many, see Section~\\ref{sec:zeros})   complex-conjugate pairs of eigenvalues in the spectrum.  Several curves situated to the right from the bold red curve are obtained using the fact that if $\\gamma$ and $k$ are solutions in the limit $\\ell=0$, then the same values of $\\gamma$ and $k$ also correspond to a spectral singularity with $\\ell_n = (n\\pi)/(2k)$, $n=1,2,\\ldots$. Once a single solution with a new distance   $\\ell_n$ is obtained, a new branch of solutions can be constructed using numerical continuation in $\\gamma$ or in $\\ell$. In the limit $\\ell \\to \\infty$ all the red curves in figure~\\ref{fig:general}(a) approach  the asymptotic value $\\gamma=0$ or at $\\gamma_*= \\pi^2/2\\approx 4.935$. Notice that, except for the bold line demarcating the $\\PT$-symmetry breaking threshold, none of the red curves can be continued to the limit $\\ell\\to 0$.\n\nThe multitude of red curves  in figure~\\ref{fig:general}(a) demonstrates explicitly how new spectral singularities emerge with the increase of $\\ell$. Indeed, drawing an imaginary horizontal line, say, at $\\gamma=3$ [see the vertical axis  tick  in figure~\\ref{fig:general}(a)], we observe that with the increase of $\\ell$ this line   intersects more and more red curves. Each intersection corresponds to the values of $\\gamma$ and $\\ell$ at which   some  root $k$   crosses the real axis and goes down from the upper to lower complex half-plane. Thus,  a finite-width resonance transforms  to a complex eigenvalue through the spectral singularity (we recall again that  in view of   $\\PT$ symmetry  any root $k\\ne 0$ crosses the real line simultaneously with its counterpart $-\\bar{k}$, i.e.  the corresponding   spectral singularity is self-dual).\nIn order to illustrate this process, in Figure~\\ref{fig:g=3}  we show the evolution of three numerically found complex zeros of function $F(k, \\gamma, \\ell)$  under   the increase of $\\ell$.  In this figure the imaginary part of each complex zero changes sign from positive to negative and then asymptotically approaches zero (remaining negative). In the limit of large $\\ell$,  this behavior agrees  with expansion (\\ref{3.4}) of Lemma~\\ref{lm3.4} , where, for the chosen value of $\\gamma$,   we have $\\sin\\sqrt{2\\gamma}>0$.   Thus the growing distance   $\\ell$ results in a sequence of self-dual spectral singularities and in the increasing number of complex-conjugate eigenvalues in the spectrum.\n\n \\begin{figure}\n \t\\centering\n \t\\includegraphics[width=0.8\\columnwidth]{fig07.eps}\n \t\\caption{(a,b) Real and imaginary parts of three complex zeros of function $F$ for fixed $\\gamma=3$ and increasing $\\ell$. For each curve, the imaginary part is positive for small $\\ell$ and becomes negative for all sufficiently large $\\ell$. \\textcolor{black}{For each three shown eigenvalues, its real and \timaginary parts are of the same colour in both panels.}}\n \t\\label{fig:g=3}\n \\end{figure}\n\n\nThe second group of spectral singularities [blue curves in figure~\\ref{fig:general}(a)] was obtained by means of the continuation from the  next solution in the limit $\\ell=0$, i.e.  from $\\gamma_\\star^{(1)}$ and $k_\\star^{(1)}$ in Table~\\ref{tbl:1}. Again, one of the curves [the leftmost bold curve in figure~\\ref{fig:general}(a)] was obtained through the direct continuation from the limit $\\ell=0$, while other blue curves were generated  using the periodicity of function $F(k, \\gamma, \\ell)$  in $\\ell$ and cannot be continued to the limit $\\ell\\to0$. In the limit $\\ell\\to\\infty$ the blue curves approach the horizontal asymptotes  $\\gamma=2\\pi^2\\approx19.739$ and $\\gamma=9\\pi^2/2\\approx 44.413$. In  the $(\\gamma, \\ell)$-plane  the  gain-and-loss amplitudes corresponding to the   group of blue curves are well separated from  those for the red curves: indeed, all red curves are bounded from above by the asymptote $\\gamma= \\pi^2/2\\approx 4.935$, while all blue curves are bounded from below by $\\gamma_*\\approx 11.561$, see (\\ref{eq:gap}). Thus, the  emergence of new spectral singularities with the increase of $\\ell$ is sensitive to the value of the gain-and-loss amplitude and does not occur for     the gain-and-loss amplitudes lying in the  gap  between the red and blue curves.\n\nIn comparison with the red curves discussed above, the curves from the blue group in figure~\\ref{fig:general}(a) feature more complicated behavior and, in particular, can intersect each other (and also intersect the curves from the next, third group of green curves discussed below). The intersections between the blue curves occur for the gain-and-loss amplitudes in the interval $11.561 \\lessapprox   \\gamma <  9\\pi^2/2\\approx 19.739$, where $\\sin\\sqrt{2\\gamma}$ is negative.  \\textcolor{black}{At first glance,} this might seem to contradict to  the  expansion (\\ref{3.4}) of Lemma~\\ref{lm3.4},  which  suggests that in this case the multitude of complex zeros   accumulate in the upper complex half-plane with the growth of $\\ell$. However, this apparent contradiction is resolved if we trace the behavior of the complex roots more closely.   Indeed, choosing for an example $\\gamma =16$ [see the vertical axis tick in  figure~\\ref{fig:general}(a)] and computing several  complex roots under the increase of $\\ell$, we observe that each considered root first goes down from the upper half-plane to the lower one but then again returns to the upper half-plane, see Figure~\\ref{fig:g=16}(b) and (b$_1$). Thus,  in  this  interval of the gain-and-loss amplitudes the increase of $\\ell$  results  either in the transformation from the resonance to the eigenvalue or to the opposite process,  i.e. to the disappearance of the complex-conjugate pair.  Respectively, the intersection between the two blue curves corresponds to  two coexisting spectral singularities, i.e. to the moment  when one complex-conjugate pair of eigenvalues disappears and another pair (with different $k$) emerges. For sufficiently large $\\ell$ the imaginary part  of each   considered root  remains positive and approaches zero, in accordance with expansion (\\ref{3.4}). Thus, in this interval of the gain-and-loss strengths, the limit $\\ell\\to\\infty$ the spectrum contains only a finite number of complex-conjugate eigenvalues, which correspond to complex zeros $k$ whose behavior is not covered   by expansion  (\\ref{3.4}).\n\n \\begin{figure}\n\t\\centering\n\t\\includegraphics[width=0.99\\columnwidth]{fig08.eps}\n\t\\caption{(a,b) Real and imaginary parts for three   complex zeros of function $F$ for fixed $\\gamma=16$ and increasing $\\ell$. Panel b$_1$ is the magnification of some region of (b) and  shows more clearly that imaginary parts of all   three shown eigenvalues are positive for all sufficiently large  $\\ell$. \\textcolor{black}{For each three shown eigenvalues, its real and \timaginary parts are of the same colour in both panels.}}\n\t\\label{fig:g=16}\n\\end{figure}\n\nThe third  group of spectral singularities [green curves in figure~\\ref{fig:general}(a)] was obtained by means of the continuation from  the solution $\\gamma_\\star^{(2)}$ and $k_\\star^{(2)}$ in Table~\\ref{tbl:1}. In view of the very complicated structure of the overall resulting picture, we only show a section of these curves corresponding to relatively small   values of the gain-and-loss amplitudes $\\gamma$. Quite interestingly, there is no gap between the blue and green groups of curves, which results in the multitude of intersections between blue and green curves, \\textcolor{black}{see Fig.~\\ref{fig:general}(a) and the magnified view in Fig.~\\ref{fig:general}(c)}. These   intersections suggest a possibility of   simultaneous emergence of two complex-conjugate pairs of eigenvalues from two different interior points of the continuous spectra.\n\nConsidering further solutions $\\gamma_\\star^{(n)}$, $k_\\star^{(n)}$, $n=3,4,\\ldots$, in the limit $\\ell=0$ one can construct new groups of spectral singularities  with larger values of $\\gamma$, which are not shown in Figure~\\ref{fig:general}.\n\n\n5.3 Creating a  spectral singularity  at a given wavenumber\n\\subsection{Creating a  spectral singularity  at a given wavenumber}\n\n\\begin{table}\n\t\\centering\n\t\\begin{tabular}{c|cccc}\n\t\t$n$ & 0 & 1& 2&\\\\\\hline\n\t\t$\\gamma_\\star^{(n)}$ & 2.071&13.307&27.783 &\\\\[2mm]\n\t\t$k_\\star^{(n)}$ & 1.065&4.318 &7.529 &\n\t\\end{tabular}\n\t\\caption{Approximate values of gain-and-loss amplitudes $\\gamma=\\gamma_\\star^{(n)}$  and wavenumbers   $k=k_\\star^{(n)}$, $n=0, 1, \\ldots$,  corresponding to spectral singularities with lowest $\\gamma$ in the limit $\\ell=0$, see Table~I in  \\cite{Mostafazadeh2009}. \\label{tbl:1}}\n\\end{table}\n\n\nFor $\\ell=0$ spectral singularity can only be obtained for some isolated values of the wavenumber $k$ and the gain-and-loss amplitude $\\gamma$ \\cite{Mostafazadeh2009}. Several lowest values of $\\gamma$ corresponding to the spectral singularities  and the associated wavenumbers  $k$ are listed in Table~\\ref{tbl:1}.  An  important advantage of the more general system with nonzero gain-to-loss disctance    $\\ell>0$ consists  in the possibility to create a spectral   singularity   at any wavenumber $k$ given beforehand. Indeed, let us  return back to equations (\\ref{4.4}), (\\ref{4.10}) and    discuss the following   issue: given a point $k$ on the real axis, how to choose $\\b$ and $\\ell$ to have a  resonance at this point? Equations (\\ref{4.4})--(\\ref{4.10}) allow us to answer  easily this question.\n\nWe fix $k>0$ and we find the associated value of $u$ by resolving (\\ref{4.1}):\n\\begin{equation}\\label{4.12}\nu^{-2}-u^2=4k^2\\b^{-2},\\qquad\n\nu=\\b R^{-1}, \\quad \\mbox{where\\ } R = \\sqrt{2k^2+\\sqrt{4k^4+\\b^4}}.\n\\end{equation}\nWe divide equation (\\ref{4.4}) by $u^6$ and substitute then the above formulae and\n\\begin{equation*}\n\\frac{1-u^{12}}{u^6}=\\frac{1-u^4}{u^2}\\frac{1+u^4+u^8}{u^4}=(u^{-2}-u^2)\\big((u^{-2}-u^2)^2+3\n\\big).\n\\end{equation*}\nThis gives the equation:\n\\begin{equation}\\label{4.11}\n\\begin{aligned}\n2\\b^4 k^2&\\cosh(\\b^2R^{-1})  \\cos R\n\n\n - \\b^6 \\sinh(\\b^2R^{-1})  \\sin R\n  +2k^2(16k^4+3\\b^4)=0.\n\\end{aligned}\n\\end{equation}\nAn algorithm for creating a resonance at a prescribed point  $k$ is as follows. Given $k>0$, we first solve equation (\\ref{4.11}) with respect to $\\b$ and we also find $u$ by (\\ref{4.12}). Then needed values of  $\\ell$ are determined by (\\ref{4.10}), (\\ref{4.5}).\n\n\nIn order to illustrate this  algorithm,  we us consider a finite interval of wavenumbers $k\\in (0, k_1]$, where  we set $k_1 = 10$ for the numerics reported on in what follows. We scan   the chosen interval  with a sufficiently small step ($\\Delta k=0.01$) and for each value of $k$ solve equation (\\ref{4.11}) numerically  using the simple dichotomy method. While for each $k$  equation (\\ref{4.11}) might have several roots $\\beta$, in our numerical procedure we always choose the  minimal positive root, i.e., the one which allows to achieve the spectral singularity with given $k$ at the smallest possible value  of the gain-and-loss amplitude $\\gamma=\\gamma_\\textrm{min}$. Next, we choose the minimal positive distance $\\ell_\\textrm{min}$ which satisfies the conditions   (\\ref{4.10}), (\\ref{4.5}) and then we use the periodicity in $\\ell$ to generate the sequence of larger gain-to-loss distances $\\ell_n = \\ell_\\textrm{min} + n\\pi/(2k)$, $n=1,2\\ldots$ [see   (\\ref{eq:periodic})]. The resulting dependencies $\\gamma_\\textrm{min}(k)$ and $\\ell_\\textrm{min}(k)$, $\\ell_n(k)$   are shown in figure~\\ref{fig:min}. The minimal gain-and-loss amplitude $\\g_{min}(k)$  and the minimal gain-to-loss distance $\\ell_{min}$ are   discontinuous,  which means that the small variation  in  the wavenumuber $k$  might require  a significant  change either in   $\\gamma$ or in $\\ell$.  It is especially important that the values of the  distance  $\\ell$ are generically different from zero, which points out explicitly    that the new degree of freedom offered by the nonzero gain-to-loss distance    is   important for the achieving a spectral singularity  at   the given wavenumber $k$.\n\n\n\\begin{figure}\n\t\\centering\n\t\\includegraphics[width=0.8\\columnwidth]{fig04.eps}\n\t\\caption{(a) Minimal value of the gain-and-loss amplitude $\\gamma_\\textrm{min}$  which corresponds to   a spectral singularity with the given value of the wavenumber $k$. (b)  Minimal gain-to-loss distance  $\\ell_\\textrm{min}$  which corresponds to   a spectral singularity with the  $k$ and $\\gamma$ from the  left panel (bold curves) and larger distances $\\ell_n$ obtained using the periodicity in $\\ell$ (thin \\textcolor{black}{dotted} curves).}\n\t\\label{fig:min}\n\\end{figure}\n\n\n5.4 $\\PT$-symmetry breaking laser-antilaser threshold\n\\subsection{$\\PT$-symmetry breaking laser-antilaser threshold}\n\nA particularly important characteristics of any $\\PT$-symmetric structure is the $\\PT$ symmetry breaking threshold, i.e., the amplitude of the gain-and-loss\ncorresponding to the  ``phase transition'' from the purely real to complex spectrum.  The best studied scenario of the phase transition  \nis the collision of two real discrete  eigenvalues at an exceptional point with the subsequent splitting in a complex-conjugate pair. However, in systems with nonempty continuous spectrum,  the phase transition can also occur through   the splitting of a self-dual spectral singularity, which results in a bifurcation of a complex-conjugate pair from an interior point of the continuum \\cite{Yang17,KZ17,Konotop2019}. At the moment corresponding to the formation of the spectral singularity, the system operates in the CPA-laser regime \\cite{Longhi10}. Thus, in such a system, the $\\PT$-symmetry breaking threshold at the same time corresponds to the CPA-laser threshold.\n\nLemma~\\ref{lm3.3} guarantees that the spectrum of our system is real for sufficiently small gain-and-loss amplitudes $\\gamma$. Additionally, according to Lemma~\\ref{lm3.5}, the spectrum  does not have any real discrete eigenvalue. Hence, the $\\PT$-symmetry breaking is expected to occur through the emergence of a self-dual spectral singularity. In order to identify the $\\PT$-symmetry breaking threshold in our system, we start from the limit $\\ell=0$, there the phase transition takes place at  $\\gamma_\\star^{(0)} \\approx 2.072$, see \\cite{Mostafazadeh2009,KZ17} and Table~\\ref{tbl:1}. Thus, the spectrum with $\\ell=0$ is purely real and continuous for $\\gamma \\in [0, \\gamma_\\star]$, while the  increase of the gain-and-loss just above $\\gamma_\\star^{(0)}$ leads to the bifurcation of a complex-conjugate pair from an interior point  of the continuum. The spectral singularity forming at  $\\gamma_\\star^{(0)}$ takes place at wavenumber $k=k_\\star^{(0)} \\approx  1.065$. Respectively, the complex-conjugate pair of eigenvalues bifurcates from $\\lambda_0 =  [k_\\star^{(0)}]^2$. (Notice that the further increase of $\\gamma$ above the next threshold values listed in Table~\\ref{tbl:1} leads to the formation of new spectral singularities and, respectively, to bifurcations of new complex-conjugate pairs in the spectrum.)\n\nNext, we use the numerical continuation in $\\ell$ in order to continue the known solution at $\\ell=0$ to the domain $\\ell>0$.  The obtained dependence of the threshold value  of the gain-and-loss amplitude on distance $\\ell$ is shown in figure~\\ref{fig:threshold}(a), where we observe that the phase transition threshold decreases monotonically with the growth of $\\ell$. This means that introducing an additional space   between the gain and loss, one can decrease the $\\PT$-symmetry breaking  threshold, i.e. achieve the laser-antilaser operation at \\textcolor{black}{lower} gain-and-loss amplitudes than in a waveguide with the adjacent gain and loss.\n\n\n\\begin{figure}\n\t\\centering\n\t\\includegraphics[width=0.85\\columnwidth]{fig05.eps}\n\t\\caption{(a) $\\PT$-symmetry breaking threshold $\\gamma_\\star^{(0)}$ \\textit{vs} the distance between the gain and loss $\\ell$. The spectrum is purely real and continuous for $\\gamma\\leq \\gamma_\\star$, but acquires a pair of complex conjugate eigenvalues  as the gain-and-loss amplitude exceeds the threshold $\\gamma_\\star^{(0)}$. (b) Values of the wavevector $k_\\star^{(0)}$ corresponding to the dependence in (a).}\n\t\\label{fig:threshold}\n\\end{figure}\n\n5.5 General picture of spectral singularities\n\\subsection{General picture of spectral singularities}\n\\label{sec:general}\n\n\\begin{figure}\n\t\\centering\n\t\\includegraphics[width=0.99\\columnwidth]{fig06.eps}\n\t\\caption{(a) Values of the gain-and-loss amplitude $\\gamma$ and distance  $\\ell$, for which   spectral singularities occur. (b) Corresponding  wavenumbers   $k$. \\textcolor{black}{Panel (c) magnifies   the region  $(\\ell, \\gamma)\\in [2,4]\\times[10,50]$ from   (a), and panel  (d) is the  magnification of corresponding curves from panel (b).}}\n\t\\label{fig:general}\n\\end{figure}\n\nLet us now turn to description of the general picture of spectral singularities. In order to construct systematically different solutions, we again start from the limit $\\ell=0$, where the values of $\\gamma$ and $k$ corresponding to spectral singularities are known, see Table~\\ref{tbl:1}. Then we use the periodicity of function $e^{-4ik\\ell}$ in $\\ell$ in order to construct new branches of solutions having no counterparts in the limit $\\ell=0$, cf. equation (\\ref{eq:periodic}). This procedure results in a fairly complicated picture containing a multitude of spectral singularities, some part of which (corresponding to  relatively  small   values of the gain-and-loss) is shown in figure~\\ref{fig:general}(a,b) as the curves on the plane $\\gamma$ {\\it vs} $\\ell$ and $k$ {\\it vs} $\\ell$.\n\n\nLet us describe the structure of the found solutions using the  diagram $\\gamma$ {\\it vs} $\\ell$ in figure~\\ref{fig:general}(a). The multitude of curves shown in this plot can be divided into three groups (plotted with red, blue and green curves) which have been obtained by means of the continuation from three different solutions in the limit $\\ell=0$. There is a well-visible vertical gap between red and blue curves, which corresponds to the ``forbidden'' values of the gain-and-loss amplitudes $\\gamma$ found above in (\\ref{eq:gap}). At the same time,  there is no gap between blue and green curves, which  results in a multitude of intersections between these curves.\nThe first group of spectral singularities,  corresponding to red curves in Figure~\\ref{fig:general}(a),   is obtained through  the continuation from the  spectral singularity  in the limit $\\ell=0$ with the smallest gain-and-loss amplitude, i.e.  from values $\\gamma_\\star^{(0)}$ and $k_\\star^{(0)}$ in Table~\\ref{tbl:1}.  The leftmost (bold)  curve in this group which originates in the limit $\\ell=0$  is the $\\PT$-symmetry breaking threshold which was already shown in figure~\\ref{fig:threshold}(a). For values of $\\gamma$ above this curve, there is always one or more (but finitely many, see Section~\\ref{sec:zeros})   complex-conjugate pairs of eigenvalues in the spectrum.  Several curves situated to the right from the bold red curve are obtained using the fact that if $\\gamma$ and $k$ are solutions in the limit $\\ell=0$, then the same values of $\\gamma$ and $k$ also correspond to a spectral singularity with $\\ell_n = (n\\pi)/(2k)$, $n=1,2,\\ldots$. Once a single solution with a new distance   $\\ell_n$ is obtained, a new branch of solutions can be constructed using numerical continuation in $\\gamma$ or in $\\ell$. In the limit $\\ell \\to \\infty$ all the red curves in figure~\\ref{fig:general}(a) approach  the asymptotic value $\\gamma=0$ or at $\\gamma_*= \\pi^2/2\\approx 4.935$. Notice that, except for the bold line demarcating the $\\PT$-symmetry breaking threshold, none of the red curves can be continued to the limit $\\ell\\to 0$.\n\nThe multitude of red curves  in figure~\\ref{fig:general}(a) demonstrates explicitly how new spectral singularities emerge with the increase of $\\ell$. Indeed, drawing an imaginary horizontal line, say, at $\\gamma=3$ [see the vertical axis  tick  in figure~\\ref{fig:general}(a)], we observe that with the increase of $\\ell$ this line   intersects more and more red curves. Each intersection corresponds to the values of $\\gamma$ and $\\ell$ at which   some  root $k$   crosses the real axis and goes down from the upper to lower complex half-plane. Thus,  a finite-width resonance transforms  to a complex eigenvalue through the spectral singularity (we recall again that  in view of   $\\PT$ symmetry  any root $k\\ne 0$ crosses the real line simultaneously with its counterpart $-\\bar{k}$, i.e.  the corresponding   spectral singularity is self-dual).\nIn order to illustrate this process, in Figure~\\ref{fig:g=3}  we show the evolution of three numerically found complex zeros of function $F(k, \\gamma, \\ell)$  under   the increase of $\\ell$.  In this figure the imaginary part of each complex zero changes sign from positive to negative and then asymptotically approaches zero (remaining negative). In the limit of large $\\ell$,  this behavior agrees  with expansion (\\ref{3.4}) of Lemma~\\ref{lm3.4} , where, for the chosen value of $\\gamma$,   we have $\\sin\\sqrt{2\\gamma}>0$.   Thus the growing distance   $\\ell$ results in a sequence of self-dual spectral singularities and in the increasing number of complex-conjugate eigenvalues in the spectrum.\n\n \\begin{figure}\n \t\\centering\n \t\\includegraphics[width=0.8\\columnwidth]{fig07.eps}\n \t\\caption{(a,b) Real and imaginary parts of three complex zeros of function $F$ for fixed $\\gamma=3$ and increasing $\\ell$. For each curve, the imaginary part is positive for small $\\ell$ and becomes negative for all sufficiently large $\\ell$. \\textcolor{black}{For each three shown eigenvalues, its real and \timaginary parts are of the same colour in both panels.}}\n \t\\label{fig:g=3}\n \\end{figure}\n\n\nThe second group of spectral singularities [blue curves in figure~\\ref{fig:general}(a)] was obtained by means of the continuation from the  next solution in the limit $\\ell=0$, i.e.  from $\\gamma_\\star^{(1)}$ and $k_\\star^{(1)}$ in Table~\\ref{tbl:1}. Again, one of the curves [the leftmost bold curve in figure~\\ref{fig:general}(a)] was obtained through the direct continuation from the limit $\\ell=0$, while other blue curves were generated  using the periodicity of function $F(k, \\gamma, \\ell)$  in $\\ell$ and cannot be continued to the limit $\\ell\\to0$. In the limit $\\ell\\to\\infty$ the blue curves approach the horizontal asymptotes  $\\gamma=2\\pi^2\\approx19.739$ and $\\gamma=9\\pi^2/2\\approx 44.413$. In  the $(\\gamma, \\ell)$-plane  the  gain-and-loss amplitudes corresponding to the   group of blue curves are well separated from  those for the red curves: indeed, all red curves are bounded from above by the asymptote $\\gamma= \\pi^2/2\\approx 4.935$, while all blue curves are bounded from below by $\\gamma_*\\approx 11.561$, see (\\ref{eq:gap}). Thus, the  emergence of new spectral singularities with the increase of $\\ell$ is sensitive to the value of the gain-and-loss amplitude and does not occur for     the gain-and-loss amplitudes lying in the  gap  between the red and blue curves.\n\nIn comparison with the red curves discussed above, the curves from the blue group in figure~\\ref{fig:general}(a) feature more complicated behavior and, in particular, can intersect each other (and also intersect the curves from the next, third group of green curves discussed below). The intersections between the blue curves occur for the gain-and-loss amplitudes in the interval $11.561 \\lessapprox   \\gamma <  9\\pi^2/2\\approx 19.739$, where $\\sin\\sqrt{2\\gamma}$ is negative.  \\textcolor{black}{At first glance,} this might seem to contradict to  the  expansion (\\ref{3.4}) of Lemma~\\ref{lm3.4},  which  suggests that in this case the multitude of complex zeros   accumulate in the upper complex half-plane with the growth of $\\ell$. However, this apparent contradiction is resolved if we trace the behavior of the complex roots more closely.   Indeed, choosing for an example $\\gamma =16$ [see the vertical axis tick in  figure~\\ref{fig:general}(a)] and computing several  complex roots under the increase of $\\ell$, we observe that each considered root first goes down from the upper half-plane to the lower one but then again returns to the upper half-plane, see Figure~\\ref{fig:g=16}(b) and (b$_1$). Thus,  in  this  interval of the gain-and-loss amplitudes the increase of $\\ell$  results  either in the transformation from the resonance to the eigenvalue or to the opposite process,  i.e. to the disappearance of the complex-conjugate pair.  Respectively, the intersection between the two blue curves corresponds to  two coexisting spectral singularities, i.e. to the moment  when one complex-conjugate pair of eigenvalues disappears and another pair (with different $k$) emerges. For sufficiently large $\\ell$ the imaginary part  of each   considered root  remains positive and approaches zero, in accordance with expansion (\\ref{3.4}). Thus, in this interval of the gain-and-loss strengths, the limit $\\ell\\to\\infty$ the spectrum contains only a finite number of complex-conjugate eigenvalues, which correspond to complex zeros $k$ whose behavior is not covered   by expansion  (\\ref{3.4}).\n\n \\begin{figure}\n\t\\centering\n\t\\includegraphics[width=0.99\\columnwidth]{fig08.eps}\n\t\\caption{(a,b) Real and imaginary parts for three   complex zeros of function $F$ for fixed $\\gamma=16$ and increasing $\\ell$. Panel b$_1$ is the magnification of some region of (b) and  shows more clearly that imaginary parts of all   three shown eigenvalues are positive for all sufficiently large  $\\ell$. \\textcolor{black}{For each three shown eigenvalues, its real and \timaginary parts are of the same colour in both panels.}}\n\t\\label{fig:g=16}\n\\end{figure}\n\nThe third  group of spectral singularities [green curves in figure~\\ref{fig:general}(a)] was obtained by means of the continuation from  the solution $\\gamma_\\star^{(2)}$ and $k_\\star^{(2)}$ in Table~\\ref{tbl:1}. In view of the very complicated structure of the overall resulting picture, we only show a section of these curves corresponding to relatively small   values of the gain-and-loss amplitudes $\\gamma$. Quite interestingly, there is no gap between the blue and green groups of curves, which results in the multitude of intersections between blue and green curves, \\textcolor{black}{see Fig.~\\ref{fig:general}(a) and the magnified view in Fig.~\\ref{fig:general}(c)}. These   intersections suggest a possibility of   simultaneous emergence of two complex-conjugate pairs of eigenvalues from two different interior points of the continuous spectra.\n\nConsidering further solutions $\\gamma_\\star^{(n)}$, $k_\\star^{(n)}$, $n=3,4,\\ldots$, in the limit $\\ell=0$ one can construct new groups of spectral singularities  with larger values of $\\gamma$, which are not shown in Figure~\\ref{fig:general}.\n\n\n",
        "Descriptive_question1": "What is the approximate value of the gain-and-loss amplitude for n=0 in table_1?",
        "Descriptive_question2": "What is the wavenumber corresponding to n=1 in table_1?",
        "Reasoning_question1": "How does the gain-and-loss amplitude change as n increases from 0 to 2 in table_1, and what might this indicate about the system behavior?",
        "Reasoning_question2": "Comparing the wavenumbers for n=0, 1, and 2 in table_1, what trend can be observed, and what could be a possible explanation for this pattern?",
        "Descriptive_answer1": "2.071",
        "Descriptive_answer2": "4.318",
        "Reasoning_answer1": "As n increases from 0 to 2 in Table 1, the gain-and-loss amplitude (γₓ^{(n)}) rises significantly from 2.071 to 13.307 to 27.783. This indicates a clear upward trend. To understand this behavior, let's consider the increments: from n=0 to n=1, the amplitude increases by approximately 11.236, and from n=1 to n=2, it increases by approximately 14.476. This suggests that the amplitude is not only increasing but doing so at an accelerating rate. Such a pattern might indicate that the system requires increasingly higher gain-and-loss amplitudes to achieve spectral singularities at higher n values, possibly reflecting greater energy or instability thresholds as the system's mode or order (represented by n) increases. This could be tied to the underlying physics of spectral singularities where higher-order singularities demand more pronounced gain-loss interactions to manifest.",
        "Reasoning_answer2": "Comparing the wavenumbers (kₓ^{(n)}) for n=0, 1, and 2 in Table 1, we observe values of 1.065, 4.318, and 7.529, respectively. This shows a consistent increase in wavenumber as n increases. Analyzing the differences, from n=0 to n=1, the wavenumber increases by approximately 3.253, and from n=1 to n=2, it increases by approximately 3.211. The roughly equal increments suggest a near-linear trend in wavenumber growth with n. A possible explanation for this pattern could be related to the spatial or frequency characteristics of the system. Since wavenumber k is related to the spatial frequency of the wave, an increasing k with n might indicate that higher-order spectral singularities correspond to waves with shorter wavelengths or higher frequencies. This could reflect a systematic progression in the system's resonant modes or boundary conditions as n increases, aligning with the theoretical framework of spectral singularities in the limit ℓ=0."
    },
    {
        "paper_id": "1605.00335.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}[!t]\n\\footnotesize\n\\centering\n\\caption{Comparison of the AUC and runtime for OGM, I-GPOM, and I-GPOM2 using the Intel dataset.}\n\\begin{tabular}{lcc}\n\\toprule\nMethod\t\t\t& AUC\t\t& Runtime (min) \\\\ \\midrule\n\nOGM\t\t\t& 0.9300\t& 7.28 \t\\\\\nI-GPOM\t\t\t& 0.9439\t& 102.44 \t\\\\\nI-GPOM2\t\t\t& 0.9668\t& 114.53 \\\\ \\bottomrule\n\\end{tabular}\n\\label{tab:aucroc}\n\n\\end{table}",
        "caption": "Comparison of the AUC and runtime for OGM, I-GPOM, and I-GPOM2 using the Intel dataset.",
        "label": "tab:aucroc",
        "section_info": "3 Mapping\n\\section{Mapping}\n\\label{sec:mapping}\nThe GP mapper module is shown in Figure~\\ref{fig:mapper} which takes the processed measurements, i.e.\\@ training data, and a test point window centered at the current robot pose as inputs to perform regression and classification steps for local maps generation and fuse them incrementally into the global frame through the BCM technique~\\citep{tresp2000bayesian}.\n\nBefore formal statement of the problem, we clarify the following assumptions.\n\\begin{assumption}[Static environment]\nThe environment that the robot navigates in is static.\n\\end{assumption}\n\\begin{assumption}[Gaussian occupancy map points]\n Any sampled point from the occupancy map representation of the environment is a random variable whose distribution is Gaussian.\n\\end{assumption}\n\n\\begin{figure}[t]\n  \\centering \n  \\includegraphics[width=.7\\columnwidth]{mapper}\n  \\caption{Schematic illustration of GP Mapper module. GP models the correlation in data and place distributions on test points. The logistic regression classifier squashes the output of GP into probabilities and returns the local map where the BCM module updates the global map incrementally.}\n  \\label{fig:mapper}\n\\end{figure}\n\n\\subsection{Gaussian Processes}\n\\label{subsec:GPs}\nA Gaussian Process is a collection of any finite number of random variables which are jointly distributed Gaussian \\citep{rasmussen2006gaussian}. The joint distribution of the observed target values, $\\boldsymbol y$, and the function values (the latent variable), $\\boldsymbol f_*$, at the query points can be written as\n\\begin{equation}\n\\label{eq:gp_joint}\n \\begin{bmatrix}\n\t\\boldsymbol y \\\\\n\t\\boldsymbol f_*\n \\end{bmatrix} \\sim \\mathcal{N}(\\boldsymbol 0,\n \\begin{bmatrix}\n\t\\boldsymbol K(\\boldsymbol X,\\boldsymbol X)+\\sigma_n^2 \\boldsymbol I_{n} & \\boldsymbol K(\\boldsymbol X,\\boldsymbol X_*) \\\\\n\t\\boldsymbol K(\\boldsymbol X_*,\\boldsymbol X)\t\t\t& \\boldsymbol K(\\boldsymbol X_*,\\boldsymbol X_*) \n \\end{bmatrix})\n\\end{equation}\nwhere $\\boldsymbol X$ is the $d\\times n$ design matrix of aggregated input vectors $\\boldsymbol x$, $\\boldsymbol X_*$ is a $d\\times n_*$ query points matrix, $\\boldsymbol K(\\cdot,\\cdot)$ is the GP covariance matrix, and $\\sigma_n^2$ is the variance of the observation noise which is assumed to have an independent and identically distributed (i.i.d.) Gaussian distribution. Define a training set \\mbox{$\\mathcal{D} = \\{(\\boldsymbol x^{[i]},y^{[i]}) \\mid i=1\\colon n\\}$}. The predictive conditional distribution for a single query point $f_*|\\mathcal{D},\\boldsymbol x_* \\sim \\mathcal{N}(\\EV{f_*},\\Var{f_*})$ can be derived as \n\\begin{equation}\n \\label{eq:gp_mean}\n \\mu = \\EV{f_*} = \\boldsymbol k(\\boldsymbol X,\\boldsymbol x_*)^{T}[\\boldsymbol K(\\boldsymbol X,\\boldsymbol X)+\\sigma_n^2 \\boldsymbol I_{n}]^{-1}\\boldsymbol y\n\\end{equation}\n\\begin{align}\n\\label{eq:gp_cov}\n \\sigma = \\Var{f_*} = k(\\boldsymbol x_*,\\boldsymbol x_*) - \\boldsymbol k(\\boldsymbol X,\\boldsymbol x_*)^{T}[\\boldsymbol K(\\boldsymbol X,\\boldsymbol X)+\\sigma_n^2 \\boldsymbol I_{n}]^{-1}\\boldsymbol k(\\boldsymbol X,\\boldsymbol x_*)\n\\end{align}\n\nThe Mat\\'ern family of covariance functions \\citep{stein1999interpolation} has proven powerful features to model structural correlations \\citep{jadidi2013exploration,kim2013occupancy,maani2014com,kim2015gpmap} and hereby we select them as the kernel of GPs. For a single query point $\\boldsymbol x_*$ the function is given by \n\\begin{align}\n\\label{eq:Matern}\nk(\\boldsymbol x,\\boldsymbol x_*) = \\frac{1}{\\Gamma(\\nu) 2^{\\nu-1}}\\left[\\frac{\\sqrt{2\\nu}\\lVert \\boldsymbol x - \\boldsymbol x_* \\rVert}{\\kappa}\\right]^{\\nu} K_{\\nu}\\left(\\frac{\\sqrt{2\\nu}\\lVert \\boldsymbol x - \\boldsymbol x_* \\rVert}{\\kappa} \\right)\n\\end{align}\nwhere $\\Gamma$ is the Gamma function, $K_{\\nu}(\\cdot)$ is the modified Bessel function of the second kind of order $\\nu$, $\\kappa$ is the characteristic length scale, and $\\nu$ is a positive parameter used to control the smoothness of the covariance.\n\nThe hyperparameters of the covariance and mean function, $\\boldsymbol\\theta$, can be computed by minimization of the negative log of the marginal likelihood (NLML) function.\n\\begin{align}\n\\label{eq:nlml}\n\t\\log p(\\boldsymbol y|\\boldsymbol X,&\\boldsymbol\\theta) = -\\frac{1}{2}\\boldsymbol y^{T}[\\boldsymbol K(\\boldsymbol X,\\boldsymbol X)+\\sigma_n^2 \\boldsymbol I_{n}]^{-1}\\boldsymbol y -\\frac{1}{2}\\log \\arrowvert K(\\boldsymbol X,\\boldsymbol X)+\\sigma_n^2 \\boldsymbol I_{n} \\arrowvert-\\frac{n}{2}\\log 2\\pi\\\n\\end{align}\n\n\\subsection{Problem statement and formulation}\n\\label{subsec:statement}\nLet $\\mathcal{M}$ be the set of possible occupancy maps. We consider the map of the environment to be static and as an $n_m$-tuple random variable \\mbox{$(M^{[1]},...,M^{[n_m]})$} whose elements are described by a normal distribution \\mbox{$m^{[i]} \\sim \\mathcal{N}(\\mu^{[i]},\\sigma^{[i]})$, $i \\in \\{1\\colon n_m\\}$}. Let \\mbox{$\\mathcal{Z} \\subset \\mathbb{R}_{\\geq 0}$} be the set of possible range measurements. The observation consists of an $n_z$-tuple random variable $(Z^{[1]},...,Z^{[n_z]})$ whose elements can take values \\mbox{$\\boldsymbol z^{[k]} \\in \\mathcal{Z}$, $k \\in \\{1\\colon n_z\\}$}. Let $\\mathcal{X} \\subset \\mathbb{R}^2$ be the set of spatial coordinates to build a map on. Let $\\boldsymbol x_o^{[k]} \\in \\mathcal{X}_o \\subset \\mathcal{X}$ be an observed occupied point by the $k$-th sensor beam from the environment which, at any time-step $t$, can be calculated by transforming the local observation $\\boldsymbol z^{[k]}$ to the global frame using the robot pose $\\boldsymbol x_t \\in \\mathrm{SE(2)}$. Let $\\boldsymbol X_f^{[k]} \\in \\mathcal{X}_f \\subset \\mathcal{X}$ be the matrix of sampled unoccupied points from a line segment with the robot pose and corresponding observed occupied point as its endpoints. Let $\\mathcal{D}=\\mathcal{D}_o \\cup \\mathcal{D}_f$ be the set of all training points. We define a training set of occupied points \\mbox{$\\mathcal{D}_o = \\{(\\boldsymbol x_o^{[i]},y_o^{[i]}) \\mid i=1\\colon n_o\\}$} and a training set of unoccupied points \\mbox{$\\mathcal{D}_f = \\{(\\boldsymbol x_f^{[i]},y_f^{[i]}) \\mid i=1\\colon n_f\\}$} in which  \\mbox{$\\boldsymbol y_o = \\mathrm{vec}(y_o^{[1]},...,y_o^{[n_o]})$} and \\mbox{$\\boldsymbol y_f = \\mathrm{vec}(y_f^{[1]},...,y_f^{[n_f]})$} are target vectors and each of their elements can belong to the set $\\mathcal{Y}=\\{-1,+1\\}$ where $-1$ and $+1$ corresponds to unoccupied and occupied locations, respectively,  $n_o$ is the total number of occupied points, and $n_f$ is the total number of unoccupied points. Given the robot pose $\\boldsymbol x_t$ and observations $Z_t= \\boldsymbol z_t$, we wish to estimate \\mbox{$p(M=m\\mid \\boldsymbol x_t, Z_t= \\boldsymbol z_t)$}.  Place a joint distribution over $M$; the map can be inferred as a Gaussian process by defining the process as the function $y:\\mathcal{X}\\rightarrow\\mathcal{M}$, therefore\n\\begin{equation}\n \\label{eq:mapGP}\n y(\\boldsymbol x) \\sim \\mathcal{GP}(f_m(\\boldsymbol x), k(\\boldsymbol x,\\boldsymbol x'))\n\\end{equation}\nIt is often the case that we set the mean function $f_m(\\boldsymbol x)$ as zero, unless it is mentioned explicitly that $f_m(\\boldsymbol x)\\neq0$. For a given query point in the map, $\\boldsymbol x_*$, GP predicts a mean, $\\mu$, and an associated variance, $\\sigma$. We can write\n\\begin{equation}\n \\label{eq:mapy}\n m^{[i]} = y(\\boldsymbol x^{[i]}_*) \\sim \\mathcal{N}(\\mu^{[i]},\\sigma^{[i]})\n\\end{equation}\nTo show a valid probabilistic representation of the map $p(m^{[i]})$, the classification step, a logistic regression classifier~\\citep[Sections 3.1 and 3.2]{rasmussen2006gaussian},~\\citep[Chapter 8]{murphy2012machine},~\\citep{maani2014com}, squashes data into the range $[0,1]$.\n\n\\subsection{Sensor model, training and test data}\n\\label{subsec:sensor}\nThe robot is assumed to be equipped with a 2D range-finder sensor. The raw measurements include points returned from obstacle locations. For any sensor beam, the distance from the sensor position to the detected obstacle along that beam indicates a line from the unoccupied region of the environment. To build training data points for the unoccupied part of the map, it is required to sample along the aforementioned line. Figure~\\ref{fig:setup} shows the conceptual illustration of the environment and training points generation.\n\n\\begin{figure}[t]\n  \\centering \n  \\includegraphics[width=.45\\columnwidth]{model}\n  \\caption{Conceptual illustration of the robot, the environment, and observations. Training data consists of free and occupied points labeled $y_f=-1$ and $y_o=+1$ respectively. Free points are sampled along each beam, i.e.\\@ negative sensor information while occupied points are directly observable.}\n  \\label{fig:setup}\n\\end{figure}\n\nA sensor beam $\\boldsymbol z_t = (\\boldsymbol z_t^{[1]},...,\\boldsymbol z_t^{[n_z]})$ has $n_z$ range observations at a specific bearing depending on the density of the beam. The observation model for each $\\boldsymbol z_t^{[k]}$ can be written as\n\\begin{equation}\n \\boldsymbol z_t^{[k]} =\n \\begin{bmatrix}\n\tr_t^{[k]} \\\\\n\t\\alpha_t^{[k]}\n \\end{bmatrix} = h(\\boldsymbol x_t,\\boldsymbol x_o^{[k]})+\\boldsymbol v, \\quad \\boldsymbol v \\sim \\mathcal{N}(\\boldsymbol 0,\\boldsymbol R)\n\\end{equation}\n\\begin{equation}\n  h(\\boldsymbol x_t,\\boldsymbol x_o^{[k]}) \\triangleq \n  \\begin{bmatrix}\n\t\\sqrt{(\\boldsymbol x_o^{[k,1]} - \\boldsymbol x_t^{[1]})^2 + (\\boldsymbol x_o^{[k,2]} - \\boldsymbol x_t^{[2]})^2}\\\\\n\t\\arctan(\\boldsymbol x_o^{[k,2]} - \\boldsymbol x_t^{[2]},\\boldsymbol x_o^{[k,1]} - \\boldsymbol x_t^{[1]}) - \\boldsymbol x_t^{[3]}\n \\end{bmatrix}\n\\end{equation}\nwhere $r_t^{[k]}$ is the range measurement from the $k$-th sensor beam and $\\alpha_t^{[k]}$ is the corresponding angle of $r_t^{[k]}$. \nThe observation model noise $\\boldsymbol v$ is assumed to be Gaussian with zero mean and covariance $\\boldsymbol R$. To find $\\boldsymbol x_o^{[k]}$ which is in the map space, the inverse model can be calculated as\n\\begin{equation}\n\\label{eq:occnt}\n  \\boldsymbol x_o^{[k]} = \\boldsymbol x_t^{[1:2]} + r_t^{[k]} R(\\boldsymbol x_t^{[3]}) \n \\begin{bmatrix}\n\t\\cos(\\alpha_t^{[k]}) \\\\\n\t\\sin(\\alpha_t^{[k]})\n \\end{bmatrix}\n\\end{equation}\nwhere $R(\\boldsymbol x_t^{[3]}) \\in \\mathrm{SO(2)}$ indicates a $2\\times2$ rotation matrix.\n\nHaving defined the observed occupied points in the map space, now we can construct the training set of occupied points as \\mbox{$\\mathcal{D}_o = \\{(\\boldsymbol x_o^{[k]},y_o^{[k]}) \\mid k=1\\colon n_z\\}$}. \nOne simple way to build the free area training points is to uniformly sample along the line segment, $l_z^{[k]}$, with the robot position and any occupied point $\\boldsymbol x_o^{[k]}$ as its end points. Therefore,\n\\begin{equation}\n\\label{eq:unoccnt}\n \\boldsymbol X_f^{[k,j]} = \\boldsymbol x_t^{[1:2]} + \\delta_j R(\\boldsymbol x_t^{[3]})\n \\begin{bmatrix}\n\t\\cos(\\alpha_t^{[k]}) \\\\\n\t\\sin(\\alpha_t^{[k]})\n \\end{bmatrix}\n\\end{equation}\nwhere \\mbox{$\\delta_j \\sim \\mathcal{U}(0, r_t^{[k]}) \\quad j = 1\\colon n_f^{[k]}$}, $\\mathcal{U}(0, r_t^{[k]})$ is a uniform distribution with the support $[0,r_t^{[k]}]$ and $n_f^{[k]}$ is the desired number of samples for the $k$-th sensor beam. $n_f^{[k]}$ can be a fixed value for all the beams or variable, e.g.\\@ a function of the line segment length $\\lVert l_z^{[k]}\\rVert=r_t^{[k]}$. In the case of a variable number of points for each beam, it is useful to set a minimum value $n_{fmin}$. Therefore we can write\n\\begin{equation}\n n_f^{[k]} \\triangleq \\max (\\{n_{fmin},s_l(r_t^{[k]})\\})\n\\end{equation}\nwhere $s_l(\\cdot)$ is a function that adaptively generates a number of sampled points based on the input distance. This minimum value controls the sparsity of the training set of unoccupied points. Alternatively, we can select a number of equidistant points instead of sampling. However, as the number of training points increases, the computational time grows cubicly. We can construct the training set of unoccupied points as \\mbox{$\\mathcal{D}_f = \\bigcup_{i=1}^{n_z} \\mathcal{D}_f^{[i]}$} where \\mbox{$\\mathcal{D}_f^{[i]} = \\{(\\boldsymbol X_f^{[k]},\\boldsymbol y_f^{[k]}) \\mid k=1\\colon n_z\\}$} and \\mbox{$\\boldsymbol y_f^{[k]} = \\mathrm{vec}(y_f^{[1]},...,y_f^{[n_f^{[k]}]})$}.\n\n\\begin{remark}\nGenerally speaking, query points can have any desired distributions and the actual representation of the map depends on that distribution. However, building the map over a grid facilitates comparison with standard occupancy grid-based methods, i.e.\\@ at similar map resolutions. We use function \\texttt{TestDataWindow}, in Algorithms~\\ref{alg:GPOM} and \\ref{alg:GPOM2}, for generating a grid at a given position. The size of this grid can be set according to the maximum sensor range, the environment size, or available computational resources for data processing.\n\\end{remark}\n\n\\begin{remark}\nThroughout all algorithms, when we write $m$ for a map, it is assumed that the mean $\\boldsymbol\\mu$, the variance $\\boldsymbol\\sigma$, the occupancy probability $p(m)$, and the corresponding spatial coordinates are available even if they are not mentioned or used explicitly. For simplicity, when $m$ is used for computations such as in $\\log(p(m))$, we write $\\log(m)$.\n\\end{remark}\n\n\\subsection{Map management}\n\\label{subsec:management}\n\nAn important advantage of a mapping method is its capability to use past information appropriately. The mapping module returns local maps centered at the robot pose. Therefore, in order to keep track of the global map, a map management step is required where the local inferred map can be fused with the current global map. This incremental approach allows for handling larger map sizes, and map inference at the local level is independent of the global map.\n\nTo incorporate new information incrementally, map updates are performed using BCM. The technique combines estimators which were trained on different data sets. Assuming a Gaussian prior with zero mean and covariance $\\boldsymbol\\Sigma$ and each GP with mean $\\EV{f_*|\\mathcal{D}^{[i]}}$ and covariance $\\Cov{f_*|\\mathcal{D}^{[i]}}$, it follows that~\\citep{tresp2000bayesian}\n\\vspace{-0.1cm}\n\\begin{equation}\t\n\\label{eq:bcm}\t\n\n\t\\EV{f_*|\\mathcal{D}} = \\boldsymbol{C}^{-1} \\sum_{i=1}^{p_m}\\Cov{f_*|\\mathcal{D}^{[i]}}^{-1}\\EV{f_*|\\mathcal{D}^{[i]}}\n\\end{equation}\n\n\\begin{equation}\n\n\t\\label{eq:bcm2}\t\n\t \\boldsymbol{C} = \\Cov{f_*|\\mathcal{D}}^{-1} = -(p_m-1)(\\boldsymbol\\Sigma)^{-1} + \\sum_{i=1}^{p_m}\\Cov{f_*|\\mathcal{D}^{[i]}}^{-1}\n\\end{equation}\nwhere $p_m$ is the total number of mapping processes. In this work, we use BCM for combining a local and a previously existing global map, or merging two global maps; therefore $p_m = 2$. In addition, in the case of uninformative prior over map points the term $\\boldsymbol\\Sigma^{-1}$ can be set to zero, i.e.\\@ very large covariances/variances.\n\n\\begin{algorithm}[t!]\n\n\n\\caption[IGPOM]{\\texttt{IGPOM}()}\n\\label{alg:GPOM}\n\\begin{algorithmic}[1]\n\\Require Robot pose $\\boldsymbol p$ and measurements $\\boldsymbol z$;\n\n\n\\If{$\\mathrm{firstFrame}$}\n\\State $m \\gets \\varnothing$ $\\quad$ // Initialize the map\n\n\\State optimize GP hyperparameters $\\boldsymbol\\theta$ // Minimize the NLML, Equation~\\eqref{eq:nlml}\n\\EndIf\n\\State $\\boldsymbol X_* \\gets \\texttt{TestDataWindow}(\\boldsymbol p)$ // Query points grid centered at the robot pose\n\\State $\\boldsymbol X_o, \\boldsymbol y_o \\gets \\texttt{Transform2Global}(\\boldsymbol p, \\boldsymbol z)$ // Occupied training data, label $+1$, Equation~\\eqref{eq:occnt}\n\\State $\\boldsymbol X_f, \\boldsymbol y_f \\gets \\texttt{TrainingData}(\\boldsymbol p, \\boldsymbol z)$ // Unoccupied training data, label $-1$, Equation~\\eqref{eq:unoccnt}\n\\State $[\\boldsymbol\\mu_*, \\boldsymbol\\sigma_*] \\gets \\texttt{GP}(\\boldsymbol\\theta, [\\boldsymbol X_o; \\boldsymbol X_f], [\\boldsymbol y_o; \\boldsymbol y_f], \\boldsymbol X_*)$ // Compute predictive mean and variance, Equation~\\eqref{eq:gp_mean} and \\eqref{eq:gp_cov}\n\\State $m \\gets \\texttt{UpdateMap}(\\boldsymbol\\mu_*,\\boldsymbol\\sigma_*, m)$ // Algorithm~\\ref{alg:update}\n\\Return $m$\n\\end{algorithmic}\n\\end{algorithm}\n\n\\begin{algorithm}[t]\n\n\n\\caption[FusionBCM]{\\texttt{FusionBCM}($\\mu_a, \\mu_b, \\sigma_a, \\sigma_b$)}\n\\label{alg:bcm}\n\\begin{algorithmic}[1]\n\\State $\\sigma_c \\gets (\\sigma_a^{-1}+\\sigma_b^{-1})^{-1}$ // Point-wise calculation of Equation~\\eqref{eq:bcm2}\n\\State $\\mu_c \\gets \\sigma_c(\\sigma_a^{-1}\\mu_a + \\sigma_b^{-1}\\mu_b)$ // Point-wise calculation of Equation~\\eqref{eq:bcm}\n\\Return $\\mu_c, \\sigma_c$\n\\end{algorithmic}\n\\end{algorithm}\n\n\\begin{algorithm}[t]\n\n\n\\caption[UpdateMap]{\\texttt{UpdateMap}()}\n\\label{alg:update}\n\\begin{algorithmic}[1]\n\\Require Global map $m$, $\\boldsymbol\\mu$, $\\boldsymbol\\sigma$ and local map $m_*$, $\\boldsymbol\\mu_*$, $\\boldsymbol\\sigma_*$;\n\\For{all $i\\in\\mathcal{M}_*$}\n\\State $j \\gets$ find the corresponding global index of $i$ using the map spatial coordinates and a nearest neighbor search\n\\State $\\boldsymbol\\mu^{[j]}, \\boldsymbol\\sigma^{[j]} \\gets \\texttt{FusionBCM}(\\boldsymbol\\mu^{[j]}, \\boldsymbol\\mu_*^{[i]}, \\boldsymbol\\sigma^{[j]}, \\boldsymbol\\sigma_*^{[i]})$ // Algorithm~\\ref{alg:bcm}\n\\EndFor\n\\State $m \\gets \\texttt{LogisticRegression}(\\boldsymbol\\mu, \\boldsymbol\\sigma)$ // Squash data into (0,1)\n\\Return $m$\n\\end{algorithmic}\n\\end{algorithm}\n\nThe steps of the incremental GPOM (I-GPOM) are shown in Figure~\\ref{fig:mapper} and Algorithms~\\ref{alg:GPOM}, \\ref{alg:bcm}, and \\ref{alg:update}, where a BCM module updates the global map as new observations are taken. In Figure~\\ref{fig:bcmanalysis} a comparison of the incremental (\\mbox{I-GPOM}) and batch (GPOM) GP occupancy mapping using the Intel dataset~\\citep{Radish_data_set} with respect to the area under the receiver operating characteristic curve (AUC) and runtime is presented. The probability that the classifier ranks a randomly chosen positive instance higher than a randomly chosen negative instance can be understood using the AUC of the classifier; furthermore, the AUC is useful for domains with skewed class distribution and unequal classification error costs~\\citep{fawcett2006introduction}. Without loss of generality, a set of $25$ laser scans, where each scan contains about $180$ points, had to be set due to the memory limitation imposed by the batch GP computations with a growing gap between successive laser scans from $1$ to $29$. The proposed incremental mapping approach using BCM performs accurate and close to the batch form even with about $8$ steps intermission between successive observations and is faster.\n\nOptimization of the hyper-parameters is performed once at the beginning of each experiment by minimization of the negative log of the marginal likelihood function. For the prevailing case of multiple runs in the same environment, the optimized values can then be loaded off-line.\n\n\\begin{figure}[t]\n  \\centering \n  \\subfloat{\\includegraphics[width=.5\\columnwidth,trim={0.5cm 0cm 2cm 0cm},clip]{bcmEffect}\n  \\label{fig:bcmeffect}}\n  \\subfloat{\\includegraphics[width=.5\\columnwidth,trim={0.5cm 0cm 2cm 0cm},clip]{bcmEffectTime}\n  \\label{fig:bcmeffecttime}} \n  \\caption{Comparison of I-GPOM and batch GPOM methods using the Intel dataset with the observations size of $25$ laser scans at each step due to the memory limitation for the batch GP computations. The left plot shows the AUC and the right plot depicts the runtime for each step. The horizontal axes indicate observations gaps. As the number of gaps grows, the batch GP outperforms the incremental method as it learns the correlation between observations at once; however, with higher computational time. On the other hand, the incremental method in nearly constant time per update produces a similar average map quality with the mean difference of $0.0078$.}\n  \\label{fig:bcmanalysis}\n\\end{figure}\n\n\\subsection{I-GPOM2; an improved mapping strategy}\n\\label{challanges}\n\nInferring a high quality map compatible with the actual shape of the environment\ncan be non-trivial (see Figure~9 in \\citet{t2012gaussian} and Figure~3 in \\citet{kim2013continuous}). \nAlthough considering correlations of map points through regression \nresults in handling sparse measurements, training a unique GP for both\noccupied and free areas has two major challenges:\n\\begin{itemize}\n \\item It limits the selection of an appropriate kernel that suits both occupied and unoccupied regions of the map,  \n  effectively resulting in poorly extrapolated obstacles or low quality free areas.\n \\item Most importantly, it leads to a mixed variance surface. In other words, it is not\n  possible to disambiguate between boundaries of occupied-unknown and\n  free-unknown space, unless the continuous map is thresholded (see Figure~6 in \\citet{t2012gaussian}).\n\\end{itemize}\n\nThe first problem is directly related to the inferred map quality, while the second is a challenge for exploration using continuous occupancy maps. The integral kernel approach \\citep{o2011continuous} can mitigate the first aforementioned deficiency, however, the integration over GPs kernels is computationally demanding and results in less tractable methods. In order to address these problems we propose training two separate GPs, one for free areas and one for obstacles, and merge them to build a unique continuous occupancy map (I-GPOM2). The complete results of occupancy mapping with the three different methods in the Intel dataset are presented in Figure~\\ref{fig:Intelmaps}, while the AUCs are compared in Table \\ref{tab:aucroc}. The I-GPOM2 method demonstrates more flexibility to model the cluttered rooms and has higher performance than the other methods. The ground truth map was generated using the registered points map and an image dilation technique to remove outliers. In this way, the ground truth map has the same orientation which makes the comparison convenient. GPOM-based maps infer partially observed regions; however, in the absence of a complete ground truth map, this fact can be only verified using Figure~\\ref{fig:Intelmaps} and is not reflected in the AUC of I-GPOM and I-GPOM2. Algorithms~\\ref{alg:GPOM2} and \\ref{alg:merge} encapsulate the I-GPOM2 methods as implemented in the present work.\n\n\\begin{figure}[t!]\n  \\centering \n  \\subfloat{\\includegraphics[width=.3\\columnwidth]{ogmap}\\label{fig:ogmap}}~\n  \\subfloat{\\includegraphics[width=.345\\columnwidth]{gpmap}\\label{fig:gpmap}}~\n  \\subfloat{\\includegraphics[width=.345\\columnwidth]{gpmap2}\\label{fig:gpmap2}}\n  \\caption{Occupancy maps visualization; from left to right: OGM, I-GPOM, I-GPOM2. The maps are build incrementally using all observations available in the Intel dataset. For the I-GPOM and I-GPOM2 maps the Mat\\'ern ($\\nu = 3/2$) covariance function is used. I-GPOM and I-GPOM2 can complete partially observable areas, i.e.\\@ incomplete areas in the OGM; however, using two GP in I-GPOM2 method produces more accurate maps for navigation purposes. The SLAM problem is solved by using the Pose SLAM algorithm and the map qualities depend on the robot localization accuracy.}\n  \\label{fig:Intelmaps}\n\\end{figure}\n\n\\begin{table}[!t]\n\\footnotesize\n\\centering\n\\caption{Comparison of the AUC and runtime for OGM, I-GPOM, and I-GPOM2 using the Intel dataset.}\n\\begin{tabular}{lcc}\n\\toprule\nMethod\t\t\t& AUC\t\t& Runtime (min) \\\\ \\midrule\n\nOGM\t\t\t& 0.9300\t& 7.28 \t\\\\\nI-GPOM\t\t\t& 0.9439\t& 102.44 \t\\\\\nI-GPOM2\t\t\t& 0.9668\t& 114.53 \\\\ \\bottomrule\n\\end{tabular}\n\\label{tab:aucroc}\n\n\\end{table}\n\n\\subsection{Frontier map}\n\\label{subsec:Frontier_Maps}\nConstructing a frontier map is the fundamental ingredient of any geometry-based exploration approach. It reveals the boundaries between known-free and unknown areas which are potentially informative regions for map expansion. In contrast to the classical binary representation, defining frontiers in a probabilistic form using map uncertainty is more suitable for computing expected behaviors. The boundaries that correspond to frontiers can be computed using the following heuristic formula.\n\\begin{equation}\n\\label{eq:frontier}\n\t\\bar{f}^{[i]} \\triangleq \\lVert\\nabla p(m^{[i]})\\rVert_1 - \\beta(\\lVert\\nabla p(m_o^{[i]})\\rVert_1 + p(m_o^{[i]}) - 0.5)\n\\end{equation}\nwhere $\\nabla$ denotes the gradient operator, and $\\beta$ is a factor that controls the effect of obstacle boundaries. $\\lVert\\nabla p(m^{[i]})\\rVert_1$ indicates all boundaries while $\\lVert\\nabla p(m_o^{[i]})\\rVert_1$ defines obstacle outlines. The subtracted constant is to remove the biased probability for unknown areas in the obstacles probability map.\n\n\n\\begin{algorithm}[t]\n\n\n\\caption[IGPOM2]{\\texttt{IGPOM2}()}\n\\label{alg:GPOM2}\n\\begin{algorithmic}[1]\n\\Require Robot pose $\\boldsymbol p$ and measurements $\\boldsymbol z$;\n\n\n\\If{$\\mathrm{firstFrame}$}\n\\State $m, m_o, m_f \\gets \\varnothing$ $\\quad$ // Initialize the map\n\n\\State optimize GP hyperparameters $\\boldsymbol\\theta_o$, $\\boldsymbol\\theta_f$ // Minimize the NLML, Equation~\\eqref{eq:nlml}\n\n\n\n\\EndIf\n\\State $\\boldsymbol X_* \\gets \\texttt{TestDataWindow}(\\boldsymbol p)$ // Query points grid centered at the robot pose\n\\State $\\boldsymbol X_o, \\boldsymbol y_o \\gets \\texttt{Transform2Global}(\\boldsymbol p, \\boldsymbol z)$ // Occupied training data, label $+1$, Equation~\\eqref{eq:occnt}\n\\State $\\boldsymbol X_f, \\boldsymbol y_f \\gets \\texttt{TrainingData}(\\boldsymbol p, \\boldsymbol z)$ // Unoccupied training data, label $-1$, Equation~\\eqref{eq:unoccnt}\n\\State $[\\boldsymbol\\mu_{o*}, \\boldsymbol\\sigma_{o*}] \\gets \\texttt{GP}(\\boldsymbol\\theta_o, \\boldsymbol X_o, \\boldsymbol y_o, \\boldsymbol X_*)$ // Compute occupied map predictive mean and variance, Equation~\\eqref{eq:gp_mean} and \\eqref{eq:gp_cov}\n\\State $[\\boldsymbol\\mu_{f*}, \\boldsymbol\\sigma_{f*}] \\gets \\texttt{GP}(\\boldsymbol\\theta_f, \\boldsymbol X_f, \\boldsymbol y_f, \\boldsymbol X_*)$ // Compute unoccupied map predictive mean and variance using \\eqref{eq:gp_mean} and \\eqref{eq:gp_cov}\n\\State $m_o \\gets \\texttt{UpdateMap}(\\boldsymbol\\mu_{o*}, \\boldsymbol\\sigma_{o*}, m_o)$ // Algorithm~\\ref{alg:update}\n\\State $m_f \\gets \\texttt{UpdateMap}(\\boldsymbol\\mu_{f*}, \\boldsymbol\\sigma_{f*}, m_f)$ \n\\State $m \\gets \\texttt{MergeMap}(m_o, m_f)$ // Algorithm~\\ref{alg:merge}\n\n\n\\Return $m, m_o$\n\\end{algorithmic}\n\\end{algorithm}\n\n\\begin{algorithm}[t!]\n\n\\caption[MergeMap]{\\texttt{MergeMap}()}\n\\label{alg:merge}\n\\begin{algorithmic}[1]\n\\Require Unoccupied map $m_f$, $\\boldsymbol\\mu_f$, $\\boldsymbol\\sigma_f$ and occupied map $m_o$, $\\boldsymbol\\mu_o$, $\\boldsymbol\\sigma_o$; \n\\For{all $i\\in\\mathcal{M}$}\n\\State $\\boldsymbol\\mu^{[i]}, \\boldsymbol\\sigma^{[i]} \\gets \\texttt{FusionBCM}(\\boldsymbol\\mu_{o}^{[i]}, \\boldsymbol\\mu_{f}^{[i]}, \\boldsymbol\\sigma_{o}^{[i]}, \\boldsymbol\\sigma_{f}^{[i]})$ // Algorithm~\\ref{alg:bcm}\n\\EndFor\n\\State $m \\gets \\texttt{LogisticRegression}(\\boldsymbol\\mu, \\boldsymbol\\sigma)$ // Squash data into (0,1)\n\\Return $m$\n\\end{algorithmic}\n\\end{algorithm}\n\nThe frontier surface is converted to a probability frontier map through the incorporation of the map uncertainty. To squash the frontier and variance values into the range $[0, 1]$, a logistic regression classifier with inputs from $\\bar{f}^{[i]}$ and map uncertainty $\\sigma^{[i]}$ is applied to data which yields\n\\begin{equation}\n\\label{eq:logisticf}\n\t\\vspace{-0.1cm}\n\tp(f^{[i]}|m^{[i]}, w_f^{[i]}) = \\frac{1}{1+\\exp(-w_f^{[i]} \\bar{f}^{[i]})}\n\\end{equation}\nwhere $w_f^{[i]} = \\gamma_f \\sqrt{\\lambda^{[i]}}$ denotes the required weights, $\\lambda^{[i]} \\triangleq \\sigma_{min} / {\\sigma^{[i]}}$ is the bounded information associated with location $i$, and $\\gamma_f > 0$ is a constant to control the sigmoid shape. The details of the frontier map computations are presented in Algorithm~\\ref{alg:frontier}. Figure~\\ref{fig:ex_maps} (middle) depicts an instance of the frontier map from an exploration experiment in the Cave environment~\\citep{Radish_data_set}.\n\nIn practice, the following steps are required to use the frontier map and check the termination condition:\n  \\begin{enumerate}\n\\item The probabilistic frontier map is converted to a binary map using a pre-defined threshold. Note that any point with a probability higher than $0.5$ is potentially a valid frontier.\n\\item The binary map of frontiers is clustered into subsets of candidate macro-actions.\n\\item The centroids of clusters construct a discrete action set at time-step $t$, i.e.\\@ $\\mathcal{A}_t$, that is used in the utility maximization step.\n\\item The robot plans a path to each centroid (macro-action) to check its reachability. A centroid that is not reachable is then removed from the action set.\n\\item The exploration mission continues until the action set $\\mathcal{A}_t$ is not empty (repeats from step 1).\n\\end{enumerate}\n\n\\begin{figure}[!t]\n  \\centering \n  \\subfloat{\\includegraphics[width=.32\\columnwidth]{ex_com}\n  \\label{fig:ex_com}}~\n  \\subfloat{\\includegraphics[width=.32\\columnwidth]{ex_frontier}\n  \\label{fig:frontier}}~\n  \\subfloat{\\includegraphics[width=.32\\columnwidth]{ex_mi}\n  \\label{fig:ex_MIsurf}}\n  \\caption{Inferred continuous occupancy map (left); associated probabilistic frontier map (middle); and mutual information surface (right, discussed in Section~\\ref{sec:miexp}). The frontier map highlights the informative regions for further exploration by assigning higher probabilities to frontier points. The lower probabilities show the obstacles and walls while the values greater than the \\emph{no discrimination} probability, $0.5$, can be considered as frontiers. In the MI surface, the areas beyond the current perception field of the robot preserve their initial entropy values and the higher values demonstrate regions with greater information gain. The map dimensions are in meters and the MI values in nats.}\n  \\label{fig:ex_maps}\n\\end{figure}\n\n\\begin{algorithm}[t]\n\n\n\\caption[BuildFrontierMap]{\\texttt{BuildFrontierMap}()}\n\\label{alg:frontier}\n\\begin{algorithmic}[1]\n\\Require Current map $m$, $\\boldsymbol\\sigma$ and occupied map $m_o$, $\\boldsymbol\\sigma_o$;\n\\State // Compute boundaries\n\\State $dm \\gets$  $\\lVert\\nabla p(m)\\rVert_1$, $dm_o \\gets$  $\\lVert\\nabla p(m_o)\\rVert_1$\n\\State $\\sigma_{min} \\gets \\min(\\boldsymbol\\sigma)$\n\\State $f \\gets \\varnothing$\n\\State // Compute probabilistic frontiers \n\\For{all $i \\in \\mathcal{M}$}\n\\State $\\bar{f}^{[i]} \\gets{dm}^{[i]} - \\beta({dm}_o^{[i]} + m_o^{[i]} - 0.5)$\n\\State $w_f^{[i]} \\gets \\gamma_f\\ \\mathrm{sqrt}(\\sigma_{min} / \\boldsymbol\\sigma^{[i]})$ // Logistic regression weights\n\\State $f^{[i]} \\gets (1 + \\exp(-w_f^{[i]} \\bar{f}^{[i]}))^{-1}$ // Squash data into (0,1), Equation~\\eqref{eq:logisticf}\n\\EndFor\n\\Return $f$\n\\end{algorithmic}\n\\end{algorithm}\n\n\\subsection{Computational complexity}\n\\label{subsec:timecomplex}\n\nFor the mapping algorithms, the computational cost of GPs is $\\bigO{n^3_t}$, given the need to invert a matrix of the size of training data, $n_t = n_o + n_f$. BCM scales linearly with the number of map points, $n_m$. The overall map update operation involves a nearest neighbor query for each test point, $n_q$, and the logistic regression classifier is at worst linear in the number of map points resulting in $\\bigO{n_{t}^3 + n_{q} \\log n_{q} + n_m}$.\n\nA more sophisticated approximation approach can reduce the computational complexity further. The fully independent training conditional (FITC)~\\citep{snelson2006sparse} based on inducing conditionals suggests an $\\bigO{n_t n_i^2}$ upper bound where $n_i$ is the number of inducing points. More recently, in~\\cite{hensman2013gaussian}, the GP computation upper bound is reduced to $\\bigO{n_i^3}$ which brings more flexibility in increasing the number of inducing points.\n\n\n3.5 I-GPOM2; an improved mapping strategy\n\\subsection{I-GPOM2; an improved mapping strategy}\n\\label{challanges}\n\nInferring a high quality map compatible with the actual shape of the environment\ncan be non-trivial (see Figure~9 in \\citet{t2012gaussian} and Figure~3 in \\citet{kim2013continuous}). \nAlthough considering correlations of map points through regression \nresults in handling sparse measurements, training a unique GP for both\noccupied and free areas has two major challenges:\n\\begin{itemize}\n \\item It limits the selection of an appropriate kernel that suits both occupied and unoccupied regions of the map,  \n  effectively resulting in poorly extrapolated obstacles or low quality free areas.\n \\item Most importantly, it leads to a mixed variance surface. In other words, it is not\n  possible to disambiguate between boundaries of occupied-unknown and\n  free-unknown space, unless the continuous map is thresholded (see Figure~6 in \\citet{t2012gaussian}).\n\\end{itemize}\n\nThe first problem is directly related to the inferred map quality, while the second is a challenge for exploration using continuous occupancy maps. The integral kernel approach \\citep{o2011continuous} can mitigate the first aforementioned deficiency, however, the integration over GPs kernels is computationally demanding and results in less tractable methods. In order to address these problems we propose training two separate GPs, one for free areas and one for obstacles, and merge them to build a unique continuous occupancy map (I-GPOM2). The complete results of occupancy mapping with the three different methods in the Intel dataset are presented in Figure~\\ref{fig:Intelmaps}, while the AUCs are compared in Table \\ref{tab:aucroc}. The I-GPOM2 method demonstrates more flexibility to model the cluttered rooms and has higher performance than the other methods. The ground truth map was generated using the registered points map and an image dilation technique to remove outliers. In this way, the ground truth map has the same orientation which makes the comparison convenient. GPOM-based maps infer partially observed regions; however, in the absence of a complete ground truth map, this fact can be only verified using Figure~\\ref{fig:Intelmaps} and is not reflected in the AUC of I-GPOM and I-GPOM2. Algorithms~\\ref{alg:GPOM2} and \\ref{alg:merge} encapsulate the I-GPOM2 methods as implemented in the present work.\n\n\\begin{figure}[t!]\n  \\centering \n  \\subfloat{\\includegraphics[width=.3\\columnwidth]{ogmap}\\label{fig:ogmap}}~\n  \\subfloat{\\includegraphics[width=.345\\columnwidth]{gpmap}\\label{fig:gpmap}}~\n  \\subfloat{\\includegraphics[width=.345\\columnwidth]{gpmap2}\\label{fig:gpmap2}}\n  \\caption{Occupancy maps visualization; from left to right: OGM, I-GPOM, I-GPOM2. The maps are build incrementally using all observations available in the Intel dataset. For the I-GPOM and I-GPOM2 maps the Mat\\'ern ($\\nu = 3/2$) covariance function is used. I-GPOM and I-GPOM2 can complete partially observable areas, i.e.\\@ incomplete areas in the OGM; however, using two GP in I-GPOM2 method produces more accurate maps for navigation purposes. The SLAM problem is solved by using the Pose SLAM algorithm and the map qualities depend on the robot localization accuracy.}\n  \\label{fig:Intelmaps}\n\\end{figure}\n\n\\begin{table}[!t]\n\\footnotesize\n\\centering\n\\caption{Comparison of the AUC and runtime for OGM, I-GPOM, and I-GPOM2 using the Intel dataset.}\n\\begin{tabular}{lcc}\n\\toprule\nMethod\t\t\t& AUC\t\t& Runtime (min) \\\\ \\midrule\n\nOGM\t\t\t& 0.9300\t& 7.28 \t\\\\\nI-GPOM\t\t\t& 0.9439\t& 102.44 \t\\\\\nI-GPOM2\t\t\t& 0.9668\t& 114.53 \\\\ \\bottomrule\n\\end{tabular}\n\\label{tab:aucroc}\n\n\\end{table}\n\n",
        "Descriptive_question1": "What is the AUC value for I-GPOM2 in table_1?",
        "Descriptive_question2": "Which method has the shortest runtime in table_1?",
        "Reasoning_question1": "Why might I-GPOM2 have a higher AUC compared to OGM and I-GPOM in table_1?",
        "Reasoning_question2": "What could be the reasons for the significant difference in runtime between OGM and the other methods in table_1?",
        "Descriptive_answer1": "0.9668",
        "Descriptive_answer2": "OGM",
        "Reasoning_answer1": "I-GPOM2 likely has a higher AUC compared to OGM and I-GPOM because it uses two separate Gaussian Processes (GPs) to model occupied and unoccupied areas, allowing for more tailored kernel selection and better handling of distinct spatial characteristics. This dual approach, as described in the context, addresses limitations in I-GPOM where a single GP struggles to model both regions effectively, resulting in improved map quality and classification accuracy, as reflected by the AUC of 0.9668 versus 0.9439 for I-GPOM and 0.9300 for OGM. The flexibility to model cluttered environments, as noted in the document, further enhances its performance in capturing the true structure of the Intel dataset.",
        "Reasoning_answer2": "The significant difference in runtime between OGM (7.28 minutes) and the other methods, I-GPOM (102.44 minutes) and I-GPOM2 (114.53 minutes), can be attributed to the computational complexity of Gaussian Process (GP) computations used in the latter two methods. OGM likely relies on simpler grid-based mapping techniques that do not involve intensive matrix operations, whereas I-GPOM and I-GPOM2 require cubic time complexity operations (O(n^3)) for GP training due to matrix inversions, as outlined in the computational complexity section. Additionally, I-GPOM2's use of two separate GPs increases the computational burden further compared to I-GPOM, contributing to its longer runtime. The incremental map updates and Bayesian Committee Machine (BCM) techniques, while efficient for large maps, still add overhead compared to the straightforward OGM approach."
    },
    {
        "paper_id": "1605.00335.json",
        "table_id": "table_2",
        "table_content": "\\begin{table}[t]\n\\footnotesize\n\\centering\n\\caption{The compared exploration methods and their corresponding attributes.}\n\n\\begin{tabular}{lcccc}\n\\toprule\n\t\t& NF\t\t& OGMI \t\t& GPNF\t \t& GPMI \t\\\\ \\midrule\n\nSLAM\t\t& Pose SLAM\t& Pose SLAM\t& Pose SLAM\t& Pose SLAM\t\\\\\nMapping\t\t& OGM\t\t& OGM\t\t& I-GPOM2\t& I-GPOM2\t\\\\\nFrontiers \t& binary\t& binary\t& probabilistic\t& probabilistic\t\\\\\nUtility \t& path length\t& MI+path length & path length\t& MI+path length \\\\\nPlanner\t\t& $A^*$\t\t& $A^*$\t\t& $A^*$\t\t& $A^*$\t\t\\\\ \\bottomrule\n\\end{tabular}\n\\label{tab:expmethods}\n\\end{table}",
        "caption": "The compared exploration methods and their corresponding attributes.",
        "label": "tab:expmethods",
        "section_info": "5 Results and Discussion\n\\section{Results and Discussion}\n\\label{sec:Results}\nWe now present results using two publicly available datasets~\\citep{Radish_data_set}. In the first scenario, we use the Intel research lab. map which is a highly structured indoor environment. The second scenario is based on the University of Freiburg campus area. The second map is almost ten times larger than the Intel map and is an example of a large-scale environment with open areas.\n\nThe experiments include comparison among the original nearest frontier (NF)~\\citep{yamauchi1997frontier}, MI-based exploration using OGM (OGMI), the natural extension of NF with a GPOM representation (GPNF) \\citep{maani2014com}, and the proposed MI-based (GPMI) exploration approaches. NF and OGMI results are computed using OGMs while for the GPOM-based methods the I-GPOM2 representation and the probabilistic frontier map proposed in this work are employed. For all the techniques, we use the $A^*$ algorithm to find the shortest path from the robot position to any frontier. The path cost is calculated using the Euclidean distance between map points. Details about the compared methods are described in Table~\\ref{tab:expmethods}.\n\n\\begin{figure}[t]\n  \\centering \n  \\includegraphics[width=.4\\columnwidth,trim={1.5cm 1.5cm 1.5cm 1.5cm},clip]{Intel_obsmap}\n  \\caption{The constructed environment for exploration experiments using the binary map of obstacles from the Intel dataset.}\n  \\label{fig:intel_obsmap}\n  \n\\end{figure}\n\n\\begin{table}[t]\n\\footnotesize\n\\centering\n\\caption{The compared exploration methods and their corresponding attributes.}\n\n\\begin{tabular}{lcccc}\n\\toprule\n\t\t& NF\t\t& OGMI \t\t& GPNF\t \t& GPMI \t\\\\ \\midrule\n\nSLAM\t\t& Pose SLAM\t& Pose SLAM\t& Pose SLAM\t& Pose SLAM\t\\\\\nMapping\t\t& OGM\t\t& OGM\t\t& I-GPOM2\t& I-GPOM2\t\\\\\nFrontiers \t& binary\t& binary\t& probabilistic\t& probabilistic\t\\\\\nUtility \t& path length\t& MI+path length & path length\t& MI+path length \\\\\nPlanner\t\t& $A^*$\t\t& $A^*$\t\t& $A^*$\t\t& $A^*$\t\t\\\\ \\bottomrule\n\\end{tabular}\n\\label{tab:expmethods}\n\\end{table}\n\n\\begin{table}[t]\n\\scriptsize\n\\centering\n\\caption{Parameters for frontier and MI maps computations. Note that the employed maximum sensor range and the maximum range used in the MI algorithm for prediction do not need to be the same.}\n\\begin{tabular}{lll}\n\\toprule\nParameter\t\t\t& Symbol\t\t\t\t & Value \\\\ \\midrule\n\\multicolumn{3}{l}{$1)$ Beam-based mixture measurement model:} \\\\\n\nHit std\t\t\t\t& $\\sigma_{hit}$\t& 0.03 $\\m$\t\\\\\nShort decay\t\t\t& $\\lambda_{short}$\t& 0.2 $\\m$\t \t\\\\\nMax range and size of \\texttt{TestDataWindow}\t\t& $r_{max}$\t\t& \t\\\\\n$-$ Intel map\t\t\t& \t\t\t& 14.0 $\\m$\t\\\\\n$-$ Freiburg map\t\t& \t\t\t& 60.0 $\\m$\t\\\\\nHit weight\t\t\t& $z_{hit}$\t\t& 0.7\t\t\\\\\nShort weight\t\t\t& $z_{short}$\t\t& 0.1\t\t\\\\ \nMax weight\t\t\t& $z_{max}$\t\t& 0.1\t \t\\\\\nRandom weight\t\t\t& $z_{rand}$\t\t& 0.1\t\t\\\\ \\midrule\n\\multicolumn{3}{l}{$2)$ Frontier map:} \\\\\nOccupied boundaries factor\t& $\\beta$\t\t& 3.0\t\t\\\\\nLogistic regression weight\t& $\\gamma$\t\t& 10.0\t\t\\\\ \nFrontier probability threshold & $-$\t\t\t& \t\t\\\\ \n$-$ Intel map\t\t\t& \t\t\t& 0.6\t\\\\\n$-$ Freiburg map\t\t& \t\t\t& 0.55\t\\\\ \nFrontier cluster size \t\t& $-$\t\t\t& \t\t\\\\ \n$-$ Intel map\t\t\t& \t\t\t& 14\t\\\\\n$-$ Freiburg map\t\t& \t\t\t& 3\t\\\\ \nNumber of clusters\t\t& $-$\t\t\t& \t\t\\\\ \n$-$ Intel map\t\t\t& \t\t\t& 20\t\\\\\n$-$ Freiburg map\t\t& \t\t\t& 5\t\t\\\\ \\midrule\n\\multicolumn{3}{l}{$3)$ MI map and utility function:} \\\\\nNo. of sensor beams over 360 deg \t\t& $n_z$\t\t& 133\t\t\\\\\nMax range\t\t\t& $r_{max}$\t& \t\\\\\n$-$ Intel map\t\t\t& \t\t& 4.0 $\\m$\t\\\\\n$-$ Freiburg map\t\t& \t\t& 60.0 $\\m$\t\\\\\nNumerical integration resolution \t\t& $s_z$\t\t& \t\t\\\\\n$-$ Intel map\t\t\t& \t\t& 10/3 $\\m^{-1}$\t\\\\\n$-$ Freiburg map\t\t& \t\t& 1 $\\m^{-1}$\t\\\\\nInformation gain factor \t\t\t& $\\alpha$\t& \t\t\\\\\n$-$ Intel map\t\t\t& \t\t& 0.1\t\\\\\n$-$ Freiburg map\t\t& \t\t& 0.5\t\\\\\nOccupied probability threshold \t\t\t& $p_{o}$\t& 0.65\t\t\\\\\nUnoccupied probability threshold\t\t& $p_{f}$\t& \t\t\\\\ \n$-$ Intel map\t\t\t& \t\t\t& 0.35\t\\\\\n$-$ Freiburg map\t\t& \t\t\t& 0.4\t\\\\\t\\bottomrule\n\\end{tabular}\n\\label{tab:param}\n\\end{table}\n\n\\begin{figure}[t!]\n  \\centering  \n  \\subfloat[]{\n    \\includegraphics[width=0.345\\columnwidth]{gpmi_gpom}\n    \\label{fig:intel_gpom}\n    }\n  \\subfloat[]{\n    \\includegraphics[width=0.3\\columnwidth]{gpmi_ogm}\n    \\label{fig:intel_ogm}\n    }\n  \\subfloat[]{\n    \\includegraphics[width=0.345\\columnwidth]{gpmi_ent}\n    \\label{fig:intel_ent}\n    }\n  \\caption{MI-based exploration in the Intel map derived from the Intel dataset. (a) I-GPOM2, (b) the equivalent OGM computed at the end of the experiment (c) corresponding entropy map of the GPOM (nats). The sparse observations due to the occluded perception field in a complex environment such as the Intel map signifies the capabilities of OGM and GPOM methods to cope with such limitations. Map dimensions are in meters, and the maps are built with the resolution $0.135\\m$.}\n  \\label{fig:intelResults2}\n\\end{figure}\n\n\\begin{figure}\n  \\centering\n  \\includegraphics[width=.4\\columnwidth]{gpmi_poseslam}\n  \\caption{Pose SLAM map of the MI-based exploration in the Intel map derived from the Intel dataset. Dotted (red) curves are the robot path and connecting lines (green) indicate loop-closures. Map dimensions are in meters. The starting robot position is at (18,26), horizontally and vertically, respectively, and the robot terminates the exploration mission at the most bottom right room.}\n  \\label{fig:intel_slam}\n\\end{figure}\n\n\\begin{figure}[th]\n  \\centering \n  \\includegraphics[width=.6\\columnwidth,trim={1.cm 1.cm 1.cm 1.cm},clip]{intel_boxplot}\n  \\caption{The box plots show comparison of different exploration strategies in the Intel dataset from $10$ independent runs. The compared criteria are travel distance ($\\m$), time (min), map entropy rate (nats/step), the mapping performance using the area under the receiving operating characteristic curve, localization root mean-squared error ($\\m$), and the number of closed loops by Pose SLAM.}\n  \\label{fig:intel_boxplot}\n\\end{figure}\n\n\\subsection{Experimental setup}\n\\label{subsec:setup}\nThe environment is constructed using a binary map of obstacles and, for the Intel map, is shown in Figure~\\ref{fig:intel_obsmap}. The simulated robot is equipped with odometric and laser range-finder sensors to provide the required sensory inputs for Pose SLAM. The odometric and laser range-finder sensors noise covariances are set to \\mbox{$\\boldsymbol \\Sigma_u = \\diag(0.1\\m, 0.1\\m, 0.0026\\rad)^2$} and $\\boldsymbol \\Sigma_y = \\diag(0.03\\m, 0.03\\m, 0.0013\\rad)^2$, respectively. The motion of the robot is modeled using a velocity motion model~\\citep[Chapter 5]{thrun2005probabilistic} and a proportional control law for following a planned trajectory. Laser beams are simulated through ray-casting operation over the ground truth map using the true robot pose. In all the presented results, Pose SLAM \\citep{ila2010information} is included as the backbone to provide localization data together with the number of closed loops. Additionally, for each map, Pose SLAM parameters are set and fixed regardless of the exploration method. \n\nThe localization Root Mean-Squared Error (RMSE) is computed at the end of each experiment by the difference in the robot traveled path (estimated and ground truth poses) to highlight the effect of each exploration approach on the localization accuracy. The required parameters for the beam-based mixture measurement model~\\citep{thrun2005probabilistic}, frontier maps, and MI maps computations are listed in Table~\\ref{tab:param}. The sensitivity of the parameters in Table~\\ref{tab:param} is not high and slight variations of them ($\\sim 10\\%$) do not affect the presented results.\n\nThe implementation has been developed in MATLAB and GP computations have been implemented by modifying the open source GP library in \\citet{rasmussen2006gaussian}. As described in Section~\\ref{subsec:regeneration}, during exploration, map drifts occur due to loop-closure in the SLAM process. As it is computationally expensive to process all measurements from scratch at each iteration, a mechanism has been adopted to address the problem. The cumulative relative entropy by summing the computed JSD can detect such map drifts.\n\nEach technique is evaluated based on six different criteria, namely, travel distance, mapping and planning time, Map Entropy Rate (MER), AUC of the GP occupancy map calculated at the end of each experiment using all available observations, localization RMSE, and the Number of Closed Loops (NCL). The map entropy at any time-step can be computed using~\\eqref{mapEnt}. The map entropy calculation can become independent of the map resolution following the idea in~\\citet{stachniss2005information}; that is the cell area, i.e. the squared of the map resolution, weights each entropy term. To see the performance of decision-making across the entire an experiment, the MER is then computed at the end of each experiment using the difference between final and initial map entropies divided by the number of exploration steps. Note that none of the compared exploration strategies explicitly plans for loop-closing actions. For each dataset, the results are from $10$ independent runs using the same setup and parameters.\n\n\\begin{figure}[!t]\n  \\centering \n  \\subfloat{\\includegraphics[width=.335\\columnwidth, trim={0cm 0cm 0cm 0.5cm},clip]{frcampus}\n  \\label{fig:frcampus_sat}} \n  \\subfloat{\\includegraphics[width=.335\\columnwidth]{frcampus_10cm_s}\n  \\label{fig:frcampus_og}} \n  \\subfloat{\\includegraphics[width=.32\\columnwidth, trim={1.25cm 1.75cm 2.25cm 1.5cm},clip]{frcampus_grid}\n  \\label{fig:frcampus_grid}}\n  \\caption{The left picture shows the satellite map of the Freiburg University Campus where the yellow dashed line indicate the robot trajectory. The middle figure shows the corresponding occupancy map of the dataset~\\citep{Radish_data_set}. The right figure shows the corresponding binary map of obstacles used for exploration experiments. Map dimensions are in meters.}\n  \\label{fig:frcampus}\n  \n\\end{figure}\n\n\\subsection{Exploration results in the Intel map}\nAn example of the exploration results using GPMI is shown in Figures~\\ref{fig:intelResults2} and~\\ref{fig:intel_slam}. \nThe statistical summary of the results are depicted in Figure~\\ref{fig:intel_boxplot}.\nThe most significant part of the results is related to the map entropy rate in which a negative value means the map entropy has been reduced at each step. In the nearest frontier techniques there is no prediction step regarding map entropy reduction; therefore, the results are purely based on chance and structural shape of the environment. OGMI shows marginal improvements over NF with roughly similar computational times for the exploration mission. Thus, it is the preferred technique in comparison with NF.\n\nGPNF and GPMI exploit I-GPOM2 for mapping, exploration, and planning. GP-based methods handle sparse sensor measurements by learning the structural dependencies (spatial correlations) present in the environment. The significant increase in the map entropy rate is due to this fact. The results from GPMI show higher travel distance and a higher number of closed loops which can be understood from the fact that information gain in the utility function drives the robot to possibly further but more informative targets. As this behavior does not show any undesirable effect on the localization accuracy, it can be concluded that it performs better than the other techniques; however with a higher computational time. The information gain calculation could be sped up by using CSQMI due to its similar behavior to MI~\\citep{charrow2015information}. Under the GPMI scheme, the robot chooses macro-actions that balance the cost of traveling and MI between the map and future measurements. Although the utility function does not include the localization uncertainty explicitly, the correlation between robot poses and the map helps to improve the localization accuracy.\n\n\\begin{figure}[t]\n  \\centering \n  \\includegraphics[width=.6\\columnwidth,trim={1cm 1cm 1cm 1cm},clip]{frcamp_boxplot}\n  \\caption{The box plots show comparison of different exploration strategies in the Freiburg campus dataset from $10$ independent runs. The compared criteria are travel distance ($\\m$), time (min), map entropy rate (nats/step), the mapping performance using the area under the receiving operating characteristic curve, localization root mean-squared error ($\\m$), and the number of closed loops by Pose SLAM.}\n  \\label{fig:fr_boxplot}\n\\end{figure}\n\n\\subsection{Outdoor scenario: Freiburg Campus}\nIn the second scenario, the map is an outdoor area with a larger size (almost ten times). Figure~\\ref{fig:frcampus} shows the satellite map of the area as well as the trajectory that the robot was driven for data collection. Similar to the first experiment, a binary map of the dataset is constructed and used for exploration experiments. The statistical summary of the results is shown in Figure~\\ref{fig:fr_boxplot}. To maintain the computational time manageable, the occupancy maps are built with the coarse resolution of $1 \\m$.\n\nOverall, the trend is similar to the previous test, and specifically, the map entropy rate plot shows a significant difference between GPMI and the other techniques. Again, this significant map entropy rate improvement has been achieved without any undesirable effects on the localization accuracy. The sharpness of the localization error distribution can be seen as the reliability and repeatability characteristic of GPMI. Since this map has large open areas relative to the robot's sensing range, it is highly unlikely that the robot closes loops by chance. For the GPMI, the number of closed loops has a higher median which supports the idea of implicit loop-closing actions due to the correlations between the map and the robot pose. However, the NCL distribution has wider tails which does not support its repeatability. The exploration times in this environment is less than those of the previous experiment in the Intel map. We associate the faster map exploration results with the combination of the difference in map resolutions and the open shape of the Freiburg campus map. In contrast, the Intel map is highly structured with narrow hallways and small rooms which require a finer map resolution leading to a higher number of query points. Furthermore, in the Intel map, unlike the Freiburg campus map, a larger maximum range does not help the robot to explore the map faster due to the occlusion problem.\n\n\\begin{figure}[t]\n  \\centering \n  \\subfloat{\\includegraphics[width=.4\\columnwidth]{frcamp_nf}}~\n  \\subfloat{\\includegraphics[width=.4\\columnwidth]{frcamp_ogmi}}\\\\\n  \\subfloat{\\includegraphics[width=.4\\columnwidth]{frcamp_gpnf}}~\n  \\subfloat{\\includegraphics[width=.4\\columnwidth]{frcamp_gpmi}}\n  \\caption{Illustrative examples of exploration in the Freiburg Campus map. The top left and right, and the bottom left and right figures show the results for NF, OGMI, GPNF, and GPMI, respectively.}\n  \\label{fig:frcampus_ex}\n\\end{figure}\n\nFigure~\\ref{fig:frcampus_ex} shows the results from an exploration run in Freiburg campus map using NF, OGMI, GPNF, and GPMI. The robot behavior is distinguishable in all four maps. In NF case, the robot tends to travel to every corner in the map to complete the partially observable parts of the map. This behavior leads to trajectories along the boundaries of the map. In OGMI, the prediction of the information gain reduces this effect. However, the OGM requires a higher number of measurements to cover an area; therefore, the robot still needs to travel to the corners. In GPNF case, this effect has been alleviated since the the continuous mapping algorithm can deal with sparse measurements. However, in GPMI case, the robot behaves completely different as by taking the expectation over future measurements (calculating MI) the robot does not act based on the current map uncertainty minimization, but improving the future map state in expectation.\n\n\n",
        "Descriptive_question1": "What mapping technique is used by NF in table_2?",
        "Descriptive_question2": "Which exploration method uses probabilistic frontiers in table_2?",
        "Reasoning_question1": "How does the utility function differ between NF and GPMI in table_2, and what might be the impact on exploration strategy?",
        "Reasoning_question2": "Why might GPMI and GPNF show different performance outcomes compared to NF and OGMI in table_2, considering their mapping and frontier approaches?",
        "Descriptive_answer1": "OGM",
        "Descriptive_answer2": "GPNF",
        "Reasoning_answer1": "In table_2, NF uses a utility function based solely on 'path length,' which prioritizes the shortest distance to the frontier, likely leading to a more straightforward, distance-minimizing exploration strategy that may overlook areas with higher information potential. On the other hand, GPMI employs a utility function that combines 'MI+path length,' integrating mutual information (MI) with path length. This approach balances the cost of travel with the potential information gain, encouraging the robot to explore further if the information reward is significant. The impact on exploration strategy is substantial: while NF might result in shorter, less informative paths, GPMI is likely to drive the robot towards more informative targets, potentially increasing travel distance but enhancing map accuracy and entropy reduction, as supported by results discussed in the Intel and Freiburg scenarios.",
        "Reasoning_answer2": "GPMI and GPNF are likely to show different performance outcomes compared to NF and OGMI due to their distinct mapping and frontier approaches in table_2. Both GPMI and GPNF utilize 'I-GPOM2' for mapping, a Gaussian Process-based method that handles sparse sensor data by learning spatial correlations, enabling better prediction and representation of the environment compared to the 'OGM' (Occupancy Grid Map) used by NF and OGMI, which requires more measurements to achieve similar coverage. Additionally, GPMI and GPNF employ 'probabilistic' frontiers, allowing for a more nuanced uncertainty estimation in identifying exploration targets, unlike the 'binary' frontiers of NF and OGMI, which oversimplify frontier detection. This combination likely results in more efficient exploration for GPMI and GPNF, as they can prioritize areas with higher uncertainty or information potential, leading to improved map entropy reduction and localization accuracy, as evidenced by experimental results in structured and large-scale environments."
    },
    {
        "paper_id": "1605.00335.json",
        "table_id": "table_3",
        "table_content": "\\begin{table}[t]\n\\scriptsize\n\\centering\n\\caption{Parameters for frontier and MI maps computations. Note that the employed maximum sensor range and the maximum range used in the MI algorithm for prediction do not need to be the same.}\n\\begin{tabular}{lll}\n\\toprule\nParameter\t\t\t& Symbol\t\t\t\t & Value \\\\ \\midrule\n\\multicolumn{3}{l}{$1)$ Beam-based mixture measurement model:} \\\\\n\nHit std\t\t\t\t& $\\sigma_{hit}$\t& 0.03 $\\m$\t\\\\\nShort decay\t\t\t& $\\lambda_{short}$\t& 0.2 $\\m$\t \t\\\\\nMax range and size of \\texttt{TestDataWindow}\t\t& $r_{max}$\t\t& \t\\\\\n$-$ Intel map\t\t\t& \t\t\t& 14.0 $\\m$\t\\\\\n$-$ Freiburg map\t\t& \t\t\t& 60.0 $\\m$\t\\\\\nHit weight\t\t\t& $z_{hit}$\t\t& 0.7\t\t\\\\\nShort weight\t\t\t& $z_{short}$\t\t& 0.1\t\t\\\\ \nMax weight\t\t\t& $z_{max}$\t\t& 0.1\t \t\\\\\nRandom weight\t\t\t& $z_{rand}$\t\t& 0.1\t\t\\\\ \\midrule\n\\multicolumn{3}{l}{$2)$ Frontier map:} \\\\\nOccupied boundaries factor\t& $\\beta$\t\t& 3.0\t\t\\\\\nLogistic regression weight\t& $\\gamma$\t\t& 10.0\t\t\\\\ \nFrontier probability threshold & $-$\t\t\t& \t\t\\\\ \n$-$ Intel map\t\t\t& \t\t\t& 0.6\t\\\\\n$-$ Freiburg map\t\t& \t\t\t& 0.55\t\\\\ \nFrontier cluster size \t\t& $-$\t\t\t& \t\t\\\\ \n$-$ Intel map\t\t\t& \t\t\t& 14\t\\\\\n$-$ Freiburg map\t\t& \t\t\t& 3\t\\\\ \nNumber of clusters\t\t& $-$\t\t\t& \t\t\\\\ \n$-$ Intel map\t\t\t& \t\t\t& 20\t\\\\\n$-$ Freiburg map\t\t& \t\t\t& 5\t\t\\\\ \\midrule\n\\multicolumn{3}{l}{$3)$ MI map and utility function:} \\\\\nNo. of sensor beams over 360 deg \t\t& $n_z$\t\t& 133\t\t\\\\\nMax range\t\t\t& $r_{max}$\t& \t\\\\\n$-$ Intel map\t\t\t& \t\t& 4.0 $\\m$\t\\\\\n$-$ Freiburg map\t\t& \t\t& 60.0 $\\m$\t\\\\\nNumerical integration resolution \t\t& $s_z$\t\t& \t\t\\\\\n$-$ Intel map\t\t\t& \t\t& 10/3 $\\m^{-1}$\t\\\\\n$-$ Freiburg map\t\t& \t\t& 1 $\\m^{-1}$\t\\\\\nInformation gain factor \t\t\t& $\\alpha$\t& \t\t\\\\\n$-$ Intel map\t\t\t& \t\t& 0.1\t\\\\\n$-$ Freiburg map\t\t& \t\t& 0.5\t\\\\\nOccupied probability threshold \t\t\t& $p_{o}$\t& 0.65\t\t\\\\\nUnoccupied probability threshold\t\t& $p_{f}$\t& \t\t\\\\ \n$-$ Intel map\t\t\t& \t\t\t& 0.35\t\\\\\n$-$ Freiburg map\t\t& \t\t\t& 0.4\t\\\\\t\\bottomrule\n\\end{tabular}\n\\label{tab:param}\n\\end{table}",
        "caption": "Parameters for frontier and MI maps computations. Note that the employed maximum sensor range and the maximum range used in the MI algorithm for prediction do not need to be the same.",
        "label": "tab:param",
        "section_info": "5 Results and Discussion\n\\section{Results and Discussion}\n\\label{sec:Results}\nWe now present results using two publicly available datasets~\\citep{Radish_data_set}. In the first scenario, we use the Intel research lab. map which is a highly structured indoor environment. The second scenario is based on the University of Freiburg campus area. The second map is almost ten times larger than the Intel map and is an example of a large-scale environment with open areas.\n\nThe experiments include comparison among the original nearest frontier (NF)~\\citep{yamauchi1997frontier}, MI-based exploration using OGM (OGMI), the natural extension of NF with a GPOM representation (GPNF) \\citep{maani2014com}, and the proposed MI-based (GPMI) exploration approaches. NF and OGMI results are computed using OGMs while for the GPOM-based methods the I-GPOM2 representation and the probabilistic frontier map proposed in this work are employed. For all the techniques, we use the $A^*$ algorithm to find the shortest path from the robot position to any frontier. The path cost is calculated using the Euclidean distance between map points. Details about the compared methods are described in Table~\\ref{tab:expmethods}.\n\n\\begin{figure}[t]\n  \\centering \n  \\includegraphics[width=.4\\columnwidth,trim={1.5cm 1.5cm 1.5cm 1.5cm},clip]{Intel_obsmap}\n  \\caption{The constructed environment for exploration experiments using the binary map of obstacles from the Intel dataset.}\n  \\label{fig:intel_obsmap}\n  \n\\end{figure}\n\n\\begin{table}[t]\n\\footnotesize\n\\centering\n\\caption{The compared exploration methods and their corresponding attributes.}\n\n\\begin{tabular}{lcccc}\n\\toprule\n\t\t& NF\t\t& OGMI \t\t& GPNF\t \t& GPMI \t\\\\ \\midrule\n\nSLAM\t\t& Pose SLAM\t& Pose SLAM\t& Pose SLAM\t& Pose SLAM\t\\\\\nMapping\t\t& OGM\t\t& OGM\t\t& I-GPOM2\t& I-GPOM2\t\\\\\nFrontiers \t& binary\t& binary\t& probabilistic\t& probabilistic\t\\\\\nUtility \t& path length\t& MI+path length & path length\t& MI+path length \\\\\nPlanner\t\t& $A^*$\t\t& $A^*$\t\t& $A^*$\t\t& $A^*$\t\t\\\\ \\bottomrule\n\\end{tabular}\n\\label{tab:expmethods}\n\\end{table}\n\n\\begin{table}[t]\n\\scriptsize\n\\centering\n\\caption{Parameters for frontier and MI maps computations. Note that the employed maximum sensor range and the maximum range used in the MI algorithm for prediction do not need to be the same.}\n\\begin{tabular}{lll}\n\\toprule\nParameter\t\t\t& Symbol\t\t\t\t & Value \\\\ \\midrule\n\\multicolumn{3}{l}{$1)$ Beam-based mixture measurement model:} \\\\\n\nHit std\t\t\t\t& $\\sigma_{hit}$\t& 0.03 $\\m$\t\\\\\nShort decay\t\t\t& $\\lambda_{short}$\t& 0.2 $\\m$\t \t\\\\\nMax range and size of \\texttt{TestDataWindow}\t\t& $r_{max}$\t\t& \t\\\\\n$-$ Intel map\t\t\t& \t\t\t& 14.0 $\\m$\t\\\\\n$-$ Freiburg map\t\t& \t\t\t& 60.0 $\\m$\t\\\\\nHit weight\t\t\t& $z_{hit}$\t\t& 0.7\t\t\\\\\nShort weight\t\t\t& $z_{short}$\t\t& 0.1\t\t\\\\ \nMax weight\t\t\t& $z_{max}$\t\t& 0.1\t \t\\\\\nRandom weight\t\t\t& $z_{rand}$\t\t& 0.1\t\t\\\\ \\midrule\n\\multicolumn{3}{l}{$2)$ Frontier map:} \\\\\nOccupied boundaries factor\t& $\\beta$\t\t& 3.0\t\t\\\\\nLogistic regression weight\t& $\\gamma$\t\t& 10.0\t\t\\\\ \nFrontier probability threshold & $-$\t\t\t& \t\t\\\\ \n$-$ Intel map\t\t\t& \t\t\t& 0.6\t\\\\\n$-$ Freiburg map\t\t& \t\t\t& 0.55\t\\\\ \nFrontier cluster size \t\t& $-$\t\t\t& \t\t\\\\ \n$-$ Intel map\t\t\t& \t\t\t& 14\t\\\\\n$-$ Freiburg map\t\t& \t\t\t& 3\t\\\\ \nNumber of clusters\t\t& $-$\t\t\t& \t\t\\\\ \n$-$ Intel map\t\t\t& \t\t\t& 20\t\\\\\n$-$ Freiburg map\t\t& \t\t\t& 5\t\t\\\\ \\midrule\n\\multicolumn{3}{l}{$3)$ MI map and utility function:} \\\\\nNo. of sensor beams over 360 deg \t\t& $n_z$\t\t& 133\t\t\\\\\nMax range\t\t\t& $r_{max}$\t& \t\\\\\n$-$ Intel map\t\t\t& \t\t& 4.0 $\\m$\t\\\\\n$-$ Freiburg map\t\t& \t\t& 60.0 $\\m$\t\\\\\nNumerical integration resolution \t\t& $s_z$\t\t& \t\t\\\\\n$-$ Intel map\t\t\t& \t\t& 10/3 $\\m^{-1}$\t\\\\\n$-$ Freiburg map\t\t& \t\t& 1 $\\m^{-1}$\t\\\\\nInformation gain factor \t\t\t& $\\alpha$\t& \t\t\\\\\n$-$ Intel map\t\t\t& \t\t& 0.1\t\\\\\n$-$ Freiburg map\t\t& \t\t& 0.5\t\\\\\nOccupied probability threshold \t\t\t& $p_{o}$\t& 0.65\t\t\\\\\nUnoccupied probability threshold\t\t& $p_{f}$\t& \t\t\\\\ \n$-$ Intel map\t\t\t& \t\t\t& 0.35\t\\\\\n$-$ Freiburg map\t\t& \t\t\t& 0.4\t\\\\\t\\bottomrule\n\\end{tabular}\n\\label{tab:param}\n\\end{table}\n\n\\begin{figure}[t!]\n  \\centering  \n  \\subfloat[]{\n    \\includegraphics[width=0.345\\columnwidth]{gpmi_gpom}\n    \\label{fig:intel_gpom}\n    }\n  \\subfloat[]{\n    \\includegraphics[width=0.3\\columnwidth]{gpmi_ogm}\n    \\label{fig:intel_ogm}\n    }\n  \\subfloat[]{\n    \\includegraphics[width=0.345\\columnwidth]{gpmi_ent}\n    \\label{fig:intel_ent}\n    }\n  \\caption{MI-based exploration in the Intel map derived from the Intel dataset. (a) I-GPOM2, (b) the equivalent OGM computed at the end of the experiment (c) corresponding entropy map of the GPOM (nats). The sparse observations due to the occluded perception field in a complex environment such as the Intel map signifies the capabilities of OGM and GPOM methods to cope with such limitations. Map dimensions are in meters, and the maps are built with the resolution $0.135\\m$.}\n  \\label{fig:intelResults2}\n\\end{figure}\n\n\\begin{figure}\n  \\centering\n  \\includegraphics[width=.4\\columnwidth]{gpmi_poseslam}\n  \\caption{Pose SLAM map of the MI-based exploration in the Intel map derived from the Intel dataset. Dotted (red) curves are the robot path and connecting lines (green) indicate loop-closures. Map dimensions are in meters. The starting robot position is at (18,26), horizontally and vertically, respectively, and the robot terminates the exploration mission at the most bottom right room.}\n  \\label{fig:intel_slam}\n\\end{figure}\n\n\\begin{figure}[th]\n  \\centering \n  \\includegraphics[width=.6\\columnwidth,trim={1.cm 1.cm 1.cm 1.cm},clip]{intel_boxplot}\n  \\caption{The box plots show comparison of different exploration strategies in the Intel dataset from $10$ independent runs. The compared criteria are travel distance ($\\m$), time (min), map entropy rate (nats/step), the mapping performance using the area under the receiving operating characteristic curve, localization root mean-squared error ($\\m$), and the number of closed loops by Pose SLAM.}\n  \\label{fig:intel_boxplot}\n\\end{figure}\n\n\\subsection{Experimental setup}\n\\label{subsec:setup}\nThe environment is constructed using a binary map of obstacles and, for the Intel map, is shown in Figure~\\ref{fig:intel_obsmap}. The simulated robot is equipped with odometric and laser range-finder sensors to provide the required sensory inputs for Pose SLAM. The odometric and laser range-finder sensors noise covariances are set to \\mbox{$\\boldsymbol \\Sigma_u = \\diag(0.1\\m, 0.1\\m, 0.0026\\rad)^2$} and $\\boldsymbol \\Sigma_y = \\diag(0.03\\m, 0.03\\m, 0.0013\\rad)^2$, respectively. The motion of the robot is modeled using a velocity motion model~\\citep[Chapter 5]{thrun2005probabilistic} and a proportional control law for following a planned trajectory. Laser beams are simulated through ray-casting operation over the ground truth map using the true robot pose. In all the presented results, Pose SLAM \\citep{ila2010information} is included as the backbone to provide localization data together with the number of closed loops. Additionally, for each map, Pose SLAM parameters are set and fixed regardless of the exploration method. \n\nThe localization Root Mean-Squared Error (RMSE) is computed at the end of each experiment by the difference in the robot traveled path (estimated and ground truth poses) to highlight the effect of each exploration approach on the localization accuracy. The required parameters for the beam-based mixture measurement model~\\citep{thrun2005probabilistic}, frontier maps, and MI maps computations are listed in Table~\\ref{tab:param}. The sensitivity of the parameters in Table~\\ref{tab:param} is not high and slight variations of them ($\\sim 10\\%$) do not affect the presented results.\n\nThe implementation has been developed in MATLAB and GP computations have been implemented by modifying the open source GP library in \\citet{rasmussen2006gaussian}. As described in Section~\\ref{subsec:regeneration}, during exploration, map drifts occur due to loop-closure in the SLAM process. As it is computationally expensive to process all measurements from scratch at each iteration, a mechanism has been adopted to address the problem. The cumulative relative entropy by summing the computed JSD can detect such map drifts.\n\nEach technique is evaluated based on six different criteria, namely, travel distance, mapping and planning time, Map Entropy Rate (MER), AUC of the GP occupancy map calculated at the end of each experiment using all available observations, localization RMSE, and the Number of Closed Loops (NCL). The map entropy at any time-step can be computed using~\\eqref{mapEnt}. The map entropy calculation can become independent of the map resolution following the idea in~\\citet{stachniss2005information}; that is the cell area, i.e. the squared of the map resolution, weights each entropy term. To see the performance of decision-making across the entire an experiment, the MER is then computed at the end of each experiment using the difference between final and initial map entropies divided by the number of exploration steps. Note that none of the compared exploration strategies explicitly plans for loop-closing actions. For each dataset, the results are from $10$ independent runs using the same setup and parameters.\n\n\\begin{figure}[!t]\n  \\centering \n  \\subfloat{\\includegraphics[width=.335\\columnwidth, trim={0cm 0cm 0cm 0.5cm},clip]{frcampus}\n  \\label{fig:frcampus_sat}} \n  \\subfloat{\\includegraphics[width=.335\\columnwidth]{frcampus_10cm_s}\n  \\label{fig:frcampus_og}} \n  \\subfloat{\\includegraphics[width=.32\\columnwidth, trim={1.25cm 1.75cm 2.25cm 1.5cm},clip]{frcampus_grid}\n  \\label{fig:frcampus_grid}}\n  \\caption{The left picture shows the satellite map of the Freiburg University Campus where the yellow dashed line indicate the robot trajectory. The middle figure shows the corresponding occupancy map of the dataset~\\citep{Radish_data_set}. The right figure shows the corresponding binary map of obstacles used for exploration experiments. Map dimensions are in meters.}\n  \\label{fig:frcampus}\n  \n\\end{figure}\n\n\\subsection{Exploration results in the Intel map}\nAn example of the exploration results using GPMI is shown in Figures~\\ref{fig:intelResults2} and~\\ref{fig:intel_slam}. \nThe statistical summary of the results are depicted in Figure~\\ref{fig:intel_boxplot}.\nThe most significant part of the results is related to the map entropy rate in which a negative value means the map entropy has been reduced at each step. In the nearest frontier techniques there is no prediction step regarding map entropy reduction; therefore, the results are purely based on chance and structural shape of the environment. OGMI shows marginal improvements over NF with roughly similar computational times for the exploration mission. Thus, it is the preferred technique in comparison with NF.\n\nGPNF and GPMI exploit I-GPOM2 for mapping, exploration, and planning. GP-based methods handle sparse sensor measurements by learning the structural dependencies (spatial correlations) present in the environment. The significant increase in the map entropy rate is due to this fact. The results from GPMI show higher travel distance and a higher number of closed loops which can be understood from the fact that information gain in the utility function drives the robot to possibly further but more informative targets. As this behavior does not show any undesirable effect on the localization accuracy, it can be concluded that it performs better than the other techniques; however with a higher computational time. The information gain calculation could be sped up by using CSQMI due to its similar behavior to MI~\\citep{charrow2015information}. Under the GPMI scheme, the robot chooses macro-actions that balance the cost of traveling and MI between the map and future measurements. Although the utility function does not include the localization uncertainty explicitly, the correlation between robot poses and the map helps to improve the localization accuracy.\n\n\\begin{figure}[t]\n  \\centering \n  \\includegraphics[width=.6\\columnwidth,trim={1cm 1cm 1cm 1cm},clip]{frcamp_boxplot}\n  \\caption{The box plots show comparison of different exploration strategies in the Freiburg campus dataset from $10$ independent runs. The compared criteria are travel distance ($\\m$), time (min), map entropy rate (nats/step), the mapping performance using the area under the receiving operating characteristic curve, localization root mean-squared error ($\\m$), and the number of closed loops by Pose SLAM.}\n  \\label{fig:fr_boxplot}\n\\end{figure}\n\n\\subsection{Outdoor scenario: Freiburg Campus}\nIn the second scenario, the map is an outdoor area with a larger size (almost ten times). Figure~\\ref{fig:frcampus} shows the satellite map of the area as well as the trajectory that the robot was driven for data collection. Similar to the first experiment, a binary map of the dataset is constructed and used for exploration experiments. The statistical summary of the results is shown in Figure~\\ref{fig:fr_boxplot}. To maintain the computational time manageable, the occupancy maps are built with the coarse resolution of $1 \\m$.\n\nOverall, the trend is similar to the previous test, and specifically, the map entropy rate plot shows a significant difference between GPMI and the other techniques. Again, this significant map entropy rate improvement has been achieved without any undesirable effects on the localization accuracy. The sharpness of the localization error distribution can be seen as the reliability and repeatability characteristic of GPMI. Since this map has large open areas relative to the robot's sensing range, it is highly unlikely that the robot closes loops by chance. For the GPMI, the number of closed loops has a higher median which supports the idea of implicit loop-closing actions due to the correlations between the map and the robot pose. However, the NCL distribution has wider tails which does not support its repeatability. The exploration times in this environment is less than those of the previous experiment in the Intel map. We associate the faster map exploration results with the combination of the difference in map resolutions and the open shape of the Freiburg campus map. In contrast, the Intel map is highly structured with narrow hallways and small rooms which require a finer map resolution leading to a higher number of query points. Furthermore, in the Intel map, unlike the Freiburg campus map, a larger maximum range does not help the robot to explore the map faster due to the occlusion problem.\n\n\\begin{figure}[t]\n  \\centering \n  \\subfloat{\\includegraphics[width=.4\\columnwidth]{frcamp_nf}}~\n  \\subfloat{\\includegraphics[width=.4\\columnwidth]{frcamp_ogmi}}\\\\\n  \\subfloat{\\includegraphics[width=.4\\columnwidth]{frcamp_gpnf}}~\n  \\subfloat{\\includegraphics[width=.4\\columnwidth]{frcamp_gpmi}}\n  \\caption{Illustrative examples of exploration in the Freiburg Campus map. The top left and right, and the bottom left and right figures show the results for NF, OGMI, GPNF, and GPMI, respectively.}\n  \\label{fig:frcampus_ex}\n\\end{figure}\n\nFigure~\\ref{fig:frcampus_ex} shows the results from an exploration run in Freiburg campus map using NF, OGMI, GPNF, and GPMI. The robot behavior is distinguishable in all four maps. In NF case, the robot tends to travel to every corner in the map to complete the partially observable parts of the map. This behavior leads to trajectories along the boundaries of the map. In OGMI, the prediction of the information gain reduces this effect. However, the OGM requires a higher number of measurements to cover an area; therefore, the robot still needs to travel to the corners. In GPNF case, this effect has been alleviated since the the continuous mapping algorithm can deal with sparse measurements. However, in GPMI case, the robot behaves completely different as by taking the expectation over future measurements (calculating MI) the robot does not act based on the current map uncertainty minimization, but improving the future map state in expectation.\n\n\n5.1 Experimental setup\n\\subsection{Experimental setup}\n\\label{subsec:setup}\nThe environment is constructed using a binary map of obstacles and, for the Intel map, is shown in Figure~\\ref{fig:intel_obsmap}. The simulated robot is equipped with odometric and laser range-finder sensors to provide the required sensory inputs for Pose SLAM. The odometric and laser range-finder sensors noise covariances are set to \\mbox{$\\boldsymbol \\Sigma_u = \\diag(0.1\\m, 0.1\\m, 0.0026\\rad)^2$} and $\\boldsymbol \\Sigma_y = \\diag(0.03\\m, 0.03\\m, 0.0013\\rad)^2$, respectively. The motion of the robot is modeled using a velocity motion model~\\citep[Chapter 5]{thrun2005probabilistic} and a proportional control law for following a planned trajectory. Laser beams are simulated through ray-casting operation over the ground truth map using the true robot pose. In all the presented results, Pose SLAM \\citep{ila2010information} is included as the backbone to provide localization data together with the number of closed loops. Additionally, for each map, Pose SLAM parameters are set and fixed regardless of the exploration method. \n\nThe localization Root Mean-Squared Error (RMSE) is computed at the end of each experiment by the difference in the robot traveled path (estimated and ground truth poses) to highlight the effect of each exploration approach on the localization accuracy. The required parameters for the beam-based mixture measurement model~\\citep{thrun2005probabilistic}, frontier maps, and MI maps computations are listed in Table~\\ref{tab:param}. The sensitivity of the parameters in Table~\\ref{tab:param} is not high and slight variations of them ($\\sim 10\\%$) do not affect the presented results.\n\nThe implementation has been developed in MATLAB and GP computations have been implemented by modifying the open source GP library in \\citet{rasmussen2006gaussian}. As described in Section~\\ref{subsec:regeneration}, during exploration, map drifts occur due to loop-closure in the SLAM process. As it is computationally expensive to process all measurements from scratch at each iteration, a mechanism has been adopted to address the problem. The cumulative relative entropy by summing the computed JSD can detect such map drifts.\n\nEach technique is evaluated based on six different criteria, namely, travel distance, mapping and planning time, Map Entropy Rate (MER), AUC of the GP occupancy map calculated at the end of each experiment using all available observations, localization RMSE, and the Number of Closed Loops (NCL). The map entropy at any time-step can be computed using~\\eqref{mapEnt}. The map entropy calculation can become independent of the map resolution following the idea in~\\citet{stachniss2005information}; that is the cell area, i.e. the squared of the map resolution, weights each entropy term. To see the performance of decision-making across the entire an experiment, the MER is then computed at the end of each experiment using the difference between final and initial map entropies divided by the number of exploration steps. Note that none of the compared exploration strategies explicitly plans for loop-closing actions. For each dataset, the results are from $10$ independent runs using the same setup and parameters.\n\n\\begin{figure}[!t]\n  \\centering \n  \\subfloat{\\includegraphics[width=.335\\columnwidth, trim={0cm 0cm 0cm 0.5cm},clip]{frcampus}\n  \\label{fig:frcampus_sat}} \n  \\subfloat{\\includegraphics[width=.335\\columnwidth]{frcampus_10cm_s}\n  \\label{fig:frcampus_og}} \n  \\subfloat{\\includegraphics[width=.32\\columnwidth, trim={1.25cm 1.75cm 2.25cm 1.5cm},clip]{frcampus_grid}\n  \\label{fig:frcampus_grid}}\n  \\caption{The left picture shows the satellite map of the Freiburg University Campus where the yellow dashed line indicate the robot trajectory. The middle figure shows the corresponding occupancy map of the dataset~\\citep{Radish_data_set}. The right figure shows the corresponding binary map of obstacles used for exploration experiments. Map dimensions are in meters.}\n  \\label{fig:frcampus}\n  \n\\end{figure}\n\n",
        "Descriptive_question1": "What is the value of the hit standard deviation in table_3?",
        "Descriptive_question2": "What is the maximum sensor range for the Intel map in table_3?",
        "Reasoning_question1": "Why might the maximum sensor range differ between the Intel map and the Freiburg map in table_3?",
        "Reasoning_question2": "How could the difference in frontier probability thresholds between the Intel map and the Freiburg map in table_3 affect exploration strategies?",
        "Descriptive_answer1": "0.03 m",
        "Descriptive_answer2": "14.0 m",
        "Reasoning_answer1": "The difference in maximum sensor range between the Intel map (14.0 m) and the Freiburg map (60.0 m) likely stems from the distinct characteristics of the environments they represent. The Intel map, representing a highly structured indoor environment with narrow hallways and small rooms, may require a shorter sensor range due to frequent occlusions and limited line-of-sight, which restricts how far a sensor can effectively detect. On the other hand, the Freiburg map, representing a large-scale outdoor campus area with open spaces, benefits from a longer sensor range as there are fewer obstacles to block the sensor's field of view, allowing for detection over greater distances. Additionally, the difference in map resolutions and the scale of the environments (Freiburg being almost ten times larger than Intel) could necessitate varied sensor ranges to balance computational efficiency and exploration effectiveness in each context.",
        "Reasoning_answer2": "The frontier probability threshold for the Intel map is 0.6, while for the Freiburg map, it is slightly lower at 0.55. This difference could influence exploration strategies significantly. A higher threshold, as seen in the Intel map, implies a stricter criterion for identifying frontiers, meaning the robot might be more selective and focus only on areas with a higher certainty of being unexplored, potentially leading to slower but more precise exploration in a structured indoor environment with many obstacles. Conversely, the lower threshold for the Freiburg map suggests a more lenient approach to identifying frontiers, which could encourage the robot to explore a broader range of potential unexplored areas more quickly, fitting for a larger, open outdoor environment where covering expansive areas might be prioritized over precision. This variation in thresholds could thus affect how aggressively or conservatively the robot explores, impacting travel distance and the number of exploration steps taken in each map."
    },
    {
        "paper_id": "1704.01161.json",
        "table_id": "table_1",
        "table_content": "\\begin{table*}[t]\n\n\t\\begin{center}\n\t\t\\begin{tabular}{c | c | c | c }\n\t\t\t\\hline\n\t\t\tStepsize  &  Discretization Error & Martingale Noise Impact & TD(0) Behavior\\\\\n\t\t\t\\hline\n\t\t\t& & & \\\\[-2ex]\n\t\t\tLarge & Large & Large & Possibly diverging \\\\[1ex]\n\t\t\tModerate & $O(n_0)$  & $O(n_0)$ w.h.p.& Stay in $O(n_0)$ ball w.h.p. \\\\[1ex]\n\t\t\tSmall &  $\\epsilon/3$  & $\\epsilon/3$ w.h.p. & Converging w.h.p.\\\\[1ex]\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\t\\caption{\\label{tab:AnalysisOutline}Chronological Summary of Analysis Outline}\n\\end{table*}",
        "caption": "\\label{tab:AnalysisOutline}Chronological Summary of Analysis Outline",
        "label": "tab:AnalysisOutline",
        "section_info": "5 Proof of Theorem~\\ref{thm: convergence rate}\n\\section{Proof of Theorem~\\ref{thm: convergence rate}}\n\nIn this section we prove Theorem~\\ref{thm: convergence rate}. Throughout this section we assume \\ref{assum:bounded_feat}. All proofs for intermediate lemmas are  given in Appendix~\\ref{sec: main thm appendix}. \n\n\\subsection{Outline of Approach} \\label{sec:outline}\n\n\n\\begin{table*}[t]\n\n\t\\begin{center}\n\t\t\\begin{tabular}{c | c | c | c }\n\t\t\t\\hline\n\t\t\tStepsize  &  Discretization Error & Martingale Noise Impact & TD(0) Behavior\\\\\n\t\t\t\\hline\n\t\t\t& & & \\\\[-2ex]\n\t\t\tLarge & Large & Large & Possibly diverging \\\\[1ex]\n\t\t\tModerate & $O(n_0)$  & $O(n_0)$ w.h.p.& Stay in $O(n_0)$ ball w.h.p. \\\\[1ex]\n\t\t\tSmall &  $\\epsilon/3$  & $\\epsilon/3$ w.h.p. & Converging w.h.p.\\\\[1ex]\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\t\\caption{\\label{tab:AnalysisOutline}Chronological Summary of Analysis Outline}\n\\end{table*}\nThe limiting ODE for \\eqref{eq:SA_traj} is\n\n\\begin{equation}\n\\label{eq:limiting_ODE}\n\\dot{\\theta}(t) = h(\\theta(t)) =  b - A\\theta(t) = -A(\\theta(t) - \\thS) \\enspace.\n\\end{equation}\nLet $\\theta(t, s, u_0),$ $t \\geq s,$ denote the solution to the above ODE starting at $u_0$ at time $t = s.$ When the starting point and time are unimportant, we will denote this solution by $\\theta(t)$ .\n\nAs the solutions of the ODE are continuous functions of time, we also define a linear interpolation $\\{\\bart(t)\\}$ of $\\{\\theta_n\\}.$ Let $t_0 = 0.$ For $n \\geq 0,$ let $\\tI{n + 1} = \\tI{n} + \\alpha_n$ and let\n\\begin{equation}\n\\label{eqn:LinInt}\n\\bart(\\tau) \\!=\\!\n\\begin{cases}\n\\theta_n & \\! \\! \\text{ if } \\tau = \\tI{n} \\enspace,\\\\\n\\theta_n + \\frac{\\tau - \\tI{n}}{\\alpha_n}[\\theta_{n + 1} - \\theta_n] & \\! \\! \\text{ if } \\tau \\in (\\tI{n}, \\tI{n + 1}) \\enspace.\n\\end{cases}\n\\end{equation}\n\n\n\n\nOur tool for comparing $\\bart(t)$ to $\\theta(t)$ is the \\emph{Variation of Parameters} (VoP) method \\cite{lakshmikantham1998method}.\nInitially, $\\bart(t)$ could stray away from $\\thS$ when the stepsizes may not be small enough to tame the noise. However,  we show that $\\|\\bart(\\tI{n}) - \\thS\\| = O(n),$ i.e., $\\theta_n$ does not stray away from $\\thS$ too fast.  Later, we show that we can fix some $n_0$ so that first the TD(0) iterates for $n \\geq n_0$ stay within an $O(n_0)$ distance from $\\thS.$ Then, after for some additional time, when the stepsizes decay enough, the TD(0) iterates start behaving almost like a noiseless version. These three different behaviours are summarized in Table~\\ref{tab:AnalysisOutline} and illustrated in Figure~\\ref{fig:trajectory}.\n\n\\begin{figure*}\n\t\\begin{center}\n\n\n\t\\includegraphics[scale=0.25]{trajectory}\n\t\\end{center}\n\t\\caption{Visualization of the proof outline. The three balls (from large to small) are respectively the $2\\Ro(n_0)$ ball, $\\Ro(n_0)$ ball, and $\\ei$ ball, where $\\Ro(n_0)$ is from Lemma~\\ref{lem:WorstCaseThetaBd}.\n\t\tThe blue curve is the initial, possibly diverging phase of $\\bart(t)$. The green curve is $\\bart(t)$ when the stepsizes are moderate in size ($t_{n_0} \\leq t \\leq t_{\\nMid}$ in the analysis). Similarly, the red curve is $\\bart(t)$ when the stepsizes are sufficiently small ($ t > t_{\\nMid}$). The dotted curves are the associated ODE trajectories $\\theta(t,t_n,\\theta_n)$.}\n\t\\label{fig:trajectory}\n\\end{figure*}\n\n\n\n\n\\subsection{Preliminaries}\n\nWe establish some preliminary results here that will be used throughout this section.\nLet $s \\in \\Real,$ and $u_0 \\in \\dReal.$ Using results from Chapter 6, \\cite{hirsch2012differential}, it follows that the solution $\\theta(t, s, u_0),$ $t \\geq s,$ of \\eqref{eq:limiting_ODE} satisfies the relation\n\n\\begin{equation} \\label{eq:ODE_traj}\n\\theta(t, s, u_0) = \\thS + e^{-A(t - s)} (u_0 - \\thS) \\enspace.\n\\end{equation}\n\nAs the matrix $A$ is positive definite, for $\\theta(t) \\equiv \\theta(t, s, u_0),$\n\n\\[\n\\frac{d}{dt}\\|\\theta(t) - \\thS\\|^2 = -2(\\theta(t) - \\thS)^\\top A (\\theta(t) - \\thS)<0 \\enspace.\n\\]\n\nHence\n\n\\begin{equation}\n\\label{eq: norm of thetan - thetastar mon dec}\n\\|\\theta(t', s ,u_0) - \\thS\\| \\leq \\|\\theta(t, s,u_0) - \\thS\\|\n\\enspace,\n\\end{equation}\nfor all $t' \\geq t \\geq s$ and $u_0.$\n\nLet $\\lambda$ be as in Theorem~\\ref{thm: convergence rate}. From Corollary 3.6, p71, \\cite{teschl2012ordinary},  $\\exists \\Kl \\geq 1$ so that $\\forall t \\geq s$\n\n\\begin{equation}\n\\label{eq:expMatBd}\n\\|e^{-A(t - s)}\\| \\leq \\Kl e^{-\\lambda (t - s)} \\enspace.\n\\end{equation}\n\nSeparately, as  $t_{n+1}-t_{k+1} = \\sum_{\\ell = k + 1}^{n} \\alpha_\\ell = \\sum_{\\ell=k+1}^n \\tfrac{1}{\\ell+1},$\n\n\\begin{equation}\n\\label{eq:bounding the exp of tk}\n\\frac{(k + 1)^\\lambda}{(n + 1)^{\\lambda}} \\leq e^{-\\lambda(\\tI{n + 1} - \\tI{k + 1})} \\leq \\frac{(k + 2)^\\lambda}{(n + 2)^\\lambda} \\enspace.\n\\end{equation}\n\n\nThe following result is a consequence of \\ref{assum:bounded_feat} that gives a bound directly on the martingale difference noise as a function of the iterates. We emphasize that this strong behavior of TD(0) is significant in our work. We also are not aware of other works that utilized it even though \\ref{assum:bounded_feat} or equivalents are often assumed and accepted. \n\n\\begin{lemma}[Martingale Noise Behavior] \\label{lem:martingale_bound_TD0}\n\n\n\tFor all $n \\geq 0,$\n\t\\[\n\t\\|M_{n + 1}\\|\\leq \\Km[1 + \\|\\theta_n - \\thS\\|] \\enspace ,\n\t\\]\n\n\twhere\n\n\t\\[\n\t\\Km := \\frac{1}{4}\\max \\left\\{2 + [1 + \\gamma] \\|A^{-1}\\| \\|b\\| , 1 + \\gamma + 4 \\| A \\| \\right\\} \\enspace.\n\t\\]\n\\end{lemma}\n\n\\begin{remark} \\label{rem: weak noise}\n\tThe noise behavior usually used in the literature (e.g., \\cite{sutton2009fast,sutton2009convergent}) is the same as we assumed in \\ref{assum:bounded second moments} for Theorem~\\ref{thm:ExPDecayRate}:\n\n\t\\[\n\t\\bE[||M_{n+1}||^2|{\\cal F}_n]\\leq K_s(1+||\\theta_n||^2)\\enspace,\n\t\\]\n\n\tfor some constant $K_s \\geq 0$. However, here we assume the stronger \\ref{assum:bounded_feat}, which, using a similar proof technique to that of Lemma~\\ref{lem:martingale_bound_TD0}, implies\n\t\\[\n\t\\||M_{n + 1}||^2\\leq 3[1+\\gamma + \\max(\\|A\\|,\\|b\\|)]^2(1 + ||\\theta_n||^2)\n\t\\]\n\tfor all $n \\geq 0.$\t\n\\end{remark}\n\n\n\nThe remaining parts of the analysis rely on the comparison of the discrete TD(0) trajectory $\\{\\theta_n\\}$ to the continuous solution $\\theta(t)$ of the limiting ODE. For this, we first switch from directly treating $\\{\\theta_n\\}$ to treating their linear interpolation $\\{\\bart(t)\\}$ as defined in \\eqref{eqn:LinInt}. The key idea then is to use the VoP method \\cite{lakshmikantham1998method} as in Lemma~\\ref{lem:vopApp}, and express $\\bart(t)$ as a perturbation of $\\theta(t)$ due to two factors: the discretization error and the martingale difference noise. \nOur quantification of these two factors is as follows. \nFor the interval $[\\tI{\\ell_1}, \\tI{\\ell_2}],$ let\n\n$$\nE^\\text{d}_{[\\ell_1,\\ell_2]} := \\sum_{k = \\ell_1}^{\\ell_2-1}\\int_{\\tI{k}}^{\\tI{k + 1}} e^{-A (\\tI{n + 1} - \\tau) } A [\\bart(\\tau) - \\theta_k] \\df \\tau \\enspace,\n$$\n\nand\n\n$$\nE^\\text{m}_{[\\ell_1,\\ell_2]} := \\sum_{k = \\ell_1}^{\\ell_2-1}\\left[\\int_{\\tI{k}}^{\\tI{k + 1}} e^{-A (\\tI{n + 1}  - \\tau)}\\df \\tau\\right] M_{k + 1} \\enspace.\n$$\n\n\n\\begin{corollary}[Comparison of SA Trajectory and ODE Solution] \\label{cor:ODE perturbation}\n\tFor every $\\ell_2 \\geq \\ell_1$,\n\n\t\\[\n\t\\bart(\\tI{\\ell_2}) - \\thS  =  \\theta(\\tI{\\ell_2}, \\tI{\\ell_1}, \\bart(\\tI{\\ell_1})) - \\thS + E^\\text{d}_{[\\ell_1,\\ell_2]} + E^\\text{m}_{[\\ell_1,\\ell_2]} \\enspace.\n\t\\]\n\\end{corollary}\n\nWe highlight that both the paths, $\\bart(t)$ and $\\theta(t, \\tI{\\ell_1}, \\bart(\\tI{\\ell_1})),$ $t \\geq \\tI{\\ell_1},$ start at the same point $\\bart(\\tI{\\ell_1})$ at time $\\tI{\\ell_1}.$ \nConsequently, by bounding $E^\\text{d}_{[\\ell_1,\\ell_2]}$ and $E^\\text{m}_{[\\ell_1,\\ell_2]}$ we can estimate the distance of interest.\n\n\\subsection{Part I -- Initial Possible Divergence}\n\\label{sec:phase1}\n\nIn this section, we show that the TD(0) iterates lie in an $O(n)$-ball around $\\thS.$ We stress that this is one of the results that enable us to accomplish more than existing literature. Previously, the distance of the initial iterates from $\\thS$ was bounded using various assumptions,  often justified  with an artificial projection step which we are able to avoid.\n\nLet $R_0 := 1  + \\|\\theta_0 - \\thS\\|.$\n\\begin{lemma}[Worst-case Iterates Bound]\n\t\\label{lem:WorstCaseThetaBd}\n\tFor $n \\geq 0,$\n\n\t\\[\n\t\\|\\theta_n - \\thS\\| \\leq \\Ro(n) \n\t\\enspace,\n\t\\]\n\n\twhere\n\t\\[\n\t\\Ro(n) := [n+1] \\cS R_0\n\t\\]\n\tand\n\t$\\cS:= 1 + \\|\\thS\\| \\leq 1 + \\|A^{-1}\\| \\; \\|b\\|$\n\\end{lemma}\n\n\nNext, since $\\|M_{n+1}\\|$ is linearly bounded by $\\|\\theta_n - \\thS\\|$, the following result shows that $\\|M_{n+1}\\|$ is $O(n)$ as well. It  follows from Lemmas~\\ref{lem:martingale_bound_TD0} and \\ref{lem:WorstCaseThetaBd}.\n\n\\begin{corollary}[Worst-case Noise Bound]\n\t\\label{cor:WorstCaseMBd}\n\tFor $n \\geq 0,$\n\t\\[\n\t\\|M_{n + 1}\\| \\leq  \\Km[1 + \\cS R_0][n + 1] \\enspace.\n\t\\]\n\\end{corollary}\n\n\\subsection{Part II -- Rate of Convergence}\n\n\n\n\n\n\n\n\n\n\nHere, we bound the probability of the event\n\n\\begin{equation*}\n\\cE(n_0, n_1) := \\{\\|\\theta_n - \\thS\\| \\leq \\epsilon \\; \\forall n > n_0 + n_1\\} \\enspace\n\\end{equation*}\n\nfor sufficiently large $n_0, n_1;$ how large they should be will be elaborated later.\nWe do this by comparing the TD(0) trajectory $\\theta_{n}$ with the ODE solution $\\theta(\\tI{n}, \\tI{n_0}, \\bart(\\tI{n_0}))$ $\\forall n \\geq n_0$; for this we will use Corollary ~\\ref{cor:ODE perturbation} along with Lemma~\\ref{lem:WorstCaseThetaBd}.  Next, we show that if $n_0$ is sufficiently large, or equivalently the stepsizes $\\{\\alpha_{n}\\}_{n \\geq n_0}$ are small enough, then after waiting for a finite number of iterations from $n_0,$ the TD(0) iterates are $\\epsilon-$close to $\\thS$ w.h.p. The sufficiently long waiting time ensures that the ODE solution $\\theta(t_{n + 1}, t_{n_0}, \\bar{\\theta}_{n_0})$ is $\\epsilon-$close to $\\thS;$ the small stepsizes ensure that the discretization error and martingale difference noise are small enough.\n\n\n\nLet $\\delta \\in (0, 1),$ and let $\\epsilon$ be such that $\\epsilon>0.$ Also, for an event $\\cE,$ let $\\cE^c$ denote its complement and let $\\{\\cE_1, \\cE_2\\}$ denote $\\cE_1 \\cap \\cE_2.$ We begin with a careful decomposition of $\\cE^c(n_0, n_1),$ the complement of the event of interest. The idea is to break it down into an incremental union of events. Each such event has an inductive structure: good up to iterate $n$ (denoted by $\\Gnd$ below) and the $(n + 1)-$th iterate is bad. The good event $\\Gnd$ holds when all the iterates up to $n$ remain in an $O(n_0)$ ball around $\\thS.$ For $n < n_0 + n_1,$ the bad event means that $\\theta_{n + 1}$ is outside the $O(n_0)$ ball around $\\thS,$ while for $n \\geq n_0 + n_1,$ the bad event means that $\\theta_{n + 1}$ is outside the $\\ei$ ball around $\\thS.$  Formally, for $n_1 \\geq 1,$ define the events\n\n\\[\n\\Emid := \\hspace{-0.5em}\\bigcup_{n = n_0} ^{n_0 + n_1 - 1} \\! \\left\\{ \\Gnd, \\|\\theta_{n + 1} - \\thS\\| \\!  > \\!2 \\Ro(n_0) \\right\\} \\enspace,\n\\]\n\n\\begin{align}\n&\\Eaft \n\\\\\n&:= \\bigcup_{n = n_0 + n_1} ^{\\infty} \\left\\{\\Gnd,  \\|\\theta_{n + 1} - \\thS\\| > \\min\\{\\epsilon, 2\\Ro(n_0) \\} \\right\\} \\;,\n\\end{align}\n\nand, $\\forall n \\geq n_0,$ let\n\\[\n\\Gnd \\! := \\! \\left\\{\\bigcap_{k = n_0}^{n} \\!\\{\\|\\theta_k - \\thS\\| \\! \\leq \\! 2\\Ro(n_0) \\}\\right\\} \\enspace.\n\\]\n\n\nUsing the above definitions, the decomposition of $\\cE^c(n_0, n_1)$ is the following relation.\n\\begin{lemma}[Decomposition of Event of Interest] \\label{lem: decomposition}\n\tFor $n_0, n_1 \\geq 1,$\n\n\t\\[\n\t\\cE^c(n_0, n_1) \\subseteq \\Emid \\cup \\Eaft \\enspace.\n\t\\]\n\n\\end{lemma}\n\n\nFor the following results, define the constants\n\n\n\n\n\n\\[\n\\cMb :=\n\\begin{cases}\n\\frac{6 \\Km \\Kl 2^{\\lambda - 0.5}}{\\sqrt{2 \\lambda - 1}} & \\text{ if $\\lambda > 0.5$}\\\\\n\\frac{6 \\Km \\Kl }{\\sqrt{1 - 2\\lambda}} & \\text{ if $\\lambda < 0.5$ \\enspace.}\n\\end{cases}\n\\]\n\n\n\n\n\nNext, we show that on the ``good'' event $\\Gnd,$ the discretization error is small for all sufficiently large $n.$\n\n\n\n\\begin{lemma}[Part II Discretization Error Bound]\n\n\t\\label{lem:SmallDE}\n\n\n\n\n\n\n\n\n\n\n\n\tFor any\n\t$$\n\tn \\geq n_0 \\geq \\tfrac{K_\\lambda6  \\|A\\|(\\|A\\| + 2\\Km)}{\\lambda},\n\t$$\n\t\\[\n\t\\|E_{[n_0,n+1]}^d\\| \\leq \\tfrac{1}{3}[n_0+1]C_*R_0 = \\tfrac{1}{3} \\Ro(n_0).\n\t\\]\n\t\n\tFurthermore, for\n\t$$n \\geq \\nMid \\geq \\left( 1 + \\tfrac{K_\\lambda6  \\|A\\| (\\|A\\| + 2\\Km) C_* R_0}{\\lambda \\min\\{ \\epsilon, \\Ro(n_0)\\}} \\right)(n_0 + 1)$$\n\n\tit thus also holds on $G_{n_0,n}$ that\n\n\t\\begin{align*}\n\t\\|E_{[\\nMid,n+1]}^d\\| &\\leq \\tfrac{1}{3}\\min\\{\\epsilon, [n_0+1]C_*R_0\\} \\\\ &= \\tfrac{1}{3}\\min\\{\\epsilon, \\Ro(n_0)\\}\\enspace.\n\t\\end{align*}\n\\end{lemma}\n\n\nThe next result gives a bound on the probability that, on the ``good'' event $\\Gnd,$ the martingale difference noise is small when $n$ is large. The bound has two forms for the different values of $\\lambda$.\n\n\\begin{lemma}[Part II Martingale Difference Noise Concentration]\n\n\t\\label{lem:MartConc}\n\tLet $n_0 \\geq 1$ and $R \\geq 0.$ Let $n \\geq n' \\geq n_0.$\n\n\t\\begin{itemize}\n\n\t\t\\item For $\\lambda > 1/2,$\n\n\t\t\\begin{align*}\n\t\t\\Pr\\{\\Gnd, &\\|E_{[n',n+1]}^{m}\\| \\geq R \\} \\\\\n\t\t&\\leq 2d^2 \\exp\\left[- \\frac{(n + 1) R^2 }{ 2d^3 \\cMb^2 \\Ro^2(n_0)}\\right] \\enspace.\n\t\t\\end{align*}\n\t\t\n\t\t\\item For $\\lambda < 1/2,$\n\n\t\t\\begin{align*}\n\t\t\\Pr\\{\\Gnd, &\\|E_{[n',n+1]}^{m}\\| \\geq R \\} \\\\ &\\leq 2d^2 \\exp\\left[-\\frac{[n' + 1]^{1 - 2 \\lambda} (n + 1)^{2 \\lambda} R^2}{2d^3 \\cMb^2 \\Ro^2(n_0)}\\right] \\enspace.\n\t\t\\end{align*}\n\t\\end{itemize}\n\t\n\\end{lemma}\n\nHaving Lemma~\\ref{lem:SmallDE}, we substitute $R = \\tfrac{\\Ro(n_0)}{2}$ in Lemma~\\ref{lem:MartConc} and estimate the resulting sum to bound $\\Emid$.\n\n\\begin{lemma}[Bound on Probability of $\\Emid$] \\label{lem: bound on Emid}\n\tLet $n_0 \\geq \\max\\left\\{\\tfrac{K_\\lambda6  \\|A\\| (\\|A\\| + 2\\Km)}{\\lambda}, 2^{\\frac{1}{\\lambda}}\\right\\}$ and $n_1 \\geq 1.$\n\n\t\\begin{itemize}\n\t\t\\item For $\\lambda > 1/2,$ \n\n\t\t\\[\n\t\t\\Pr\\{\\Emid \\} \\leq  16d^5 \\cMb^2  \\exp\\left[-\\frac{n_0}{8d^3 \\cMb^2 }\\right] \\enspace.\n\t\t\\]\n\t\t\n\t\t\\item For $\\lambda < 1/2,$ \n\n\t\t\\begin{equation*}\n\t\t\\Pr\\{\\Emid \\} \\leq \\\\ 2d^2 \\left[\\frac{8d^3 \\cMb^2}{ \\lambda}\\right]^{\\frac{1}{2\\lambda}}  \\frac{\\exp[- \\frac{n_0}{64d^3 \\cMb^2}]}{(n_0 + 1)^{\\frac{1 - 2 \\lambda}{2 \\lambda}}} \\enspace.\n\t\t\\end{equation*}\n\t\t\n\t\\end{itemize}\n\\end{lemma}\n\n\n\n\nLastly, we upper bound $\\Eaft$ in the same spirit as $\\Emid$ in Lemma~\\ref{lem: bound on Emid}, again using Lemmas~\\ref{lem:SmallDE} and \\ref{lem:MartConc}; this time with $ R = \\frac{\\ei}{3}$ . \n\\begin{lemma}[Bound on Probability of $\\Eaft$] \\label{lem: bound on Eaft}\n\tLet $$n_0 \\geq \\max\\left\\{\\tfrac{K_\\lambda6  \\|A\\| (\\|A\\| + 2\\Km)}{\\lambda}, 2^{\\frac{1}{\\lambda}}\\right\\}$$ and\n\n\t\\[\n\t\\nMid \\geq \\left(1  + \\tfrac{K_\\lambda6 \\|A\\| (\\|A\\| + 2\\Km) }{\\lambda \\min\\{ \\epsilon, \\Ro(n_0)\\}} \\right)\\Ro(n_0).\n\t\\]\n\n\n\tLet $n_1 \\equiv n_1(\\ei,\\nMid,n_0) \\geq (\\nMid + 1) \\left[ \\frac{6\\Kl \\Ro(n_0)}{\\ei} \\right]^{1/\\lambda} - n_0.$\n\n\t\\begin{itemize}\n\n\t\t\\item\n\t\tFor $\\lambda > 1/2,$\n\n\t\t\\begin{align*}\n\t\t\\Pr\\{&\\Eaft \\}\n\t\t\\leq  36 d^5 \\cMb^2 \\left[\\frac{\\Ro(n_0)}{\\ei}\\right]^2\\\\\n\n\t\t&\\times \\exp \\left[- \\frac{(6\\Kl)^{1/\\lambda}}{ 18 d^3 \\cMb^2} (\\nMid + 1) \\left[\\frac{\\ei}{\\Ro(n_0)}\\right]^{2 - \\tfrac{1}{\\lambda}}  \\right] .\n\t\t\\end{align*}\n\t\t\\item\n\t\tFor $\\lambda < 1/2,$ \n\n\t\t\\begin{align*}\n\t\t\\Pr\\{\\Eaft\\}\n\t\t\\leq\n\t\t2d^2& \\left[ \\frac{ 18 d^3 \\cMb^2  [\\Ro(n_0)]^2}{\\ei^2 \\lambda } \\right]^{\\frac{1}{2\\lambda}} \\\\\n\t\t&\\times \\exp\\left[-\\frac{K_\\lambda^2 }{4 d^3 \\cMb^2 } (\\nMid + 1) \\right].\n\t\t\\end{align*}\n\t\\end{itemize}\n\\end{lemma}\n\nWe are now ready to put the pieces together for proving Theorem~\\ref{thm: convergence rate}. For the detailed calculations see end of Appendix~\\ref{sec: main thm appendix}.\n\\begin{proof}[Proof of Theorem~\\ref{thm: convergence rate}]\nFrom Lemma~\\ref{lem: decomposition}, by a union bound,\n\n\\begin{align*}\n\\Pr\\{\\cE^c(n_0, n_1)\\}\n\\leq\n\\Pr\\{\\Emid \\} + \\Pr\\{\\Eaft\\} \\enspace.\n\\end{align*}\n\nThe behavior of $\\Emid$ is dictated by $n_0$, while the behavior of $\\Eaft$ by $n_1$.  Using Lemma~\\ref{lem: bound on Emid}, we set $n_0$ so that $\\Emid$ is less than $\\delta/2$, resulting in the condition $n_0  = O\\left(\\ln\\tfrac{1}{\\delta}\\right)$. Next, using Lemma~\\ref{lem: bound on Eaft}, we set $n_1$ so that $\\Eaft$ is less than $\\delta/2$, resulting in \n$$\nn_1 = \\tilde{O}\\left(\\big[{(1/\\epsilon)}\\ln{(1/\\delta)}\\big]^{\\max\\left\\{1+{1/\\lambda},2\\right\\}}\\right)\n$$ for \n$\n\\lambda > 1/2,\n$\nand \n$$\nn_1 =\\tilde{O}\\left(\\big[{(1/\\epsilon)}\\ln{(1/\\delta)}\\big]^{1+{1/\\lambda}}\\right)\n$$ for \n$\\lambda < 1/2.\n$\n\\end{proof}\n\n\n\n\n\n\n5.1 Outline of Approach\n\\subsection{Outline of Approach} \\label{sec:outline}\n\n\n\\begin{table*}[t]\n\n\t\\begin{center}\n\t\t\\begin{tabular}{c | c | c | c }\n\t\t\t\\hline\n\t\t\tStepsize  &  Discretization Error & Martingale Noise Impact & TD(0) Behavior\\\\\n\t\t\t\\hline\n\t\t\t& & & \\\\[-2ex]\n\t\t\tLarge & Large & Large & Possibly diverging \\\\[1ex]\n\t\t\tModerate & $O(n_0)$  & $O(n_0)$ w.h.p.& Stay in $O(n_0)$ ball w.h.p. \\\\[1ex]\n\t\t\tSmall &  $\\epsilon/3$  & $\\epsilon/3$ w.h.p. & Converging w.h.p.\\\\[1ex]\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t\\end{center}\n\t\\caption{\\label{tab:AnalysisOutline}Chronological Summary of Analysis Outline}\n\\end{table*}\nThe limiting ODE for \\eqref{eq:SA_traj} is\n\n\\begin{equation}\n\\label{eq:limiting_ODE}\n\\dot{\\theta}(t) = h(\\theta(t)) =  b - A\\theta(t) = -A(\\theta(t) - \\thS) \\enspace.\n\\end{equation}\nLet $\\theta(t, s, u_0),$ $t \\geq s,$ denote the solution to the above ODE starting at $u_0$ at time $t = s.$ When the starting point and time are unimportant, we will denote this solution by $\\theta(t)$ .\n\nAs the solutions of the ODE are continuous functions of time, we also define a linear interpolation $\\{\\bart(t)\\}$ of $\\{\\theta_n\\}.$ Let $t_0 = 0.$ For $n \\geq 0,$ let $\\tI{n + 1} = \\tI{n} + \\alpha_n$ and let\n\\begin{equation}\n\\label{eqn:LinInt}\n\\bart(\\tau) \\!=\\!\n\\begin{cases}\n\\theta_n & \\! \\! \\text{ if } \\tau = \\tI{n} \\enspace,\\\\\n\\theta_n + \\frac{\\tau - \\tI{n}}{\\alpha_n}[\\theta_{n + 1} - \\theta_n] & \\! \\! \\text{ if } \\tau \\in (\\tI{n}, \\tI{n + 1}) \\enspace.\n\\end{cases}\n\\end{equation}\n\n\n\n\nOur tool for comparing $\\bart(t)$ to $\\theta(t)$ is the \\emph{Variation of Parameters} (VoP) method \\cite{lakshmikantham1998method}.\nInitially, $\\bart(t)$ could stray away from $\\thS$ when the stepsizes may not be small enough to tame the noise. However,  we show that $\\|\\bart(\\tI{n}) - \\thS\\| = O(n),$ i.e., $\\theta_n$ does not stray away from $\\thS$ too fast.  Later, we show that we can fix some $n_0$ so that first the TD(0) iterates for $n \\geq n_0$ stay within an $O(n_0)$ distance from $\\thS.$ Then, after for some additional time, when the stepsizes decay enough, the TD(0) iterates start behaving almost like a noiseless version. These three different behaviours are summarized in Table~\\ref{tab:AnalysisOutline} and illustrated in Figure~\\ref{fig:trajectory}.\n\n\\begin{figure*}\n\t\\begin{center}\n\n\n\t\\includegraphics[scale=0.25]{trajectory}\n\t\\end{center}\n\t\\caption{Visualization of the proof outline. The three balls (from large to small) are respectively the $2\\Ro(n_0)$ ball, $\\Ro(n_0)$ ball, and $\\ei$ ball, where $\\Ro(n_0)$ is from Lemma~\\ref{lem:WorstCaseThetaBd}.\n\t\tThe blue curve is the initial, possibly diverging phase of $\\bart(t)$. The green curve is $\\bart(t)$ when the stepsizes are moderate in size ($t_{n_0} \\leq t \\leq t_{\\nMid}$ in the analysis). Similarly, the red curve is $\\bart(t)$ when the stepsizes are sufficiently small ($ t > t_{\\nMid}$). The dotted curves are the associated ODE trajectories $\\theta(t,t_n,\\theta_n)$.}\n\t\\label{fig:trajectory}\n\\end{figure*}\n\n\n\n\n",
        "Descriptive_question1": "What is the Discretization Error when the Stepsize is Small in table_1?",
        "Descriptive_question2": "What is the TD(0) Behavior when the Stepsize is Large in table_1?",
        "Reasoning_question1": "How does the Martingale Noise Impact change across different Stepsize levels in table_1, and what might this imply about the stability of the system?",
        "Reasoning_question2": "Why might the TD(0) Behavior transition from 'Possibly diverging' to 'Converging w.h.p.' as the Stepsize decreases in table_1?",
        "Descriptive_answer1": "ε/3",
        "Descriptive_answer2": "Possibly diverging",
        "Reasoning_answer1": "Looking at table_1, I observe that the Martingale Noise Impact varies with the Stepsize. When the Stepsize is Large, the Martingale Noise Impact is 'Large', indicating a significant effect of noise on the system. As the Stepsize reduces to Moderate, the impact decreases to 'O(n_0) w.h.p.', suggesting that the noise effect is bounded with high probability. Finally, with a Small Stepsize, it further reduces to 'ε/3 w.h.p.', implying a minimal noise impact. This progressive reduction in noise impact likely indicates improved system stability as smaller stepsizes seem to mitigate the randomness or variability introduced by the martingale noise, allowing for more predictable and controlled behavior in the TD(0) algorithm.",
        "Reasoning_answer2": "Analyzing table_1, I can see that the TD(0) Behavior changes with the Stepsize. Starting with a Large Stepsize, the behavior is 'Possibly diverging', which suggests that the algorithm may not stabilize and could move away from the optimal solution due to overly aggressive updates. As the Stepsize decreases to Moderate, the behavior improves to 'Stay in O(n_0) ball w.h.p.', indicating that the updates are more controlled, keeping the iterates within a bounded region with high probability. Finally, with a Small Stepsize, the behavior is 'Converging w.h.p.', meaning the algorithm is likely to approach the optimal solution. This transition likely occurs because smaller stepsizes reduce the risk of overshooting the target, allowing finer adjustments and enabling the algorithm to settle closer to the true value, as supported by the decreasing Discretization Error and Martingale Noise Impact in the table."
    },
    {
        "paper_id": "1511.04027.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}[h!]\n\\caption{Percent change in $\\mathcal{R}_c$ with respect to a $1\\%$ change in the parameter value, for a low and a high isolation effectiveness $r$, and a low and a high value of $f_T$, while keeping the other parameter values as presented in Table \\ref{tab:ParamsDef}.} \\label{r0change1}\n\\vspace{-6mm}\n\\begin{center}\n\\begin{tabular}{|c|c|c|c|c|c|c|c|c|}\n\\hline\n&Parameter & $\\beta$ & $r$ & $\\ell$ & $\\gamma_r$ & $\\gamma$ & $\\alpha$ & $f_T$ \\\\\n\\hline\n\n &\\% change & 1\\% & -0.23\\% & 0.423\\%  & -0.423\\% & -0.382\\% & -0.195\\% & -0.119\\% \\\\\n $f_T = 0.25$& for $r = 0.35$ &  &  &    &   &   &   &  \\\\\n\\cline{2-9}\n&\\% change & 1\\% & -1.014\\% & 0.053\\%  & -0.053\\% & -0.445\\% & -0.501\\% & -0.306\\% \\\\\n  & for $r = 0.95$ &  &  &    &   &   &   &  \\\\\n\\hline\n&\\% change & 1\\% & -0.402\\% & 0.747\\%  & -0.747\\% & -0.167\\% & -0.086\\% & -0.471\\% \\\\\n $f_T = 0.75$& for $r = 0.35$ &  &  &    &   &   &   &  \\\\\n\\cline{2-9}\n&\\% change & 1\\% & -3.521\\% & 0.185\\%  & -0.185\\% & -0.383\\% & -0.431\\% & -2.373\\% \\\\\n  & for $r = 0.95$ &  &  &    &   &   &   &  \\\\\n\\hline\n\\end{tabular}\n\\end{center}\n\\end{table}",
        "caption": "Percent change in $\\mathcal{R}_c$ with respect to a $1\\%$ change in the parameter value, for a low and a high isolation effectiveness $r$, and a low and a high value of $f_T$, while keeping the other parameter values as presented in Table \\ref{tab:ParamsDef}.",
        "label": "r0change1",
        "section_info": "3 \\protect\\normalsize Model analysis\n\\section{\\protect\\normalsize Model analysis}\n\n \\subsection{\\protect\\normalsize Basic properties}\nSince model (\\ref{model1}) imitates the dynamics of human populations, all variables and parameters should be non-negative. Thus, following the approach shown in appendix A of [\\ref{lit:HTh}], we show the following result.\n\\begin{theorem}\nThe variables of model (\\ref{model1}) are non-negative for all time.\n\\end{theorem}\n\n\\begin{lemma}\nThe closed set \n\\begin{equation*}\n\\Omega = \\big\\lbrace (S, E_1, E_2, I, J, R)\\in \\mathbb{R}_+^6: \\frac{\\Lambda}{\\mu + q_1\\gamma + q_2\\gamma_r}\\le{S + E_1 + E_2 + I + J + R} \\le \\frac{\\Lambda}{\\mu}  \\big\\rbrace\n\\end{equation*}\nis positively invariant for model (\\ref{model1}) and is absorbing.\n\\end{lemma}\n\\noindent Proof: Equation (\\ref{Neq}) implies that\n\\begin{eqnarray}\n\\frac{dN}{dt} &\\le& \\Lambda - \\mu N, \\label{Neq1}\\\\\n\\frac{dN}{dt} &\\ge& \\Lambda - (\\mu+q_1\\gamma+q_2\\gamma_r) N. \\label{Neq2}\n\\end{eqnarray}\nIt follows from (\\ref{Neq1}) that \n\\begin{equation}\nN(t) \\le  \\frac{\\Lambda}{\\mu} + \\left(N(0) -\\frac{\\Lambda}{\\mu} \\right) e^{- \\mu t}\\label{ineq1}\n\\end{equation}\nand from (\\ref{Neq2}) that \n\\begin{equation}\nN(t)\\ge\\frac{\\Lambda}{\\mu + q_1\\gamma + q_2\\gamma_r} + \\left(N(0) -\\frac{\\Lambda}{\\mu + q_1\\gamma + q_2\\gamma_r} \\right) e^{-(\\mu + q_1\\gamma + q_2\\gamma_r)t}. \\label{ineq2}\n\\end{equation}\nIf we assume $N(0) > \\Lambda/\\mu$, then $dN/dt < 0$ and therefore (based on inequality (\\ref{ineq1})), $N(t)$ decreases steadily until reaching $\\Lambda/\\mu$ when $t$ tends to $\\infty$. Similarly, if we assume $N(0) < \\Lambda/(\\mu + q_1\\gamma + q_2\\gamma_r)$, then $dN/dt > 0$ and therefore  (based on inequality (\\ref{ineq2})), $N(t)$ increases steadily until reaching a maximum at $\\Lambda/(\\mu + q_1\\gamma + q_2\\gamma_r)$ when $t$ tends to $\\infty$. It remains to check the case if $N(0)$ lies in the phase between $\\Lambda/(\\mu + q_1\\gamma + q_2\\gamma_r)$ and $\\Lambda/\\mu$. To this end, both inequalities (\\ref{ineq1}) and (\\ref{ineq2}) are combined together to get \n\\begin{equation*}\n\\frac{\\Lambda}{\\mu + q_1\\gamma + q_2\\gamma_r} + \\left(N(0) -\\frac{\\Lambda}{\\mu + q_1\\gamma + q_2\\gamma_r} \\right) e^{-(\\mu + q_1\\gamma + q_2\\gamma_r)t}\\le N(t) \\le \\frac{\\Lambda}{\\mu} + \\left(N(0) -\\frac{\\Lambda}{\\mu} \\right) e^{- \\mu t}.\n\\end{equation*}\nOn taking the limit when $t$ tends to $\\infty$, we find that $N(t)$ remains within the same phase. Thus, the set $\\Omega$ is positively invariant and absorbing.\n\n \\subsection{\\protect\\normalsize Equilibrium analysis}\n \\subsubsection*{\\protect\\normalsize Ebola-free equilibrium and the control reproduction number $\\mathcal{R}_c$}\nIt is easy to check that model (\\ref{model1}) has the Ebola-free equilibrium \n\\begin{equation}\nE_0 = \\left(\\frac{\\Lambda}{\\mu}, 0, 0, 0, 0, 0\\right)^{\\prime}\n\\end{equation}\nwhere the prime `` ${}^{\\prime}$ '' means vector transpose.\\newline \n\\indent The basic reproduction number,  $\\mathcal{R}_0$,  is a measure of the average number of secondary cases produced by a typical infectious individual during the entire course of infection in a completely susceptible population and in the absence of control interventions [\\ref{lit:  BrauerF},\\ref{lit: AndersonRM}]. On the other hand, the control reproduction number, $\\mathcal{R}_c$, quantifies the potential for infectious disease transmission in the context of a partially susceptible population due to the implementation of control interventions. When $\\mathcal{R}_c > 1$, the infection may spread in the population, and the rate of spread is higher with increasingly high values of $\\mathcal{R}_c$. If $\\mathcal{R}_c < 1$, infection cannot be sustained and is unable to generate an epidemic.  For our model, $\\mathcal{R}_c$  is computed using the next generation matrix approach shown in [\\ref{lit:PVDDJW2002}].  Accordingly, we compute the matrices $\\mathbf{F}$ (for the new infection terms) and $\\mathbf{V}$ (for the\ntransition terms) as\n\n\\begin{eqnarray*}\n  \\mathbf{F} = \\left(\\begin{array}{cccc}\n    0 & 0 & \\beta & (1-r) \\ell \\beta \\\\\n    0 & 0 & 0 & 0 \\\\\n    0 & 0 & 0 & 0 \\\\\n    0 & 0 & 0 & 0 \\\\\n  \\end{array}\\right) \\quad, \\quad\n   \\mathbf{V} = \\left(\\begin{array}{cccc}\n    \\kappa_1 + \\mu & 0 & 0 & 0 \\\\\n    -\\kappa_1 & \\kappa_2 + \\mu  & 0 & 0\\\\\n    0 & -(1-f_T) \\kappa_2 & \\alpha+\\gamma+\\mu & 0\\\\\n    0 & - f_T \\kappa_2  & - \\alpha & \\gamma_r + \\mu\\\\\n  \\end{array}\\right).\n\\end{eqnarray*}\n\\noindent Thus, the control reproduction number is given by\n\\begin{eqnarray}\n\\mathcal{R}_c & = &\\rho(\\mathbf{F}\\mathbf{V}^{-1}) = \\frac{\\kappa_1 \\kappa_2 \\beta[(1-f_T) (\\mu + \\gamma_r) + (1-r)\\ell(\\alpha+f_T(\\gamma+\\mu))]}{(\\kappa_1 + \\mu)(\\kappa_2 + \\mu)(\\alpha+\\gamma+\\mu)(\\gamma_r + \\mu)} \\nonumber\\\\\n& = & \\frac{\\kappa_1\\kappa_2\\beta}{(\\kappa_1 + \\mu)(\\kappa_2 + \\mu)(\\alpha+\\gamma+\\mu)}\\left[1 - f_T + (1-r) \\ell \\left( \\frac{\\alpha}{\\gamma_r+\\mu} + f_T  \\frac{\\gamma + \\mu}{\\gamma_r+\\mu} \\right) \\right]\\nonumber\\\\\n& = & \\mathcal{R}_0\\left[1-\\frac{\\alpha}{(\\alpha+\\gamma+\\mu)}\\right] \\left[1 - f_T + (1-r) \\ell \\left( \\frac{\\alpha}{\\gamma_r+\\mu} + f_T  \\frac{\\gamma + \\mu}{\\gamma_r+\\mu} \\right) \\right]\\label{R0eq}\n\\end{eqnarray}\nwhere $\\rho$ is the spectral radius (dominant eigenvalue in\nmagnitude) of the matrix $\\mathbf{F}\\mathbf{V}^{-1}$ and \n\\begin{equation}\n\\mathcal{R}_0 = \\frac{\\kappa_1\\kappa_2\\beta}{(\\kappa_1 + \\mu)(\\kappa_2 + \\mu)(\\gamma+\\mu)}\n\\end{equation}\nis the basic reproduction number for the model.\n\n\\indent The local stability of the Ebola-free equilibrium, $E_0$, for values of $\\mathcal{R}_c < 1$ is established based on a direct use of Theorem 2 in [\\ref{lit:PVDDJW2002}]. We summarize our result in the following lemma.\n\\begin{lemma}\nThe Ebola-free equilibrium $E_0$ of model (\\ref{model1}) is locally asymptotically stable if and only if $\\mathcal{R}_c < 1$.\n\\end{lemma}\n\n\n\\subsubsection*{\\protect\\normalsize Ebola-endemic equilibrium}\nOn putting the derivatives in the left hand side of (\\ref{model1}) equal zero and solving the resulting algebraic system with respect to the variables $\\bar{S}, \\bar{E}_1, \\bar{E}_2, \\bar{I}, \\bar{J}$, and $\\bar{R}$, we obtain\n \\begin{eqnarray}\n\\bar{S} & = & \\frac{\\Lambda}{\\bar\\lambda + \\mu},\\nonumber\\\\\n\\bar{E}_1 & = & \\frac{\\Lambda}{\\bar\\lambda + \\mu} \\cdot \\frac{\\bar\\lambda}{\\kappa_1 + \\mu},\\nonumber\\\\\n\\bar{E}_2 & = & \\frac{\\kappa_1}{\\kappa_2 + \\mu}\\cdot \\frac{\\Lambda}{\\bar\\lambda + \\mu} \\cdot \\frac{\\bar\\lambda}{\\kappa_1 + \\mu},\\nonumber \\\\\n\\bar{I} & = & \\frac{(1-f_T)\\kappa_2}{\\alpha+\\gamma + \\mu} \\cdot \\frac{\\kappa_1}{\\kappa_2 + \\mu}\\cdot \\frac{\\Lambda}{\\bar\\lambda + \\mu} \\cdot \\frac{\\bar\\lambda}{\\kappa_1 + \\mu},\\label{eqvar} \\\\\n\\bar{J} & = &  \\frac{\\kappa_1}{\\kappa_2 + \\mu}\\cdot \\frac{\\Lambda}{\\bar\\lambda + \\mu} \\cdot \\frac{\\bar\\lambda}{\\kappa_1 + \\mu} \\cdot \\frac{\\kappa_2}{\\gamma_r + \\mu} \\left[f_T + (1-f_T) \\frac{\\alpha}{\\alpha+\\gamma + \\mu} \\right], \\nonumber\\\\\n\\bar{R} & = & \\frac{1}{\\mu}[(1-q_1)\\gamma I + (1-q_2) \\gamma_r J]\\nonumber\n\\end{eqnarray}\nwhere\n\\begin{equation}\n\\bar\\lambda = \\frac{\\beta(I + (1-r)\\ell \\bar{J})}{\\bar{N} - r \\bar{J}}\\label{lambda}\n\\end{equation}\nis the equilibrium force of infection. On substituting from (\\ref{eqvar}) into (\\ref{lambda}) and simplifying (with the assumption that $\\lambda \\ne 0$), we get\n \\begin{equation}\n\\bar\\lambda = \\frac{\\mu(\\mathcal{R}_c - 1)}{1 - Term}\n\\end{equation}\nwhere \n \\begin{equation*}\nTerm = \\frac{\\kappa_1 \\kappa_2 [q_1(1-f_T)\\gamma(\\gamma_r + \\mu) + (r\\mu + q_2\\gamma_r)(f_T(\\gamma + \\mu) + \\alpha)]}{(\\kappa_1 + \\mu)(\\kappa_2 + \\mu)(\\alpha+\\gamma+\\mu)(\\gamma_r + \\mu)}.\n\\end{equation*}\nHence, the Ebola-endemic equilibrium is unique and we show the following lemma.\n \\begin{lemma}\n Model (\\ref{model1}) has a unique endemic equilibrium that exists if and only if $\\mathcal{R}_c > 1$.\n \\end{lemma}\n \n\n\n\n\n\n\n\\subsection{\\protect\\normalsize Normalized sensitivity analysis on $\\mathcal{R}_c$ }\\label{sensitivity}\n\n\n\nIn considering the dynamics of the Ebola system (\\ref{model1}), we conduct normalized sensitivity analysis on $\\mathcal{R}_c$ to determine the impact of parameter perturbations on the transmission dynamics of the system. By computing the normalized sensitivity indices, we consider the percent change in the output with respect to a percent change in the parameter input. Those parameters with the largest magnitude of change impact the compartment model the most; the sign indicates whether the change produces an increase or a decrease on $\\mathcal{R}_c$.\\newline\n\\indent The normalized sensitivity indices for $\\mathcal{R}_c$ are calculated by taking the partial derivative of $\\mathcal{R}_c$ with respect to each parameter and multiply the derivative with the ratio of the parameter to $\\mathcal{R}_c$. This value represents the percent change in $\\mathcal{R}_c$ with respect to a 1\\% change in the parameter value [\\ref{lit: CaswellH}]. \\newline\n \n\\vspace{-5mm}\n\n\\begin{table}[h!]\n\\caption{Percent change in $\\mathcal{R}_c$ with respect to a $1\\%$ change in the parameter value, for a low and a high isolation effectiveness $r$, and a low and a high value of $f_T$, while keeping the other parameter values as presented in Table \\ref{tab:ParamsDef}.} \\label{r0change1}\n\\vspace{-6mm}\n\\begin{center}\n\\begin{tabular}{|c|c|c|c|c|c|c|c|c|}\n\\hline\n&Parameter & $\\beta$ & $r$ & $\\ell$ & $\\gamma_r$ & $\\gamma$ & $\\alpha$ & $f_T$ \\\\\n\\hline\n\n &\\% change & 1\\% & -0.23\\% & 0.423\\%  & -0.423\\% & -0.382\\% & -0.195\\% & -0.119\\% \\\\\n $f_T = 0.25$& for $r = 0.35$ &  &  &    &   &   &   &  \\\\\n\\cline{2-9}\n&\\% change & 1\\% & -1.014\\% & 0.053\\%  & -0.053\\% & -0.445\\% & -0.501\\% & -0.306\\% \\\\\n  & for $r = 0.95$ &  &  &    &   &   &   &  \\\\\n\\hline\n&\\% change & 1\\% & -0.402\\% & 0.747\\%  & -0.747\\% & -0.167\\% & -0.086\\% & -0.471\\% \\\\\n $f_T = 0.75$& for $r = 0.35$ &  &  &    &   &   &   &  \\\\\n\\cline{2-9}\n&\\% change & 1\\% & -3.521\\% & 0.185\\%  & -0.185\\% & -0.383\\% & -0.431\\% & -2.373\\% \\\\\n  & for $r = 0.95$ &  &  &    &   &   &   &  \\\\\n\\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\\begin{figure}[h!]\n\\begin{center}\n\\includegraphics[scale=0.57]{R0SAnew11.jpg}\n\\caption{Percent change in $\\mathcal{R}_c$ with respect to a $1\\%$ change in the parameter value, for a low value of $f_T$ ($f_T = 0.25$) and two different levels of isolation effectiveness ($r = 0.35$ and $r = 0.95$). The other parameter values are kept as shown in Table \\ref{tab:ParamsDef}.} \\label{graphofR01}\n\\end{center}\n\\end{figure}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\indent We use the parameters values from Table \\ref{tab:ParamsDef} to study the sensitivity of $\\mathcal{R}_c$ to each parameter. We compute normalized sensitivity analysis on all parameters, but we just consider the impact of parameters that are the most sensitive: $\\beta, r, \\ell, \\gamma_r, \\gamma, \\alpha$, and $f_T$. The other parameters ($\\mu, \\kappa_1$, and  $\\kappa_2$)  have a very low impact, namely less than $0.001\\%$. The numerical simulations to the sensitivity of $\\mathcal{R}_c$ with respect to each of the most sensitive parameters are given in Table \\ref{r0change1}, for two different levels of isolation effectiveness ($r = 0.35$ and $r = 0.95$) and two values of $f_T$ ($f_T = 0.25$ and $f_T = 0.75$), which is the fraction of pre-symptomatic individuals diagnosed and isolated. The other parameter values are kept as shown in Table \\ref{tab:ParamsDef}.\n\\newline\n\\indent A graphical illustration of the numerical results for the scenario when $f_T = 0.25$ and the two levels of isolation effectiveness ($r = 0.35$ and $r = 0.95$) is given in Figure \\ref{graphofR01}. In the case of high isolation effectiveness ($r = 0.95$), simulations show that both the removal rate, $\\gamma_r$, of isolated individuals and the relative transmissibility parameter $\\ell$ of isolated individuals with respect to infectious individuals are the least sensitive parameters (with $0.053 \\%$ change of $\\mathcal{R}_c$), while the parameter of isolation effectiveness, $r$, is the most sensitive one, where a $1\\%$ increase in $r$ causes a $1.014 \\%$ reduction in the value of $\\mathcal{R}_c$. Also, the rate at which infectious individuals get isolated, $\\alpha$, and the fraction of pre-symptomatic individuals detected and isolated, $f_T$, impact negatively on the level of $\\mathcal{R}_c$, where a $1 \\%$ percent increase in the value of $f_T$ causes approximately a $0.31\\%$ decline in the value of the reproduction number $\\mathcal{R}_c$. Thus, as pre-symptomatic individuals are diagnosed and as isolation is highly effective, the number of available infectious individuals who are capable of transmitting Ebola decreases and therefore, the reproduction number decreases. Also, the removal (by recovery or Ebola-induced death) rate $\\gamma$ of infectious individuals affects negatively on $\\mathcal{R}_c$. Hence, for the case of highly effective isolation, the parameters concerning early diagnosis and isolation have a significant impact on the reproduction number. \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\indent This percent impact of the parameters on $\\mathcal{R}_c$ remains so as long as isolation is highly effective. However, if the effectiveness of isolation is low, in the sense that all parameter values are kept the same except the value of the parameter $r$, which is reduced to $0.35$, then we get the results presented in Table \\ref{r0change1} and Figure \\ref{graphofR01}. In this case, both the relative transmissibility $\\ell$ and the removal rate of isolated individuals, $\\gamma_r$, are the second most sensitive parameters, after $\\beta$ which is the most impactful one. Also, $\\ell$ became more sensitive than $r$. The implication is that, when isolation is less effective, there exists the possibility for isolated people to make successful contacts with susceptible individuals and therefore the possibility of causing new infections increases. This causes an increase in the reproduction number. Also, it is noted that the effect of $f_T$ and $\\alpha$ is reduced, which means that diagnosing and isolating infected individuals becomes a weak strategy if the effectiveness of isolation is low.\n \n\n\n\\begin{figure}[h!]\n\\begin{center}\n\\includegraphics[scale=0.585]{R0SAnew12.jpg}\n\\caption{Percent change in $\\mathcal{R}_c$ with respect to a $1 \\%$ change in the parameter value, for a high value of $f_T$ ($f_T = 0.75$) and two different levels of isolation effectiveness ($r = 0.35$ and $r = 0.95$). The other parameter values are kept as shown in Table \\ref{tab:ParamsDef}.} \\label{graphofR02}\n\\end{center}\n\\end{figure}\n\nOn repeating the previous analyses, but this time for a higher value of $f_T$ ($f_T = 0.75$), we obtain the results shown in Table  \\ref{r0change1}, which are also illustrated in Figure \\ref{graphofR02}. In comparison to the scenario when $f_T = 0.25$, the simulations show that increasing the fraction of pre-symptomatic individuals who are diagnosed and isolated,  $f_T$, increases the percent impact of the parameters $r, \\ell, \\gamma_r,$ and $f_T$, and decreases the percent impact of the parameters $\\gamma$ and $\\alpha$, on the value of the control reproduction number $\\mathcal{R}_c$.   \n\n \n\n  \n\n\n\\subsection{\\protect\\normalsize Impact of early detection and isolation on the value of $\\mathcal{R}_c$}\n\n \\begin{figure}[H]\n\\centering\n\\includegraphics[scale=0.55]{R0fT.jpg}\n\\caption{Impact of early detection of pre-symptomatic individuals on the value of $\\mathcal{R}_c$.}\n\\label{fig:R0fT}\n\\end{figure}\n\n\nTo study the impact of early detection of pre-symptomatic individuals and isolation on the reproduction number, we first depict $\\mathcal{R}_c$ as a function of $f_T$, for different levels of isolation effectiveness $r$. Figure \\ref{fig:R0fT} shows that the control reproduction number declines as the proportion, $f_T$, of pre-symptomatic individuals, who get diagnosed and isolated, increases.  Simulations are done using parameter values from Table \\ref{tab:ParamsDef}, but for three different values of $r$. It further shows that the curve corresponding to a low and an intermidate value of isolation effectivenes $r$ (e.g. $r = 0.35$ for the solid curve and $r = 0.65$ for the dashed curve) hits $\\mathcal{R}_c = 1$ at some critical value of $f_T$ (say $f_T^{\\star}$), while for the high value of $r$ ($r = 0.95$), it never hits the critical threshold $\\mathcal{R}_c = 1$, as the curve is totally below the critical threshold. This indicates that for a high effectiveness of isolation, the control reproduction number is less than one and therefore the infection dies out. Analytically, the exact form of $f_T^{\\star}$ is \n\\begin{equation}\nf_T^{\\star} = \\left[ 1 + (1-r)\\ell \\frac{\\alpha}{\\gamma_r + \\mu} - \\frac{1}{\\mathcal{R}_0} \\left(1 + \\frac{\\alpha}{\\gamma+\\mu}\\right) \\right] / \\left[ 1 - \\frac{(1-r)\\ell(\\gamma + \\mu)}{\\gamma_r + \\mu} \\right].  \\label{fTcond1}\n\\end{equation}\nThe critical proportion $f_T^{\\star}$ represents the minimum proportion of pre-symptomatic individuals who are detected and get isolated to ensure an effective control of Ebola. This critical value remains feasible as long as the following inequality holds\n\n\n\n\\begin{equation}\n(1-r)\\ell < \\frac{\\gamma_r + \\mu}{(\\gamma + \\mu)\\mathcal{R}_0}. \\label{fTcond20}\n\\end{equation}\nIf we keep all parameters fixed except $r$, then condition (\\ref{fTcond20}) could be rewritten in a more convenient form\n\n\n\n\\begin{equation}\nr > 1- \\frac{\\gamma_r + \\mu}{\\ell (\\gamma + \\mu) \\mathcal{R}_0}. \\label{fTcond3}\n\\end{equation}\nThis gives the minimum level of effectiveness of isolation required to obtain an isolation and early diagnosis-based control strategy for Ebola tranmission. \n\n\n\n\n \\begin{figure}[H]\n\\centering\n\\includegraphics[scale=0.55]{R0alpha.jpg}\n\\caption{Impact of isolating infectious individuals on the value of $\\mathcal{R}_c$.}\n\\label{fig:R0alpha}\n\\end{figure}\n\nNow, we could also ask a similar question on the role of isolating infectious individuals to contain Ebola transmission. Figure \\ref{fig:R0alpha} shows the impact of changing the rate at which infectious individuals get isolated, $\\alpha$, on $\\mathcal{R}_c$, for the same three different levels of isolation effectivenes, as used above. The analysis shows that it is possible to control the epidemic if and only if $\\alpha > \\alpha^\\star$, where\n\\begin{equation}\n\\alpha^\\star = \\frac{[ (1-f_T)(\\gamma_r + \\mu)(\\gamma+\\mu) + (1-r)\\ell f_T (\\gamma+\\mu)^2]\\mathcal{R}_0 -  (\\gamma_r + \\mu)(\\gamma+\\mu) }{(\\gamma_r + \\mu) - \\ell (1-r) \\mathcal{R}_0(\\gamma + \\mu)}\n\\end{equation}\n\n\n\n\nand with the implementation of condition (\\ref{fTcond20}).\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n3.3 \\protect\\normalsize Normalized sensitivity analysis on $\\mathcal{R}_c$\n\\subsection{\\protect\\normalsize Normalized sensitivity analysis on $\\mathcal{R}_c$ }\\label{sensitivity}\n\n\n\nIn considering the dynamics of the Ebola system (\\ref{model1}), we conduct normalized sensitivity analysis on $\\mathcal{R}_c$ to determine the impact of parameter perturbations on the transmission dynamics of the system. By computing the normalized sensitivity indices, we consider the percent change in the output with respect to a percent change in the parameter input. Those parameters with the largest magnitude of change impact the compartment model the most; the sign indicates whether the change produces an increase or a decrease on $\\mathcal{R}_c$.\\newline\n\\indent The normalized sensitivity indices for $\\mathcal{R}_c$ are calculated by taking the partial derivative of $\\mathcal{R}_c$ with respect to each parameter and multiply the derivative with the ratio of the parameter to $\\mathcal{R}_c$. This value represents the percent change in $\\mathcal{R}_c$ with respect to a 1\\% change in the parameter value [\\ref{lit: CaswellH}]. \\newline\n \n\\vspace{-5mm}\n\n\\begin{table}[h!]\n\\caption{Percent change in $\\mathcal{R}_c$ with respect to a $1\\%$ change in the parameter value, for a low and a high isolation effectiveness $r$, and a low and a high value of $f_T$, while keeping the other parameter values as presented in Table \\ref{tab:ParamsDef}.} \\label{r0change1}\n\\vspace{-6mm}\n\\begin{center}\n\\begin{tabular}{|c|c|c|c|c|c|c|c|c|}\n\\hline\n&Parameter & $\\beta$ & $r$ & $\\ell$ & $\\gamma_r$ & $\\gamma$ & $\\alpha$ & $f_T$ \\\\\n\\hline\n\n &\\% change & 1\\% & -0.23\\% & 0.423\\%  & -0.423\\% & -0.382\\% & -0.195\\% & -0.119\\% \\\\\n $f_T = 0.25$& for $r = 0.35$ &  &  &    &   &   &   &  \\\\\n\\cline{2-9}\n&\\% change & 1\\% & -1.014\\% & 0.053\\%  & -0.053\\% & -0.445\\% & -0.501\\% & -0.306\\% \\\\\n  & for $r = 0.95$ &  &  &    &   &   &   &  \\\\\n\\hline\n&\\% change & 1\\% & -0.402\\% & 0.747\\%  & -0.747\\% & -0.167\\% & -0.086\\% & -0.471\\% \\\\\n $f_T = 0.75$& for $r = 0.35$ &  &  &    &   &   &   &  \\\\\n\\cline{2-9}\n&\\% change & 1\\% & -3.521\\% & 0.185\\%  & -0.185\\% & -0.383\\% & -0.431\\% & -2.373\\% \\\\\n  & for $r = 0.95$ &  &  &    &   &   &   &  \\\\\n\\hline\n\\end{tabular}\n\\end{center}\n\\end{table}\n\n\n\n\n\\begin{figure}[h!]\n\\begin{center}\n\\includegraphics[scale=0.57]{R0SAnew11.jpg}\n\\caption{Percent change in $\\mathcal{R}_c$ with respect to a $1\\%$ change in the parameter value, for a low value of $f_T$ ($f_T = 0.25$) and two different levels of isolation effectiveness ($r = 0.35$ and $r = 0.95$). The other parameter values are kept as shown in Table \\ref{tab:ParamsDef}.} \\label{graphofR01}\n\\end{center}\n\\end{figure}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\indent We use the parameters values from Table \\ref{tab:ParamsDef} to study the sensitivity of $\\mathcal{R}_c$ to each parameter. We compute normalized sensitivity analysis on all parameters, but we just consider the impact of parameters that are the most sensitive: $\\beta, r, \\ell, \\gamma_r, \\gamma, \\alpha$, and $f_T$. The other parameters ($\\mu, \\kappa_1$, and  $\\kappa_2$)  have a very low impact, namely less than $0.001\\%$. The numerical simulations to the sensitivity of $\\mathcal{R}_c$ with respect to each of the most sensitive parameters are given in Table \\ref{r0change1}, for two different levels of isolation effectiveness ($r = 0.35$ and $r = 0.95$) and two values of $f_T$ ($f_T = 0.25$ and $f_T = 0.75$), which is the fraction of pre-symptomatic individuals diagnosed and isolated. The other parameter values are kept as shown in Table \\ref{tab:ParamsDef}.\n\\newline\n\\indent A graphical illustration of the numerical results for the scenario when $f_T = 0.25$ and the two levels of isolation effectiveness ($r = 0.35$ and $r = 0.95$) is given in Figure \\ref{graphofR01}. In the case of high isolation effectiveness ($r = 0.95$), simulations show that both the removal rate, $\\gamma_r$, of isolated individuals and the relative transmissibility parameter $\\ell$ of isolated individuals with respect to infectious individuals are the least sensitive parameters (with $0.053 \\%$ change of $\\mathcal{R}_c$), while the parameter of isolation effectiveness, $r$, is the most sensitive one, where a $1\\%$ increase in $r$ causes a $1.014 \\%$ reduction in the value of $\\mathcal{R}_c$. Also, the rate at which infectious individuals get isolated, $\\alpha$, and the fraction of pre-symptomatic individuals detected and isolated, $f_T$, impact negatively on the level of $\\mathcal{R}_c$, where a $1 \\%$ percent increase in the value of $f_T$ causes approximately a $0.31\\%$ decline in the value of the reproduction number $\\mathcal{R}_c$. Thus, as pre-symptomatic individuals are diagnosed and as isolation is highly effective, the number of available infectious individuals who are capable of transmitting Ebola decreases and therefore, the reproduction number decreases. Also, the removal (by recovery or Ebola-induced death) rate $\\gamma$ of infectious individuals affects negatively on $\\mathcal{R}_c$. Hence, for the case of highly effective isolation, the parameters concerning early diagnosis and isolation have a significant impact on the reproduction number. \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\indent This percent impact of the parameters on $\\mathcal{R}_c$ remains so as long as isolation is highly effective. However, if the effectiveness of isolation is low, in the sense that all parameter values are kept the same except the value of the parameter $r$, which is reduced to $0.35$, then we get the results presented in Table \\ref{r0change1} and Figure \\ref{graphofR01}. In this case, both the relative transmissibility $\\ell$ and the removal rate of isolated individuals, $\\gamma_r$, are the second most sensitive parameters, after $\\beta$ which is the most impactful one. Also, $\\ell$ became more sensitive than $r$. The implication is that, when isolation is less effective, there exists the possibility for isolated people to make successful contacts with susceptible individuals and therefore the possibility of causing new infections increases. This causes an increase in the reproduction number. Also, it is noted that the effect of $f_T$ and $\\alpha$ is reduced, which means that diagnosing and isolating infected individuals becomes a weak strategy if the effectiveness of isolation is low.\n \n\n\n\\begin{figure}[h!]\n\\begin{center}\n\\includegraphics[scale=0.585]{R0SAnew12.jpg}\n\\caption{Percent change in $\\mathcal{R}_c$ with respect to a $1 \\%$ change in the parameter value, for a high value of $f_T$ ($f_T = 0.75$) and two different levels of isolation effectiveness ($r = 0.35$ and $r = 0.95$). The other parameter values are kept as shown in Table \\ref{tab:ParamsDef}.} \\label{graphofR02}\n\\end{center}\n\\end{figure}\n\nOn repeating the previous analyses, but this time for a higher value of $f_T$ ($f_T = 0.75$), we obtain the results shown in Table  \\ref{r0change1}, which are also illustrated in Figure \\ref{graphofR02}. In comparison to the scenario when $f_T = 0.25$, the simulations show that increasing the fraction of pre-symptomatic individuals who are diagnosed and isolated,  $f_T$, increases the percent impact of the parameters $r, \\ell, \\gamma_r,$ and $f_T$, and decreases the percent impact of the parameters $\\gamma$ and $\\alpha$, on the value of the control reproduction number $\\mathcal{R}_c$.   \n\n \n\n  \n\n\n",
        "Descriptive_question1": "What is the percent change in R_c with respect to a 1% change in the parameter beta in table_1?",
        "Descriptive_question2": "What is the percent change in R_c for parameter r when f_T is 0.75 and r is 0.95 in table_1?",
        "Reasoning_question1": "Why does the percent change in R_c for parameter r vary significantly between low and high isolation effectiveness in table_1?",
        "Reasoning_question2": "How does the increase in f_T from 0.25 to 0.75 impact the sensitivity of R_c to the parameter f_T in table_1, and what does this suggest about early detection strategies?",
        "Descriptive_answer1": "1%",
        "Descriptive_answer2": "-3.521%",
        "Reasoning_answer1": "The percent change in R_c for parameter r varies significantly between low and high isolation effectiveness because isolation effectiveness directly influences the control of disease transmission. At low isolation effectiveness (r = 0.35), the impact of a 1% change in r on R_c is less pronounced (e.g., -0.23% at f_T = 0.25) as isolated individuals are still likely to transmit the disease, reducing the effectiveness of isolation as a control measure. However, at high isolation effectiveness (r = 0.95), the impact is much stronger (e.g., -1.014% at f_T = 0.25 and -3.521% at f_T = 0.75) because a small increase in r significantly reduces the transmission potential by limiting contacts between isolated individuals and susceptibles. This difference highlights that isolation is a more powerful intervention when its effectiveness is already high, as the incremental benefit of improving r yields a larger reduction in R_c.",
        "Reasoning_answer2": "When f_T increases from 0.25 to 0.75, the sensitivity of R_c to a 1% change in f_T becomes more negative, indicating a stronger impact. At f_T = 0.25, the percent change in R_c is -0.119% (for r = 0.35) and -0.306% (for r = 0.95), whereas at f_T = 0.75, it increases to -0.471% (for r = 0.35) and -2.373% (for r = 0.95). This trend suggests that as more pre-symptomatic individuals are detected and isolated (higher f_T), the effectiveness of early detection as a control strategy becomes increasingly significant in reducing R_c. The implication for early detection strategies is that scaling up efforts to identify and isolate pre-symptomatic cases can have a disproportionately larger effect on controlling disease spread, especially when paired with high isolation effectiveness, as it reduces the pool of infectious individuals more efficiently."
    },
    {
        "paper_id": "2107.07729.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}[]\n    \\centering\n    \\begin{tabular}{|c|c|c|}\n         \\hline\n         \\textbf{Protocol} & \\textbf{Labeled Data} & \\textbf{Unlabeled Data} \\\\\n         \\hline\n         \\hline\n         P-1 & 10K & 1.39M \\\\\n         \\hline\n         P-2 & 20K & 1.38M \\\\\n         \\hline\n         P-3 & 30K & 1.37M \\\\\n         \\hline\n         P-4 & 50K & 1.35M \\\\\n         \\hline\n         P-5 & 140K (10\\%) & 1.26M \\\\\n         \\hline\n         P-6 & 0.7M (50\\%) & 0.7M \\\\\n         \\hline\n    \\end{tabular}\n    \\caption{Six protocols have been used to evaluate the proposed SSL-MTPP algorithm with varying labeled data (from 10K to 0.7M) and varying unlabeled data on the given training set.}\n    \\label{tab:protocol}\n\\end{table}",
        "caption": "Six protocols have been used to evaluate the proposed SSL-MTPP algorithm with varying labeled data (from 10K to 0.7M) and varying unlabeled data on the given training set.",
        "label": "tab:protocol",
        "section_info": "4 Experiments and Analysis\n\\section{Experiments and Analysis}\n\\label{sec:experiment}\nThe proposed SSL-MTPP algorithm has been evaluated under varying amount of labeled training data. Comparison has been performed with the baseline/native supervised MTPP model which follows a similar architecture as the proposed SSL-MTPP model without the unsupervised branch (described above). Details regarding the dataset, protocols, and results are given below.   \n\n\n\\subsection{Dataset and Protocol}\n\nExperiments have been performed on the Retweet dataset \\cite{zhao2015seismic}, which is formed through the Seismic dataset. The dataset contains multiple sequences of retweets, where each sequence corresponds to information regarding the retweets on a particular tweet. Each sequence contains information regarding the event time (retweet) and the corresponding marker information. Here, the marker refers to type of user (based on the number of followers) who has retweeted. Three categories of marker are provided: (i) normal user, (ii) influencer user, and (iii) celebrity user. The marker information is defined based on the number of followers (degree) of a given user: (i) degree lower than the median (normal user), (ii) degree higher or equal to the median but less than the $95^{th}$ percentile (influencer user), and (iii) degree higher or equal to the $95^{th}$ percentile (celebrity user). The dataset consists of over two million events with an average sequence length of 209 events.The dataset has imbalance class distribution with 50.6\\% event marker are normal users,45\\% events marker are influencer users,while only 4.4\\% events marker are celebrity users. For experiments, data pertaining to 1.4M events was used for training, while the remaining 60K events formed the test set.\n\nTable \\ref{tab:protocol} presents the six protocols used to evaluate the proposed SSL-MTPP algorithm. The training set (consisting of 1.4M events) is split into a labeled set and an unlabeled set for each protocol. The labeled set contains data varying from 10K (P-1) to 0.7M (P-6), while the remaining data forms the unlabeled set. For experiments, the proposed SSL-MTPP algorithm is trained for each protocol, and comparison has been performed with the baseline supervised MTPP model trained in the traditional supervised manner with labeled data only.\n\n\n\\begin{table*}[!t]\n    \\centering\n    \\caption{Performance of the proposed semi-supervised learning algorithm with varying amount of labeled data during training.}\n    \\begin{tabular}{|c|c||l|c|c|c|}\n    \\hline\n    \\textbf{Protocol} & \\textbf{Labeled Data} & \\textbf{Model} & \\textbf{Avg. Precision (\\%)} & \\textbf{Macro-F1 (\\%)} & \\textbf{Micro-F1 (\\%)} \\\\\n    \\hline\n    \\hline\n    \\multirow{2}{*}{P-1} & \\multirow{2}{*}{10K} & Native Supervised MTPP & 38.84 & 39.40 & 58.98 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 39.10 & 39.48 & 59.40\\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-2} & \\multirow{2}{*}{20K} & Native Supervised MTPP & 38.77 & 39.52 & 58.78 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 67.95 & 40.77 & 59.50 \\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-3} & \\multirow{2}{*}{30K} & Native Supervised MTPP & 43.92 & 40.26 & 58.03 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 68.07 & 40.72 & 59.56\\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-4} & \\multirow{2}{*}{50K} & Native Supervised MTPP & 44.91 & 40.74 & 57.61 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 68.35 & 40.71 & 59.59 \\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-5} & \\multirow{2}{*}{140K (10\\%)} & Native Supervised MTPP & 45.08 & 37.88 & 56.86 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP & 68.42 & 40.73 & 59.79 \\\\\n     \\hline\n     \\hline\n\n    \\multirow{2}{*}{P-6} & \\multirow{2}{*}{0.7M (50\\%)} & Native Supervised MTPP & 66.74 & 40.47 & 57.76\\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 69.49 & 40.09 & 59.22 \\\\\n     \\hline\n\n    \\end{tabular}\n    \\label{tab:res}\n\\end{table*}\n\n\n\\subsection{Results and Analysis}\n\nTable \\ref{tab:res} presents the results obtained on the Retweet dataset. The proposed SSL-MTPP algorithm has been evaluated on different protocols containing varying amount of labeled data (Table \\ref{tab:protocol}). In the literature, most of the research has focused on reporting the Macro-F1 and Micro-F1 performance metrics. As part of this research, we observe that these might not be the most appropriate metrics to judge a model's performance, especially under the scenario of imbalanced per-class data. To this effect, along with the Macro-F1 and Micro-F1 values, we also report the average precision of each model. \n\nAs can be observed from Table \\ref{tab:res}, limited variation is observed for the Macro-F1 and Micro-F1 values for the proposed and native supervised MTPP model. For the SSL-MTPP model, the Micro-F1 values lie in the range of $59.22\\% - 59.59\\%$, regardless of the amount of labeled data, and the Macro-F1 values lie in the range of $39.48\\% - 40.77\\%$, thus demonstrating limited variation. On the other hand, the average precision lies in the range of $39.10\\% - 69.49\\%$ with varying labeled data. Similar behavior is observed for the native supervised MTPP model as well, where limited variations are observed for the Macro-F1 and Micro-F1 values, while higher range is observed for the average precision metric. The consistent behavior across the two models suggest average precision to be a better metric for comparing performance in the setup of imbalanced testing data across classes. \n\n\\begin{figure}[t]\n    \\centering \n    \\includegraphics[width=3.3in]{barPrec.png}\n    \\caption{Average precision of the proposed SSL-MTPP model and the native supervised MTPP model on different protocols. The proposed algorithm presents improved performance across protocols. }\n    \\label{fig:prec}\n\\end{figure}\n\n\nFigure \\ref{fig:prec} presents the average precision (\\%) of the proposed SSL-MTPP model and the native supervised MTPP model for six protocols. Across the protocols, the proposed SSL-MTPP model demonstrates improved performance as compared to the baseline model. In P-1, where only 10K labeled data is available, the SSL-MTPP model obtains an average precision of 39.10\\%, presenting an improvement over the baseline model (38.84\\%). Larger improvements are observed for P-2 to P-5, where at least 20K labeled data is available for training. For example, in P-4, the SSL-MTPP model obtains an average precision of 68.35\\% demonstrating an improvement of almost 24\\% as compared to the baseline model. Improvement is also observed for the Macro-F1 and Micro-F1 values across protocols. Further, relatively less improvement is seen when 0.7M labeled data is available for training (P-6), where the proposed SSL-MTPP model achieves an average precision of 69.49\\% (as compared to 66.74\\% of the baseline model), thus suggesting higher improvement when limited training data is available for training. The above behavior appears intuitive in nature as well, since as the amount of labeled data increases, the baseline model is able to learn better, thus reducing the gap in improvement.\n\n\\begin{figure}[t]\n    \\centering \n    \\includegraphics[width=3.3in]{bar.png}\n    \\caption{Effect of varying lambda values ($\\lambda$ in Equation \\ref{eq:sum}) on the average precision performance. Experiments have been performed with the SSL-MTPP algorithm for P-3 (30K labeled training data).}\n    \\label{fig:bar}\n\\end{figure}\n\nExperiments have also been performed to analyze the effect of the weight parameter ($\\lambda$ in Equation \\ref{eq:sum}) for understanding the effect of the fusion of the supervised and unsupervised representations. Figure \\ref{fig:bar} presents the average precision obtained on P-3 (30K labeled training data) with varying $\\lambda$ values. Best performance is obtained with a value of $0.1$, while a drop in performance is seen with a very small ($0.001$) and a very large ($10$) value. A small value reduces the contribution of the unsupervised representation, while a very large value offsets the contribution of the supervised embedding thus resulting in a drop in performance. \n\n\n\n\n\n\n\n4.1 Dataset and Protocol\n\\subsection{Dataset and Protocol}\n\nExperiments have been performed on the Retweet dataset \\cite{zhao2015seismic}, which is formed through the Seismic dataset. The dataset contains multiple sequences of retweets, where each sequence corresponds to information regarding the retweets on a particular tweet. Each sequence contains information regarding the event time (retweet) and the corresponding marker information. Here, the marker refers to type of user (based on the number of followers) who has retweeted. Three categories of marker are provided: (i) normal user, (ii) influencer user, and (iii) celebrity user. The marker information is defined based on the number of followers (degree) of a given user: (i) degree lower than the median (normal user), (ii) degree higher or equal to the median but less than the $95^{th}$ percentile (influencer user), and (iii) degree higher or equal to the $95^{th}$ percentile (celebrity user). The dataset consists of over two million events with an average sequence length of 209 events.The dataset has imbalance class distribution with 50.6\\% event marker are normal users,45\\% events marker are influencer users,while only 4.4\\% events marker are celebrity users. For experiments, data pertaining to 1.4M events was used for training, while the remaining 60K events formed the test set.\n\nTable \\ref{tab:protocol} presents the six protocols used to evaluate the proposed SSL-MTPP algorithm. The training set (consisting of 1.4M events) is split into a labeled set and an unlabeled set for each protocol. The labeled set contains data varying from 10K (P-1) to 0.7M (P-6), while the remaining data forms the unlabeled set. For experiments, the proposed SSL-MTPP algorithm is trained for each protocol, and comparison has been performed with the baseline supervised MTPP model trained in the traditional supervised manner with labeled data only.\n\n\n\\begin{table*}[!t]\n    \\centering\n    \\caption{Performance of the proposed semi-supervised learning algorithm with varying amount of labeled data during training.}\n    \\begin{tabular}{|c|c||l|c|c|c|}\n    \\hline\n    \\textbf{Protocol} & \\textbf{Labeled Data} & \\textbf{Model} & \\textbf{Avg. Precision (\\%)} & \\textbf{Macro-F1 (\\%)} & \\textbf{Micro-F1 (\\%)} \\\\\n    \\hline\n    \\hline\n    \\multirow{2}{*}{P-1} & \\multirow{2}{*}{10K} & Native Supervised MTPP & 38.84 & 39.40 & 58.98 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 39.10 & 39.48 & 59.40\\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-2} & \\multirow{2}{*}{20K} & Native Supervised MTPP & 38.77 & 39.52 & 58.78 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 67.95 & 40.77 & 59.50 \\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-3} & \\multirow{2}{*}{30K} & Native Supervised MTPP & 43.92 & 40.26 & 58.03 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 68.07 & 40.72 & 59.56\\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-4} & \\multirow{2}{*}{50K} & Native Supervised MTPP & 44.91 & 40.74 & 57.61 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 68.35 & 40.71 & 59.59 \\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-5} & \\multirow{2}{*}{140K (10\\%)} & Native Supervised MTPP & 45.08 & 37.88 & 56.86 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP & 68.42 & 40.73 & 59.79 \\\\\n     \\hline\n     \\hline\n\n    \\multirow{2}{*}{P-6} & \\multirow{2}{*}{0.7M (50\\%)} & Native Supervised MTPP & 66.74 & 40.47 & 57.76\\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 69.49 & 40.09 & 59.22 \\\\\n     \\hline\n\n    \\end{tabular}\n    \\label{tab:res}\n\\end{table*}\n\n\n4.2 Results and Analysis\n\\subsection{Results and Analysis}\n\nTable \\ref{tab:res} presents the results obtained on the Retweet dataset. The proposed SSL-MTPP algorithm has been evaluated on different protocols containing varying amount of labeled data (Table \\ref{tab:protocol}). In the literature, most of the research has focused on reporting the Macro-F1 and Micro-F1 performance metrics. As part of this research, we observe that these might not be the most appropriate metrics to judge a model's performance, especially under the scenario of imbalanced per-class data. To this effect, along with the Macro-F1 and Micro-F1 values, we also report the average precision of each model. \n\nAs can be observed from Table \\ref{tab:res}, limited variation is observed for the Macro-F1 and Micro-F1 values for the proposed and native supervised MTPP model. For the SSL-MTPP model, the Micro-F1 values lie in the range of $59.22\\% - 59.59\\%$, regardless of the amount of labeled data, and the Macro-F1 values lie in the range of $39.48\\% - 40.77\\%$, thus demonstrating limited variation. On the other hand, the average precision lies in the range of $39.10\\% - 69.49\\%$ with varying labeled data. Similar behavior is observed for the native supervised MTPP model as well, where limited variations are observed for the Macro-F1 and Micro-F1 values, while higher range is observed for the average precision metric. The consistent behavior across the two models suggest average precision to be a better metric for comparing performance in the setup of imbalanced testing data across classes. \n\n\\begin{figure}[t]\n    \\centering \n    \\includegraphics[width=3.3in]{barPrec.png}\n    \\caption{Average precision of the proposed SSL-MTPP model and the native supervised MTPP model on different protocols. The proposed algorithm presents improved performance across protocols. }\n    \\label{fig:prec}\n\\end{figure}\n\n\nFigure \\ref{fig:prec} presents the average precision (\\%) of the proposed SSL-MTPP model and the native supervised MTPP model for six protocols. Across the protocols, the proposed SSL-MTPP model demonstrates improved performance as compared to the baseline model. In P-1, where only 10K labeled data is available, the SSL-MTPP model obtains an average precision of 39.10\\%, presenting an improvement over the baseline model (38.84\\%). Larger improvements are observed for P-2 to P-5, where at least 20K labeled data is available for training. For example, in P-4, the SSL-MTPP model obtains an average precision of 68.35\\% demonstrating an improvement of almost 24\\% as compared to the baseline model. Improvement is also observed for the Macro-F1 and Micro-F1 values across protocols. Further, relatively less improvement is seen when 0.7M labeled data is available for training (P-6), where the proposed SSL-MTPP model achieves an average precision of 69.49\\% (as compared to 66.74\\% of the baseline model), thus suggesting higher improvement when limited training data is available for training. The above behavior appears intuitive in nature as well, since as the amount of labeled data increases, the baseline model is able to learn better, thus reducing the gap in improvement.\n\n\\begin{figure}[t]\n    \\centering \n    \\includegraphics[width=3.3in]{bar.png}\n    \\caption{Effect of varying lambda values ($\\lambda$ in Equation \\ref{eq:sum}) on the average precision performance. Experiments have been performed with the SSL-MTPP algorithm for P-3 (30K labeled training data).}\n    \\label{fig:bar}\n\\end{figure}\n\nExperiments have also been performed to analyze the effect of the weight parameter ($\\lambda$ in Equation \\ref{eq:sum}) for understanding the effect of the fusion of the supervised and unsupervised representations. Figure \\ref{fig:bar} presents the average precision obtained on P-3 (30K labeled training data) with varying $\\lambda$ values. Best performance is obtained with a value of $0.1$, while a drop in performance is seen with a very small ($0.001$) and a very large ($10$) value. A small value reduces the contribution of the unsupervised representation, while a very large value offsets the contribution of the supervised embedding thus resulting in a drop in performance. \n\n\n\n\n\n\n\n",
        "Descriptive_question1": "What is the labeled data size for Protocol P-1 in table_1?",
        "Descriptive_question2": "What is the Avg. Precision (%) of the Proposed Semi-Supervised MTPP for Protocol P-6 in table_1?",
        "Reasoning_question1": "How does the performance of the Proposed Semi-Supervised MTPP compare to the Native Supervised MTPP in terms of Avg. Precision (%) across different protocols in table_1?",
        "Reasoning_question2": "Why might the Avg. Precision (%) show more variation compared to Macro-F1 (%) and Micro-F1 (%) metrics for both models in table_1?",
        "Descriptive_answer1": "10K",
        "Descriptive_answer2": "69.49",
        "Reasoning_answer1": "Let's analyze the Avg. Precision (%) performance of the Proposed Semi-Supervised MTPP compared to the Native Supervised MTPP across the protocols in the table. Starting with P-1, the Semi-Supervised model achieves a precision of 39.10%, slightly higher than the Native model's 38.84%, showing a modest improvement. For P-2, the Semi-Supervised model jumps to 67.95%, a significant leap over the Native model's 38.77%, indicating a substantial advantage with just 20K labeled data. This trend continues in P-3, with the Semi-Supervised model at 68.07% versus 43.92% for the Native model, and in P-4, where it's 68.35% against 44.91%. In P-5, the gap remains wide, with the Semi-Supervised model at 68.42% and the Native at 45.08%. However, by P-6, with 0.7M labeled data, the difference narrows, as the Semi-Supervised model achieves 69.49% compared to the Native model's 66.74%. Overall, the Semi-Supervised model consistently outperforms the Native model across all protocols, with the most significant improvements observed in P-2 to P-5 when labeled data is limited, suggesting that the semi-supervised approach leverages unlabeled data effectively to boost performance.",
        "Reasoning_answer2": "To understand why Avg. Precision (%) shows more variation compared to Macro-F1 (%) and Micro-F1 (%) for both models, let's consider the nature of these metrics and the context of the dataset. First, Avg. Precision focuses on the model's ability to rank positive instances correctly, which can be more sensitive to changes in the amount of labeled data as it reflects the model's discriminative power, especially in imbalanced datasets like this one (with only 4.4% celebrity user events). Looking at the table, Avg. Precision for the Semi-Supervised model ranges widely from 39.10% in P-1 to 69.49% in P-6, mirroring the increase in labeled data. In contrast, Macro-F1, which equally weights all classes, ranges only from 39.48% to 40.77%, and Micro-F1, which is dominated by majority classes, varies minimally from 59.22% to 59.79%. This limited variation in F1 scores suggests they are less responsive to improvements in handling minority classes or benefiting from additional data. Avg. Precision likely shows more variation because it better captures the model's improved ranking and detection capabilities as labeled data increases, while F1 metrics are constrained by class imbalance and structural limitations in reflecting nuanced performance changes."
    },
    {
        "paper_id": "2107.07729.json",
        "table_id": "table_2",
        "table_content": "\\begin{table*}[!t]\n    \\centering\n    \\caption{Performance of the proposed semi-supervised learning algorithm with varying amount of labeled data during training.}\n    \\begin{tabular}{|c|c||l|c|c|c|}\n    \\hline\n    \\textbf{Protocol} & \\textbf{Labeled Data} & \\textbf{Model} & \\textbf{Avg. Precision (\\%)} & \\textbf{Macro-F1 (\\%)} & \\textbf{Micro-F1 (\\%)} \\\\\n    \\hline\n    \\hline\n    \\multirow{2}{*}{P-1} & \\multirow{2}{*}{10K} & Native Supervised MTPP & 38.84 & 39.40 & 58.98 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 39.10 & 39.48 & 59.40\\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-2} & \\multirow{2}{*}{20K} & Native Supervised MTPP & 38.77 & 39.52 & 58.78 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 67.95 & 40.77 & 59.50 \\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-3} & \\multirow{2}{*}{30K} & Native Supervised MTPP & 43.92 & 40.26 & 58.03 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 68.07 & 40.72 & 59.56\\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-4} & \\multirow{2}{*}{50K} & Native Supervised MTPP & 44.91 & 40.74 & 57.61 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 68.35 & 40.71 & 59.59 \\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-5} & \\multirow{2}{*}{140K (10\\%)} & Native Supervised MTPP & 45.08 & 37.88 & 56.86 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP & 68.42 & 40.73 & 59.79 \\\\\n     \\hline\n     \\hline\n\n    \\multirow{2}{*}{P-6} & \\multirow{2}{*}{0.7M (50\\%)} & Native Supervised MTPP & 66.74 & 40.47 & 57.76\\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 69.49 & 40.09 & 59.22 \\\\\n     \\hline\n\n    \\end{tabular}\n    \\label{tab:res}\n\\end{table*}",
        "caption": "Performance of the proposed semi-supervised learning algorithm with varying amount of labeled data during training.",
        "label": "tab:res",
        "section_info": "4 Experiments and Analysis\n\\section{Experiments and Analysis}\n\\label{sec:experiment}\nThe proposed SSL-MTPP algorithm has been evaluated under varying amount of labeled training data. Comparison has been performed with the baseline/native supervised MTPP model which follows a similar architecture as the proposed SSL-MTPP model without the unsupervised branch (described above). Details regarding the dataset, protocols, and results are given below.   \n\n\n\\subsection{Dataset and Protocol}\n\nExperiments have been performed on the Retweet dataset \\cite{zhao2015seismic}, which is formed through the Seismic dataset. The dataset contains multiple sequences of retweets, where each sequence corresponds to information regarding the retweets on a particular tweet. Each sequence contains information regarding the event time (retweet) and the corresponding marker information. Here, the marker refers to type of user (based on the number of followers) who has retweeted. Three categories of marker are provided: (i) normal user, (ii) influencer user, and (iii) celebrity user. The marker information is defined based on the number of followers (degree) of a given user: (i) degree lower than the median (normal user), (ii) degree higher or equal to the median but less than the $95^{th}$ percentile (influencer user), and (iii) degree higher or equal to the $95^{th}$ percentile (celebrity user). The dataset consists of over two million events with an average sequence length of 209 events.The dataset has imbalance class distribution with 50.6\\% event marker are normal users,45\\% events marker are influencer users,while only 4.4\\% events marker are celebrity users. For experiments, data pertaining to 1.4M events was used for training, while the remaining 60K events formed the test set.\n\nTable \\ref{tab:protocol} presents the six protocols used to evaluate the proposed SSL-MTPP algorithm. The training set (consisting of 1.4M events) is split into a labeled set and an unlabeled set for each protocol. The labeled set contains data varying from 10K (P-1) to 0.7M (P-6), while the remaining data forms the unlabeled set. For experiments, the proposed SSL-MTPP algorithm is trained for each protocol, and comparison has been performed with the baseline supervised MTPP model trained in the traditional supervised manner with labeled data only.\n\n\n\\begin{table*}[!t]\n    \\centering\n    \\caption{Performance of the proposed semi-supervised learning algorithm with varying amount of labeled data during training.}\n    \\begin{tabular}{|c|c||l|c|c|c|}\n    \\hline\n    \\textbf{Protocol} & \\textbf{Labeled Data} & \\textbf{Model} & \\textbf{Avg. Precision (\\%)} & \\textbf{Macro-F1 (\\%)} & \\textbf{Micro-F1 (\\%)} \\\\\n    \\hline\n    \\hline\n    \\multirow{2}{*}{P-1} & \\multirow{2}{*}{10K} & Native Supervised MTPP & 38.84 & 39.40 & 58.98 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 39.10 & 39.48 & 59.40\\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-2} & \\multirow{2}{*}{20K} & Native Supervised MTPP & 38.77 & 39.52 & 58.78 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 67.95 & 40.77 & 59.50 \\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-3} & \\multirow{2}{*}{30K} & Native Supervised MTPP & 43.92 & 40.26 & 58.03 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 68.07 & 40.72 & 59.56\\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-4} & \\multirow{2}{*}{50K} & Native Supervised MTPP & 44.91 & 40.74 & 57.61 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 68.35 & 40.71 & 59.59 \\\\\n     \\hline\n     \\hline\n    \\multirow{2}{*}{P-5} & \\multirow{2}{*}{140K (10\\%)} & Native Supervised MTPP & 45.08 & 37.88 & 56.86 \\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP & 68.42 & 40.73 & 59.79 \\\\\n     \\hline\n     \\hline\n\n    \\multirow{2}{*}{P-6} & \\multirow{2}{*}{0.7M (50\\%)} & Native Supervised MTPP & 66.74 & 40.47 & 57.76\\\\\n    \\cline{3-6}\n     & & Proposed Semi-Supervised MTPP& 69.49 & 40.09 & 59.22 \\\\\n     \\hline\n\n    \\end{tabular}\n    \\label{tab:res}\n\\end{table*}\n\n\n\\subsection{Results and Analysis}\n\nTable \\ref{tab:res} presents the results obtained on the Retweet dataset. The proposed SSL-MTPP algorithm has been evaluated on different protocols containing varying amount of labeled data (Table \\ref{tab:protocol}). In the literature, most of the research has focused on reporting the Macro-F1 and Micro-F1 performance metrics. As part of this research, we observe that these might not be the most appropriate metrics to judge a model's performance, especially under the scenario of imbalanced per-class data. To this effect, along with the Macro-F1 and Micro-F1 values, we also report the average precision of each model. \n\nAs can be observed from Table \\ref{tab:res}, limited variation is observed for the Macro-F1 and Micro-F1 values for the proposed and native supervised MTPP model. For the SSL-MTPP model, the Micro-F1 values lie in the range of $59.22\\% - 59.59\\%$, regardless of the amount of labeled data, and the Macro-F1 values lie in the range of $39.48\\% - 40.77\\%$, thus demonstrating limited variation. On the other hand, the average precision lies in the range of $39.10\\% - 69.49\\%$ with varying labeled data. Similar behavior is observed for the native supervised MTPP model as well, where limited variations are observed for the Macro-F1 and Micro-F1 values, while higher range is observed for the average precision metric. The consistent behavior across the two models suggest average precision to be a better metric for comparing performance in the setup of imbalanced testing data across classes. \n\n\\begin{figure}[t]\n    \\centering \n    \\includegraphics[width=3.3in]{barPrec.png}\n    \\caption{Average precision of the proposed SSL-MTPP model and the native supervised MTPP model on different protocols. The proposed algorithm presents improved performance across protocols. }\n    \\label{fig:prec}\n\\end{figure}\n\n\nFigure \\ref{fig:prec} presents the average precision (\\%) of the proposed SSL-MTPP model and the native supervised MTPP model for six protocols. Across the protocols, the proposed SSL-MTPP model demonstrates improved performance as compared to the baseline model. In P-1, where only 10K labeled data is available, the SSL-MTPP model obtains an average precision of 39.10\\%, presenting an improvement over the baseline model (38.84\\%). Larger improvements are observed for P-2 to P-5, where at least 20K labeled data is available for training. For example, in P-4, the SSL-MTPP model obtains an average precision of 68.35\\% demonstrating an improvement of almost 24\\% as compared to the baseline model. Improvement is also observed for the Macro-F1 and Micro-F1 values across protocols. Further, relatively less improvement is seen when 0.7M labeled data is available for training (P-6), where the proposed SSL-MTPP model achieves an average precision of 69.49\\% (as compared to 66.74\\% of the baseline model), thus suggesting higher improvement when limited training data is available for training. The above behavior appears intuitive in nature as well, since as the amount of labeled data increases, the baseline model is able to learn better, thus reducing the gap in improvement.\n\n\\begin{figure}[t]\n    \\centering \n    \\includegraphics[width=3.3in]{bar.png}\n    \\caption{Effect of varying lambda values ($\\lambda$ in Equation \\ref{eq:sum}) on the average precision performance. Experiments have been performed with the SSL-MTPP algorithm for P-3 (30K labeled training data).}\n    \\label{fig:bar}\n\\end{figure}\n\nExperiments have also been performed to analyze the effect of the weight parameter ($\\lambda$ in Equation \\ref{eq:sum}) for understanding the effect of the fusion of the supervised and unsupervised representations. Figure \\ref{fig:bar} presents the average precision obtained on P-3 (30K labeled training data) with varying $\\lambda$ values. Best performance is obtained with a value of $0.1$, while a drop in performance is seen with a very small ($0.001$) and a very large ($10$) value. A small value reduces the contribution of the unsupervised representation, while a very large value offsets the contribution of the supervised embedding thus resulting in a drop in performance. \n\n\n\n\n\n\n\n4.2 Results and Analysis\n\\subsection{Results and Analysis}\n\nTable \\ref{tab:res} presents the results obtained on the Retweet dataset. The proposed SSL-MTPP algorithm has been evaluated on different protocols containing varying amount of labeled data (Table \\ref{tab:protocol}). In the literature, most of the research has focused on reporting the Macro-F1 and Micro-F1 performance metrics. As part of this research, we observe that these might not be the most appropriate metrics to judge a model's performance, especially under the scenario of imbalanced per-class data. To this effect, along with the Macro-F1 and Micro-F1 values, we also report the average precision of each model. \n\nAs can be observed from Table \\ref{tab:res}, limited variation is observed for the Macro-F1 and Micro-F1 values for the proposed and native supervised MTPP model. For the SSL-MTPP model, the Micro-F1 values lie in the range of $59.22\\% - 59.59\\%$, regardless of the amount of labeled data, and the Macro-F1 values lie in the range of $39.48\\% - 40.77\\%$, thus demonstrating limited variation. On the other hand, the average precision lies in the range of $39.10\\% - 69.49\\%$ with varying labeled data. Similar behavior is observed for the native supervised MTPP model as well, where limited variations are observed for the Macro-F1 and Micro-F1 values, while higher range is observed for the average precision metric. The consistent behavior across the two models suggest average precision to be a better metric for comparing performance in the setup of imbalanced testing data across classes. \n\n\\begin{figure}[t]\n    \\centering \n    \\includegraphics[width=3.3in]{barPrec.png}\n    \\caption{Average precision of the proposed SSL-MTPP model and the native supervised MTPP model on different protocols. The proposed algorithm presents improved performance across protocols. }\n    \\label{fig:prec}\n\\end{figure}\n\n\nFigure \\ref{fig:prec} presents the average precision (\\%) of the proposed SSL-MTPP model and the native supervised MTPP model for six protocols. Across the protocols, the proposed SSL-MTPP model demonstrates improved performance as compared to the baseline model. In P-1, where only 10K labeled data is available, the SSL-MTPP model obtains an average precision of 39.10\\%, presenting an improvement over the baseline model (38.84\\%). Larger improvements are observed for P-2 to P-5, where at least 20K labeled data is available for training. For example, in P-4, the SSL-MTPP model obtains an average precision of 68.35\\% demonstrating an improvement of almost 24\\% as compared to the baseline model. Improvement is also observed for the Macro-F1 and Micro-F1 values across protocols. Further, relatively less improvement is seen when 0.7M labeled data is available for training (P-6), where the proposed SSL-MTPP model achieves an average precision of 69.49\\% (as compared to 66.74\\% of the baseline model), thus suggesting higher improvement when limited training data is available for training. The above behavior appears intuitive in nature as well, since as the amount of labeled data increases, the baseline model is able to learn better, thus reducing the gap in improvement.\n\n\\begin{figure}[t]\n    \\centering \n    \\includegraphics[width=3.3in]{bar.png}\n    \\caption{Effect of varying lambda values ($\\lambda$ in Equation \\ref{eq:sum}) on the average precision performance. Experiments have been performed with the SSL-MTPP algorithm for P-3 (30K labeled training data).}\n    \\label{fig:bar}\n\\end{figure}\n\nExperiments have also been performed to analyze the effect of the weight parameter ($\\lambda$ in Equation \\ref{eq:sum}) for understanding the effect of the fusion of the supervised and unsupervised representations. Figure \\ref{fig:bar} presents the average precision obtained on P-3 (30K labeled training data) with varying $\\lambda$ values. Best performance is obtained with a value of $0.1$, while a drop in performance is seen with a very small ($0.001$) and a very large ($10$) value. A small value reduces the contribution of the unsupervised representation, while a very large value offsets the contribution of the supervised embedding thus resulting in a drop in performance. \n\n\n\n\n\n\n\n",
        "Descriptive_question1": "What is the average precision for the Proposed Semi-Supervised MTPP model under protocol P-1 in table_2?",
        "Descriptive_question2": "How much labeled data is used in protocol P-5 as shown in table_2?",
        "Reasoning_question1": "Why does the average precision metric show a wider range of variation compared to Macro-F1 and Micro-F1 metrics across different protocols in table_2?",
        "Reasoning_question2": "How does the performance improvement of the Proposed Semi-Supervised MTPP model over the Native Supervised MTPP model change with increasing labeled data in table_2?",
        "Descriptive_answer1": "39.10%",
        "Descriptive_answer2": "140K",
        "Reasoning_answer1": "The average precision metric shows a wider range of variation compared to Macro-F1 and Micro-F1 metrics across different protocols in the table because it is more sensitive to the imbalanced nature of the dataset. As noted in the context, the Retweet dataset has an imbalanced class distribution (50.6% normal users, 45% influencer users, and only 4.4% celebrity users), which impacts performance metrics differently. Macro-F1 and Micro-F1, which are commonly used metrics, tend to mask per-class imbalances by averaging performance across classes or focusing on overall performance, resulting in limited variation (e.g., Macro-F1 for SSL-MTPP ranges from 39.48% to 40.77%, and Micro-F1 from 59.22% to 59.59%). In contrast, average precision directly reflects the model's ability to handle such imbalances by prioritizing correct predictions for minority classes, leading to a broader range of values (e.g., 39.10% to 69.49% for SSL-MTPP). This sensitivity to class imbalance and the increasing availability of labeled data for learning contributes to the wider variation observed in average precision.",
        "Reasoning_answer2": "The performance improvement of the Proposed Semi-Supervised MTPP model over the Native Supervised MTPP model varies with increasing labeled data, showing a trend of higher improvement with limited data and diminishing gains as labeled data increases. Starting with P-1 (10K labeled data), the SSL-MTPP model achieves a modest improvement in average precision (39.10% vs. 38.84%, a difference of 0.26%). This improvement significantly increases in P-2 (20K labeled data) with a jump to 67.95% vs. 38.77% (a difference of 29.18%), indicating that semi-supervised learning benefits greatly from even a small increase in labeled data when starting from a low base. The trend continues in P-3 (30K), P-4 (50K), and P-5 (140K), with improvements of 24.15% (68.07% vs. 43.92%), 23.44% (68.35% vs. 44.91%), and 23.34% (68.42% vs. 45.08%) respectively, showing substantial but slightly decreasing relative gains. However, by P-6 (0.7M labeled data), the improvement drops to just 2.75% (69.49% vs. 66.74%), suggesting that as labeled data becomes abundant, the baseline supervised model learns more effectively, reducing the advantage of the semi-supervised approach. This behavior is intuitive, as semi-supervised learning leverages unlabeled data more effectively when labeled data is scarce, but its relative benefit diminishes with sufficient labeled data."
    },
    {
        "paper_id": "2212.10452.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}[h]\n\t\\centering\n\t\\caption{Quantitative sequence database}\n\t\\label{table1}\n\t\\begin{tabular}{|c|c|c|}  \n\t\t\\hline \n\t\t\\textbf{SID} & \\textbf{Quantitative sequence} & \\textbf{SU} \\\\\n\t\t\\hline  \n\t\t\\(s_{1}\\) & $<$[(\\textit{b}, 2)(\\textit{d}, 1)], [(\\textit{g}, 1)], [(\\textit{f}, 1)]$>$ & 11 \\\\ \n\t\t\\hline\n\t\t\\(s_{2}\\) & $<$[(\\textit{d}, 1)], [(\\textit{g}, 1)]$>$  & 2 \\\\  \n\t\t\\hline  \n\t\t\\(s_{3}\\) & $<$[(\\textit{a}, 1)(\\textit{b}, 1)], [(\\textit{c}, 1)], [(\\textit{c}, 2)], [(\\textit{d}, 1)]$>$ & 12 \\\\\n\t\t\\hline  \n\t\t\\(s_{4}\\) & $<$[(\\textit{a}, 2)(\\textit{b}, 1)], [(\\textit{c}, 1)], [(\\textit{e}, 1)]$>$ & 13 \\\\\n\t\t\\hline\n\t\t\\(s_{5}\\) & $<$[(\\textit{d}, 3)], [(\\textit{b}, 1)], [(\\textit{a}, 1)], [(\\textit{c}, 1)], [(\\textit{e}, 1)]$>$ & 13 \\\\\n\t\t\\hline\n\t\\end{tabular}\n\\end{table}",
        "caption": "Quantitative sequence database",
        "label": "table1",
        "section_info": "3 Preliminaries and Problem Statement\n\\section{Preliminaries and Problem Statement}\n\\label{sec:preliminaries}\n\nIn this section, we first introduce and define the basic notations and concepts related to utility occupancy mining on sequence data. The problem of high utility-occupancy sequential pattern mining is then formulated.\n\n\\subsection{Notations and Concepts}\n\nGiven a finite set $I$ = \\{$i_{1}$, $i_{2}$, $\\cdots$, $i_{m}$\\} containing $m$ distinct items, a quantitative itemset $c$  is a non-empty set and can be defined as $c$ = [($i_1$, $q_1$)($i_2$, $q_2$)$\\cdots$($i_n$, $q_n$)], where $q_j$ is the quality value for $i_j$. Each item and its associated quality (internal utility) together comprise the elements of the quantitative itemset $c$. The items in the quantitative itemset $c$ is a subset of $I$. An itemset $w$ is a non-empty set with no quality information for $c$, which is called that $w$ matches $c$, and is denoted as $w$ $\\sim$ $c$. To simplify the description of some definitions in this paper, we assume that all items in a quantitative itemset are sorted alphabetically. A quantitative sequence is denoted as $s$ and defined as $s$ = $<$$c_1$, $c_1$, $\\cdots$, $c_l$$>$. $s$ is an ordered list containing one or more quantitative itemsets, and the order in which the quantitative itemsets appear can represent the chronological relationship of realistic applications. $v$ = $<$$w_1$, $w_1$, $\\cdots$, $w_l$$>$ is used as $s$ without quantity information that is called that $v$ matches $s$ and is denoted as $v$ $\\sim$ $s$. For the sake of illustration, quantitative itemset and quantitative sequence can also be termed as $q$-itemset and $q$-sequence. Regarding a quantitative sequence database $\\mathcal{D}$, it is a collection of triples $<$\\textit{SID}, \\textit{qs}, \\textit{SU}$>$, where \\textit{qs} is a $q$-sequence, \\textit{SID} is the unique identifier of \\textit{qs}, and \\textit{SU} is the total utility of \\textit{qs}. Furthermore, each item $i$ such that $i$ $\\in$ $\\mathcal{D}$ has its own profit value (called external utility), and can be denoted as $p$($i$).\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Quantitative sequence database}\n\t\\label{table1}\n\t\\begin{tabular}{|c|c|c|}  \n\t\t\\hline \n\t\t\\textbf{SID} & \\textbf{Quantitative sequence} & \\textbf{SU} \\\\\n\t\t\\hline  \n\t\t\\(s_{1}\\) & $<$[(\\textit{b}, 2)(\\textit{d}, 1)], [(\\textit{g}, 1)], [(\\textit{f}, 1)]$>$ & 11 \\\\ \n\t\t\\hline\n\t\t\\(s_{2}\\) & $<$[(\\textit{d}, 1)], [(\\textit{g}, 1)]$>$  & 2 \\\\  \n\t\t\\hline  \n\t\t\\(s_{3}\\) & $<$[(\\textit{a}, 1)(\\textit{b}, 1)], [(\\textit{c}, 1)], [(\\textit{c}, 2)], [(\\textit{d}, 1)]$>$ & 12 \\\\\n\t\t\\hline  \n\t\t\\(s_{4}\\) & $<$[(\\textit{a}, 2)(\\textit{b}, 1)], [(\\textit{c}, 1)], [(\\textit{e}, 1)]$>$ & 13 \\\\\n\t\t\\hline\n\t\t\\(s_{5}\\) & $<$[(\\textit{d}, 3)], [(\\textit{b}, 1)], [(\\textit{a}, 1)], [(\\textit{c}, 1)], [(\\textit{e}, 1)]$>$ & 13 \\\\\n\t\t\\hline\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[h]\n\t\\caption{External utility table}\n\t\\label{table2}\n\t\\centering\n\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\\hline\n\t\t\\textbf{Item}\t    & \\textit{a}\t& \\textit{b}\t& \\textit{c}\t& \\textit{d}\t& \\textit{e}\t& \\textit{f}  & \\textit{g} \\\\ \\hline \n\t\t\\textbf{\\textit{Unit utility}}\t& 3 & 2 & 2 & 1 & 3 & 5 & 1 \\\\ \\hline\n\t\\end{tabular}\n\\end{table}\n\nThe example $q$-sequence database and external utility table that will be used in the following are shown in Tables \\ref{table1} and \\ref{table2}. We can see that this database has five $q$-sequences and seven different items. [($b$, 2)($d$, 1)] is the first $q$-itemset in $q$-sequence $s_1$, containing two items, $b$ and $d$. According to Table \\ref{table2}, the external utility of items $b$ and $d$ are 2 and 1, respectively. In addition, $<$[$b$$d$]$>$ matches $<$[($b$, 2) ($d$, 1)]$>$.\n\n\n\\begin{definition}\n\t\\rm For an item $i$ in a $q$-itemset $c$, its utility can be denoted as $u$($i$, $c$) and is defined as $u$($i$, $c$) = $q$($i$, $c$) $\\times$ $p$($i$, $c$) where $q$($i$, $c$) is the internal utility of $i$ in $c$ and $p$($i$, $c$) is the external utility of $i$. We use $u$($c$) to denote the sum of utilities of all items in $c$, and it can be defined as $u$($c$) = $\\sum\\limits_{i \\in c}u(i, c)$. As for a $q$-sequence $s$, its utility can be denoted as $u$($s$) and is defined as $u$($s$) = $\\sum\\limits_{c \\in s}u(c)$. Moreover, given a $q$-sequence database $\\mathcal{D}$, its utility can be denoted as $u$($\\mathcal{D}$) and is defined as $u$($\\mathcal{D}$) = $\\sum\\limits_{s \\in \\mathcal{D}}u(s)$.\n\\end{definition}\n\nFor example, the utility of the item $b$ is equal to 4, because $u$($b$, $s_1$) = 2 $\\times$ 2 = 4; the utilities of three $q$-itemsets in $s_1$ are 5, 1, and 5, respectively. Thus, the \\textit{SU} of $s_1$ can be calculated as $u$($s_1$) = 5 + 1 + 5 = 11; the total utility of this example database $\\mathcal{D}$ is calculated as $u$($\\mathcal{D}$) = $\\sum_{s_i \\in \\mathcal{D}}$ $u$($s_i$) = 11 + 2 + 12 + 13 + 13 = 51.\n\n\\begin{definition}\n\t\\rm Given two itemsets $w$ and $w^\\prime$, if all the items of $w$ appear in $w^\\prime$, we say that $w^\\prime$ contains $w$, and is denoted as $w$ $\\subseteq$ $w^\\prime$. Similarly, for two $q$-itemset $c$ and $c^\\prime$, if all the items of $c$ appear in $c^\\prime$ and have the same quality, we say that $c^\\prime$ contains $c$, which is denoted as $c$ $\\subseteq$ $c^\\prime$.\n\\end{definition}\n\nFor instance, the itemset [$c$$d$$e$] contains the itemset [$c$$e$]. And the $q$-itemset [($c$, 4)($e$, 2)] is contained in [($c$, 4)($d$, 3)($e$, 2)], but not in [($c$, 3)($e$, 3)]. Because the quality of $c$ in these two $q$-itemsets [($c$, 3)($e$, 3)] and [($c$, 4)($d$, 3)($e$, 2)] is different.\n\n\n\\begin{definition}\n\t\\rm Given two sequences $v$ = $<$$w_1$, $w_2$, $\\cdots$, $w_l$$>$ and $v^\\prime$ = $<$$w^\\prime_1$, $w^\\prime_2$, $\\cdots$, $w^\\prime_l$$>$, if there exists an integer list (1 $\\le$ $k_1$ $\\le$ $k_2$ $\\le$ $\\cdots$ $l$) satisfies that $w_j$ $\\subseteq$ $w^\\prime_{k_j}$, 1 $\\le$ $j$ $\\le$ $l$, we say that $v^\\prime$ contains $v$, and is denoted as $v$ $\\subseteq$ $v^\\prime$. For two $q$-sequences $s$ = $<$$c_1$, $c_1$, $\\cdots$, $c_l$$>$ and $s^\\prime$ = $<$$c^\\prime_1$, $c^\\prime_1$, $\\cdots$, $c^\\prime_l$$>$, if these two $q$-sequences need to satisfy the containment relationship, then there exists an integer list (1 $\\le$ $k_1$ $\\le$ $k_2$ $\\le$ $\\cdots$ $l$) satisfies that $c_j$ $\\subseteq$ $c^\\prime_{k_j}$, 1 $\\le$ $j$ $\\le$ $l$, which is denoted as $s$ $\\subseteq$ $s^\\prime$. In this paper, if a sequence $t$ matches a $q$-sequence $s_k$ and also satisfies $s_k$ $\\subseteq$ $s$, then it can also be denoted as $t$ $\\subseteq$ $s$ instead of $t$ $\\sim$ $s_k$ $\\land$ $s_k$ $\\subseteq$ $s$.\n\\end{definition}\n\nFor example, the $q$-sequence $s_1$ contains $<$[($b$, 2)($d$, 1)]$>$ and $<$[($g$, 1)], [($f$, 1)]$>$, while $<$[($b$, 2)($d$, 2)]$>$ and $<$[($g$, 1)($f$, 1)]$>$ are not contained in $s_1$. \n\n\\begin{definition}\n\t\\rm For a sequences $t$, it has multiple matches in a $q$-sequence $s$. We use $u$($t$, $s$) to denote the actual utility of $s$ and it is defined as $u$($t$, $s$) = \\textit{max}\\{$u$($s^\\prime$) $\\vert$ $t$ $\\sim$ $s^\\prime$ $\\land$ $s^\\prime$ $\\subseteq$ $s$\\}. Additionally, the utility of $t$ in the $q$-sequence database $\\mathcal{D}$ can be denoted as $u$($t$) and is defined as $u$($t$) = \\{$\\sum\\limits_{s \\in \\mathcal{D}}u(t, s) \\vert t \\subseteq s$\\}. In addition, its support can be denoted as \\textit{sup}($t$) and is defined as \\textit{sup}($t$) = $\\vert$ $t$ $\\subseteq$ $s$ $\\land$ $s$ $\\in$ $\\mathcal{D}$ $\\vert$, that is, the number of $q$-sequences of $\\mathcal{D}$ matching $t$.\n\\end{definition}\n\nFor example, the sequence $t$ = $<$[$a$$b$], [$c$]$>$ has two matches in the $q$-sequence $s_3$, and so its utility can be calculated as $u$($<$[$a$$b$], [$c$]$>$) = \\textit{max}\\{$u$($<$[($a$, 1)($b$, 1)], [($c$, 1)]$>$), $u$($<$[($a$, 1)($b$, 1)], [($c$, 2)]$>$)\\} = \\textit{max}\\{7, 9\\} = 9. And $t$ has a support of 2 because $s_3$ and $s_4$ both have instances where $t$ matches.\n\nIn this paper, the concept of utility occupancy \\cite{gan2019huopm} is incorporated into sequence data. Utility occupancy is a flexible measure that can be used to identify patterns with a higher contribution in sequences. Since there is no previous work on this topic, we are the first to define the relevant concepts.\n\n\\begin{definition}\n\t\\rm In a $q$-sequence $s$, the utility occupancy of a sequence $t$, denoted as \\textit{uo}($t$, $s$), is defined as $\\textit{uo}(t, s)$ = $\\frac{u(t, s)}{u(s)}$. Note that $t$ may have more than one match at $s$. Then the utility occupancy of $t$ at position $p$ in $s$ can be denoted as \\textit{uo}($t$, $s$, $p$) and is defined as follows.\n\t$$\n\t\\begin{aligned}\n\t\\textit{uo}(t, s, p) = \\frac{\\textit{max}\\{u(t, s^\\prime) \\vert t \\sim s^\\prime \\land s^\\prime \\subseteq <s_{1}, \\cdots, s_{p}>\\}}{u(s)}.\n\t\\end{aligned}\n\t$$\t\n\tThe total utility occupancy of $t$ in a $q$-sequence database $\\mathcal{D}$, denoted as \\textit{uo}($t$), is defined as follows.\t\n\t$$\n\t\\begin{aligned}\n\t\\textit{uo}(t) = \\frac{\\sum\\limits_{t \\subseteq s \\land s \\in \\mathcal{D}}\\textit{uo}(t, s)}{\\textit{sup}(t)}.\n\t\\end{aligned}\n\t$$\n\\end{definition}\n\nFor example, the utility occupancy of the sequence $<$[$a$], [$c$]$>$ in $s_3$, $s_4$, and $s_5$ are \\textit{uo}($<$[$a$], [$c$]$>$, $s_3$) = \\textit{max}(\\{5, 7\\}) / 12 = 0.583, \\textit{uo}($<$[$a$], [$c$]$>$, $s_4$) =  8 /13 = 0.615, and \\textit{uo}($<$[$a$], [$c$]$>$, $s_5$) = 5 / 13 = 0.385, respectively. Thus, the total utility occupancy of the sequence $<$[$a$], [$c$]$>$ in the entire $\\mathcal{D}$ is equal to \\textit{uo}($<$[$a$], [$c$]$>$) = (0.583 + 0.615 + 0.385) / 3 = 0.528.\n\n\\begin{definition}\n\t\\rm In a $q$-sequence $s$ with $l$ $q$-itemsets, the remaining utility occupancy of a sequence $t$ at position $p$ can be denoted as \\textit{ruo}($t$, $s$, $p$), and is defined as follows.\n\t$$\n\t\\begin{aligned}\n\t\\textit{ruo}(t, s, p) = \\frac{u(<s_{p+1}, \\cdots, s_{l}>, s)}{u(s)}.\n\t\\end{aligned}\n\t$$\n\\end{definition}\n\n\n\n\\begin{definition}\n\t\\rm Considering two thresholds, including a minimum support threshold \\textit{minsup} (0 $\\textless$ \\textit{minsup} $\\le$ 1) and a minimum utility occupancy threshold \\textit{minuo} (0 $\\textless$ \\textit{minuo} $\\le$ 1), a sequential pattern $t$ with high support and high utility occupancy in a $q$-sequence database $\\mathcal{D}$ is called a HUOSP. Here, it satisfies \\textit{sup}($t$) $\\ge$ \\textit{minsup} and \\textit{uo}($t$) $\\ge$ \\textit{minuo}.\n\\end{definition}\n\nFor example, the remaining utility occupancy of the sequence $<$[$a$], [$c$]$>$ in $s_3$ at position 2 is equal to \\textit{ruo}($<$[$a$], [$c$]$>$, $s_3$, 2) = (4 + 1) / 12 = 0.417. And the remaining utility occupancy of the sequence $<$[$a$]$>$ in $s_4$ at position 1 is equal to \\textit{ruo}($<$[$a$]$>$, $s_4$, 1) = (2 + 2 + 3) / 13 = 0.538. Under the setting of \\textit{minsup} to 2 and \\textit{minuo} to 0.4, all found HUOSPs are shown in Table \\ref{table_huosp}.\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{All found HUOSPs in the example database $\\mathcal{D}$}\n\t\\label{table_huosp}\n\t\\begin{tabular}{|c|c|c|c|}  \n\t\t\\hline \n\t\t\\textbf{ID} & \\textbf{HUOSP} & \\textbf{Support} & \\textbf{Utility occupancy}\\\\\n\t\t\\hline  \n\t\t\\(p_{1}\\) & $<$[$a$$b$]$>$ & 2 & 0.516 \\\\ \n\t\t\\hline\n\t\t\\(p_{2}\\) & $<$[$a$$b$], [$c$]$>$ & 2 & 0.76 \\\\ \n\t\t\\hline\n\t\t\\(p_{3}\\) & $<$[$a$], [$c$]$>$ & 3 & 0.528 \\\\ \n\t\t\\hline\n\t\t\\(p_{4}\\) & $<$[$a$], [$c$], [$e$]$>$ & 2 & 0.731 \\\\ \n\t\t\\hline\n\t\t\\(p_{5}\\) & $<$[$a$], [$e$]$>$ & 2 & 0.577 \\\\ \n\t\t\\hline\n\t\t\\(p_{6}\\) & $<$[$b$], [$c$], [$e$]$>$ & 2 & 0.538 \\\\ \n\t\t\\hline\n\t\t\\(p_{7}\\) & $<$[$d$], [$g$]$>$ & 2 & 0.59 \\\\ \n\t\t\\hline\n\t\\end{tabular}\n\\end{table}\n\n\n\\subsection{Problem Statement}\n\nAfter the above definitions and concepts are given, we formulate the problem of mining HUOSPs as follows. Given a quantitative sequence database $\\mathcal{D}$, a utility table with external utilities for each item, and two thresholds \\textit{minsup} (0 $\\textless$ \\textit{minsup} $\\le$ 1) and \\textit{minuo} (0 $\\textless$ \\textit{minuo} $\\le$ 1), the goal of high utility-occupancy sequential pattern mining is to discover all HUOSPs that frequency and utility occupancy are greater than \\textit{minsup} and \\textit{minuo}, respectively.\n\n\n3.1 Notations and Concepts\n\\subsection{Notations and Concepts}\n\nGiven a finite set $I$ = \\{$i_{1}$, $i_{2}$, $\\cdots$, $i_{m}$\\} containing $m$ distinct items, a quantitative itemset $c$  is a non-empty set and can be defined as $c$ = [($i_1$, $q_1$)($i_2$, $q_2$)$\\cdots$($i_n$, $q_n$)], where $q_j$ is the quality value for $i_j$. Each item and its associated quality (internal utility) together comprise the elements of the quantitative itemset $c$. The items in the quantitative itemset $c$ is a subset of $I$. An itemset $w$ is a non-empty set with no quality information for $c$, which is called that $w$ matches $c$, and is denoted as $w$ $\\sim$ $c$. To simplify the description of some definitions in this paper, we assume that all items in a quantitative itemset are sorted alphabetically. A quantitative sequence is denoted as $s$ and defined as $s$ = $<$$c_1$, $c_1$, $\\cdots$, $c_l$$>$. $s$ is an ordered list containing one or more quantitative itemsets, and the order in which the quantitative itemsets appear can represent the chronological relationship of realistic applications. $v$ = $<$$w_1$, $w_1$, $\\cdots$, $w_l$$>$ is used as $s$ without quantity information that is called that $v$ matches $s$ and is denoted as $v$ $\\sim$ $s$. For the sake of illustration, quantitative itemset and quantitative sequence can also be termed as $q$-itemset and $q$-sequence. Regarding a quantitative sequence database $\\mathcal{D}$, it is a collection of triples $<$\\textit{SID}, \\textit{qs}, \\textit{SU}$>$, where \\textit{qs} is a $q$-sequence, \\textit{SID} is the unique identifier of \\textit{qs}, and \\textit{SU} is the total utility of \\textit{qs}. Furthermore, each item $i$ such that $i$ $\\in$ $\\mathcal{D}$ has its own profit value (called external utility), and can be denoted as $p$($i$).\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Quantitative sequence database}\n\t\\label{table1}\n\t\\begin{tabular}{|c|c|c|}  \n\t\t\\hline \n\t\t\\textbf{SID} & \\textbf{Quantitative sequence} & \\textbf{SU} \\\\\n\t\t\\hline  \n\t\t\\(s_{1}\\) & $<$[(\\textit{b}, 2)(\\textit{d}, 1)], [(\\textit{g}, 1)], [(\\textit{f}, 1)]$>$ & 11 \\\\ \n\t\t\\hline\n\t\t\\(s_{2}\\) & $<$[(\\textit{d}, 1)], [(\\textit{g}, 1)]$>$  & 2 \\\\  \n\t\t\\hline  \n\t\t\\(s_{3}\\) & $<$[(\\textit{a}, 1)(\\textit{b}, 1)], [(\\textit{c}, 1)], [(\\textit{c}, 2)], [(\\textit{d}, 1)]$>$ & 12 \\\\\n\t\t\\hline  \n\t\t\\(s_{4}\\) & $<$[(\\textit{a}, 2)(\\textit{b}, 1)], [(\\textit{c}, 1)], [(\\textit{e}, 1)]$>$ & 13 \\\\\n\t\t\\hline\n\t\t\\(s_{5}\\) & $<$[(\\textit{d}, 3)], [(\\textit{b}, 1)], [(\\textit{a}, 1)], [(\\textit{c}, 1)], [(\\textit{e}, 1)]$>$ & 13 \\\\\n\t\t\\hline\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[h]\n\t\\caption{External utility table}\n\t\\label{table2}\n\t\\centering\n\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\\hline\n\t\t\\textbf{Item}\t    & \\textit{a}\t& \\textit{b}\t& \\textit{c}\t& \\textit{d}\t& \\textit{e}\t& \\textit{f}  & \\textit{g} \\\\ \\hline \n\t\t\\textbf{\\textit{Unit utility}}\t& 3 & 2 & 2 & 1 & 3 & 5 & 1 \\\\ \\hline\n\t\\end{tabular}\n\\end{table}\n\nThe example $q$-sequence database and external utility table that will be used in the following are shown in Tables \\ref{table1} and \\ref{table2}. We can see that this database has five $q$-sequences and seven different items. [($b$, 2)($d$, 1)] is the first $q$-itemset in $q$-sequence $s_1$, containing two items, $b$ and $d$. According to Table \\ref{table2}, the external utility of items $b$ and $d$ are 2 and 1, respectively. In addition, $<$[$b$$d$]$>$ matches $<$[($b$, 2) ($d$, 1)]$>$.\n\n\n\\begin{definition}\n\t\\rm For an item $i$ in a $q$-itemset $c$, its utility can be denoted as $u$($i$, $c$) and is defined as $u$($i$, $c$) = $q$($i$, $c$) $\\times$ $p$($i$, $c$) where $q$($i$, $c$) is the internal utility of $i$ in $c$ and $p$($i$, $c$) is the external utility of $i$. We use $u$($c$) to denote the sum of utilities of all items in $c$, and it can be defined as $u$($c$) = $\\sum\\limits_{i \\in c}u(i, c)$. As for a $q$-sequence $s$, its utility can be denoted as $u$($s$) and is defined as $u$($s$) = $\\sum\\limits_{c \\in s}u(c)$. Moreover, given a $q$-sequence database $\\mathcal{D}$, its utility can be denoted as $u$($\\mathcal{D}$) and is defined as $u$($\\mathcal{D}$) = $\\sum\\limits_{s \\in \\mathcal{D}}u(s)$.\n\\end{definition}\n\nFor example, the utility of the item $b$ is equal to 4, because $u$($b$, $s_1$) = 2 $\\times$ 2 = 4; the utilities of three $q$-itemsets in $s_1$ are 5, 1, and 5, respectively. Thus, the \\textit{SU} of $s_1$ can be calculated as $u$($s_1$) = 5 + 1 + 5 = 11; the total utility of this example database $\\mathcal{D}$ is calculated as $u$($\\mathcal{D}$) = $\\sum_{s_i \\in \\mathcal{D}}$ $u$($s_i$) = 11 + 2 + 12 + 13 + 13 = 51.\n\n\\begin{definition}\n\t\\rm Given two itemsets $w$ and $w^\\prime$, if all the items of $w$ appear in $w^\\prime$, we say that $w^\\prime$ contains $w$, and is denoted as $w$ $\\subseteq$ $w^\\prime$. Similarly, for two $q$-itemset $c$ and $c^\\prime$, if all the items of $c$ appear in $c^\\prime$ and have the same quality, we say that $c^\\prime$ contains $c$, which is denoted as $c$ $\\subseteq$ $c^\\prime$.\n\\end{definition}\n\nFor instance, the itemset [$c$$d$$e$] contains the itemset [$c$$e$]. And the $q$-itemset [($c$, 4)($e$, 2)] is contained in [($c$, 4)($d$, 3)($e$, 2)], but not in [($c$, 3)($e$, 3)]. Because the quality of $c$ in these two $q$-itemsets [($c$, 3)($e$, 3)] and [($c$, 4)($d$, 3)($e$, 2)] is different.\n\n\n\\begin{definition}\n\t\\rm Given two sequences $v$ = $<$$w_1$, $w_2$, $\\cdots$, $w_l$$>$ and $v^\\prime$ = $<$$w^\\prime_1$, $w^\\prime_2$, $\\cdots$, $w^\\prime_l$$>$, if there exists an integer list (1 $\\le$ $k_1$ $\\le$ $k_2$ $\\le$ $\\cdots$ $l$) satisfies that $w_j$ $\\subseteq$ $w^\\prime_{k_j}$, 1 $\\le$ $j$ $\\le$ $l$, we say that $v^\\prime$ contains $v$, and is denoted as $v$ $\\subseteq$ $v^\\prime$. For two $q$-sequences $s$ = $<$$c_1$, $c_1$, $\\cdots$, $c_l$$>$ and $s^\\prime$ = $<$$c^\\prime_1$, $c^\\prime_1$, $\\cdots$, $c^\\prime_l$$>$, if these two $q$-sequences need to satisfy the containment relationship, then there exists an integer list (1 $\\le$ $k_1$ $\\le$ $k_2$ $\\le$ $\\cdots$ $l$) satisfies that $c_j$ $\\subseteq$ $c^\\prime_{k_j}$, 1 $\\le$ $j$ $\\le$ $l$, which is denoted as $s$ $\\subseteq$ $s^\\prime$. In this paper, if a sequence $t$ matches a $q$-sequence $s_k$ and also satisfies $s_k$ $\\subseteq$ $s$, then it can also be denoted as $t$ $\\subseteq$ $s$ instead of $t$ $\\sim$ $s_k$ $\\land$ $s_k$ $\\subseteq$ $s$.\n\\end{definition}\n\nFor example, the $q$-sequence $s_1$ contains $<$[($b$, 2)($d$, 1)]$>$ and $<$[($g$, 1)], [($f$, 1)]$>$, while $<$[($b$, 2)($d$, 2)]$>$ and $<$[($g$, 1)($f$, 1)]$>$ are not contained in $s_1$. \n\n\\begin{definition}\n\t\\rm For a sequences $t$, it has multiple matches in a $q$-sequence $s$. We use $u$($t$, $s$) to denote the actual utility of $s$ and it is defined as $u$($t$, $s$) = \\textit{max}\\{$u$($s^\\prime$) $\\vert$ $t$ $\\sim$ $s^\\prime$ $\\land$ $s^\\prime$ $\\subseteq$ $s$\\}. Additionally, the utility of $t$ in the $q$-sequence database $\\mathcal{D}$ can be denoted as $u$($t$) and is defined as $u$($t$) = \\{$\\sum\\limits_{s \\in \\mathcal{D}}u(t, s) \\vert t \\subseteq s$\\}. In addition, its support can be denoted as \\textit{sup}($t$) and is defined as \\textit{sup}($t$) = $\\vert$ $t$ $\\subseteq$ $s$ $\\land$ $s$ $\\in$ $\\mathcal{D}$ $\\vert$, that is, the number of $q$-sequences of $\\mathcal{D}$ matching $t$.\n\\end{definition}\n\nFor example, the sequence $t$ = $<$[$a$$b$], [$c$]$>$ has two matches in the $q$-sequence $s_3$, and so its utility can be calculated as $u$($<$[$a$$b$], [$c$]$>$) = \\textit{max}\\{$u$($<$[($a$, 1)($b$, 1)], [($c$, 1)]$>$), $u$($<$[($a$, 1)($b$, 1)], [($c$, 2)]$>$)\\} = \\textit{max}\\{7, 9\\} = 9. And $t$ has a support of 2 because $s_3$ and $s_4$ both have instances where $t$ matches.\n\nIn this paper, the concept of utility occupancy \\cite{gan2019huopm} is incorporated into sequence data. Utility occupancy is a flexible measure that can be used to identify patterns with a higher contribution in sequences. Since there is no previous work on this topic, we are the first to define the relevant concepts.\n\n\\begin{definition}\n\t\\rm In a $q$-sequence $s$, the utility occupancy of a sequence $t$, denoted as \\textit{uo}($t$, $s$), is defined as $\\textit{uo}(t, s)$ = $\\frac{u(t, s)}{u(s)}$. Note that $t$ may have more than one match at $s$. Then the utility occupancy of $t$ at position $p$ in $s$ can be denoted as \\textit{uo}($t$, $s$, $p$) and is defined as follows.\n\t$$\n\t\\begin{aligned}\n\t\\textit{uo}(t, s, p) = \\frac{\\textit{max}\\{u(t, s^\\prime) \\vert t \\sim s^\\prime \\land s^\\prime \\subseteq <s_{1}, \\cdots, s_{p}>\\}}{u(s)}.\n\t\\end{aligned}\n\t$$\t\n\tThe total utility occupancy of $t$ in a $q$-sequence database $\\mathcal{D}$, denoted as \\textit{uo}($t$), is defined as follows.\t\n\t$$\n\t\\begin{aligned}\n\t\\textit{uo}(t) = \\frac{\\sum\\limits_{t \\subseteq s \\land s \\in \\mathcal{D}}\\textit{uo}(t, s)}{\\textit{sup}(t)}.\n\t\\end{aligned}\n\t$$\n\\end{definition}\n\nFor example, the utility occupancy of the sequence $<$[$a$], [$c$]$>$ in $s_3$, $s_4$, and $s_5$ are \\textit{uo}($<$[$a$], [$c$]$>$, $s_3$) = \\textit{max}(\\{5, 7\\}) / 12 = 0.583, \\textit{uo}($<$[$a$], [$c$]$>$, $s_4$) =  8 /13 = 0.615, and \\textit{uo}($<$[$a$], [$c$]$>$, $s_5$) = 5 / 13 = 0.385, respectively. Thus, the total utility occupancy of the sequence $<$[$a$], [$c$]$>$ in the entire $\\mathcal{D}$ is equal to \\textit{uo}($<$[$a$], [$c$]$>$) = (0.583 + 0.615 + 0.385) / 3 = 0.528.\n\n\\begin{definition}\n\t\\rm In a $q$-sequence $s$ with $l$ $q$-itemsets, the remaining utility occupancy of a sequence $t$ at position $p$ can be denoted as \\textit{ruo}($t$, $s$, $p$), and is defined as follows.\n\t$$\n\t\\begin{aligned}\n\t\\textit{ruo}(t, s, p) = \\frac{u(<s_{p+1}, \\cdots, s_{l}>, s)}{u(s)}.\n\t\\end{aligned}\n\t$$\n\\end{definition}\n\n\n\n\\begin{definition}\n\t\\rm Considering two thresholds, including a minimum support threshold \\textit{minsup} (0 $\\textless$ \\textit{minsup} $\\le$ 1) and a minimum utility occupancy threshold \\textit{minuo} (0 $\\textless$ \\textit{minuo} $\\le$ 1), a sequential pattern $t$ with high support and high utility occupancy in a $q$-sequence database $\\mathcal{D}$ is called a HUOSP. Here, it satisfies \\textit{sup}($t$) $\\ge$ \\textit{minsup} and \\textit{uo}($t$) $\\ge$ \\textit{minuo}.\n\\end{definition}\n\nFor example, the remaining utility occupancy of the sequence $<$[$a$], [$c$]$>$ in $s_3$ at position 2 is equal to \\textit{ruo}($<$[$a$], [$c$]$>$, $s_3$, 2) = (4 + 1) / 12 = 0.417. And the remaining utility occupancy of the sequence $<$[$a$]$>$ in $s_4$ at position 1 is equal to \\textit{ruo}($<$[$a$]$>$, $s_4$, 1) = (2 + 2 + 3) / 13 = 0.538. Under the setting of \\textit{minsup} to 2 and \\textit{minuo} to 0.4, all found HUOSPs are shown in Table \\ref{table_huosp}.\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{All found HUOSPs in the example database $\\mathcal{D}$}\n\t\\label{table_huosp}\n\t\\begin{tabular}{|c|c|c|c|}  \n\t\t\\hline \n\t\t\\textbf{ID} & \\textbf{HUOSP} & \\textbf{Support} & \\textbf{Utility occupancy}\\\\\n\t\t\\hline  \n\t\t\\(p_{1}\\) & $<$[$a$$b$]$>$ & 2 & 0.516 \\\\ \n\t\t\\hline\n\t\t\\(p_{2}\\) & $<$[$a$$b$], [$c$]$>$ & 2 & 0.76 \\\\ \n\t\t\\hline\n\t\t\\(p_{3}\\) & $<$[$a$], [$c$]$>$ & 3 & 0.528 \\\\ \n\t\t\\hline\n\t\t\\(p_{4}\\) & $<$[$a$], [$c$], [$e$]$>$ & 2 & 0.731 \\\\ \n\t\t\\hline\n\t\t\\(p_{5}\\) & $<$[$a$], [$e$]$>$ & 2 & 0.577 \\\\ \n\t\t\\hline\n\t\t\\(p_{6}\\) & $<$[$b$], [$c$], [$e$]$>$ & 2 & 0.538 \\\\ \n\t\t\\hline\n\t\t\\(p_{7}\\) & $<$[$d$], [$g$]$>$ & 2 & 0.59 \\\\ \n\t\t\\hline\n\t\\end{tabular}\n\\end{table}\n\n\n",
        "Descriptive_question1": "What is the total utility (SU) of sequence s1 in table_1?",
        "Descriptive_question2": "Which sequence in table_1 has the lowest total utility (SU)?",
        "Reasoning_question1": "Why might sequence s2 in table_1 have a significantly lower total utility (SU) compared to other sequences?",
        "Reasoning_question2": "What trends can be observed in the quantitative sequences of table_1 regarding the frequency of item 'a' across different sequences?",
        "Descriptive_answer1": "11",
        "Descriptive_answer2": "s2",
        "Reasoning_answer1": "Sequence s2 has a significantly lower total utility (SU) of 2 compared to other sequences in table_1 because it contains only two quantitative itemsets with minimal internal utilities and items that likely have low external utilities. Looking at the sequence $<$[(\textit{d}, 1)], [(\textit{g}, 1)]$>$, each itemset has only one item with a quantity of 1. From the external utility table, item 'd' has a unit utility of 1, and item 'g' also has a unit utility of 1. Therefore, the utility for each itemset is 1 * 1 = 1, resulting in a total utility of 1 + 1 = 2. In contrast, other sequences like s1, s3, s4, and s5 have more itemsets and/or items with higher quantities or higher external utilities (e.g., item 'a' with unit utility 3, or item 'f' with unit utility 5), leading to much higher total utilities ranging from 11 to 13. The limited number of itemsets and low utility values of items in s2 directly contribute to its lower SU.",
        "Reasoning_answer2": "Analyzing the quantitative sequences in table_1, we can observe specific trends regarding the frequency and placement of item 'a' across the sequences. First, item 'a' appears in three out of the five sequences: s3, s4, and s5, indicating it is not as ubiquitous as some other items but still significant. In s3, 'a' appears in the first itemset with a quantity of 1, alongside 'b'. In s4, it also appears in the first itemset but with a higher quantity of 2, again with 'b', suggesting a potential pattern or association with 'b' in initial itemsets. In s5, 'a' appears later in the sequence in the third itemset with a quantity of 1, indicating variability in its position. Comparing this to other items like 'd', which appears in four sequences (s1, s2, s3, s5), or 'b', which appears in four sequences (s1, s3, s4, s5), 'a' is less frequent but often positioned early in the sequence when present (s3 and s4). This trend might suggest that 'a' is associated with initiating sequences or specific contexts in combination with other items like 'b', though its presence is not consistent across all sequences, hinting at selective occurrence based on specific conditions or patterns in the data."
    },
    {
        "paper_id": "2212.10452.json",
        "table_id": "table_2",
        "table_content": "\\begin{table}[h]\n\t\\caption{External utility table}\n\t\\label{table2}\n\t\\centering\n\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\\hline\n\t\t\\textbf{Item}\t    & \\textit{a}\t& \\textit{b}\t& \\textit{c}\t& \\textit{d}\t& \\textit{e}\t& \\textit{f}  & \\textit{g} \\\\ \\hline \n\t\t\\textbf{\\textit{Unit utility}}\t& 3 & 2 & 2 & 1 & 3 & 5 & 1 \\\\ \\hline\n\t\\end{tabular}\n\\end{table}",
        "caption": "External utility table",
        "label": "table2",
        "section_info": "3 Preliminaries and Problem Statement\n\\section{Preliminaries and Problem Statement}\n\\label{sec:preliminaries}\n\nIn this section, we first introduce and define the basic notations and concepts related to utility occupancy mining on sequence data. The problem of high utility-occupancy sequential pattern mining is then formulated.\n\n\\subsection{Notations and Concepts}\n\nGiven a finite set $I$ = \\{$i_{1}$, $i_{2}$, $\\cdots$, $i_{m}$\\} containing $m$ distinct items, a quantitative itemset $c$  is a non-empty set and can be defined as $c$ = [($i_1$, $q_1$)($i_2$, $q_2$)$\\cdots$($i_n$, $q_n$)], where $q_j$ is the quality value for $i_j$. Each item and its associated quality (internal utility) together comprise the elements of the quantitative itemset $c$. The items in the quantitative itemset $c$ is a subset of $I$. An itemset $w$ is a non-empty set with no quality information for $c$, which is called that $w$ matches $c$, and is denoted as $w$ $\\sim$ $c$. To simplify the description of some definitions in this paper, we assume that all items in a quantitative itemset are sorted alphabetically. A quantitative sequence is denoted as $s$ and defined as $s$ = $<$$c_1$, $c_1$, $\\cdots$, $c_l$$>$. $s$ is an ordered list containing one or more quantitative itemsets, and the order in which the quantitative itemsets appear can represent the chronological relationship of realistic applications. $v$ = $<$$w_1$, $w_1$, $\\cdots$, $w_l$$>$ is used as $s$ without quantity information that is called that $v$ matches $s$ and is denoted as $v$ $\\sim$ $s$. For the sake of illustration, quantitative itemset and quantitative sequence can also be termed as $q$-itemset and $q$-sequence. Regarding a quantitative sequence database $\\mathcal{D}$, it is a collection of triples $<$\\textit{SID}, \\textit{qs}, \\textit{SU}$>$, where \\textit{qs} is a $q$-sequence, \\textit{SID} is the unique identifier of \\textit{qs}, and \\textit{SU} is the total utility of \\textit{qs}. Furthermore, each item $i$ such that $i$ $\\in$ $\\mathcal{D}$ has its own profit value (called external utility), and can be denoted as $p$($i$).\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Quantitative sequence database}\n\t\\label{table1}\n\t\\begin{tabular}{|c|c|c|}  \n\t\t\\hline \n\t\t\\textbf{SID} & \\textbf{Quantitative sequence} & \\textbf{SU} \\\\\n\t\t\\hline  \n\t\t\\(s_{1}\\) & $<$[(\\textit{b}, 2)(\\textit{d}, 1)], [(\\textit{g}, 1)], [(\\textit{f}, 1)]$>$ & 11 \\\\ \n\t\t\\hline\n\t\t\\(s_{2}\\) & $<$[(\\textit{d}, 1)], [(\\textit{g}, 1)]$>$  & 2 \\\\  \n\t\t\\hline  \n\t\t\\(s_{3}\\) & $<$[(\\textit{a}, 1)(\\textit{b}, 1)], [(\\textit{c}, 1)], [(\\textit{c}, 2)], [(\\textit{d}, 1)]$>$ & 12 \\\\\n\t\t\\hline  \n\t\t\\(s_{4}\\) & $<$[(\\textit{a}, 2)(\\textit{b}, 1)], [(\\textit{c}, 1)], [(\\textit{e}, 1)]$>$ & 13 \\\\\n\t\t\\hline\n\t\t\\(s_{5}\\) & $<$[(\\textit{d}, 3)], [(\\textit{b}, 1)], [(\\textit{a}, 1)], [(\\textit{c}, 1)], [(\\textit{e}, 1)]$>$ & 13 \\\\\n\t\t\\hline\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[h]\n\t\\caption{External utility table}\n\t\\label{table2}\n\t\\centering\n\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\\hline\n\t\t\\textbf{Item}\t    & \\textit{a}\t& \\textit{b}\t& \\textit{c}\t& \\textit{d}\t& \\textit{e}\t& \\textit{f}  & \\textit{g} \\\\ \\hline \n\t\t\\textbf{\\textit{Unit utility}}\t& 3 & 2 & 2 & 1 & 3 & 5 & 1 \\\\ \\hline\n\t\\end{tabular}\n\\end{table}\n\nThe example $q$-sequence database and external utility table that will be used in the following are shown in Tables \\ref{table1} and \\ref{table2}. We can see that this database has five $q$-sequences and seven different items. [($b$, 2)($d$, 1)] is the first $q$-itemset in $q$-sequence $s_1$, containing two items, $b$ and $d$. According to Table \\ref{table2}, the external utility of items $b$ and $d$ are 2 and 1, respectively. In addition, $<$[$b$$d$]$>$ matches $<$[($b$, 2) ($d$, 1)]$>$.\n\n\n\\begin{definition}\n\t\\rm For an item $i$ in a $q$-itemset $c$, its utility can be denoted as $u$($i$, $c$) and is defined as $u$($i$, $c$) = $q$($i$, $c$) $\\times$ $p$($i$, $c$) where $q$($i$, $c$) is the internal utility of $i$ in $c$ and $p$($i$, $c$) is the external utility of $i$. We use $u$($c$) to denote the sum of utilities of all items in $c$, and it can be defined as $u$($c$) = $\\sum\\limits_{i \\in c}u(i, c)$. As for a $q$-sequence $s$, its utility can be denoted as $u$($s$) and is defined as $u$($s$) = $\\sum\\limits_{c \\in s}u(c)$. Moreover, given a $q$-sequence database $\\mathcal{D}$, its utility can be denoted as $u$($\\mathcal{D}$) and is defined as $u$($\\mathcal{D}$) = $\\sum\\limits_{s \\in \\mathcal{D}}u(s)$.\n\\end{definition}\n\nFor example, the utility of the item $b$ is equal to 4, because $u$($b$, $s_1$) = 2 $\\times$ 2 = 4; the utilities of three $q$-itemsets in $s_1$ are 5, 1, and 5, respectively. Thus, the \\textit{SU} of $s_1$ can be calculated as $u$($s_1$) = 5 + 1 + 5 = 11; the total utility of this example database $\\mathcal{D}$ is calculated as $u$($\\mathcal{D}$) = $\\sum_{s_i \\in \\mathcal{D}}$ $u$($s_i$) = 11 + 2 + 12 + 13 + 13 = 51.\n\n\\begin{definition}\n\t\\rm Given two itemsets $w$ and $w^\\prime$, if all the items of $w$ appear in $w^\\prime$, we say that $w^\\prime$ contains $w$, and is denoted as $w$ $\\subseteq$ $w^\\prime$. Similarly, for two $q$-itemset $c$ and $c^\\prime$, if all the items of $c$ appear in $c^\\prime$ and have the same quality, we say that $c^\\prime$ contains $c$, which is denoted as $c$ $\\subseteq$ $c^\\prime$.\n\\end{definition}\n\nFor instance, the itemset [$c$$d$$e$] contains the itemset [$c$$e$]. And the $q$-itemset [($c$, 4)($e$, 2)] is contained in [($c$, 4)($d$, 3)($e$, 2)], but not in [($c$, 3)($e$, 3)]. Because the quality of $c$ in these two $q$-itemsets [($c$, 3)($e$, 3)] and [($c$, 4)($d$, 3)($e$, 2)] is different.\n\n\n\\begin{definition}\n\t\\rm Given two sequences $v$ = $<$$w_1$, $w_2$, $\\cdots$, $w_l$$>$ and $v^\\prime$ = $<$$w^\\prime_1$, $w^\\prime_2$, $\\cdots$, $w^\\prime_l$$>$, if there exists an integer list (1 $\\le$ $k_1$ $\\le$ $k_2$ $\\le$ $\\cdots$ $l$) satisfies that $w_j$ $\\subseteq$ $w^\\prime_{k_j}$, 1 $\\le$ $j$ $\\le$ $l$, we say that $v^\\prime$ contains $v$, and is denoted as $v$ $\\subseteq$ $v^\\prime$. For two $q$-sequences $s$ = $<$$c_1$, $c_1$, $\\cdots$, $c_l$$>$ and $s^\\prime$ = $<$$c^\\prime_1$, $c^\\prime_1$, $\\cdots$, $c^\\prime_l$$>$, if these two $q$-sequences need to satisfy the containment relationship, then there exists an integer list (1 $\\le$ $k_1$ $\\le$ $k_2$ $\\le$ $\\cdots$ $l$) satisfies that $c_j$ $\\subseteq$ $c^\\prime_{k_j}$, 1 $\\le$ $j$ $\\le$ $l$, which is denoted as $s$ $\\subseteq$ $s^\\prime$. In this paper, if a sequence $t$ matches a $q$-sequence $s_k$ and also satisfies $s_k$ $\\subseteq$ $s$, then it can also be denoted as $t$ $\\subseteq$ $s$ instead of $t$ $\\sim$ $s_k$ $\\land$ $s_k$ $\\subseteq$ $s$.\n\\end{definition}\n\nFor example, the $q$-sequence $s_1$ contains $<$[($b$, 2)($d$, 1)]$>$ and $<$[($g$, 1)], [($f$, 1)]$>$, while $<$[($b$, 2)($d$, 2)]$>$ and $<$[($g$, 1)($f$, 1)]$>$ are not contained in $s_1$. \n\n\\begin{definition}\n\t\\rm For a sequences $t$, it has multiple matches in a $q$-sequence $s$. We use $u$($t$, $s$) to denote the actual utility of $s$ and it is defined as $u$($t$, $s$) = \\textit{max}\\{$u$($s^\\prime$) $\\vert$ $t$ $\\sim$ $s^\\prime$ $\\land$ $s^\\prime$ $\\subseteq$ $s$\\}. Additionally, the utility of $t$ in the $q$-sequence database $\\mathcal{D}$ can be denoted as $u$($t$) and is defined as $u$($t$) = \\{$\\sum\\limits_{s \\in \\mathcal{D}}u(t, s) \\vert t \\subseteq s$\\}. In addition, its support can be denoted as \\textit{sup}($t$) and is defined as \\textit{sup}($t$) = $\\vert$ $t$ $\\subseteq$ $s$ $\\land$ $s$ $\\in$ $\\mathcal{D}$ $\\vert$, that is, the number of $q$-sequences of $\\mathcal{D}$ matching $t$.\n\\end{definition}\n\nFor example, the sequence $t$ = $<$[$a$$b$], [$c$]$>$ has two matches in the $q$-sequence $s_3$, and so its utility can be calculated as $u$($<$[$a$$b$], [$c$]$>$) = \\textit{max}\\{$u$($<$[($a$, 1)($b$, 1)], [($c$, 1)]$>$), $u$($<$[($a$, 1)($b$, 1)], [($c$, 2)]$>$)\\} = \\textit{max}\\{7, 9\\} = 9. And $t$ has a support of 2 because $s_3$ and $s_4$ both have instances where $t$ matches.\n\nIn this paper, the concept of utility occupancy \\cite{gan2019huopm} is incorporated into sequence data. Utility occupancy is a flexible measure that can be used to identify patterns with a higher contribution in sequences. Since there is no previous work on this topic, we are the first to define the relevant concepts.\n\n\\begin{definition}\n\t\\rm In a $q$-sequence $s$, the utility occupancy of a sequence $t$, denoted as \\textit{uo}($t$, $s$), is defined as $\\textit{uo}(t, s)$ = $\\frac{u(t, s)}{u(s)}$. Note that $t$ may have more than one match at $s$. Then the utility occupancy of $t$ at position $p$ in $s$ can be denoted as \\textit{uo}($t$, $s$, $p$) and is defined as follows.\n\t$$\n\t\\begin{aligned}\n\t\\textit{uo}(t, s, p) = \\frac{\\textit{max}\\{u(t, s^\\prime) \\vert t \\sim s^\\prime \\land s^\\prime \\subseteq <s_{1}, \\cdots, s_{p}>\\}}{u(s)}.\n\t\\end{aligned}\n\t$$\t\n\tThe total utility occupancy of $t$ in a $q$-sequence database $\\mathcal{D}$, denoted as \\textit{uo}($t$), is defined as follows.\t\n\t$$\n\t\\begin{aligned}\n\t\\textit{uo}(t) = \\frac{\\sum\\limits_{t \\subseteq s \\land s \\in \\mathcal{D}}\\textit{uo}(t, s)}{\\textit{sup}(t)}.\n\t\\end{aligned}\n\t$$\n\\end{definition}\n\nFor example, the utility occupancy of the sequence $<$[$a$], [$c$]$>$ in $s_3$, $s_4$, and $s_5$ are \\textit{uo}($<$[$a$], [$c$]$>$, $s_3$) = \\textit{max}(\\{5, 7\\}) / 12 = 0.583, \\textit{uo}($<$[$a$], [$c$]$>$, $s_4$) =  8 /13 = 0.615, and \\textit{uo}($<$[$a$], [$c$]$>$, $s_5$) = 5 / 13 = 0.385, respectively. Thus, the total utility occupancy of the sequence $<$[$a$], [$c$]$>$ in the entire $\\mathcal{D}$ is equal to \\textit{uo}($<$[$a$], [$c$]$>$) = (0.583 + 0.615 + 0.385) / 3 = 0.528.\n\n\\begin{definition}\n\t\\rm In a $q$-sequence $s$ with $l$ $q$-itemsets, the remaining utility occupancy of a sequence $t$ at position $p$ can be denoted as \\textit{ruo}($t$, $s$, $p$), and is defined as follows.\n\t$$\n\t\\begin{aligned}\n\t\\textit{ruo}(t, s, p) = \\frac{u(<s_{p+1}, \\cdots, s_{l}>, s)}{u(s)}.\n\t\\end{aligned}\n\t$$\n\\end{definition}\n\n\n\n\\begin{definition}\n\t\\rm Considering two thresholds, including a minimum support threshold \\textit{minsup} (0 $\\textless$ \\textit{minsup} $\\le$ 1) and a minimum utility occupancy threshold \\textit{minuo} (0 $\\textless$ \\textit{minuo} $\\le$ 1), a sequential pattern $t$ with high support and high utility occupancy in a $q$-sequence database $\\mathcal{D}$ is called a HUOSP. Here, it satisfies \\textit{sup}($t$) $\\ge$ \\textit{minsup} and \\textit{uo}($t$) $\\ge$ \\textit{minuo}.\n\\end{definition}\n\nFor example, the remaining utility occupancy of the sequence $<$[$a$], [$c$]$>$ in $s_3$ at position 2 is equal to \\textit{ruo}($<$[$a$], [$c$]$>$, $s_3$, 2) = (4 + 1) / 12 = 0.417. And the remaining utility occupancy of the sequence $<$[$a$]$>$ in $s_4$ at position 1 is equal to \\textit{ruo}($<$[$a$]$>$, $s_4$, 1) = (2 + 2 + 3) / 13 = 0.538. Under the setting of \\textit{minsup} to 2 and \\textit{minuo} to 0.4, all found HUOSPs are shown in Table \\ref{table_huosp}.\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{All found HUOSPs in the example database $\\mathcal{D}$}\n\t\\label{table_huosp}\n\t\\begin{tabular}{|c|c|c|c|}  \n\t\t\\hline \n\t\t\\textbf{ID} & \\textbf{HUOSP} & \\textbf{Support} & \\textbf{Utility occupancy}\\\\\n\t\t\\hline  \n\t\t\\(p_{1}\\) & $<$[$a$$b$]$>$ & 2 & 0.516 \\\\ \n\t\t\\hline\n\t\t\\(p_{2}\\) & $<$[$a$$b$], [$c$]$>$ & 2 & 0.76 \\\\ \n\t\t\\hline\n\t\t\\(p_{3}\\) & $<$[$a$], [$c$]$>$ & 3 & 0.528 \\\\ \n\t\t\\hline\n\t\t\\(p_{4}\\) & $<$[$a$], [$c$], [$e$]$>$ & 2 & 0.731 \\\\ \n\t\t\\hline\n\t\t\\(p_{5}\\) & $<$[$a$], [$e$]$>$ & 2 & 0.577 \\\\ \n\t\t\\hline\n\t\t\\(p_{6}\\) & $<$[$b$], [$c$], [$e$]$>$ & 2 & 0.538 \\\\ \n\t\t\\hline\n\t\t\\(p_{7}\\) & $<$[$d$], [$g$]$>$ & 2 & 0.59 \\\\ \n\t\t\\hline\n\t\\end{tabular}\n\\end{table}\n\n\n\\subsection{Problem Statement}\n\nAfter the above definitions and concepts are given, we formulate the problem of mining HUOSPs as follows. Given a quantitative sequence database $\\mathcal{D}$, a utility table with external utilities for each item, and two thresholds \\textit{minsup} (0 $\\textless$ \\textit{minsup} $\\le$ 1) and \\textit{minuo} (0 $\\textless$ \\textit{minuo} $\\le$ 1), the goal of high utility-occupancy sequential pattern mining is to discover all HUOSPs that frequency and utility occupancy are greater than \\textit{minsup} and \\textit{minuo}, respectively.\n\n\n3.1 Notations and Concepts\n\\subsection{Notations and Concepts}\n\nGiven a finite set $I$ = \\{$i_{1}$, $i_{2}$, $\\cdots$, $i_{m}$\\} containing $m$ distinct items, a quantitative itemset $c$  is a non-empty set and can be defined as $c$ = [($i_1$, $q_1$)($i_2$, $q_2$)$\\cdots$($i_n$, $q_n$)], where $q_j$ is the quality value for $i_j$. Each item and its associated quality (internal utility) together comprise the elements of the quantitative itemset $c$. The items in the quantitative itemset $c$ is a subset of $I$. An itemset $w$ is a non-empty set with no quality information for $c$, which is called that $w$ matches $c$, and is denoted as $w$ $\\sim$ $c$. To simplify the description of some definitions in this paper, we assume that all items in a quantitative itemset are sorted alphabetically. A quantitative sequence is denoted as $s$ and defined as $s$ = $<$$c_1$, $c_1$, $\\cdots$, $c_l$$>$. $s$ is an ordered list containing one or more quantitative itemsets, and the order in which the quantitative itemsets appear can represent the chronological relationship of realistic applications. $v$ = $<$$w_1$, $w_1$, $\\cdots$, $w_l$$>$ is used as $s$ without quantity information that is called that $v$ matches $s$ and is denoted as $v$ $\\sim$ $s$. For the sake of illustration, quantitative itemset and quantitative sequence can also be termed as $q$-itemset and $q$-sequence. Regarding a quantitative sequence database $\\mathcal{D}$, it is a collection of triples $<$\\textit{SID}, \\textit{qs}, \\textit{SU}$>$, where \\textit{qs} is a $q$-sequence, \\textit{SID} is the unique identifier of \\textit{qs}, and \\textit{SU} is the total utility of \\textit{qs}. Furthermore, each item $i$ such that $i$ $\\in$ $\\mathcal{D}$ has its own profit value (called external utility), and can be denoted as $p$($i$).\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Quantitative sequence database}\n\t\\label{table1}\n\t\\begin{tabular}{|c|c|c|}  \n\t\t\\hline \n\t\t\\textbf{SID} & \\textbf{Quantitative sequence} & \\textbf{SU} \\\\\n\t\t\\hline  \n\t\t\\(s_{1}\\) & $<$[(\\textit{b}, 2)(\\textit{d}, 1)], [(\\textit{g}, 1)], [(\\textit{f}, 1)]$>$ & 11 \\\\ \n\t\t\\hline\n\t\t\\(s_{2}\\) & $<$[(\\textit{d}, 1)], [(\\textit{g}, 1)]$>$  & 2 \\\\  \n\t\t\\hline  \n\t\t\\(s_{3}\\) & $<$[(\\textit{a}, 1)(\\textit{b}, 1)], [(\\textit{c}, 1)], [(\\textit{c}, 2)], [(\\textit{d}, 1)]$>$ & 12 \\\\\n\t\t\\hline  \n\t\t\\(s_{4}\\) & $<$[(\\textit{a}, 2)(\\textit{b}, 1)], [(\\textit{c}, 1)], [(\\textit{e}, 1)]$>$ & 13 \\\\\n\t\t\\hline\n\t\t\\(s_{5}\\) & $<$[(\\textit{d}, 3)], [(\\textit{b}, 1)], [(\\textit{a}, 1)], [(\\textit{c}, 1)], [(\\textit{e}, 1)]$>$ & 13 \\\\\n\t\t\\hline\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[h]\n\t\\caption{External utility table}\n\t\\label{table2}\n\t\\centering\n\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\\hline\n\t\t\\textbf{Item}\t    & \\textit{a}\t& \\textit{b}\t& \\textit{c}\t& \\textit{d}\t& \\textit{e}\t& \\textit{f}  & \\textit{g} \\\\ \\hline \n\t\t\\textbf{\\textit{Unit utility}}\t& 3 & 2 & 2 & 1 & 3 & 5 & 1 \\\\ \\hline\n\t\\end{tabular}\n\\end{table}\n\nThe example $q$-sequence database and external utility table that will be used in the following are shown in Tables \\ref{table1} and \\ref{table2}. We can see that this database has five $q$-sequences and seven different items. [($b$, 2)($d$, 1)] is the first $q$-itemset in $q$-sequence $s_1$, containing two items, $b$ and $d$. According to Table \\ref{table2}, the external utility of items $b$ and $d$ are 2 and 1, respectively. In addition, $<$[$b$$d$]$>$ matches $<$[($b$, 2) ($d$, 1)]$>$.\n\n\n\\begin{definition}\n\t\\rm For an item $i$ in a $q$-itemset $c$, its utility can be denoted as $u$($i$, $c$) and is defined as $u$($i$, $c$) = $q$($i$, $c$) $\\times$ $p$($i$, $c$) where $q$($i$, $c$) is the internal utility of $i$ in $c$ and $p$($i$, $c$) is the external utility of $i$. We use $u$($c$) to denote the sum of utilities of all items in $c$, and it can be defined as $u$($c$) = $\\sum\\limits_{i \\in c}u(i, c)$. As for a $q$-sequence $s$, its utility can be denoted as $u$($s$) and is defined as $u$($s$) = $\\sum\\limits_{c \\in s}u(c)$. Moreover, given a $q$-sequence database $\\mathcal{D}$, its utility can be denoted as $u$($\\mathcal{D}$) and is defined as $u$($\\mathcal{D}$) = $\\sum\\limits_{s \\in \\mathcal{D}}u(s)$.\n\\end{definition}\n\nFor example, the utility of the item $b$ is equal to 4, because $u$($b$, $s_1$) = 2 $\\times$ 2 = 4; the utilities of three $q$-itemsets in $s_1$ are 5, 1, and 5, respectively. Thus, the \\textit{SU} of $s_1$ can be calculated as $u$($s_1$) = 5 + 1 + 5 = 11; the total utility of this example database $\\mathcal{D}$ is calculated as $u$($\\mathcal{D}$) = $\\sum_{s_i \\in \\mathcal{D}}$ $u$($s_i$) = 11 + 2 + 12 + 13 + 13 = 51.\n\n\\begin{definition}\n\t\\rm Given two itemsets $w$ and $w^\\prime$, if all the items of $w$ appear in $w^\\prime$, we say that $w^\\prime$ contains $w$, and is denoted as $w$ $\\subseteq$ $w^\\prime$. Similarly, for two $q$-itemset $c$ and $c^\\prime$, if all the items of $c$ appear in $c^\\prime$ and have the same quality, we say that $c^\\prime$ contains $c$, which is denoted as $c$ $\\subseteq$ $c^\\prime$.\n\\end{definition}\n\nFor instance, the itemset [$c$$d$$e$] contains the itemset [$c$$e$]. And the $q$-itemset [($c$, 4)($e$, 2)] is contained in [($c$, 4)($d$, 3)($e$, 2)], but not in [($c$, 3)($e$, 3)]. Because the quality of $c$ in these two $q$-itemsets [($c$, 3)($e$, 3)] and [($c$, 4)($d$, 3)($e$, 2)] is different.\n\n\n\\begin{definition}\n\t\\rm Given two sequences $v$ = $<$$w_1$, $w_2$, $\\cdots$, $w_l$$>$ and $v^\\prime$ = $<$$w^\\prime_1$, $w^\\prime_2$, $\\cdots$, $w^\\prime_l$$>$, if there exists an integer list (1 $\\le$ $k_1$ $\\le$ $k_2$ $\\le$ $\\cdots$ $l$) satisfies that $w_j$ $\\subseteq$ $w^\\prime_{k_j}$, 1 $\\le$ $j$ $\\le$ $l$, we say that $v^\\prime$ contains $v$, and is denoted as $v$ $\\subseteq$ $v^\\prime$. For two $q$-sequences $s$ = $<$$c_1$, $c_1$, $\\cdots$, $c_l$$>$ and $s^\\prime$ = $<$$c^\\prime_1$, $c^\\prime_1$, $\\cdots$, $c^\\prime_l$$>$, if these two $q$-sequences need to satisfy the containment relationship, then there exists an integer list (1 $\\le$ $k_1$ $\\le$ $k_2$ $\\le$ $\\cdots$ $l$) satisfies that $c_j$ $\\subseteq$ $c^\\prime_{k_j}$, 1 $\\le$ $j$ $\\le$ $l$, which is denoted as $s$ $\\subseteq$ $s^\\prime$. In this paper, if a sequence $t$ matches a $q$-sequence $s_k$ and also satisfies $s_k$ $\\subseteq$ $s$, then it can also be denoted as $t$ $\\subseteq$ $s$ instead of $t$ $\\sim$ $s_k$ $\\land$ $s_k$ $\\subseteq$ $s$.\n\\end{definition}\n\nFor example, the $q$-sequence $s_1$ contains $<$[($b$, 2)($d$, 1)]$>$ and $<$[($g$, 1)], [($f$, 1)]$>$, while $<$[($b$, 2)($d$, 2)]$>$ and $<$[($g$, 1)($f$, 1)]$>$ are not contained in $s_1$. \n\n\\begin{definition}\n\t\\rm For a sequences $t$, it has multiple matches in a $q$-sequence $s$. We use $u$($t$, $s$) to denote the actual utility of $s$ and it is defined as $u$($t$, $s$) = \\textit{max}\\{$u$($s^\\prime$) $\\vert$ $t$ $\\sim$ $s^\\prime$ $\\land$ $s^\\prime$ $\\subseteq$ $s$\\}. Additionally, the utility of $t$ in the $q$-sequence database $\\mathcal{D}$ can be denoted as $u$($t$) and is defined as $u$($t$) = \\{$\\sum\\limits_{s \\in \\mathcal{D}}u(t, s) \\vert t \\subseteq s$\\}. In addition, its support can be denoted as \\textit{sup}($t$) and is defined as \\textit{sup}($t$) = $\\vert$ $t$ $\\subseteq$ $s$ $\\land$ $s$ $\\in$ $\\mathcal{D}$ $\\vert$, that is, the number of $q$-sequences of $\\mathcal{D}$ matching $t$.\n\\end{definition}\n\nFor example, the sequence $t$ = $<$[$a$$b$], [$c$]$>$ has two matches in the $q$-sequence $s_3$, and so its utility can be calculated as $u$($<$[$a$$b$], [$c$]$>$) = \\textit{max}\\{$u$($<$[($a$, 1)($b$, 1)], [($c$, 1)]$>$), $u$($<$[($a$, 1)($b$, 1)], [($c$, 2)]$>$)\\} = \\textit{max}\\{7, 9\\} = 9. And $t$ has a support of 2 because $s_3$ and $s_4$ both have instances where $t$ matches.\n\nIn this paper, the concept of utility occupancy \\cite{gan2019huopm} is incorporated into sequence data. Utility occupancy is a flexible measure that can be used to identify patterns with a higher contribution in sequences. Since there is no previous work on this topic, we are the first to define the relevant concepts.\n\n\\begin{definition}\n\t\\rm In a $q$-sequence $s$, the utility occupancy of a sequence $t$, denoted as \\textit{uo}($t$, $s$), is defined as $\\textit{uo}(t, s)$ = $\\frac{u(t, s)}{u(s)}$. Note that $t$ may have more than one match at $s$. Then the utility occupancy of $t$ at position $p$ in $s$ can be denoted as \\textit{uo}($t$, $s$, $p$) and is defined as follows.\n\t$$\n\t\\begin{aligned}\n\t\\textit{uo}(t, s, p) = \\frac{\\textit{max}\\{u(t, s^\\prime) \\vert t \\sim s^\\prime \\land s^\\prime \\subseteq <s_{1}, \\cdots, s_{p}>\\}}{u(s)}.\n\t\\end{aligned}\n\t$$\t\n\tThe total utility occupancy of $t$ in a $q$-sequence database $\\mathcal{D}$, denoted as \\textit{uo}($t$), is defined as follows.\t\n\t$$\n\t\\begin{aligned}\n\t\\textit{uo}(t) = \\frac{\\sum\\limits_{t \\subseteq s \\land s \\in \\mathcal{D}}\\textit{uo}(t, s)}{\\textit{sup}(t)}.\n\t\\end{aligned}\n\t$$\n\\end{definition}\n\nFor example, the utility occupancy of the sequence $<$[$a$], [$c$]$>$ in $s_3$, $s_4$, and $s_5$ are \\textit{uo}($<$[$a$], [$c$]$>$, $s_3$) = \\textit{max}(\\{5, 7\\}) / 12 = 0.583, \\textit{uo}($<$[$a$], [$c$]$>$, $s_4$) =  8 /13 = 0.615, and \\textit{uo}($<$[$a$], [$c$]$>$, $s_5$) = 5 / 13 = 0.385, respectively. Thus, the total utility occupancy of the sequence $<$[$a$], [$c$]$>$ in the entire $\\mathcal{D}$ is equal to \\textit{uo}($<$[$a$], [$c$]$>$) = (0.583 + 0.615 + 0.385) / 3 = 0.528.\n\n\\begin{definition}\n\t\\rm In a $q$-sequence $s$ with $l$ $q$-itemsets, the remaining utility occupancy of a sequence $t$ at position $p$ can be denoted as \\textit{ruo}($t$, $s$, $p$), and is defined as follows.\n\t$$\n\t\\begin{aligned}\n\t\\textit{ruo}(t, s, p) = \\frac{u(<s_{p+1}, \\cdots, s_{l}>, s)}{u(s)}.\n\t\\end{aligned}\n\t$$\n\\end{definition}\n\n\n\n\\begin{definition}\n\t\\rm Considering two thresholds, including a minimum support threshold \\textit{minsup} (0 $\\textless$ \\textit{minsup} $\\le$ 1) and a minimum utility occupancy threshold \\textit{minuo} (0 $\\textless$ \\textit{minuo} $\\le$ 1), a sequential pattern $t$ with high support and high utility occupancy in a $q$-sequence database $\\mathcal{D}$ is called a HUOSP. Here, it satisfies \\textit{sup}($t$) $\\ge$ \\textit{minsup} and \\textit{uo}($t$) $\\ge$ \\textit{minuo}.\n\\end{definition}\n\nFor example, the remaining utility occupancy of the sequence $<$[$a$], [$c$]$>$ in $s_3$ at position 2 is equal to \\textit{ruo}($<$[$a$], [$c$]$>$, $s_3$, 2) = (4 + 1) / 12 = 0.417. And the remaining utility occupancy of the sequence $<$[$a$]$>$ in $s_4$ at position 1 is equal to \\textit{ruo}($<$[$a$]$>$, $s_4$, 1) = (2 + 2 + 3) / 13 = 0.538. Under the setting of \\textit{minsup} to 2 and \\textit{minuo} to 0.4, all found HUOSPs are shown in Table \\ref{table_huosp}.\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{All found HUOSPs in the example database $\\mathcal{D}$}\n\t\\label{table_huosp}\n\t\\begin{tabular}{|c|c|c|c|}  \n\t\t\\hline \n\t\t\\textbf{ID} & \\textbf{HUOSP} & \\textbf{Support} & \\textbf{Utility occupancy}\\\\\n\t\t\\hline  \n\t\t\\(p_{1}\\) & $<$[$a$$b$]$>$ & 2 & 0.516 \\\\ \n\t\t\\hline\n\t\t\\(p_{2}\\) & $<$[$a$$b$], [$c$]$>$ & 2 & 0.76 \\\\ \n\t\t\\hline\n\t\t\\(p_{3}\\) & $<$[$a$], [$c$]$>$ & 3 & 0.528 \\\\ \n\t\t\\hline\n\t\t\\(p_{4}\\) & $<$[$a$], [$c$], [$e$]$>$ & 2 & 0.731 \\\\ \n\t\t\\hline\n\t\t\\(p_{5}\\) & $<$[$a$], [$e$]$>$ & 2 & 0.577 \\\\ \n\t\t\\hline\n\t\t\\(p_{6}\\) & $<$[$b$], [$c$], [$e$]$>$ & 2 & 0.538 \\\\ \n\t\t\\hline\n\t\t\\(p_{7}\\) & $<$[$d$], [$g$]$>$ & 2 & 0.59 \\\\ \n\t\t\\hline\n\t\\end{tabular}\n\\end{table}\n\n\n",
        "Descriptive_question1": "What is the unit utility value of item 'a' in table_2?",
        "Descriptive_question2": "Which item has the highest unit utility in table_2?",
        "Reasoning_question1": "Why might item 'f' have a significantly higher unit utility compared to other items in table_2?",
        "Reasoning_question2": "How does the variation in unit utility values across items in table_2 impact the calculation of total utility in a quantitative sequence database?",
        "Descriptive_answer1": "3",
        "Descriptive_answer2": "f",
        "Reasoning_answer1": "Item 'f' has a unit utility of 5, which is significantly higher than most other items in table_2, such as 'd' and 'g' with 1, or 'b' and 'c' with 2. This could be due to 'f' being a rarer or more valuable item in the context of the experiment or application. Alternatively, it might reflect a higher profit margin or demand associated with 'f' as determined by the external utility framework. The higher value suggests that 'f' contributes more to the total utility per unit compared to other items, possibly indicating its strategic importance in the sequence data mining process.",
        "Reasoning_answer2": "The variation in unit utility values across items in table_2, ranging from 1 for 'd' and 'g' to 5 for 'f', directly affects the calculation of total utility in a quantitative sequence database. Since the utility of an item in a sequence is calculated as the product of its internal utility (quantity) and external utility (unit utility from table_2), items with higher unit utility values like 'f' will contribute disproportionately more to the total utility of a sequence compared to items with lower values like 'd'. This means that sequences containing items with higher unit utilities will have a greater impact on the overall utility of the database, influencing metrics like utility occupancy and the identification of high utility-occupancy sequential patterns (HUOSPs). Consequently, this variation emphasizes the importance of certain items over others in strategic decision-making or pattern mining."
    },
    {
        "paper_id": "2212.10452.json",
        "table_id": "table_3",
        "table_content": "\\begin{table}[h]\n\t\\centering\n\t\\caption{All found HUOSPs in the example database $\\mathcal{D}$}\n\t\\label{table_huosp}\n\t\\begin{tabular}{|c|c|c|c|}  \n\t\t\\hline \n\t\t\\textbf{ID} & \\textbf{HUOSP} & \\textbf{Support} & \\textbf{Utility occupancy}\\\\\n\t\t\\hline  \n\t\t\\(p_{1}\\) & $<$[$a$$b$]$>$ & 2 & 0.516 \\\\ \n\t\t\\hline\n\t\t\\(p_{2}\\) & $<$[$a$$b$], [$c$]$>$ & 2 & 0.76 \\\\ \n\t\t\\hline\n\t\t\\(p_{3}\\) & $<$[$a$], [$c$]$>$ & 3 & 0.528 \\\\ \n\t\t\\hline\n\t\t\\(p_{4}\\) & $<$[$a$], [$c$], [$e$]$>$ & 2 & 0.731 \\\\ \n\t\t\\hline\n\t\t\\(p_{5}\\) & $<$[$a$], [$e$]$>$ & 2 & 0.577 \\\\ \n\t\t\\hline\n\t\t\\(p_{6}\\) & $<$[$b$], [$c$], [$e$]$>$ & 2 & 0.538 \\\\ \n\t\t\\hline\n\t\t\\(p_{7}\\) & $<$[$d$], [$g$]$>$ & 2 & 0.59 \\\\ \n\t\t\\hline\n\t\\end{tabular}\n\\end{table}",
        "caption": "All found HUOSPs in the example database $\\mathcal{D}$",
        "label": "table_huosp",
        "section_info": "3 Preliminaries and Problem Statement\n\\section{Preliminaries and Problem Statement}\n\\label{sec:preliminaries}\n\nIn this section, we first introduce and define the basic notations and concepts related to utility occupancy mining on sequence data. The problem of high utility-occupancy sequential pattern mining is then formulated.\n\n\\subsection{Notations and Concepts}\n\nGiven a finite set $I$ = \\{$i_{1}$, $i_{2}$, $\\cdots$, $i_{m}$\\} containing $m$ distinct items, a quantitative itemset $c$  is a non-empty set and can be defined as $c$ = [($i_1$, $q_1$)($i_2$, $q_2$)$\\cdots$($i_n$, $q_n$)], where $q_j$ is the quality value for $i_j$. Each item and its associated quality (internal utility) together comprise the elements of the quantitative itemset $c$. The items in the quantitative itemset $c$ is a subset of $I$. An itemset $w$ is a non-empty set with no quality information for $c$, which is called that $w$ matches $c$, and is denoted as $w$ $\\sim$ $c$. To simplify the description of some definitions in this paper, we assume that all items in a quantitative itemset are sorted alphabetically. A quantitative sequence is denoted as $s$ and defined as $s$ = $<$$c_1$, $c_1$, $\\cdots$, $c_l$$>$. $s$ is an ordered list containing one or more quantitative itemsets, and the order in which the quantitative itemsets appear can represent the chronological relationship of realistic applications. $v$ = $<$$w_1$, $w_1$, $\\cdots$, $w_l$$>$ is used as $s$ without quantity information that is called that $v$ matches $s$ and is denoted as $v$ $\\sim$ $s$. For the sake of illustration, quantitative itemset and quantitative sequence can also be termed as $q$-itemset and $q$-sequence. Regarding a quantitative sequence database $\\mathcal{D}$, it is a collection of triples $<$\\textit{SID}, \\textit{qs}, \\textit{SU}$>$, where \\textit{qs} is a $q$-sequence, \\textit{SID} is the unique identifier of \\textit{qs}, and \\textit{SU} is the total utility of \\textit{qs}. Furthermore, each item $i$ such that $i$ $\\in$ $\\mathcal{D}$ has its own profit value (called external utility), and can be denoted as $p$($i$).\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Quantitative sequence database}\n\t\\label{table1}\n\t\\begin{tabular}{|c|c|c|}  \n\t\t\\hline \n\t\t\\textbf{SID} & \\textbf{Quantitative sequence} & \\textbf{SU} \\\\\n\t\t\\hline  \n\t\t\\(s_{1}\\) & $<$[(\\textit{b}, 2)(\\textit{d}, 1)], [(\\textit{g}, 1)], [(\\textit{f}, 1)]$>$ & 11 \\\\ \n\t\t\\hline\n\t\t\\(s_{2}\\) & $<$[(\\textit{d}, 1)], [(\\textit{g}, 1)]$>$  & 2 \\\\  \n\t\t\\hline  \n\t\t\\(s_{3}\\) & $<$[(\\textit{a}, 1)(\\textit{b}, 1)], [(\\textit{c}, 1)], [(\\textit{c}, 2)], [(\\textit{d}, 1)]$>$ & 12 \\\\\n\t\t\\hline  \n\t\t\\(s_{4}\\) & $<$[(\\textit{a}, 2)(\\textit{b}, 1)], [(\\textit{c}, 1)], [(\\textit{e}, 1)]$>$ & 13 \\\\\n\t\t\\hline\n\t\t\\(s_{5}\\) & $<$[(\\textit{d}, 3)], [(\\textit{b}, 1)], [(\\textit{a}, 1)], [(\\textit{c}, 1)], [(\\textit{e}, 1)]$>$ & 13 \\\\\n\t\t\\hline\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[h]\n\t\\caption{External utility table}\n\t\\label{table2}\n\t\\centering\n\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\\hline\n\t\t\\textbf{Item}\t    & \\textit{a}\t& \\textit{b}\t& \\textit{c}\t& \\textit{d}\t& \\textit{e}\t& \\textit{f}  & \\textit{g} \\\\ \\hline \n\t\t\\textbf{\\textit{Unit utility}}\t& 3 & 2 & 2 & 1 & 3 & 5 & 1 \\\\ \\hline\n\t\\end{tabular}\n\\end{table}\n\nThe example $q$-sequence database and external utility table that will be used in the following are shown in Tables \\ref{table1} and \\ref{table2}. We can see that this database has five $q$-sequences and seven different items. [($b$, 2)($d$, 1)] is the first $q$-itemset in $q$-sequence $s_1$, containing two items, $b$ and $d$. According to Table \\ref{table2}, the external utility of items $b$ and $d$ are 2 and 1, respectively. In addition, $<$[$b$$d$]$>$ matches $<$[($b$, 2) ($d$, 1)]$>$.\n\n\n\\begin{definition}\n\t\\rm For an item $i$ in a $q$-itemset $c$, its utility can be denoted as $u$($i$, $c$) and is defined as $u$($i$, $c$) = $q$($i$, $c$) $\\times$ $p$($i$, $c$) where $q$($i$, $c$) is the internal utility of $i$ in $c$ and $p$($i$, $c$) is the external utility of $i$. We use $u$($c$) to denote the sum of utilities of all items in $c$, and it can be defined as $u$($c$) = $\\sum\\limits_{i \\in c}u(i, c)$. As for a $q$-sequence $s$, its utility can be denoted as $u$($s$) and is defined as $u$($s$) = $\\sum\\limits_{c \\in s}u(c)$. Moreover, given a $q$-sequence database $\\mathcal{D}$, its utility can be denoted as $u$($\\mathcal{D}$) and is defined as $u$($\\mathcal{D}$) = $\\sum\\limits_{s \\in \\mathcal{D}}u(s)$.\n\\end{definition}\n\nFor example, the utility of the item $b$ is equal to 4, because $u$($b$, $s_1$) = 2 $\\times$ 2 = 4; the utilities of three $q$-itemsets in $s_1$ are 5, 1, and 5, respectively. Thus, the \\textit{SU} of $s_1$ can be calculated as $u$($s_1$) = 5 + 1 + 5 = 11; the total utility of this example database $\\mathcal{D}$ is calculated as $u$($\\mathcal{D}$) = $\\sum_{s_i \\in \\mathcal{D}}$ $u$($s_i$) = 11 + 2 + 12 + 13 + 13 = 51.\n\n\\begin{definition}\n\t\\rm Given two itemsets $w$ and $w^\\prime$, if all the items of $w$ appear in $w^\\prime$, we say that $w^\\prime$ contains $w$, and is denoted as $w$ $\\subseteq$ $w^\\prime$. Similarly, for two $q$-itemset $c$ and $c^\\prime$, if all the items of $c$ appear in $c^\\prime$ and have the same quality, we say that $c^\\prime$ contains $c$, which is denoted as $c$ $\\subseteq$ $c^\\prime$.\n\\end{definition}\n\nFor instance, the itemset [$c$$d$$e$] contains the itemset [$c$$e$]. And the $q$-itemset [($c$, 4)($e$, 2)] is contained in [($c$, 4)($d$, 3)($e$, 2)], but not in [($c$, 3)($e$, 3)]. Because the quality of $c$ in these two $q$-itemsets [($c$, 3)($e$, 3)] and [($c$, 4)($d$, 3)($e$, 2)] is different.\n\n\n\\begin{definition}\n\t\\rm Given two sequences $v$ = $<$$w_1$, $w_2$, $\\cdots$, $w_l$$>$ and $v^\\prime$ = $<$$w^\\prime_1$, $w^\\prime_2$, $\\cdots$, $w^\\prime_l$$>$, if there exists an integer list (1 $\\le$ $k_1$ $\\le$ $k_2$ $\\le$ $\\cdots$ $l$) satisfies that $w_j$ $\\subseteq$ $w^\\prime_{k_j}$, 1 $\\le$ $j$ $\\le$ $l$, we say that $v^\\prime$ contains $v$, and is denoted as $v$ $\\subseteq$ $v^\\prime$. For two $q$-sequences $s$ = $<$$c_1$, $c_1$, $\\cdots$, $c_l$$>$ and $s^\\prime$ = $<$$c^\\prime_1$, $c^\\prime_1$, $\\cdots$, $c^\\prime_l$$>$, if these two $q$-sequences need to satisfy the containment relationship, then there exists an integer list (1 $\\le$ $k_1$ $\\le$ $k_2$ $\\le$ $\\cdots$ $l$) satisfies that $c_j$ $\\subseteq$ $c^\\prime_{k_j}$, 1 $\\le$ $j$ $\\le$ $l$, which is denoted as $s$ $\\subseteq$ $s^\\prime$. In this paper, if a sequence $t$ matches a $q$-sequence $s_k$ and also satisfies $s_k$ $\\subseteq$ $s$, then it can also be denoted as $t$ $\\subseteq$ $s$ instead of $t$ $\\sim$ $s_k$ $\\land$ $s_k$ $\\subseteq$ $s$.\n\\end{definition}\n\nFor example, the $q$-sequence $s_1$ contains $<$[($b$, 2)($d$, 1)]$>$ and $<$[($g$, 1)], [($f$, 1)]$>$, while $<$[($b$, 2)($d$, 2)]$>$ and $<$[($g$, 1)($f$, 1)]$>$ are not contained in $s_1$. \n\n\\begin{definition}\n\t\\rm For a sequences $t$, it has multiple matches in a $q$-sequence $s$. We use $u$($t$, $s$) to denote the actual utility of $s$ and it is defined as $u$($t$, $s$) = \\textit{max}\\{$u$($s^\\prime$) $\\vert$ $t$ $\\sim$ $s^\\prime$ $\\land$ $s^\\prime$ $\\subseteq$ $s$\\}. Additionally, the utility of $t$ in the $q$-sequence database $\\mathcal{D}$ can be denoted as $u$($t$) and is defined as $u$($t$) = \\{$\\sum\\limits_{s \\in \\mathcal{D}}u(t, s) \\vert t \\subseteq s$\\}. In addition, its support can be denoted as \\textit{sup}($t$) and is defined as \\textit{sup}($t$) = $\\vert$ $t$ $\\subseteq$ $s$ $\\land$ $s$ $\\in$ $\\mathcal{D}$ $\\vert$, that is, the number of $q$-sequences of $\\mathcal{D}$ matching $t$.\n\\end{definition}\n\nFor example, the sequence $t$ = $<$[$a$$b$], [$c$]$>$ has two matches in the $q$-sequence $s_3$, and so its utility can be calculated as $u$($<$[$a$$b$], [$c$]$>$) = \\textit{max}\\{$u$($<$[($a$, 1)($b$, 1)], [($c$, 1)]$>$), $u$($<$[($a$, 1)($b$, 1)], [($c$, 2)]$>$)\\} = \\textit{max}\\{7, 9\\} = 9. And $t$ has a support of 2 because $s_3$ and $s_4$ both have instances where $t$ matches.\n\nIn this paper, the concept of utility occupancy \\cite{gan2019huopm} is incorporated into sequence data. Utility occupancy is a flexible measure that can be used to identify patterns with a higher contribution in sequences. Since there is no previous work on this topic, we are the first to define the relevant concepts.\n\n\\begin{definition}\n\t\\rm In a $q$-sequence $s$, the utility occupancy of a sequence $t$, denoted as \\textit{uo}($t$, $s$), is defined as $\\textit{uo}(t, s)$ = $\\frac{u(t, s)}{u(s)}$. Note that $t$ may have more than one match at $s$. Then the utility occupancy of $t$ at position $p$ in $s$ can be denoted as \\textit{uo}($t$, $s$, $p$) and is defined as follows.\n\t$$\n\t\\begin{aligned}\n\t\\textit{uo}(t, s, p) = \\frac{\\textit{max}\\{u(t, s^\\prime) \\vert t \\sim s^\\prime \\land s^\\prime \\subseteq <s_{1}, \\cdots, s_{p}>\\}}{u(s)}.\n\t\\end{aligned}\n\t$$\t\n\tThe total utility occupancy of $t$ in a $q$-sequence database $\\mathcal{D}$, denoted as \\textit{uo}($t$), is defined as follows.\t\n\t$$\n\t\\begin{aligned}\n\t\\textit{uo}(t) = \\frac{\\sum\\limits_{t \\subseteq s \\land s \\in \\mathcal{D}}\\textit{uo}(t, s)}{\\textit{sup}(t)}.\n\t\\end{aligned}\n\t$$\n\\end{definition}\n\nFor example, the utility occupancy of the sequence $<$[$a$], [$c$]$>$ in $s_3$, $s_4$, and $s_5$ are \\textit{uo}($<$[$a$], [$c$]$>$, $s_3$) = \\textit{max}(\\{5, 7\\}) / 12 = 0.583, \\textit{uo}($<$[$a$], [$c$]$>$, $s_4$) =  8 /13 = 0.615, and \\textit{uo}($<$[$a$], [$c$]$>$, $s_5$) = 5 / 13 = 0.385, respectively. Thus, the total utility occupancy of the sequence $<$[$a$], [$c$]$>$ in the entire $\\mathcal{D}$ is equal to \\textit{uo}($<$[$a$], [$c$]$>$) = (0.583 + 0.615 + 0.385) / 3 = 0.528.\n\n\\begin{definition}\n\t\\rm In a $q$-sequence $s$ with $l$ $q$-itemsets, the remaining utility occupancy of a sequence $t$ at position $p$ can be denoted as \\textit{ruo}($t$, $s$, $p$), and is defined as follows.\n\t$$\n\t\\begin{aligned}\n\t\\textit{ruo}(t, s, p) = \\frac{u(<s_{p+1}, \\cdots, s_{l}>, s)}{u(s)}.\n\t\\end{aligned}\n\t$$\n\\end{definition}\n\n\n\n\\begin{definition}\n\t\\rm Considering two thresholds, including a minimum support threshold \\textit{minsup} (0 $\\textless$ \\textit{minsup} $\\le$ 1) and a minimum utility occupancy threshold \\textit{minuo} (0 $\\textless$ \\textit{minuo} $\\le$ 1), a sequential pattern $t$ with high support and high utility occupancy in a $q$-sequence database $\\mathcal{D}$ is called a HUOSP. Here, it satisfies \\textit{sup}($t$) $\\ge$ \\textit{minsup} and \\textit{uo}($t$) $\\ge$ \\textit{minuo}.\n\\end{definition}\n\nFor example, the remaining utility occupancy of the sequence $<$[$a$], [$c$]$>$ in $s_3$ at position 2 is equal to \\textit{ruo}($<$[$a$], [$c$]$>$, $s_3$, 2) = (4 + 1) / 12 = 0.417. And the remaining utility occupancy of the sequence $<$[$a$]$>$ in $s_4$ at position 1 is equal to \\textit{ruo}($<$[$a$]$>$, $s_4$, 1) = (2 + 2 + 3) / 13 = 0.538. Under the setting of \\textit{minsup} to 2 and \\textit{minuo} to 0.4, all found HUOSPs are shown in Table \\ref{table_huosp}.\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{All found HUOSPs in the example database $\\mathcal{D}$}\n\t\\label{table_huosp}\n\t\\begin{tabular}{|c|c|c|c|}  \n\t\t\\hline \n\t\t\\textbf{ID} & \\textbf{HUOSP} & \\textbf{Support} & \\textbf{Utility occupancy}\\\\\n\t\t\\hline  \n\t\t\\(p_{1}\\) & $<$[$a$$b$]$>$ & 2 & 0.516 \\\\ \n\t\t\\hline\n\t\t\\(p_{2}\\) & $<$[$a$$b$], [$c$]$>$ & 2 & 0.76 \\\\ \n\t\t\\hline\n\t\t\\(p_{3}\\) & $<$[$a$], [$c$]$>$ & 3 & 0.528 \\\\ \n\t\t\\hline\n\t\t\\(p_{4}\\) & $<$[$a$], [$c$], [$e$]$>$ & 2 & 0.731 \\\\ \n\t\t\\hline\n\t\t\\(p_{5}\\) & $<$[$a$], [$e$]$>$ & 2 & 0.577 \\\\ \n\t\t\\hline\n\t\t\\(p_{6}\\) & $<$[$b$], [$c$], [$e$]$>$ & 2 & 0.538 \\\\ \n\t\t\\hline\n\t\t\\(p_{7}\\) & $<$[$d$], [$g$]$>$ & 2 & 0.59 \\\\ \n\t\t\\hline\n\t\\end{tabular}\n\\end{table}\n\n\n\\subsection{Problem Statement}\n\nAfter the above definitions and concepts are given, we formulate the problem of mining HUOSPs as follows. Given a quantitative sequence database $\\mathcal{D}$, a utility table with external utilities for each item, and two thresholds \\textit{minsup} (0 $\\textless$ \\textit{minsup} $\\le$ 1) and \\textit{minuo} (0 $\\textless$ \\textit{minuo} $\\le$ 1), the goal of high utility-occupancy sequential pattern mining is to discover all HUOSPs that frequency and utility occupancy are greater than \\textit{minsup} and \\textit{minuo}, respectively.\n\n\n3.1 Notations and Concepts\n\\subsection{Notations and Concepts}\n\nGiven a finite set $I$ = \\{$i_{1}$, $i_{2}$, $\\cdots$, $i_{m}$\\} containing $m$ distinct items, a quantitative itemset $c$  is a non-empty set and can be defined as $c$ = [($i_1$, $q_1$)($i_2$, $q_2$)$\\cdots$($i_n$, $q_n$)], where $q_j$ is the quality value for $i_j$. Each item and its associated quality (internal utility) together comprise the elements of the quantitative itemset $c$. The items in the quantitative itemset $c$ is a subset of $I$. An itemset $w$ is a non-empty set with no quality information for $c$, which is called that $w$ matches $c$, and is denoted as $w$ $\\sim$ $c$. To simplify the description of some definitions in this paper, we assume that all items in a quantitative itemset are sorted alphabetically. A quantitative sequence is denoted as $s$ and defined as $s$ = $<$$c_1$, $c_1$, $\\cdots$, $c_l$$>$. $s$ is an ordered list containing one or more quantitative itemsets, and the order in which the quantitative itemsets appear can represent the chronological relationship of realistic applications. $v$ = $<$$w_1$, $w_1$, $\\cdots$, $w_l$$>$ is used as $s$ without quantity information that is called that $v$ matches $s$ and is denoted as $v$ $\\sim$ $s$. For the sake of illustration, quantitative itemset and quantitative sequence can also be termed as $q$-itemset and $q$-sequence. Regarding a quantitative sequence database $\\mathcal{D}$, it is a collection of triples $<$\\textit{SID}, \\textit{qs}, \\textit{SU}$>$, where \\textit{qs} is a $q$-sequence, \\textit{SID} is the unique identifier of \\textit{qs}, and \\textit{SU} is the total utility of \\textit{qs}. Furthermore, each item $i$ such that $i$ $\\in$ $\\mathcal{D}$ has its own profit value (called external utility), and can be denoted as $p$($i$).\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Quantitative sequence database}\n\t\\label{table1}\n\t\\begin{tabular}{|c|c|c|}  \n\t\t\\hline \n\t\t\\textbf{SID} & \\textbf{Quantitative sequence} & \\textbf{SU} \\\\\n\t\t\\hline  \n\t\t\\(s_{1}\\) & $<$[(\\textit{b}, 2)(\\textit{d}, 1)], [(\\textit{g}, 1)], [(\\textit{f}, 1)]$>$ & 11 \\\\ \n\t\t\\hline\n\t\t\\(s_{2}\\) & $<$[(\\textit{d}, 1)], [(\\textit{g}, 1)]$>$  & 2 \\\\  \n\t\t\\hline  \n\t\t\\(s_{3}\\) & $<$[(\\textit{a}, 1)(\\textit{b}, 1)], [(\\textit{c}, 1)], [(\\textit{c}, 2)], [(\\textit{d}, 1)]$>$ & 12 \\\\\n\t\t\\hline  \n\t\t\\(s_{4}\\) & $<$[(\\textit{a}, 2)(\\textit{b}, 1)], [(\\textit{c}, 1)], [(\\textit{e}, 1)]$>$ & 13 \\\\\n\t\t\\hline\n\t\t\\(s_{5}\\) & $<$[(\\textit{d}, 3)], [(\\textit{b}, 1)], [(\\textit{a}, 1)], [(\\textit{c}, 1)], [(\\textit{e}, 1)]$>$ & 13 \\\\\n\t\t\\hline\n\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[h]\n\t\\caption{External utility table}\n\t\\label{table2}\n\t\\centering\n\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\\hline\n\t\t\\textbf{Item}\t    & \\textit{a}\t& \\textit{b}\t& \\textit{c}\t& \\textit{d}\t& \\textit{e}\t& \\textit{f}  & \\textit{g} \\\\ \\hline \n\t\t\\textbf{\\textit{Unit utility}}\t& 3 & 2 & 2 & 1 & 3 & 5 & 1 \\\\ \\hline\n\t\\end{tabular}\n\\end{table}\n\nThe example $q$-sequence database and external utility table that will be used in the following are shown in Tables \\ref{table1} and \\ref{table2}. We can see that this database has five $q$-sequences and seven different items. [($b$, 2)($d$, 1)] is the first $q$-itemset in $q$-sequence $s_1$, containing two items, $b$ and $d$. According to Table \\ref{table2}, the external utility of items $b$ and $d$ are 2 and 1, respectively. In addition, $<$[$b$$d$]$>$ matches $<$[($b$, 2) ($d$, 1)]$>$.\n\n\n\\begin{definition}\n\t\\rm For an item $i$ in a $q$-itemset $c$, its utility can be denoted as $u$($i$, $c$) and is defined as $u$($i$, $c$) = $q$($i$, $c$) $\\times$ $p$($i$, $c$) where $q$($i$, $c$) is the internal utility of $i$ in $c$ and $p$($i$, $c$) is the external utility of $i$. We use $u$($c$) to denote the sum of utilities of all items in $c$, and it can be defined as $u$($c$) = $\\sum\\limits_{i \\in c}u(i, c)$. As for a $q$-sequence $s$, its utility can be denoted as $u$($s$) and is defined as $u$($s$) = $\\sum\\limits_{c \\in s}u(c)$. Moreover, given a $q$-sequence database $\\mathcal{D}$, its utility can be denoted as $u$($\\mathcal{D}$) and is defined as $u$($\\mathcal{D}$) = $\\sum\\limits_{s \\in \\mathcal{D}}u(s)$.\n\\end{definition}\n\nFor example, the utility of the item $b$ is equal to 4, because $u$($b$, $s_1$) = 2 $\\times$ 2 = 4; the utilities of three $q$-itemsets in $s_1$ are 5, 1, and 5, respectively. Thus, the \\textit{SU} of $s_1$ can be calculated as $u$($s_1$) = 5 + 1 + 5 = 11; the total utility of this example database $\\mathcal{D}$ is calculated as $u$($\\mathcal{D}$) = $\\sum_{s_i \\in \\mathcal{D}}$ $u$($s_i$) = 11 + 2 + 12 + 13 + 13 = 51.\n\n\\begin{definition}\n\t\\rm Given two itemsets $w$ and $w^\\prime$, if all the items of $w$ appear in $w^\\prime$, we say that $w^\\prime$ contains $w$, and is denoted as $w$ $\\subseteq$ $w^\\prime$. Similarly, for two $q$-itemset $c$ and $c^\\prime$, if all the items of $c$ appear in $c^\\prime$ and have the same quality, we say that $c^\\prime$ contains $c$, which is denoted as $c$ $\\subseteq$ $c^\\prime$.\n\\end{definition}\n\nFor instance, the itemset [$c$$d$$e$] contains the itemset [$c$$e$]. And the $q$-itemset [($c$, 4)($e$, 2)] is contained in [($c$, 4)($d$, 3)($e$, 2)], but not in [($c$, 3)($e$, 3)]. Because the quality of $c$ in these two $q$-itemsets [($c$, 3)($e$, 3)] and [($c$, 4)($d$, 3)($e$, 2)] is different.\n\n\n\\begin{definition}\n\t\\rm Given two sequences $v$ = $<$$w_1$, $w_2$, $\\cdots$, $w_l$$>$ and $v^\\prime$ = $<$$w^\\prime_1$, $w^\\prime_2$, $\\cdots$, $w^\\prime_l$$>$, if there exists an integer list (1 $\\le$ $k_1$ $\\le$ $k_2$ $\\le$ $\\cdots$ $l$) satisfies that $w_j$ $\\subseteq$ $w^\\prime_{k_j}$, 1 $\\le$ $j$ $\\le$ $l$, we say that $v^\\prime$ contains $v$, and is denoted as $v$ $\\subseteq$ $v^\\prime$. For two $q$-sequences $s$ = $<$$c_1$, $c_1$, $\\cdots$, $c_l$$>$ and $s^\\prime$ = $<$$c^\\prime_1$, $c^\\prime_1$, $\\cdots$, $c^\\prime_l$$>$, if these two $q$-sequences need to satisfy the containment relationship, then there exists an integer list (1 $\\le$ $k_1$ $\\le$ $k_2$ $\\le$ $\\cdots$ $l$) satisfies that $c_j$ $\\subseteq$ $c^\\prime_{k_j}$, 1 $\\le$ $j$ $\\le$ $l$, which is denoted as $s$ $\\subseteq$ $s^\\prime$. In this paper, if a sequence $t$ matches a $q$-sequence $s_k$ and also satisfies $s_k$ $\\subseteq$ $s$, then it can also be denoted as $t$ $\\subseteq$ $s$ instead of $t$ $\\sim$ $s_k$ $\\land$ $s_k$ $\\subseteq$ $s$.\n\\end{definition}\n\nFor example, the $q$-sequence $s_1$ contains $<$[($b$, 2)($d$, 1)]$>$ and $<$[($g$, 1)], [($f$, 1)]$>$, while $<$[($b$, 2)($d$, 2)]$>$ and $<$[($g$, 1)($f$, 1)]$>$ are not contained in $s_1$. \n\n\\begin{definition}\n\t\\rm For a sequences $t$, it has multiple matches in a $q$-sequence $s$. We use $u$($t$, $s$) to denote the actual utility of $s$ and it is defined as $u$($t$, $s$) = \\textit{max}\\{$u$($s^\\prime$) $\\vert$ $t$ $\\sim$ $s^\\prime$ $\\land$ $s^\\prime$ $\\subseteq$ $s$\\}. Additionally, the utility of $t$ in the $q$-sequence database $\\mathcal{D}$ can be denoted as $u$($t$) and is defined as $u$($t$) = \\{$\\sum\\limits_{s \\in \\mathcal{D}}u(t, s) \\vert t \\subseteq s$\\}. In addition, its support can be denoted as \\textit{sup}($t$) and is defined as \\textit{sup}($t$) = $\\vert$ $t$ $\\subseteq$ $s$ $\\land$ $s$ $\\in$ $\\mathcal{D}$ $\\vert$, that is, the number of $q$-sequences of $\\mathcal{D}$ matching $t$.\n\\end{definition}\n\nFor example, the sequence $t$ = $<$[$a$$b$], [$c$]$>$ has two matches in the $q$-sequence $s_3$, and so its utility can be calculated as $u$($<$[$a$$b$], [$c$]$>$) = \\textit{max}\\{$u$($<$[($a$, 1)($b$, 1)], [($c$, 1)]$>$), $u$($<$[($a$, 1)($b$, 1)], [($c$, 2)]$>$)\\} = \\textit{max}\\{7, 9\\} = 9. And $t$ has a support of 2 because $s_3$ and $s_4$ both have instances where $t$ matches.\n\nIn this paper, the concept of utility occupancy \\cite{gan2019huopm} is incorporated into sequence data. Utility occupancy is a flexible measure that can be used to identify patterns with a higher contribution in sequences. Since there is no previous work on this topic, we are the first to define the relevant concepts.\n\n\\begin{definition}\n\t\\rm In a $q$-sequence $s$, the utility occupancy of a sequence $t$, denoted as \\textit{uo}($t$, $s$), is defined as $\\textit{uo}(t, s)$ = $\\frac{u(t, s)}{u(s)}$. Note that $t$ may have more than one match at $s$. Then the utility occupancy of $t$ at position $p$ in $s$ can be denoted as \\textit{uo}($t$, $s$, $p$) and is defined as follows.\n\t$$\n\t\\begin{aligned}\n\t\\textit{uo}(t, s, p) = \\frac{\\textit{max}\\{u(t, s^\\prime) \\vert t \\sim s^\\prime \\land s^\\prime \\subseteq <s_{1}, \\cdots, s_{p}>\\}}{u(s)}.\n\t\\end{aligned}\n\t$$\t\n\tThe total utility occupancy of $t$ in a $q$-sequence database $\\mathcal{D}$, denoted as \\textit{uo}($t$), is defined as follows.\t\n\t$$\n\t\\begin{aligned}\n\t\\textit{uo}(t) = \\frac{\\sum\\limits_{t \\subseteq s \\land s \\in \\mathcal{D}}\\textit{uo}(t, s)}{\\textit{sup}(t)}.\n\t\\end{aligned}\n\t$$\n\\end{definition}\n\nFor example, the utility occupancy of the sequence $<$[$a$], [$c$]$>$ in $s_3$, $s_4$, and $s_5$ are \\textit{uo}($<$[$a$], [$c$]$>$, $s_3$) = \\textit{max}(\\{5, 7\\}) / 12 = 0.583, \\textit{uo}($<$[$a$], [$c$]$>$, $s_4$) =  8 /13 = 0.615, and \\textit{uo}($<$[$a$], [$c$]$>$, $s_5$) = 5 / 13 = 0.385, respectively. Thus, the total utility occupancy of the sequence $<$[$a$], [$c$]$>$ in the entire $\\mathcal{D}$ is equal to \\textit{uo}($<$[$a$], [$c$]$>$) = (0.583 + 0.615 + 0.385) / 3 = 0.528.\n\n\\begin{definition}\n\t\\rm In a $q$-sequence $s$ with $l$ $q$-itemsets, the remaining utility occupancy of a sequence $t$ at position $p$ can be denoted as \\textit{ruo}($t$, $s$, $p$), and is defined as follows.\n\t$$\n\t\\begin{aligned}\n\t\\textit{ruo}(t, s, p) = \\frac{u(<s_{p+1}, \\cdots, s_{l}>, s)}{u(s)}.\n\t\\end{aligned}\n\t$$\n\\end{definition}\n\n\n\n\\begin{definition}\n\t\\rm Considering two thresholds, including a minimum support threshold \\textit{minsup} (0 $\\textless$ \\textit{minsup} $\\le$ 1) and a minimum utility occupancy threshold \\textit{minuo} (0 $\\textless$ \\textit{minuo} $\\le$ 1), a sequential pattern $t$ with high support and high utility occupancy in a $q$-sequence database $\\mathcal{D}$ is called a HUOSP. Here, it satisfies \\textit{sup}($t$) $\\ge$ \\textit{minsup} and \\textit{uo}($t$) $\\ge$ \\textit{minuo}.\n\\end{definition}\n\nFor example, the remaining utility occupancy of the sequence $<$[$a$], [$c$]$>$ in $s_3$ at position 2 is equal to \\textit{ruo}($<$[$a$], [$c$]$>$, $s_3$, 2) = (4 + 1) / 12 = 0.417. And the remaining utility occupancy of the sequence $<$[$a$]$>$ in $s_4$ at position 1 is equal to \\textit{ruo}($<$[$a$]$>$, $s_4$, 1) = (2 + 2 + 3) / 13 = 0.538. Under the setting of \\textit{minsup} to 2 and \\textit{minuo} to 0.4, all found HUOSPs are shown in Table \\ref{table_huosp}.\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{All found HUOSPs in the example database $\\mathcal{D}$}\n\t\\label{table_huosp}\n\t\\begin{tabular}{|c|c|c|c|}  \n\t\t\\hline \n\t\t\\textbf{ID} & \\textbf{HUOSP} & \\textbf{Support} & \\textbf{Utility occupancy}\\\\\n\t\t\\hline  \n\t\t\\(p_{1}\\) & $<$[$a$$b$]$>$ & 2 & 0.516 \\\\ \n\t\t\\hline\n\t\t\\(p_{2}\\) & $<$[$a$$b$], [$c$]$>$ & 2 & 0.76 \\\\ \n\t\t\\hline\n\t\t\\(p_{3}\\) & $<$[$a$], [$c$]$>$ & 3 & 0.528 \\\\ \n\t\t\\hline\n\t\t\\(p_{4}\\) & $<$[$a$], [$c$], [$e$]$>$ & 2 & 0.731 \\\\ \n\t\t\\hline\n\t\t\\(p_{5}\\) & $<$[$a$], [$e$]$>$ & 2 & 0.577 \\\\ \n\t\t\\hline\n\t\t\\(p_{6}\\) & $<$[$b$], [$c$], [$e$]$>$ & 2 & 0.538 \\\\ \n\t\t\\hline\n\t\t\\(p_{7}\\) & $<$[$d$], [$g$]$>$ & 2 & 0.59 \\\\ \n\t\t\\hline\n\t\\end{tabular}\n\\end{table}\n\n\n",
        "Descriptive_question1": "What is the support value for the HUOSP with ID p_3 in table_3?",
        "Descriptive_question2": "What is the utility occupancy of the HUOSP with ID p_2 in table_3?",
        "Reasoning_question1": "Which HUOSP in table_3 has the highest utility occupancy, and what might this indicate about its contribution to the sequences in the database?",
        "Reasoning_question2": "Comparing the support and utility occupancy values in table_3, why might the HUOSP with ID p_3 be considered significant despite not having the highest utility occupancy?",
        "Descriptive_answer1": "3",
        "Descriptive_answer2": "0.76",
        "Reasoning_answer1": "The HUOSP with ID p_2 has the highest utility occupancy of 0.76. This indicates that, among all the HUOSPs listed in the table, p_2 contributes the most to the utility of the sequences it appears in within the database. Utility occupancy measures the proportion of a sequence's total utility that is attributed to a specific pattern. Therefore, a higher utility occupancy for p_2 suggests that the sequence pattern <$[a$b], [$c]$>$ plays a significant role in the sequences where it is found, potentially marking it as a critical pattern for analysis or application in the context of the database.",
        "Reasoning_answer2": "The HUOSP with ID p_3 might be considered significant despite not having the highest utility occupancy because it has the highest support value of 3 among all HUOSPs in the table. Support indicates the frequency of occurrence of a pattern across the database sequences. While its utility occupancy of 0.528 is lower than some other patterns like p_2 (0.76) or p_4 (0.731), the higher support value suggests that the pattern <$[a], [$c]$>$ appears in more sequences, making it more prevalent. This prevalence could imply that p_3 is a more common or stable pattern in the dataset, potentially valuable for understanding broader trends or behaviors across a larger portion of the database."
    },
    {
        "paper_id": "2212.10452.json",
        "table_id": "table_4",
        "table_content": "\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minsup}}\n\t\\label{patterns_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$    \\\\ \\hline\n\t\t\tBible, \\textit{minuo} = 0.1   & 21,442 & 11,008 & 6,527 & 4,290 & 2,993 & 2,211  \\\\ \\hline\n\t\t\tFIFA, \\textit{minuo} = 0.1 & 1,162 & 259 & 87 & 38 & 14 & 7    \\\\ \\hline\n\t\t\tSign, \\textit{minuo} = 0.1 & 147,517 & 74,532 & 40,936 & 23,879 & 14,521 & 9,165  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minuo} = 0.1 &   5,732,182 & 1,311,583 & 488,651 & 165,915 & 96,636 & 76,824  \\\\ \\hline\n\t\t\tSyn20k, \\textit{minuo} = 0.1 &   3,751,369 & 1,470,986 & 766,501 & 325,895 & 178,157 & 124,254 \\\\ \\hline\n\t\t\tSyn40k, \\textit{minuo} = 0.1&  7,223,421 & 5,144,928 & 3,710,872 & 2,087,673 & 1,142,202 & 770,393 \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}",
        "caption": "Number of patterns generated by varying \\textit{minsup}",
        "label": "patterns_minsup",
        "section_info": "5 Experiments\n\\section{Experiments}  \\label{sec:experiments}\n\nWe selected both real and synthetic datasets to conduct related experiments. The proposed SUMU algorithm is the first approach to mining sequential patterns with a utility occupancy measure. Thus, there is no suitable algorithm for comparison. We mainly focused on verifying the efficiency of the proposed upper bounds and pruning strategies and the effectiveness of SUMU. The code related to SUMU is programmed using the Java language and developed in Eclipse. Our extensive experiments were conducted on a bare computer, which is equipped with an i7-12700F 2.10 GHz CPU and 16 GB of RAM. The experimental details and results are shown below.\n\n\\subsection{Experimental Setup and Datasets}\n\nThree real datasets (including Bible, FIFA, and Sign) and three synthetic datasets (including Syn10k, Syn20k, and Syn40k) were used in the experiments. The real datasets are often used in the evaluation of pattern mining algorithms and can be accessed from the website SPMF\\footnote{\\url{http://www.philippe-fournier-viger.com/spmf/}}. Regarding the used synthetic datasets, they can be generated by the IBM Quest Synthetic Data Generator \\cite{QSD}. Each dataset has its own characteristics and can represent a specific type of data in practical applications. The characteristics of these datasets are described below.\n\n$ \\bullet $ \\textit{\\textbf{Bible}} contains 13,905 items and 36,369 sequences, which are transformed from the book Bible. Its average sequence length is 21.64.\n\n$ \\bullet $ \\textit{\\textbf{FIFA}} contains 2,990 items and 20,450 sequences  derived from the website of FIFA World Cup 98. Its average sequence length is 36.23.\n\n$ \\bullet $ \\textit{\\textbf{Sign}} a small but dense dataset of sign language utterances, with 267 items and 730 sequences. Its average sequence length is 27.11.\n\n$ \\bullet $ \\textit{\\textbf{Syn10k}} is a synthetic dataset with 10,000 sequence records. It has 7,312 distinct items, and its average sequence length is 26.97.\n\n$ \\bullet $ \\textit{\\textbf{Syn20k}} is a synthetic dataset with 20,000 sequence records. It has 7,442 distinct items, and its average sequence length is 26.84.\n\n$ \\bullet $ \\textit{\\textbf{Syn40k}} is a synthetic dataset with 40,000 sequence records. It has 7,537 distinct items, and its average sequence length is 26.84.\n\nTo better evaluate the proposed SUMU algorithm, several variants regarding SUMU have also been designed. Therefore, the experimental results can better show the capabilities of the designed upper bounds and pruning strategies. In our experiments, the proposed SUMU algorithm with upper bounds \\textit{PEUO} and \\textit{RSUO} is denoted as SUMU$_\\textit{simple}$. It means that only Strategies \\ref{strategy2} and \\ref{strategy3} are used in SUMU$_\\textit{simple}$. On the basis of SUMU$_\\textit{simple}$, if unpromising items are filtered out (with Strategy \\ref{strategy1}) before generating HUOSPs, then this variant of SUMU is denoted as SUMU$_\\textit{PEUO}$. To analyze the performance of gap between \\textit{PEUO} and \\textit{TPUO}, between \\textit{RSUO} and \\textit{TSUO} in the experiments, we also designed another variant of SUMU (with Strategies \\ref{strategy1}, \\ref{strategy4}, and \\ref{strategy5}), denoted as SUMU$_\\textit{TPUO}$. In addition, on the basis of SUMU$_\\textit{PEUO}$, the fourth variant, namely SUMU$_\\textit{PES}$, is designed to evaluate the two upper bounds on the support measure. Those variants of SUMU are compared to comprehensively evaluate the effectiveness and efficiency of SUMU.\n\n\\subsection{Pattern Analysis}\n\nIn this section, we mainly discuss the effect of the change in the number of HUOSPs as the \\textit{minsup} or \\textit{minsuo} changes. The results for various \\textit{minsup} and under a fixed \\textit{minuo} are shown in Table \\ref{patterns_minsup}. Likewise, the results for various \\textit{minuo} and under a fixed \\textit{minsup} are shown in Table \\ref{patterns_minuo}. For each dataset, we use \\textit{minsup}$_1$, \\textit{minsup}$_2$ (or \\textit{minuo}$_1$, \\textit{minuo}$_2$), and so on to indicate that we increasingly adjust the parameter \\textit{minsup} (or \\textit{minuo}). For instance, in our experiments, for the Bible dataset, the six parameters on \\textit{minsup} are set to 300, 400, 500, 600, 700, and 800; and the six parameters on \\textit{minuo} are set to 0.01, 0.03, 0.05, 0.07, 0.09, and 0.11. The detailed parameter settings can be observed in Fig. \\ref{runtime_minsup} and Fig. \\ref{runtime_minuo}.\n\n\n\n\n\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minsup}}\n\t\\label{patterns_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$    \\\\ \\hline\n\t\t\tBible, \\textit{minuo} = 0.1   & 21,442 & 11,008 & 6,527 & 4,290 & 2,993 & 2,211  \\\\ \\hline\n\t\t\tFIFA, \\textit{minuo} = 0.1 & 1,162 & 259 & 87 & 38 & 14 & 7    \\\\ \\hline\n\t\t\tSign, \\textit{minuo} = 0.1 & 147,517 & 74,532 & 40,936 & 23,879 & 14,521 & 9,165  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minuo} = 0.1 &   5,732,182 & 1,311,583 & 488,651 & 165,915 & 96,636 & 76,824  \\\\ \\hline\n\t\t\tSyn20k, \\textit{minuo} = 0.1 &   3,751,369 & 1,470,986 & 766,501 & 325,895 & 178,157 & 124,254 \\\\ \\hline\n\t\t\tSyn40k, \\textit{minuo} = 0.1&  7,223,421 & 5,144,928 & 3,710,872 & 2,087,673 & 1,142,202 & 770,393 \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\n\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minuo}}\n\t\\label{patterns_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$    \\\\ \\hline{Bible}, \\textit{minsup} = 500   & 11721 & 11668 & 11390 & 10433 & 8037 & 5012\n\t\t\t\\\\ \\hline\n\t\t\tFIFA, \\textit{minsup} = 4,000 & 1,093 & 870 & 499 & 212 & 69 & 20  \\\\ \\hline\n\t\t\tSign, \\textit{minsup} = 70 & 40,936 & 28,375 & 18,330 & 11,087 & 6,134 & 3,136  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minsup} = 14 & 488,651 & 435,881 & 367,058 & 287,760 & 204,788 & 130,351   \\\\ \\hline\n\t\t\tSyn20k, \\textit{minsup} = 24 & 766,501 & 660,716 & 513,359 & 355,981 & 217,649 & 117,495  \\\\ \\hline\n\t\t\tSyn40k, \\textit{minsup} = 34 & 7,223,421 & 6,737,579 & 5,831,242 & 4,550,068 & 3,092,664 & 1,777,904  \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\nFrom Tables \\ref{patterns_minsup} and \\ref{patterns_minuo}, it is clear that the number of generated HUOSPs on each dataset is quite different as \\textit{minsup} or \\textit{minuo} is adjusted. Particularly, the number of generated HUOSPs on the synthetic datasets is higher than that on the real datasets. This is because for these synthetic datasets, each of their itemsets contains multiple items and can format more candidate patterns. Furthermore, as \\textit{minsup} decreases by interval, the number of HUOSPs increases rapidly. For example, the difference between the number of patterns generated by \\textit{minsup}$_1$ and the number of patterns generated under \\textit{minsup}$_2$ is smaller than the difference between \\textit{minsup}$_2$ and \\textit{minsup}$_1$. This phenomenon is reasonable and also occurs in frequent itemset mining or sequential pattern mining. However, this phenomenon is contrary to the utility occupancy measure. The number of generated HUOSPs gradually increases as \\textit{minuo} is decreased. This is because the HUOSPs generated by the algorithm SUMU do not vary much for smaller \\textit{minuo} settings. In fact, such similar situations also can be found in the HUOPM algorithm \\cite{gan2019huopm}. \n\n\\subsection{Efficiency Analysis}\n\nIn this subsection, we conducted extensive experiments to evaluate the performance of the different upper bounds and pruning strategies used in SUMU. The results in terms of runtime for various \\textit{minsup} and \\textit{minuo} settings are shown in Fig. \\ref{runtime_minsup} and Fig. \\ref{runtime_minuo}. And the results in terms of candidate patterns for various \\textit{minsup} and \\textit{minuo} settings are shown in Tables \\ref{candidates_minsup} and \\ref{candidates_minuo}.\n\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.32]{figs/runtimeminsup.pdf}\n\t\\caption{Running time under various \\textit{minsup} and a fixed \\textit{minuo} = 0.1.}\n\t\\label{runtime_minsup}\n\\end{figure}\n\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minsup}}\n\t\\label{candidates_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 464,804 & 263,807 & 171,872 & 123,398 & 94,177 & 75,449 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 35,367 & 18,999 & 11,721 & 7,967 & 5,737 & 4,354 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minuo} = 0.1}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 678,816 & 268,081 & 115,283 & 57,603 & 31,384 & 18,504 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 14,710 & 5,787 & 2,399 & 1,099 & 557 & 296 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,237,763 & 2,494,589 & 1,588,257 & 1,061,989 & 742,966 & 538,042 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 372,610 & 208,839 & 126,752 & 81,340 & 54,695 & 38,095 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 34,672,439 & 10,131,127 & 4,119,006 & 1,870,268 & 1,126,628 & 802,480 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 32,762,145 & 9,533,071 & 3,727,340 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 32,762,136 & 9,533,068 & 3,727,339 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 5,968,170 & 1,412,210 & 537,899 & 194,708 & 115,473 & 89,559 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 23,741,985 & 11,371,864 & 5,991,112 & 3,132,183 & 2,004,809 & 1,497,083 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 23,391,275 & 11,167,493 & 5,706,748 & 2,943,750 & 1,845,769 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 23,391,272 & 11,167,489 & 5,706,744 & 2,943,747 & 1,845,768 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 3,916,112 & 1,578,521 & 841,369 & 377,140 & 215,095 & 151,514 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 29,989,061 & 21,471,028 & 14,398,082 & 8,602,677 & 6,268,961 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 29,551,866 & 21,110,384 & 14,105,707 & 8,252,704 & 5,995,403 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 29,551,845 & 21,110,361 & 14,105,693 & 8,252,697 & 5,995,389 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 5,326,206 & 3,857,950 & 2,206,183 & 1,239,766 & 850,635 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}\n\n\n\n\nFrom Fig \\ref{runtime_minsup} and Table \\ref{candidates_minsup}, under different \\textit{minsup} settings, we can clearly see that the runtime of the variant SUMU$_\\textit{simple}$ is the worst on the datasets Bible and FIFA; the runtime of the variant SUMU$_\\textit{TPUO}$ is the worst on the datasets Sign, Syn20k, and Syn40k. SUMU$_\\textit{simple}$ and SUMU$_\\textit{TPUO}$ perform similarly on Syn10k. However, when \\textit{minsup} is set to 10, the runtime of SUMU$_\\textit{TPUO}$ exceeds that of SUMU$_\\textit{simple}$. In addition, the variant SUMU$_\\textit{PES}$ which uses four upper bounds (\\textit{PEUO}, \\textit{RSUO}, \\textit{PES}, and \\textit{RSS}) can achieve the best performance on all datasets. And the variant SUMU$_\\textit{PEUO}$ takes the second least amount of runtime. SUMU$_\\textit{PES}$ is able to minimize candidate pattern generation, while SUMU$_\\textit{simple}$ is the least reduced. The results of our experiment are as expected. In experiments under different \\textit{minsup} and a fixed \\textit{minuo}, we can draw the following conclusions.\n\n\n\n\\begin{enumerate}[label=(\\arabic*)]\n\t\\item SUMU$_\\textit{PES}$ adopts a sufficient number of pruning strategies to significantly reduce candidate patterns while achieving the shortest  runtime. Compared to several other variants of SUMU, SUMU$_\\textit{PES}$ generates much fewer candidate patterns. Although the number of candidate patterns is many times less than the other variants, the overall performance is not more than a few times better. This is a lot of unpromising candidate patterns that are also ignored in the subsequent program steps.\n\t\n\t\\item The difference between SUMU$_\\textit{simple}$ and SUMU$_\\textit{PEUO}$ demonstrate that the pruning Strategy \\ref{strategy1} is ineffective on synthetic datasets. This is because \\textit{minsup} are set to relatively small values, and thus there are not many unpromising items appearing in the sequence dataset.\n\t\n\t\\item Although \\textit{TPUO} and \\textit{TSUO} are tighter upper bounds, their calculation makes SUMU$_\\textit{TPUO}$ take longer than SUMU$_\\textit{PEUO}$. For a candidate pattern, SUMU$_\\textit{PEUO}$ is able to compute upper bounds \\textit{PEUO} and \\textit{RSUO} in a liner time. While for \\textit{TSUO}, it requires multiple sorting operations, which is a complex process. In addition, SUMU$_\\textit{TPUO}$ does not reduce any candidate pattern on many datasets (including Bible, FIFA, and Sign). Even if it works on the few remaining datasets, it only reduces the number of candidate patterns by a particularly small amount.\n\\end{enumerate}\n\n\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.32]{figs/runtimeminuo.pdf}\n\t\\caption{Running time under various \\textit{minuo} and a fixed \\textit{minsup}. (a) Bible, \\textit{minsup} = 500. (b) FIFA,  \\textit{minsup} = 4,000. (c) Sign, \\textit{minsup} = 70. (d) Syn10k, \\textit{minsup} = 14. (e) Syn20k, \\textit{minsup} = 24. (f) Syn40k, \\textit{minsup} = 34.}\n\t\\label{runtime_minuo}\n\\end{figure}\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minuo}}\n\t\\label{candidates_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minsup} = 500}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,921,104 & 672,801 & 390,245 & 264,424 & 195,445 & 153,171 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minsup} = 4000}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 280,551 & 178,428 & 144,686 & 102,302 & 69,806 & 48,009 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minsup} = 70}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,588,257 & 1,393,525 & 1,233,358 & 1,101,274 & 988,726 & 894,110 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,284,130 & 1,129,929 & 1,002,217 & 895,976 & 806,047 & 730,904 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,284,130 & 1,129,928 & 1,002,216 & 895,923 & 805,911 & 730,644 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 126,752 & 126,743 & 126,716 & 126,652 & 126,478 & 126,240 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minsup} = 14}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,119,006 & 2,916,190 & 2,399,859 & 2,021,727 & 1,747,076 & 1,534,885 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,727,340 & 2,790,561 & 2,301,336 & 1,948,286 & 1,680,334 & 1,485,466 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,727,339 & 2,790,547 & 2,301,264 & 1,948,081 & 1,679,416 & 1,482,724 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 537,899 & 537,774 & 537,328 & 536,225 & 533,228 & 526,147 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minsup} = 24}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 5,991,112 & 5,044,232 & 4,517,900 & 4,112,343 & 3,773,772 & 3,423,454 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 5,706,748 & 4,904,484 & 4,430,131 & 4,041,273 & 3,712,031 & 3,357,823 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 5,706,744 & 4,904,420 & 4,429,752 & 4,040,166 & 3,709,467 & 3,352,790 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 841,369 & 840,970 & 838,452 & 830,080 & 812,793 & 785,422 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minsup} = 34}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 36,517,396 & 34,219,382 & 31,977,235 & 29,399,324 & 26,176,962 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 36,148,568 & 33,947,656 & 31,690,858 & 29,135,445 & 25,872,139 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 36,148,106 & 33,944,569 & 31,678,889 & 29,102,992 & 25,803,949 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 7,452,659 & 7,440,512 & 7,403,338 & 7,327,730 & 7,200,566 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}\n\nFurthermore, from Fig \\ref{runtime_minuo} and Table \\ref{candidates_minuo}, under different \\textit{minuo} settings, we can clearly observe that SUMU$_\\textit{PES}$ is the fastest variant of SUMU. On the datasets Sign, Syn10k, Syn20k, and Syn40k, there are some fluctuations that occur in all the variants of SUMU, but the overall trend is still clear. Regardless of which dataset is processed, the runtime curve of SUMU$_\\textit{PES}$ becomes smoother as \\textit{minuo} is adjusted. In particular, the number of candidate patterns generated on the Bible and FIFA datasets has not changed. In experiments under different \\textit{minuo} and a fixed \\textit{minsup}, we can draw the following conclusions.\n\n\\begin{enumerate}[label=(\\arabic*)]\n\t\\item Unlike the experiments under tuning \\textit{minsup}, the runtime of each variant of SUMU is not much affected by the setting of \\textit{minuo}. On the datasets Bible and FIFA, the runtimes of SUMU$_\\textit{PEUO}$, SUMU$_\\textit{TPUO}$, and SUMU$_\\textit{PES}$ hardly increase when \\textit{minuo} decreases. While the candidate patterns for SUMU$_\\textit{PEUO}$ and SUMU$_\\textit{TPUO}$ are increased substantially. This suggests that the support measure plays a greater role in determining the program runtime than the utility occupancy measure.\n\t\n\t\\item SUMU$_\\textit{PES}$ still achieves the fastest runtime due to the most reasonable pruning strategies it uses. Moreover, on each dataset, as \\textit{minuo} decreases, it does not generate many more candidate patterns. Upper bounds \\textit{PES} and \\textit{RSS} already make it possible to reduce many invalid candidate patterns.\n\t\n\t\\item Although \\textit{TPUO} and \\textit{TSUO} are tighter upper bounds, as \\textit{minuo} decreases, they still do not reduce many irrelevant candidate patterns for SUMU$_\\textit{TPUO}$. \t\n\\end{enumerate}\n\n\n\\subsection{Memory Evaluation}\n\nThe memory consumption of each variant of SUMU is close and fluctuating, and we present the approximate memory consumption under different datasets. We investigate the reasons for the disparities in memory consumption based on program design details. The experimental results regarding memory consumption are shown in Table \\ref{memory}.\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Memory consumption}\n\t\\label{memory}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{}   & \\multicolumn{6}{c}{\\textbf{Approximate memory consumed (MB)}} \\\\ \\cline{2-7} \n\t\t\t&          Bible         &  FIFA & Sign  & Syn10k  & Syn20k  & Syn40k  \\\\ \\hline{fixed \\textit{minuo}}  & 1,000 $\\sim$ 1,400 & 1,400 $\\sim$ 1,700 & 200 $\\sim$ 400 & 300 $\\sim$ 800 & 600 $\\sim$ 800 & 1,200 $\\sim$ 1,500  \\\\ \\hline{fixed \\textit{minsup}}  & 1,000 $\\sim$ 1,400 & 1,500 $\\sim$ 1,700 & 200 $\\sim$ 400 & 300 $\\sim$ 600 & 300 $\\sim$ 800 & 1,200 $\\sim$ 1,500    \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\nSince each variant of SUMU uses both UOL-Chain and UO-Table, they consume little difference in memory, and the difference is within a reasonable range. The variants of SUMU employ a different number of pruning strategies, and thus they differ somewhat in the use of some auxiliary data structures. If the pruning Strategy \\ref{strategy1} is used, then unpromising items should be filtered. To find out which items are unpromising, the program utilizes a hash table to record the support for each item. On the contrary, if all items are used directly, it is sufficient for the program to use a single list to record those items that occur in the sequence database. The difference between SUMU$_{\\textit{PEUO}}$ and SUMU$_{\\textit{TPUO}}$ is that they use different upper bounds and pruning strategies. For the calculation of \\textit{PEUO} and \\textit{RSUO} of a pattern, this is relatively simple. Scanning the UOL-Chain of a pattern quickly and accumulating corresponding values. However, for the calculation of \\textit{TPUO} and \\textit{TSUO} of a pattern, several \\textit{minsup}-sized priority queues are required. This allows computing tighter upper bound values, but also consumes additional memory space. As for SUMU$_{\\textit{PES}}$, it uses more upper bounds \\textit{PES} and \\textit{RSS} (adopts more pruning Strategies \\ref{strategy6} and \\ref{strategy7}) compared to SUMU$_{\\textit{PEUO}}$. This suggests that in pattern extension, the program needs the associated hash tables to decide which candidate patterns satisfy the upper bounds \\textit{PES} and \\textit{RSS}. It seems that the more upper bounds and pruning strategies are used, the more memory is consumed. Nevertheless, in the experiments, effective pruning strategies can avoid unnecessary UOL-Chain and UO-Table builds due to the non-generation of some candidate patterns, also saving memory consumption. Therefore, the memory consumption of each variant of SUMU is roughly equal.\n\n\\subsection{Scalability}\n\nThis section selected five synthetic datasets to evaluate the scalability of each variant of SUMU. The dataset size increases from 10k to 50k sequence records, increasing by 10k each time. We set a relative support for experiments, e.g., \\textit{minsup} was set to 10, 20, 30, 40, and 50 for the five synthetic datasets, respectively. In addition, the \\textit{minuo} is set to 0.1 in order to generate more HUOSPs. We analyze the scalability in terms of runtime and candidate pattern generation, and the experimental results are shown in Fig. \\ref{scalability}.\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.4]{figs/scalability.pdf}\n\t\\caption{Scalability of the compared variants of SUMU}\n\t\\label{scalability}\n\\end{figure}\n\nFrom Fig. \\ref{scalability}, it is clear that the runtime of each variant of SUMU grows as the size of the processed dataset increases. This is consistent with our assumption that larger datasets carry more candidate patterns, increasing the processing difficulty. The use of UOL-Chain and UO-Table makes the trend of each variant of SUMU the same, with only differences in efficiency. The difference between all SUMU variants is clear, with SUMU$_{\\textit{PES}}$ performing best and SUMU$_{\\textit{TPUO}}$ performing worst. For SUMU$_{\\textit{PES}}$, there is no such rapid growth of candidate patterns. While other variants of SUMU generate a large number of candidate patterns. Therefore, it performs well when handling large-scale datasets. The large number of sorting operations required for the calculation of tighter upper bounds, in particular, causes SUMU$_{\\textit{TPUO}}$ to perform poorly. The difference between SUMU$_{\\textit{PEUO}}$ and SUMU$_{\\textit{simple}}$ illustrates the effectiveness of the pruning strategy \\ref{strategy1}.\n\n\n5.2 Pattern Analysis\n\\subsection{Pattern Analysis}\n\nIn this section, we mainly discuss the effect of the change in the number of HUOSPs as the \\textit{minsup} or \\textit{minsuo} changes. The results for various \\textit{minsup} and under a fixed \\textit{minuo} are shown in Table \\ref{patterns_minsup}. Likewise, the results for various \\textit{minuo} and under a fixed \\textit{minsup} are shown in Table \\ref{patterns_minuo}. For each dataset, we use \\textit{minsup}$_1$, \\textit{minsup}$_2$ (or \\textit{minuo}$_1$, \\textit{minuo}$_2$), and so on to indicate that we increasingly adjust the parameter \\textit{minsup} (or \\textit{minuo}). For instance, in our experiments, for the Bible dataset, the six parameters on \\textit{minsup} are set to 300, 400, 500, 600, 700, and 800; and the six parameters on \\textit{minuo} are set to 0.01, 0.03, 0.05, 0.07, 0.09, and 0.11. The detailed parameter settings can be observed in Fig. \\ref{runtime_minsup} and Fig. \\ref{runtime_minuo}.\n\n\n\n\n\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minsup}}\n\t\\label{patterns_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$    \\\\ \\hline\n\t\t\tBible, \\textit{minuo} = 0.1   & 21,442 & 11,008 & 6,527 & 4,290 & 2,993 & 2,211  \\\\ \\hline\n\t\t\tFIFA, \\textit{minuo} = 0.1 & 1,162 & 259 & 87 & 38 & 14 & 7    \\\\ \\hline\n\t\t\tSign, \\textit{minuo} = 0.1 & 147,517 & 74,532 & 40,936 & 23,879 & 14,521 & 9,165  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minuo} = 0.1 &   5,732,182 & 1,311,583 & 488,651 & 165,915 & 96,636 & 76,824  \\\\ \\hline\n\t\t\tSyn20k, \\textit{minuo} = 0.1 &   3,751,369 & 1,470,986 & 766,501 & 325,895 & 178,157 & 124,254 \\\\ \\hline\n\t\t\tSyn40k, \\textit{minuo} = 0.1&  7,223,421 & 5,144,928 & 3,710,872 & 2,087,673 & 1,142,202 & 770,393 \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\n\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minuo}}\n\t\\label{patterns_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$    \\\\ \\hline{Bible}, \\textit{minsup} = 500   & 11721 & 11668 & 11390 & 10433 & 8037 & 5012\n\t\t\t\\\\ \\hline\n\t\t\tFIFA, \\textit{minsup} = 4,000 & 1,093 & 870 & 499 & 212 & 69 & 20  \\\\ \\hline\n\t\t\tSign, \\textit{minsup} = 70 & 40,936 & 28,375 & 18,330 & 11,087 & 6,134 & 3,136  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minsup} = 14 & 488,651 & 435,881 & 367,058 & 287,760 & 204,788 & 130,351   \\\\ \\hline\n\t\t\tSyn20k, \\textit{minsup} = 24 & 766,501 & 660,716 & 513,359 & 355,981 & 217,649 & 117,495  \\\\ \\hline\n\t\t\tSyn40k, \\textit{minsup} = 34 & 7,223,421 & 6,737,579 & 5,831,242 & 4,550,068 & 3,092,664 & 1,777,904  \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\nFrom Tables \\ref{patterns_minsup} and \\ref{patterns_minuo}, it is clear that the number of generated HUOSPs on each dataset is quite different as \\textit{minsup} or \\textit{minuo} is adjusted. Particularly, the number of generated HUOSPs on the synthetic datasets is higher than that on the real datasets. This is because for these synthetic datasets, each of their itemsets contains multiple items and can format more candidate patterns. Furthermore, as \\textit{minsup} decreases by interval, the number of HUOSPs increases rapidly. For example, the difference between the number of patterns generated by \\textit{minsup}$_1$ and the number of patterns generated under \\textit{minsup}$_2$ is smaller than the difference between \\textit{minsup}$_2$ and \\textit{minsup}$_1$. This phenomenon is reasonable and also occurs in frequent itemset mining or sequential pattern mining. However, this phenomenon is contrary to the utility occupancy measure. The number of generated HUOSPs gradually increases as \\textit{minuo} is decreased. This is because the HUOSPs generated by the algorithm SUMU do not vary much for smaller \\textit{minuo} settings. In fact, such similar situations also can be found in the HUOPM algorithm \\cite{gan2019huopm}. \n\n",
        "Descriptive_question1": "What is the number of patterns generated for the Bible dataset at minsup_1 in table_4?",
        "Descriptive_question2": "Which dataset has the highest number of patterns at minsup_1 in table_4?",
        "Reasoning_question1": "Why do synthetic datasets like Syn40k show a higher number of patterns compared to real datasets like Bible in table_4?",
        "Reasoning_question2": "How does the trend in the number of patterns generated change as minsup increases across different datasets in table_4?",
        "Descriptive_answer1": "21,442",
        "Descriptive_answer2": "Syn40k",
        "Reasoning_answer1": "Synthetic datasets like Syn40k show a higher number of patterns compared to real datasets like Bible because they are designed with characteristics that facilitate the generation of more candidate patterns. Specifically, synthetic datasets often contain itemsets with multiple items, which inherently allows for the formation of a larger number of sequential patterns. In contrast, real datasets such as Bible may have more constrained or context-specific data structures with fewer item combinations per sequence. As seen in the experimental context, synthetic datasets like Syn10k, Syn20k, and Syn40k have much larger pattern counts (e.g., Syn40k with 7,223,421 patterns at minsup_1) compared to Bible (21,442 patterns at minsup_1), supporting the idea that their synthetic nature, with controlled and denser data generation, results in more HUOSPs as mentioned in the pattern analysis section.",
        "Reasoning_answer2": "As minsup increases across different datasets in table_4, the trend in the number of patterns generated consistently shows a decrease. This trend can be observed by examining the pattern counts from minsup_1 to minsup_6 for each dataset. For instance, in the Bible dataset, the number of patterns drops from 21,442 at minsup_1 to 2,211 at minsup_6. Similarly, for Syn40k, it decreases from 7,223,421 to 770,393 over the same range. The reasoning behind this trend is rooted in the definition of minsup as a minimum support threshold; increasing minsup means that only patterns appearing more frequently in the dataset qualify as HUOSPs, thus reducing the total count of qualifying patterns. This inverse relationship is consistent with frequent itemset and sequential pattern mining principles, as noted in the experimental analysis, where a higher threshold naturally filters out less frequent patterns across both real and synthetic datasets."
    },
    {
        "paper_id": "2212.10452.json",
        "table_id": "table_5",
        "table_content": "\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minuo}}\n\t\\label{patterns_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$    \\\\ \\hline{Bible}, \\textit{minsup} = 500   & 11721 & 11668 & 11390 & 10433 & 8037 & 5012\n\t\t\t\\\\ \\hline\n\t\t\tFIFA, \\textit{minsup} = 4,000 & 1,093 & 870 & 499 & 212 & 69 & 20  \\\\ \\hline\n\t\t\tSign, \\textit{minsup} = 70 & 40,936 & 28,375 & 18,330 & 11,087 & 6,134 & 3,136  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minsup} = 14 & 488,651 & 435,881 & 367,058 & 287,760 & 204,788 & 130,351   \\\\ \\hline\n\t\t\tSyn20k, \\textit{minsup} = 24 & 766,501 & 660,716 & 513,359 & 355,981 & 217,649 & 117,495  \\\\ \\hline\n\t\t\tSyn40k, \\textit{minsup} = 34 & 7,223,421 & 6,737,579 & 5,831,242 & 4,550,068 & 3,092,664 & 1,777,904  \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}",
        "caption": "Number of patterns generated by varying \\textit{minuo}",
        "label": "patterns_minuo",
        "section_info": "5 Experiments\n\\section{Experiments}  \\label{sec:experiments}\n\nWe selected both real and synthetic datasets to conduct related experiments. The proposed SUMU algorithm is the first approach to mining sequential patterns with a utility occupancy measure. Thus, there is no suitable algorithm for comparison. We mainly focused on verifying the efficiency of the proposed upper bounds and pruning strategies and the effectiveness of SUMU. The code related to SUMU is programmed using the Java language and developed in Eclipse. Our extensive experiments were conducted on a bare computer, which is equipped with an i7-12700F 2.10 GHz CPU and 16 GB of RAM. The experimental details and results are shown below.\n\n\\subsection{Experimental Setup and Datasets}\n\nThree real datasets (including Bible, FIFA, and Sign) and three synthetic datasets (including Syn10k, Syn20k, and Syn40k) were used in the experiments. The real datasets are often used in the evaluation of pattern mining algorithms and can be accessed from the website SPMF\\footnote{\\url{http://www.philippe-fournier-viger.com/spmf/}}. Regarding the used synthetic datasets, they can be generated by the IBM Quest Synthetic Data Generator \\cite{QSD}. Each dataset has its own characteristics and can represent a specific type of data in practical applications. The characteristics of these datasets are described below.\n\n$ \\bullet $ \\textit{\\textbf{Bible}} contains 13,905 items and 36,369 sequences, which are transformed from the book Bible. Its average sequence length is 21.64.\n\n$ \\bullet $ \\textit{\\textbf{FIFA}} contains 2,990 items and 20,450 sequences  derived from the website of FIFA World Cup 98. Its average sequence length is 36.23.\n\n$ \\bullet $ \\textit{\\textbf{Sign}} a small but dense dataset of sign language utterances, with 267 items and 730 sequences. Its average sequence length is 27.11.\n\n$ \\bullet $ \\textit{\\textbf{Syn10k}} is a synthetic dataset with 10,000 sequence records. It has 7,312 distinct items, and its average sequence length is 26.97.\n\n$ \\bullet $ \\textit{\\textbf{Syn20k}} is a synthetic dataset with 20,000 sequence records. It has 7,442 distinct items, and its average sequence length is 26.84.\n\n$ \\bullet $ \\textit{\\textbf{Syn40k}} is a synthetic dataset with 40,000 sequence records. It has 7,537 distinct items, and its average sequence length is 26.84.\n\nTo better evaluate the proposed SUMU algorithm, several variants regarding SUMU have also been designed. Therefore, the experimental results can better show the capabilities of the designed upper bounds and pruning strategies. In our experiments, the proposed SUMU algorithm with upper bounds \\textit{PEUO} and \\textit{RSUO} is denoted as SUMU$_\\textit{simple}$. It means that only Strategies \\ref{strategy2} and \\ref{strategy3} are used in SUMU$_\\textit{simple}$. On the basis of SUMU$_\\textit{simple}$, if unpromising items are filtered out (with Strategy \\ref{strategy1}) before generating HUOSPs, then this variant of SUMU is denoted as SUMU$_\\textit{PEUO}$. To analyze the performance of gap between \\textit{PEUO} and \\textit{TPUO}, between \\textit{RSUO} and \\textit{TSUO} in the experiments, we also designed another variant of SUMU (with Strategies \\ref{strategy1}, \\ref{strategy4}, and \\ref{strategy5}), denoted as SUMU$_\\textit{TPUO}$. In addition, on the basis of SUMU$_\\textit{PEUO}$, the fourth variant, namely SUMU$_\\textit{PES}$, is designed to evaluate the two upper bounds on the support measure. Those variants of SUMU are compared to comprehensively evaluate the effectiveness and efficiency of SUMU.\n\n\\subsection{Pattern Analysis}\n\nIn this section, we mainly discuss the effect of the change in the number of HUOSPs as the \\textit{minsup} or \\textit{minsuo} changes. The results for various \\textit{minsup} and under a fixed \\textit{minuo} are shown in Table \\ref{patterns_minsup}. Likewise, the results for various \\textit{minuo} and under a fixed \\textit{minsup} are shown in Table \\ref{patterns_minuo}. For each dataset, we use \\textit{minsup}$_1$, \\textit{minsup}$_2$ (or \\textit{minuo}$_1$, \\textit{minuo}$_2$), and so on to indicate that we increasingly adjust the parameter \\textit{minsup} (or \\textit{minuo}). For instance, in our experiments, for the Bible dataset, the six parameters on \\textit{minsup} are set to 300, 400, 500, 600, 700, and 800; and the six parameters on \\textit{minuo} are set to 0.01, 0.03, 0.05, 0.07, 0.09, and 0.11. The detailed parameter settings can be observed in Fig. \\ref{runtime_minsup} and Fig. \\ref{runtime_minuo}.\n\n\n\n\n\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minsup}}\n\t\\label{patterns_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$    \\\\ \\hline\n\t\t\tBible, \\textit{minuo} = 0.1   & 21,442 & 11,008 & 6,527 & 4,290 & 2,993 & 2,211  \\\\ \\hline\n\t\t\tFIFA, \\textit{minuo} = 0.1 & 1,162 & 259 & 87 & 38 & 14 & 7    \\\\ \\hline\n\t\t\tSign, \\textit{minuo} = 0.1 & 147,517 & 74,532 & 40,936 & 23,879 & 14,521 & 9,165  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minuo} = 0.1 &   5,732,182 & 1,311,583 & 488,651 & 165,915 & 96,636 & 76,824  \\\\ \\hline\n\t\t\tSyn20k, \\textit{minuo} = 0.1 &   3,751,369 & 1,470,986 & 766,501 & 325,895 & 178,157 & 124,254 \\\\ \\hline\n\t\t\tSyn40k, \\textit{minuo} = 0.1&  7,223,421 & 5,144,928 & 3,710,872 & 2,087,673 & 1,142,202 & 770,393 \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\n\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minuo}}\n\t\\label{patterns_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$    \\\\ \\hline{Bible}, \\textit{minsup} = 500   & 11721 & 11668 & 11390 & 10433 & 8037 & 5012\n\t\t\t\\\\ \\hline\n\t\t\tFIFA, \\textit{minsup} = 4,000 & 1,093 & 870 & 499 & 212 & 69 & 20  \\\\ \\hline\n\t\t\tSign, \\textit{minsup} = 70 & 40,936 & 28,375 & 18,330 & 11,087 & 6,134 & 3,136  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minsup} = 14 & 488,651 & 435,881 & 367,058 & 287,760 & 204,788 & 130,351   \\\\ \\hline\n\t\t\tSyn20k, \\textit{minsup} = 24 & 766,501 & 660,716 & 513,359 & 355,981 & 217,649 & 117,495  \\\\ \\hline\n\t\t\tSyn40k, \\textit{minsup} = 34 & 7,223,421 & 6,737,579 & 5,831,242 & 4,550,068 & 3,092,664 & 1,777,904  \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\nFrom Tables \\ref{patterns_minsup} and \\ref{patterns_minuo}, it is clear that the number of generated HUOSPs on each dataset is quite different as \\textit{minsup} or \\textit{minuo} is adjusted. Particularly, the number of generated HUOSPs on the synthetic datasets is higher than that on the real datasets. This is because for these synthetic datasets, each of their itemsets contains multiple items and can format more candidate patterns. Furthermore, as \\textit{minsup} decreases by interval, the number of HUOSPs increases rapidly. For example, the difference between the number of patterns generated by \\textit{minsup}$_1$ and the number of patterns generated under \\textit{minsup}$_2$ is smaller than the difference between \\textit{minsup}$_2$ and \\textit{minsup}$_1$. This phenomenon is reasonable and also occurs in frequent itemset mining or sequential pattern mining. However, this phenomenon is contrary to the utility occupancy measure. The number of generated HUOSPs gradually increases as \\textit{minuo} is decreased. This is because the HUOSPs generated by the algorithm SUMU do not vary much for smaller \\textit{minuo} settings. In fact, such similar situations also can be found in the HUOPM algorithm \\cite{gan2019huopm}. \n\n\\subsection{Efficiency Analysis}\n\nIn this subsection, we conducted extensive experiments to evaluate the performance of the different upper bounds and pruning strategies used in SUMU. The results in terms of runtime for various \\textit{minsup} and \\textit{minuo} settings are shown in Fig. \\ref{runtime_minsup} and Fig. \\ref{runtime_minuo}. And the results in terms of candidate patterns for various \\textit{minsup} and \\textit{minuo} settings are shown in Tables \\ref{candidates_minsup} and \\ref{candidates_minuo}.\n\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.32]{figs/runtimeminsup.pdf}\n\t\\caption{Running time under various \\textit{minsup} and a fixed \\textit{minuo} = 0.1.}\n\t\\label{runtime_minsup}\n\\end{figure}\n\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minsup}}\n\t\\label{candidates_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 464,804 & 263,807 & 171,872 & 123,398 & 94,177 & 75,449 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 35,367 & 18,999 & 11,721 & 7,967 & 5,737 & 4,354 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minuo} = 0.1}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 678,816 & 268,081 & 115,283 & 57,603 & 31,384 & 18,504 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 14,710 & 5,787 & 2,399 & 1,099 & 557 & 296 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,237,763 & 2,494,589 & 1,588,257 & 1,061,989 & 742,966 & 538,042 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 372,610 & 208,839 & 126,752 & 81,340 & 54,695 & 38,095 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 34,672,439 & 10,131,127 & 4,119,006 & 1,870,268 & 1,126,628 & 802,480 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 32,762,145 & 9,533,071 & 3,727,340 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 32,762,136 & 9,533,068 & 3,727,339 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 5,968,170 & 1,412,210 & 537,899 & 194,708 & 115,473 & 89,559 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 23,741,985 & 11,371,864 & 5,991,112 & 3,132,183 & 2,004,809 & 1,497,083 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 23,391,275 & 11,167,493 & 5,706,748 & 2,943,750 & 1,845,769 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 23,391,272 & 11,167,489 & 5,706,744 & 2,943,747 & 1,845,768 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 3,916,112 & 1,578,521 & 841,369 & 377,140 & 215,095 & 151,514 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 29,989,061 & 21,471,028 & 14,398,082 & 8,602,677 & 6,268,961 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 29,551,866 & 21,110,384 & 14,105,707 & 8,252,704 & 5,995,403 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 29,551,845 & 21,110,361 & 14,105,693 & 8,252,697 & 5,995,389 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 5,326,206 & 3,857,950 & 2,206,183 & 1,239,766 & 850,635 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}\n\n\n\n\nFrom Fig \\ref{runtime_minsup} and Table \\ref{candidates_minsup}, under different \\textit{minsup} settings, we can clearly see that the runtime of the variant SUMU$_\\textit{simple}$ is the worst on the datasets Bible and FIFA; the runtime of the variant SUMU$_\\textit{TPUO}$ is the worst on the datasets Sign, Syn20k, and Syn40k. SUMU$_\\textit{simple}$ and SUMU$_\\textit{TPUO}$ perform similarly on Syn10k. However, when \\textit{minsup} is set to 10, the runtime of SUMU$_\\textit{TPUO}$ exceeds that of SUMU$_\\textit{simple}$. In addition, the variant SUMU$_\\textit{PES}$ which uses four upper bounds (\\textit{PEUO}, \\textit{RSUO}, \\textit{PES}, and \\textit{RSS}) can achieve the best performance on all datasets. And the variant SUMU$_\\textit{PEUO}$ takes the second least amount of runtime. SUMU$_\\textit{PES}$ is able to minimize candidate pattern generation, while SUMU$_\\textit{simple}$ is the least reduced. The results of our experiment are as expected. In experiments under different \\textit{minsup} and a fixed \\textit{minuo}, we can draw the following conclusions.\n\n\n\n\\begin{enumerate}[label=(\\arabic*)]\n\t\\item SUMU$_\\textit{PES}$ adopts a sufficient number of pruning strategies to significantly reduce candidate patterns while achieving the shortest  runtime. Compared to several other variants of SUMU, SUMU$_\\textit{PES}$ generates much fewer candidate patterns. Although the number of candidate patterns is many times less than the other variants, the overall performance is not more than a few times better. This is a lot of unpromising candidate patterns that are also ignored in the subsequent program steps.\n\t\n\t\\item The difference between SUMU$_\\textit{simple}$ and SUMU$_\\textit{PEUO}$ demonstrate that the pruning Strategy \\ref{strategy1} is ineffective on synthetic datasets. This is because \\textit{minsup} are set to relatively small values, and thus there are not many unpromising items appearing in the sequence dataset.\n\t\n\t\\item Although \\textit{TPUO} and \\textit{TSUO} are tighter upper bounds, their calculation makes SUMU$_\\textit{TPUO}$ take longer than SUMU$_\\textit{PEUO}$. For a candidate pattern, SUMU$_\\textit{PEUO}$ is able to compute upper bounds \\textit{PEUO} and \\textit{RSUO} in a liner time. While for \\textit{TSUO}, it requires multiple sorting operations, which is a complex process. In addition, SUMU$_\\textit{TPUO}$ does not reduce any candidate pattern on many datasets (including Bible, FIFA, and Sign). Even if it works on the few remaining datasets, it only reduces the number of candidate patterns by a particularly small amount.\n\\end{enumerate}\n\n\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.32]{figs/runtimeminuo.pdf}\n\t\\caption{Running time under various \\textit{minuo} and a fixed \\textit{minsup}. (a) Bible, \\textit{minsup} = 500. (b) FIFA,  \\textit{minsup} = 4,000. (c) Sign, \\textit{minsup} = 70. (d) Syn10k, \\textit{minsup} = 14. (e) Syn20k, \\textit{minsup} = 24. (f) Syn40k, \\textit{minsup} = 34.}\n\t\\label{runtime_minuo}\n\\end{figure}\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minuo}}\n\t\\label{candidates_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minsup} = 500}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,921,104 & 672,801 & 390,245 & 264,424 & 195,445 & 153,171 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minsup} = 4000}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 280,551 & 178,428 & 144,686 & 102,302 & 69,806 & 48,009 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minsup} = 70}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,588,257 & 1,393,525 & 1,233,358 & 1,101,274 & 988,726 & 894,110 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,284,130 & 1,129,929 & 1,002,217 & 895,976 & 806,047 & 730,904 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,284,130 & 1,129,928 & 1,002,216 & 895,923 & 805,911 & 730,644 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 126,752 & 126,743 & 126,716 & 126,652 & 126,478 & 126,240 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minsup} = 14}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,119,006 & 2,916,190 & 2,399,859 & 2,021,727 & 1,747,076 & 1,534,885 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,727,340 & 2,790,561 & 2,301,336 & 1,948,286 & 1,680,334 & 1,485,466 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,727,339 & 2,790,547 & 2,301,264 & 1,948,081 & 1,679,416 & 1,482,724 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 537,899 & 537,774 & 537,328 & 536,225 & 533,228 & 526,147 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minsup} = 24}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 5,991,112 & 5,044,232 & 4,517,900 & 4,112,343 & 3,773,772 & 3,423,454 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 5,706,748 & 4,904,484 & 4,430,131 & 4,041,273 & 3,712,031 & 3,357,823 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 5,706,744 & 4,904,420 & 4,429,752 & 4,040,166 & 3,709,467 & 3,352,790 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 841,369 & 840,970 & 838,452 & 830,080 & 812,793 & 785,422 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minsup} = 34}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 36,517,396 & 34,219,382 & 31,977,235 & 29,399,324 & 26,176,962 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 36,148,568 & 33,947,656 & 31,690,858 & 29,135,445 & 25,872,139 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 36,148,106 & 33,944,569 & 31,678,889 & 29,102,992 & 25,803,949 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 7,452,659 & 7,440,512 & 7,403,338 & 7,327,730 & 7,200,566 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}\n\nFurthermore, from Fig \\ref{runtime_minuo} and Table \\ref{candidates_minuo}, under different \\textit{minuo} settings, we can clearly observe that SUMU$_\\textit{PES}$ is the fastest variant of SUMU. On the datasets Sign, Syn10k, Syn20k, and Syn40k, there are some fluctuations that occur in all the variants of SUMU, but the overall trend is still clear. Regardless of which dataset is processed, the runtime curve of SUMU$_\\textit{PES}$ becomes smoother as \\textit{minuo} is adjusted. In particular, the number of candidate patterns generated on the Bible and FIFA datasets has not changed. In experiments under different \\textit{minuo} and a fixed \\textit{minsup}, we can draw the following conclusions.\n\n\\begin{enumerate}[label=(\\arabic*)]\n\t\\item Unlike the experiments under tuning \\textit{minsup}, the runtime of each variant of SUMU is not much affected by the setting of \\textit{minuo}. On the datasets Bible and FIFA, the runtimes of SUMU$_\\textit{PEUO}$, SUMU$_\\textit{TPUO}$, and SUMU$_\\textit{PES}$ hardly increase when \\textit{minuo} decreases. While the candidate patterns for SUMU$_\\textit{PEUO}$ and SUMU$_\\textit{TPUO}$ are increased substantially. This suggests that the support measure plays a greater role in determining the program runtime than the utility occupancy measure.\n\t\n\t\\item SUMU$_\\textit{PES}$ still achieves the fastest runtime due to the most reasonable pruning strategies it uses. Moreover, on each dataset, as \\textit{minuo} decreases, it does not generate many more candidate patterns. Upper bounds \\textit{PES} and \\textit{RSS} already make it possible to reduce many invalid candidate patterns.\n\t\n\t\\item Although \\textit{TPUO} and \\textit{TSUO} are tighter upper bounds, as \\textit{minuo} decreases, they still do not reduce many irrelevant candidate patterns for SUMU$_\\textit{TPUO}$. \t\n\\end{enumerate}\n\n\n\\subsection{Memory Evaluation}\n\nThe memory consumption of each variant of SUMU is close and fluctuating, and we present the approximate memory consumption under different datasets. We investigate the reasons for the disparities in memory consumption based on program design details. The experimental results regarding memory consumption are shown in Table \\ref{memory}.\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Memory consumption}\n\t\\label{memory}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{}   & \\multicolumn{6}{c}{\\textbf{Approximate memory consumed (MB)}} \\\\ \\cline{2-7} \n\t\t\t&          Bible         &  FIFA & Sign  & Syn10k  & Syn20k  & Syn40k  \\\\ \\hline{fixed \\textit{minuo}}  & 1,000 $\\sim$ 1,400 & 1,400 $\\sim$ 1,700 & 200 $\\sim$ 400 & 300 $\\sim$ 800 & 600 $\\sim$ 800 & 1,200 $\\sim$ 1,500  \\\\ \\hline{fixed \\textit{minsup}}  & 1,000 $\\sim$ 1,400 & 1,500 $\\sim$ 1,700 & 200 $\\sim$ 400 & 300 $\\sim$ 600 & 300 $\\sim$ 800 & 1,200 $\\sim$ 1,500    \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\nSince each variant of SUMU uses both UOL-Chain and UO-Table, they consume little difference in memory, and the difference is within a reasonable range. The variants of SUMU employ a different number of pruning strategies, and thus they differ somewhat in the use of some auxiliary data structures. If the pruning Strategy \\ref{strategy1} is used, then unpromising items should be filtered. To find out which items are unpromising, the program utilizes a hash table to record the support for each item. On the contrary, if all items are used directly, it is sufficient for the program to use a single list to record those items that occur in the sequence database. The difference between SUMU$_{\\textit{PEUO}}$ and SUMU$_{\\textit{TPUO}}$ is that they use different upper bounds and pruning strategies. For the calculation of \\textit{PEUO} and \\textit{RSUO} of a pattern, this is relatively simple. Scanning the UOL-Chain of a pattern quickly and accumulating corresponding values. However, for the calculation of \\textit{TPUO} and \\textit{TSUO} of a pattern, several \\textit{minsup}-sized priority queues are required. This allows computing tighter upper bound values, but also consumes additional memory space. As for SUMU$_{\\textit{PES}}$, it uses more upper bounds \\textit{PES} and \\textit{RSS} (adopts more pruning Strategies \\ref{strategy6} and \\ref{strategy7}) compared to SUMU$_{\\textit{PEUO}}$. This suggests that in pattern extension, the program needs the associated hash tables to decide which candidate patterns satisfy the upper bounds \\textit{PES} and \\textit{RSS}. It seems that the more upper bounds and pruning strategies are used, the more memory is consumed. Nevertheless, in the experiments, effective pruning strategies can avoid unnecessary UOL-Chain and UO-Table builds due to the non-generation of some candidate patterns, also saving memory consumption. Therefore, the memory consumption of each variant of SUMU is roughly equal.\n\n\\subsection{Scalability}\n\nThis section selected five synthetic datasets to evaluate the scalability of each variant of SUMU. The dataset size increases from 10k to 50k sequence records, increasing by 10k each time. We set a relative support for experiments, e.g., \\textit{minsup} was set to 10, 20, 30, 40, and 50 for the five synthetic datasets, respectively. In addition, the \\textit{minuo} is set to 0.1 in order to generate more HUOSPs. We analyze the scalability in terms of runtime and candidate pattern generation, and the experimental results are shown in Fig. \\ref{scalability}.\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.4]{figs/scalability.pdf}\n\t\\caption{Scalability of the compared variants of SUMU}\n\t\\label{scalability}\n\\end{figure}\n\nFrom Fig. \\ref{scalability}, it is clear that the runtime of each variant of SUMU grows as the size of the processed dataset increases. This is consistent with our assumption that larger datasets carry more candidate patterns, increasing the processing difficulty. The use of UOL-Chain and UO-Table makes the trend of each variant of SUMU the same, with only differences in efficiency. The difference between all SUMU variants is clear, with SUMU$_{\\textit{PES}}$ performing best and SUMU$_{\\textit{TPUO}}$ performing worst. For SUMU$_{\\textit{PES}}$, there is no such rapid growth of candidate patterns. While other variants of SUMU generate a large number of candidate patterns. Therefore, it performs well when handling large-scale datasets. The large number of sorting operations required for the calculation of tighter upper bounds, in particular, causes SUMU$_{\\textit{TPUO}}$ to perform poorly. The difference between SUMU$_{\\textit{PEUO}}$ and SUMU$_{\\textit{simple}}$ illustrates the effectiveness of the pruning strategy \\ref{strategy1}.\n\n\n5.2 Pattern Analysis\n\\subsection{Pattern Analysis}\n\nIn this section, we mainly discuss the effect of the change in the number of HUOSPs as the \\textit{minsup} or \\textit{minsuo} changes. The results for various \\textit{minsup} and under a fixed \\textit{minuo} are shown in Table \\ref{patterns_minsup}. Likewise, the results for various \\textit{minuo} and under a fixed \\textit{minsup} are shown in Table \\ref{patterns_minuo}. For each dataset, we use \\textit{minsup}$_1$, \\textit{minsup}$_2$ (or \\textit{minuo}$_1$, \\textit{minuo}$_2$), and so on to indicate that we increasingly adjust the parameter \\textit{minsup} (or \\textit{minuo}). For instance, in our experiments, for the Bible dataset, the six parameters on \\textit{minsup} are set to 300, 400, 500, 600, 700, and 800; and the six parameters on \\textit{minuo} are set to 0.01, 0.03, 0.05, 0.07, 0.09, and 0.11. The detailed parameter settings can be observed in Fig. \\ref{runtime_minsup} and Fig. \\ref{runtime_minuo}.\n\n\n\n\n\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minsup}}\n\t\\label{patterns_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$    \\\\ \\hline\n\t\t\tBible, \\textit{minuo} = 0.1   & 21,442 & 11,008 & 6,527 & 4,290 & 2,993 & 2,211  \\\\ \\hline\n\t\t\tFIFA, \\textit{minuo} = 0.1 & 1,162 & 259 & 87 & 38 & 14 & 7    \\\\ \\hline\n\t\t\tSign, \\textit{minuo} = 0.1 & 147,517 & 74,532 & 40,936 & 23,879 & 14,521 & 9,165  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minuo} = 0.1 &   5,732,182 & 1,311,583 & 488,651 & 165,915 & 96,636 & 76,824  \\\\ \\hline\n\t\t\tSyn20k, \\textit{minuo} = 0.1 &   3,751,369 & 1,470,986 & 766,501 & 325,895 & 178,157 & 124,254 \\\\ \\hline\n\t\t\tSyn40k, \\textit{minuo} = 0.1&  7,223,421 & 5,144,928 & 3,710,872 & 2,087,673 & 1,142,202 & 770,393 \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\n\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minuo}}\n\t\\label{patterns_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$    \\\\ \\hline{Bible}, \\textit{minsup} = 500   & 11721 & 11668 & 11390 & 10433 & 8037 & 5012\n\t\t\t\\\\ \\hline\n\t\t\tFIFA, \\textit{minsup} = 4,000 & 1,093 & 870 & 499 & 212 & 69 & 20  \\\\ \\hline\n\t\t\tSign, \\textit{minsup} = 70 & 40,936 & 28,375 & 18,330 & 11,087 & 6,134 & 3,136  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minsup} = 14 & 488,651 & 435,881 & 367,058 & 287,760 & 204,788 & 130,351   \\\\ \\hline\n\t\t\tSyn20k, \\textit{minsup} = 24 & 766,501 & 660,716 & 513,359 & 355,981 & 217,649 & 117,495  \\\\ \\hline\n\t\t\tSyn40k, \\textit{minsup} = 34 & 7,223,421 & 6,737,579 & 5,831,242 & 4,550,068 & 3,092,664 & 1,777,904  \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\nFrom Tables \\ref{patterns_minsup} and \\ref{patterns_minuo}, it is clear that the number of generated HUOSPs on each dataset is quite different as \\textit{minsup} or \\textit{minuo} is adjusted. Particularly, the number of generated HUOSPs on the synthetic datasets is higher than that on the real datasets. This is because for these synthetic datasets, each of their itemsets contains multiple items and can format more candidate patterns. Furthermore, as \\textit{minsup} decreases by interval, the number of HUOSPs increases rapidly. For example, the difference between the number of patterns generated by \\textit{minsup}$_1$ and the number of patterns generated under \\textit{minsup}$_2$ is smaller than the difference between \\textit{minsup}$_2$ and \\textit{minsup}$_1$. This phenomenon is reasonable and also occurs in frequent itemset mining or sequential pattern mining. However, this phenomenon is contrary to the utility occupancy measure. The number of generated HUOSPs gradually increases as \\textit{minuo} is decreased. This is because the HUOSPs generated by the algorithm SUMU do not vary much for smaller \\textit{minuo} settings. In fact, such similar situations also can be found in the HUOPM algorithm \\cite{gan2019huopm}. \n\n",
        "Descriptive_question1": "What is the number of patterns generated for the Bible dataset at minuo_1 in table_5?",
        "Descriptive_question2": "What is the number of patterns generated for FIFA dataset at minuo_6 in table_5?",
        "Reasoning_question1": "Why does the number of patterns generated in table_5 decrease as minuo increases across all datasets?",
        "Reasoning_question2": "How does the trend of pattern generation in table_5 for synthetic datasets compare to real datasets when varying minuo, and what might explain this difference?",
        "Descriptive_answer1": "11721",
        "Descriptive_answer2": "20",
        "Reasoning_answer1": "The decrease in the number of patterns generated as minuo increases across all datasets in Table 5 can be explained through a logical progression. First, minuo represents a utility occupancy threshold, which is a measure used to filter patterns based on their utility value relative to the dataset. As minuo increases, this threshold becomes stricter, meaning fewer patterns will meet the criteria to be considered valid or significant under the SUMU algorithm. Consequently, with a higher minuo value, more patterns are excluded because their utility occupancy does not satisfy the elevated requirement. This results in a consistent reduction in the number of qualifying patterns across datasets like Bible, FIFA, Sign, Syn10k, Syn20k, and Syn40k. Additionally, the text indicates that the number of HUOSPs (High Utility Occupancy Sequential Patterns) does not vary much for smaller minuo settings but still gradually decreases as minuo rises, reinforcing the idea that increasing minuo systematically filters out more patterns.",
        "Reasoning_answer2": "When comparing the trend of pattern generation in Table 5 for synthetic datasets (Syn10k, Syn20k, Syn40k) versus real datasets (Bible, FIFA, Sign) as minuo varies, a clear difference emerges through a step-by-step analysis. First, observing the data, synthetic datasets consistently generate a significantly higher number of patterns compared to real datasets at each minuo level. For example, at minuo_1, Syn40k has 7,223,421 patterns while Bible has only 11,721. As minuo increases to minuo_6, Syn40k drops to 1,777,904, still vastly higher than Bible’s 5,012. The rate of decrease appears more pronounced in synthetic datasets due to the larger initial numbers, but proportionally, both types show a similar declining trend. The key difference lies in the scale: synthetic datasets start with and maintain higher pattern counts. This can be explained by the nature of the datasets as described in the text. Synthetic datasets are designed to have itemsets with multiple items, allowing for the formation of more candidate patterns. In contrast, real datasets may have more constrained or specific item distributions due to their real-world origin, limiting the number of patterns that can be generated. Additionally, synthetic datasets likely have controlled characteristics that favor pattern diversity, leading to more combinations meeting the minuo threshold initially, whereas real datasets like FIFA or Sign may reflect more irregular or sparse data distributions, inherently producing fewer patterns."
    },
    {
        "paper_id": "2212.10452.json",
        "table_id": "table_6",
        "table_content": "\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minsup}}\n\t\\label{candidates_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 464,804 & 263,807 & 171,872 & 123,398 & 94,177 & 75,449 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 35,367 & 18,999 & 11,721 & 7,967 & 5,737 & 4,354 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minuo} = 0.1}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 678,816 & 268,081 & 115,283 & 57,603 & 31,384 & 18,504 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 14,710 & 5,787 & 2,399 & 1,099 & 557 & 296 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,237,763 & 2,494,589 & 1,588,257 & 1,061,989 & 742,966 & 538,042 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 372,610 & 208,839 & 126,752 & 81,340 & 54,695 & 38,095 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 34,672,439 & 10,131,127 & 4,119,006 & 1,870,268 & 1,126,628 & 802,480 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 32,762,145 & 9,533,071 & 3,727,340 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 32,762,136 & 9,533,068 & 3,727,339 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 5,968,170 & 1,412,210 & 537,899 & 194,708 & 115,473 & 89,559 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 23,741,985 & 11,371,864 & 5,991,112 & 3,132,183 & 2,004,809 & 1,497,083 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 23,391,275 & 11,167,493 & 5,706,748 & 2,943,750 & 1,845,769 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 23,391,272 & 11,167,489 & 5,706,744 & 2,943,747 & 1,845,768 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 3,916,112 & 1,578,521 & 841,369 & 377,140 & 215,095 & 151,514 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 29,989,061 & 21,471,028 & 14,398,082 & 8,602,677 & 6,268,961 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 29,551,866 & 21,110,384 & 14,105,707 & 8,252,704 & 5,995,403 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 29,551,845 & 21,110,361 & 14,105,693 & 8,252,697 & 5,995,389 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 5,326,206 & 3,857,950 & 2,206,183 & 1,239,766 & 850,635 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}",
        "caption": "Number of candidate patterns generated by varying \\textit{minsup}",
        "label": "candidates_minsup",
        "section_info": "5 Experiments\n\\section{Experiments}  \\label{sec:experiments}\n\nWe selected both real and synthetic datasets to conduct related experiments. The proposed SUMU algorithm is the first approach to mining sequential patterns with a utility occupancy measure. Thus, there is no suitable algorithm for comparison. We mainly focused on verifying the efficiency of the proposed upper bounds and pruning strategies and the effectiveness of SUMU. The code related to SUMU is programmed using the Java language and developed in Eclipse. Our extensive experiments were conducted on a bare computer, which is equipped with an i7-12700F 2.10 GHz CPU and 16 GB of RAM. The experimental details and results are shown below.\n\n\\subsection{Experimental Setup and Datasets}\n\nThree real datasets (including Bible, FIFA, and Sign) and three synthetic datasets (including Syn10k, Syn20k, and Syn40k) were used in the experiments. The real datasets are often used in the evaluation of pattern mining algorithms and can be accessed from the website SPMF\\footnote{\\url{http://www.philippe-fournier-viger.com/spmf/}}. Regarding the used synthetic datasets, they can be generated by the IBM Quest Synthetic Data Generator \\cite{QSD}. Each dataset has its own characteristics and can represent a specific type of data in practical applications. The characteristics of these datasets are described below.\n\n$ \\bullet $ \\textit{\\textbf{Bible}} contains 13,905 items and 36,369 sequences, which are transformed from the book Bible. Its average sequence length is 21.64.\n\n$ \\bullet $ \\textit{\\textbf{FIFA}} contains 2,990 items and 20,450 sequences  derived from the website of FIFA World Cup 98. Its average sequence length is 36.23.\n\n$ \\bullet $ \\textit{\\textbf{Sign}} a small but dense dataset of sign language utterances, with 267 items and 730 sequences. Its average sequence length is 27.11.\n\n$ \\bullet $ \\textit{\\textbf{Syn10k}} is a synthetic dataset with 10,000 sequence records. It has 7,312 distinct items, and its average sequence length is 26.97.\n\n$ \\bullet $ \\textit{\\textbf{Syn20k}} is a synthetic dataset with 20,000 sequence records. It has 7,442 distinct items, and its average sequence length is 26.84.\n\n$ \\bullet $ \\textit{\\textbf{Syn40k}} is a synthetic dataset with 40,000 sequence records. It has 7,537 distinct items, and its average sequence length is 26.84.\n\nTo better evaluate the proposed SUMU algorithm, several variants regarding SUMU have also been designed. Therefore, the experimental results can better show the capabilities of the designed upper bounds and pruning strategies. In our experiments, the proposed SUMU algorithm with upper bounds \\textit{PEUO} and \\textit{RSUO} is denoted as SUMU$_\\textit{simple}$. It means that only Strategies \\ref{strategy2} and \\ref{strategy3} are used in SUMU$_\\textit{simple}$. On the basis of SUMU$_\\textit{simple}$, if unpromising items are filtered out (with Strategy \\ref{strategy1}) before generating HUOSPs, then this variant of SUMU is denoted as SUMU$_\\textit{PEUO}$. To analyze the performance of gap between \\textit{PEUO} and \\textit{TPUO}, between \\textit{RSUO} and \\textit{TSUO} in the experiments, we also designed another variant of SUMU (with Strategies \\ref{strategy1}, \\ref{strategy4}, and \\ref{strategy5}), denoted as SUMU$_\\textit{TPUO}$. In addition, on the basis of SUMU$_\\textit{PEUO}$, the fourth variant, namely SUMU$_\\textit{PES}$, is designed to evaluate the two upper bounds on the support measure. Those variants of SUMU are compared to comprehensively evaluate the effectiveness and efficiency of SUMU.\n\n\\subsection{Pattern Analysis}\n\nIn this section, we mainly discuss the effect of the change in the number of HUOSPs as the \\textit{minsup} or \\textit{minsuo} changes. The results for various \\textit{minsup} and under a fixed \\textit{minuo} are shown in Table \\ref{patterns_minsup}. Likewise, the results for various \\textit{minuo} and under a fixed \\textit{minsup} are shown in Table \\ref{patterns_minuo}. For each dataset, we use \\textit{minsup}$_1$, \\textit{minsup}$_2$ (or \\textit{minuo}$_1$, \\textit{minuo}$_2$), and so on to indicate that we increasingly adjust the parameter \\textit{minsup} (or \\textit{minuo}). For instance, in our experiments, for the Bible dataset, the six parameters on \\textit{minsup} are set to 300, 400, 500, 600, 700, and 800; and the six parameters on \\textit{minuo} are set to 0.01, 0.03, 0.05, 0.07, 0.09, and 0.11. The detailed parameter settings can be observed in Fig. \\ref{runtime_minsup} and Fig. \\ref{runtime_minuo}.\n\n\n\n\n\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minsup}}\n\t\\label{patterns_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$    \\\\ \\hline\n\t\t\tBible, \\textit{minuo} = 0.1   & 21,442 & 11,008 & 6,527 & 4,290 & 2,993 & 2,211  \\\\ \\hline\n\t\t\tFIFA, \\textit{minuo} = 0.1 & 1,162 & 259 & 87 & 38 & 14 & 7    \\\\ \\hline\n\t\t\tSign, \\textit{minuo} = 0.1 & 147,517 & 74,532 & 40,936 & 23,879 & 14,521 & 9,165  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minuo} = 0.1 &   5,732,182 & 1,311,583 & 488,651 & 165,915 & 96,636 & 76,824  \\\\ \\hline\n\t\t\tSyn20k, \\textit{minuo} = 0.1 &   3,751,369 & 1,470,986 & 766,501 & 325,895 & 178,157 & 124,254 \\\\ \\hline\n\t\t\tSyn40k, \\textit{minuo} = 0.1&  7,223,421 & 5,144,928 & 3,710,872 & 2,087,673 & 1,142,202 & 770,393 \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\n\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minuo}}\n\t\\label{patterns_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$    \\\\ \\hline{Bible}, \\textit{minsup} = 500   & 11721 & 11668 & 11390 & 10433 & 8037 & 5012\n\t\t\t\\\\ \\hline\n\t\t\tFIFA, \\textit{minsup} = 4,000 & 1,093 & 870 & 499 & 212 & 69 & 20  \\\\ \\hline\n\t\t\tSign, \\textit{minsup} = 70 & 40,936 & 28,375 & 18,330 & 11,087 & 6,134 & 3,136  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minsup} = 14 & 488,651 & 435,881 & 367,058 & 287,760 & 204,788 & 130,351   \\\\ \\hline\n\t\t\tSyn20k, \\textit{minsup} = 24 & 766,501 & 660,716 & 513,359 & 355,981 & 217,649 & 117,495  \\\\ \\hline\n\t\t\tSyn40k, \\textit{minsup} = 34 & 7,223,421 & 6,737,579 & 5,831,242 & 4,550,068 & 3,092,664 & 1,777,904  \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\nFrom Tables \\ref{patterns_minsup} and \\ref{patterns_minuo}, it is clear that the number of generated HUOSPs on each dataset is quite different as \\textit{minsup} or \\textit{minuo} is adjusted. Particularly, the number of generated HUOSPs on the synthetic datasets is higher than that on the real datasets. This is because for these synthetic datasets, each of their itemsets contains multiple items and can format more candidate patterns. Furthermore, as \\textit{minsup} decreases by interval, the number of HUOSPs increases rapidly. For example, the difference between the number of patterns generated by \\textit{minsup}$_1$ and the number of patterns generated under \\textit{minsup}$_2$ is smaller than the difference between \\textit{minsup}$_2$ and \\textit{minsup}$_1$. This phenomenon is reasonable and also occurs in frequent itemset mining or sequential pattern mining. However, this phenomenon is contrary to the utility occupancy measure. The number of generated HUOSPs gradually increases as \\textit{minuo} is decreased. This is because the HUOSPs generated by the algorithm SUMU do not vary much for smaller \\textit{minuo} settings. In fact, such similar situations also can be found in the HUOPM algorithm \\cite{gan2019huopm}. \n\n\\subsection{Efficiency Analysis}\n\nIn this subsection, we conducted extensive experiments to evaluate the performance of the different upper bounds and pruning strategies used in SUMU. The results in terms of runtime for various \\textit{minsup} and \\textit{minuo} settings are shown in Fig. \\ref{runtime_minsup} and Fig. \\ref{runtime_minuo}. And the results in terms of candidate patterns for various \\textit{minsup} and \\textit{minuo} settings are shown in Tables \\ref{candidates_minsup} and \\ref{candidates_minuo}.\n\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.32]{figs/runtimeminsup.pdf}\n\t\\caption{Running time under various \\textit{minsup} and a fixed \\textit{minuo} = 0.1.}\n\t\\label{runtime_minsup}\n\\end{figure}\n\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minsup}}\n\t\\label{candidates_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 464,804 & 263,807 & 171,872 & 123,398 & 94,177 & 75,449 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 35,367 & 18,999 & 11,721 & 7,967 & 5,737 & 4,354 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minuo} = 0.1}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 678,816 & 268,081 & 115,283 & 57,603 & 31,384 & 18,504 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 14,710 & 5,787 & 2,399 & 1,099 & 557 & 296 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,237,763 & 2,494,589 & 1,588,257 & 1,061,989 & 742,966 & 538,042 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 372,610 & 208,839 & 126,752 & 81,340 & 54,695 & 38,095 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 34,672,439 & 10,131,127 & 4,119,006 & 1,870,268 & 1,126,628 & 802,480 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 32,762,145 & 9,533,071 & 3,727,340 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 32,762,136 & 9,533,068 & 3,727,339 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 5,968,170 & 1,412,210 & 537,899 & 194,708 & 115,473 & 89,559 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 23,741,985 & 11,371,864 & 5,991,112 & 3,132,183 & 2,004,809 & 1,497,083 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 23,391,275 & 11,167,493 & 5,706,748 & 2,943,750 & 1,845,769 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 23,391,272 & 11,167,489 & 5,706,744 & 2,943,747 & 1,845,768 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 3,916,112 & 1,578,521 & 841,369 & 377,140 & 215,095 & 151,514 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 29,989,061 & 21,471,028 & 14,398,082 & 8,602,677 & 6,268,961 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 29,551,866 & 21,110,384 & 14,105,707 & 8,252,704 & 5,995,403 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 29,551,845 & 21,110,361 & 14,105,693 & 8,252,697 & 5,995,389 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 5,326,206 & 3,857,950 & 2,206,183 & 1,239,766 & 850,635 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}\n\n\n\n\nFrom Fig \\ref{runtime_minsup} and Table \\ref{candidates_minsup}, under different \\textit{minsup} settings, we can clearly see that the runtime of the variant SUMU$_\\textit{simple}$ is the worst on the datasets Bible and FIFA; the runtime of the variant SUMU$_\\textit{TPUO}$ is the worst on the datasets Sign, Syn20k, and Syn40k. SUMU$_\\textit{simple}$ and SUMU$_\\textit{TPUO}$ perform similarly on Syn10k. However, when \\textit{minsup} is set to 10, the runtime of SUMU$_\\textit{TPUO}$ exceeds that of SUMU$_\\textit{simple}$. In addition, the variant SUMU$_\\textit{PES}$ which uses four upper bounds (\\textit{PEUO}, \\textit{RSUO}, \\textit{PES}, and \\textit{RSS}) can achieve the best performance on all datasets. And the variant SUMU$_\\textit{PEUO}$ takes the second least amount of runtime. SUMU$_\\textit{PES}$ is able to minimize candidate pattern generation, while SUMU$_\\textit{simple}$ is the least reduced. The results of our experiment are as expected. In experiments under different \\textit{minsup} and a fixed \\textit{minuo}, we can draw the following conclusions.\n\n\n\n\\begin{enumerate}[label=(\\arabic*)]\n\t\\item SUMU$_\\textit{PES}$ adopts a sufficient number of pruning strategies to significantly reduce candidate patterns while achieving the shortest  runtime. Compared to several other variants of SUMU, SUMU$_\\textit{PES}$ generates much fewer candidate patterns. Although the number of candidate patterns is many times less than the other variants, the overall performance is not more than a few times better. This is a lot of unpromising candidate patterns that are also ignored in the subsequent program steps.\n\t\n\t\\item The difference between SUMU$_\\textit{simple}$ and SUMU$_\\textit{PEUO}$ demonstrate that the pruning Strategy \\ref{strategy1} is ineffective on synthetic datasets. This is because \\textit{minsup} are set to relatively small values, and thus there are not many unpromising items appearing in the sequence dataset.\n\t\n\t\\item Although \\textit{TPUO} and \\textit{TSUO} are tighter upper bounds, their calculation makes SUMU$_\\textit{TPUO}$ take longer than SUMU$_\\textit{PEUO}$. For a candidate pattern, SUMU$_\\textit{PEUO}$ is able to compute upper bounds \\textit{PEUO} and \\textit{RSUO} in a liner time. While for \\textit{TSUO}, it requires multiple sorting operations, which is a complex process. In addition, SUMU$_\\textit{TPUO}$ does not reduce any candidate pattern on many datasets (including Bible, FIFA, and Sign). Even if it works on the few remaining datasets, it only reduces the number of candidate patterns by a particularly small amount.\n\\end{enumerate}\n\n\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.32]{figs/runtimeminuo.pdf}\n\t\\caption{Running time under various \\textit{minuo} and a fixed \\textit{minsup}. (a) Bible, \\textit{minsup} = 500. (b) FIFA,  \\textit{minsup} = 4,000. (c) Sign, \\textit{minsup} = 70. (d) Syn10k, \\textit{minsup} = 14. (e) Syn20k, \\textit{minsup} = 24. (f) Syn40k, \\textit{minsup} = 34.}\n\t\\label{runtime_minuo}\n\\end{figure}\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minuo}}\n\t\\label{candidates_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minsup} = 500}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,921,104 & 672,801 & 390,245 & 264,424 & 195,445 & 153,171 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minsup} = 4000}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 280,551 & 178,428 & 144,686 & 102,302 & 69,806 & 48,009 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minsup} = 70}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,588,257 & 1,393,525 & 1,233,358 & 1,101,274 & 988,726 & 894,110 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,284,130 & 1,129,929 & 1,002,217 & 895,976 & 806,047 & 730,904 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,284,130 & 1,129,928 & 1,002,216 & 895,923 & 805,911 & 730,644 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 126,752 & 126,743 & 126,716 & 126,652 & 126,478 & 126,240 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minsup} = 14}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,119,006 & 2,916,190 & 2,399,859 & 2,021,727 & 1,747,076 & 1,534,885 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,727,340 & 2,790,561 & 2,301,336 & 1,948,286 & 1,680,334 & 1,485,466 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,727,339 & 2,790,547 & 2,301,264 & 1,948,081 & 1,679,416 & 1,482,724 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 537,899 & 537,774 & 537,328 & 536,225 & 533,228 & 526,147 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minsup} = 24}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 5,991,112 & 5,044,232 & 4,517,900 & 4,112,343 & 3,773,772 & 3,423,454 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 5,706,748 & 4,904,484 & 4,430,131 & 4,041,273 & 3,712,031 & 3,357,823 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 5,706,744 & 4,904,420 & 4,429,752 & 4,040,166 & 3,709,467 & 3,352,790 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 841,369 & 840,970 & 838,452 & 830,080 & 812,793 & 785,422 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minsup} = 34}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 36,517,396 & 34,219,382 & 31,977,235 & 29,399,324 & 26,176,962 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 36,148,568 & 33,947,656 & 31,690,858 & 29,135,445 & 25,872,139 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 36,148,106 & 33,944,569 & 31,678,889 & 29,102,992 & 25,803,949 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 7,452,659 & 7,440,512 & 7,403,338 & 7,327,730 & 7,200,566 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}\n\nFurthermore, from Fig \\ref{runtime_minuo} and Table \\ref{candidates_minuo}, under different \\textit{minuo} settings, we can clearly observe that SUMU$_\\textit{PES}$ is the fastest variant of SUMU. On the datasets Sign, Syn10k, Syn20k, and Syn40k, there are some fluctuations that occur in all the variants of SUMU, but the overall trend is still clear. Regardless of which dataset is processed, the runtime curve of SUMU$_\\textit{PES}$ becomes smoother as \\textit{minuo} is adjusted. In particular, the number of candidate patterns generated on the Bible and FIFA datasets has not changed. In experiments under different \\textit{minuo} and a fixed \\textit{minsup}, we can draw the following conclusions.\n\n\\begin{enumerate}[label=(\\arabic*)]\n\t\\item Unlike the experiments under tuning \\textit{minsup}, the runtime of each variant of SUMU is not much affected by the setting of \\textit{minuo}. On the datasets Bible and FIFA, the runtimes of SUMU$_\\textit{PEUO}$, SUMU$_\\textit{TPUO}$, and SUMU$_\\textit{PES}$ hardly increase when \\textit{minuo} decreases. While the candidate patterns for SUMU$_\\textit{PEUO}$ and SUMU$_\\textit{TPUO}$ are increased substantially. This suggests that the support measure plays a greater role in determining the program runtime than the utility occupancy measure.\n\t\n\t\\item SUMU$_\\textit{PES}$ still achieves the fastest runtime due to the most reasonable pruning strategies it uses. Moreover, on each dataset, as \\textit{minuo} decreases, it does not generate many more candidate patterns. Upper bounds \\textit{PES} and \\textit{RSS} already make it possible to reduce many invalid candidate patterns.\n\t\n\t\\item Although \\textit{TPUO} and \\textit{TSUO} are tighter upper bounds, as \\textit{minuo} decreases, they still do not reduce many irrelevant candidate patterns for SUMU$_\\textit{TPUO}$. \t\n\\end{enumerate}\n\n\n\\subsection{Memory Evaluation}\n\nThe memory consumption of each variant of SUMU is close and fluctuating, and we present the approximate memory consumption under different datasets. We investigate the reasons for the disparities in memory consumption based on program design details. The experimental results regarding memory consumption are shown in Table \\ref{memory}.\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Memory consumption}\n\t\\label{memory}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{}   & \\multicolumn{6}{c}{\\textbf{Approximate memory consumed (MB)}} \\\\ \\cline{2-7} \n\t\t\t&          Bible         &  FIFA & Sign  & Syn10k  & Syn20k  & Syn40k  \\\\ \\hline{fixed \\textit{minuo}}  & 1,000 $\\sim$ 1,400 & 1,400 $\\sim$ 1,700 & 200 $\\sim$ 400 & 300 $\\sim$ 800 & 600 $\\sim$ 800 & 1,200 $\\sim$ 1,500  \\\\ \\hline{fixed \\textit{minsup}}  & 1,000 $\\sim$ 1,400 & 1,500 $\\sim$ 1,700 & 200 $\\sim$ 400 & 300 $\\sim$ 600 & 300 $\\sim$ 800 & 1,200 $\\sim$ 1,500    \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\nSince each variant of SUMU uses both UOL-Chain and UO-Table, they consume little difference in memory, and the difference is within a reasonable range. The variants of SUMU employ a different number of pruning strategies, and thus they differ somewhat in the use of some auxiliary data structures. If the pruning Strategy \\ref{strategy1} is used, then unpromising items should be filtered. To find out which items are unpromising, the program utilizes a hash table to record the support for each item. On the contrary, if all items are used directly, it is sufficient for the program to use a single list to record those items that occur in the sequence database. The difference between SUMU$_{\\textit{PEUO}}$ and SUMU$_{\\textit{TPUO}}$ is that they use different upper bounds and pruning strategies. For the calculation of \\textit{PEUO} and \\textit{RSUO} of a pattern, this is relatively simple. Scanning the UOL-Chain of a pattern quickly and accumulating corresponding values. However, for the calculation of \\textit{TPUO} and \\textit{TSUO} of a pattern, several \\textit{minsup}-sized priority queues are required. This allows computing tighter upper bound values, but also consumes additional memory space. As for SUMU$_{\\textit{PES}}$, it uses more upper bounds \\textit{PES} and \\textit{RSS} (adopts more pruning Strategies \\ref{strategy6} and \\ref{strategy7}) compared to SUMU$_{\\textit{PEUO}}$. This suggests that in pattern extension, the program needs the associated hash tables to decide which candidate patterns satisfy the upper bounds \\textit{PES} and \\textit{RSS}. It seems that the more upper bounds and pruning strategies are used, the more memory is consumed. Nevertheless, in the experiments, effective pruning strategies can avoid unnecessary UOL-Chain and UO-Table builds due to the non-generation of some candidate patterns, also saving memory consumption. Therefore, the memory consumption of each variant of SUMU is roughly equal.\n\n\\subsection{Scalability}\n\nThis section selected five synthetic datasets to evaluate the scalability of each variant of SUMU. The dataset size increases from 10k to 50k sequence records, increasing by 10k each time. We set a relative support for experiments, e.g., \\textit{minsup} was set to 10, 20, 30, 40, and 50 for the five synthetic datasets, respectively. In addition, the \\textit{minuo} is set to 0.1 in order to generate more HUOSPs. We analyze the scalability in terms of runtime and candidate pattern generation, and the experimental results are shown in Fig. \\ref{scalability}.\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.4]{figs/scalability.pdf}\n\t\\caption{Scalability of the compared variants of SUMU}\n\t\\label{scalability}\n\\end{figure}\n\nFrom Fig. \\ref{scalability}, it is clear that the runtime of each variant of SUMU grows as the size of the processed dataset increases. This is consistent with our assumption that larger datasets carry more candidate patterns, increasing the processing difficulty. The use of UOL-Chain and UO-Table makes the trend of each variant of SUMU the same, with only differences in efficiency. The difference between all SUMU variants is clear, with SUMU$_{\\textit{PES}}$ performing best and SUMU$_{\\textit{TPUO}}$ performing worst. For SUMU$_{\\textit{PES}}$, there is no such rapid growth of candidate patterns. While other variants of SUMU generate a large number of candidate patterns. Therefore, it performs well when handling large-scale datasets. The large number of sorting operations required for the calculation of tighter upper bounds, in particular, causes SUMU$_{\\textit{TPUO}}$ to perform poorly. The difference between SUMU$_{\\textit{PEUO}}$ and SUMU$_{\\textit{simple}}$ illustrates the effectiveness of the pruning strategy \\ref{strategy1}.\n\n\n5.3 Efficiency Analysis\n\\subsection{Efficiency Analysis}\n\nIn this subsection, we conducted extensive experiments to evaluate the performance of the different upper bounds and pruning strategies used in SUMU. The results in terms of runtime for various \\textit{minsup} and \\textit{minuo} settings are shown in Fig. \\ref{runtime_minsup} and Fig. \\ref{runtime_minuo}. And the results in terms of candidate patterns for various \\textit{minsup} and \\textit{minuo} settings are shown in Tables \\ref{candidates_minsup} and \\ref{candidates_minuo}.\n\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.32]{figs/runtimeminsup.pdf}\n\t\\caption{Running time under various \\textit{minsup} and a fixed \\textit{minuo} = 0.1.}\n\t\\label{runtime_minsup}\n\\end{figure}\n\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minsup}}\n\t\\label{candidates_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 464,804 & 263,807 & 171,872 & 123,398 & 94,177 & 75,449 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 35,367 & 18,999 & 11,721 & 7,967 & 5,737 & 4,354 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minuo} = 0.1}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 678,816 & 268,081 & 115,283 & 57,603 & 31,384 & 18,504 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 14,710 & 5,787 & 2,399 & 1,099 & 557 & 296 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,237,763 & 2,494,589 & 1,588,257 & 1,061,989 & 742,966 & 538,042 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 372,610 & 208,839 & 126,752 & 81,340 & 54,695 & 38,095 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 34,672,439 & 10,131,127 & 4,119,006 & 1,870,268 & 1,126,628 & 802,480 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 32,762,145 & 9,533,071 & 3,727,340 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 32,762,136 & 9,533,068 & 3,727,339 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 5,968,170 & 1,412,210 & 537,899 & 194,708 & 115,473 & 89,559 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 23,741,985 & 11,371,864 & 5,991,112 & 3,132,183 & 2,004,809 & 1,497,083 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 23,391,275 & 11,167,493 & 5,706,748 & 2,943,750 & 1,845,769 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 23,391,272 & 11,167,489 & 5,706,744 & 2,943,747 & 1,845,768 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 3,916,112 & 1,578,521 & 841,369 & 377,140 & 215,095 & 151,514 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 29,989,061 & 21,471,028 & 14,398,082 & 8,602,677 & 6,268,961 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 29,551,866 & 21,110,384 & 14,105,707 & 8,252,704 & 5,995,403 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 29,551,845 & 21,110,361 & 14,105,693 & 8,252,697 & 5,995,389 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 5,326,206 & 3,857,950 & 2,206,183 & 1,239,766 & 850,635 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}\n\n\n\n\nFrom Fig \\ref{runtime_minsup} and Table \\ref{candidates_minsup}, under different \\textit{minsup} settings, we can clearly see that the runtime of the variant SUMU$_\\textit{simple}$ is the worst on the datasets Bible and FIFA; the runtime of the variant SUMU$_\\textit{TPUO}$ is the worst on the datasets Sign, Syn20k, and Syn40k. SUMU$_\\textit{simple}$ and SUMU$_\\textit{TPUO}$ perform similarly on Syn10k. However, when \\textit{minsup} is set to 10, the runtime of SUMU$_\\textit{TPUO}$ exceeds that of SUMU$_\\textit{simple}$. In addition, the variant SUMU$_\\textit{PES}$ which uses four upper bounds (\\textit{PEUO}, \\textit{RSUO}, \\textit{PES}, and \\textit{RSS}) can achieve the best performance on all datasets. And the variant SUMU$_\\textit{PEUO}$ takes the second least amount of runtime. SUMU$_\\textit{PES}$ is able to minimize candidate pattern generation, while SUMU$_\\textit{simple}$ is the least reduced. The results of our experiment are as expected. In experiments under different \\textit{minsup} and a fixed \\textit{minuo}, we can draw the following conclusions.\n\n\n\n\\begin{enumerate}[label=(\\arabic*)]\n\t\\item SUMU$_\\textit{PES}$ adopts a sufficient number of pruning strategies to significantly reduce candidate patterns while achieving the shortest  runtime. Compared to several other variants of SUMU, SUMU$_\\textit{PES}$ generates much fewer candidate patterns. Although the number of candidate patterns is many times less than the other variants, the overall performance is not more than a few times better. This is a lot of unpromising candidate patterns that are also ignored in the subsequent program steps.\n\t\n\t\\item The difference between SUMU$_\\textit{simple}$ and SUMU$_\\textit{PEUO}$ demonstrate that the pruning Strategy \\ref{strategy1} is ineffective on synthetic datasets. This is because \\textit{minsup} are set to relatively small values, and thus there are not many unpromising items appearing in the sequence dataset.\n\t\n\t\\item Although \\textit{TPUO} and \\textit{TSUO} are tighter upper bounds, their calculation makes SUMU$_\\textit{TPUO}$ take longer than SUMU$_\\textit{PEUO}$. For a candidate pattern, SUMU$_\\textit{PEUO}$ is able to compute upper bounds \\textit{PEUO} and \\textit{RSUO} in a liner time. While for \\textit{TSUO}, it requires multiple sorting operations, which is a complex process. In addition, SUMU$_\\textit{TPUO}$ does not reduce any candidate pattern on many datasets (including Bible, FIFA, and Sign). Even if it works on the few remaining datasets, it only reduces the number of candidate patterns by a particularly small amount.\n\\end{enumerate}\n\n\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.32]{figs/runtimeminuo.pdf}\n\t\\caption{Running time under various \\textit{minuo} and a fixed \\textit{minsup}. (a) Bible, \\textit{minsup} = 500. (b) FIFA,  \\textit{minsup} = 4,000. (c) Sign, \\textit{minsup} = 70. (d) Syn10k, \\textit{minsup} = 14. (e) Syn20k, \\textit{minsup} = 24. (f) Syn40k, \\textit{minsup} = 34.}\n\t\\label{runtime_minuo}\n\\end{figure}\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minuo}}\n\t\\label{candidates_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minsup} = 500}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,921,104 & 672,801 & 390,245 & 264,424 & 195,445 & 153,171 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minsup} = 4000}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 280,551 & 178,428 & 144,686 & 102,302 & 69,806 & 48,009 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minsup} = 70}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,588,257 & 1,393,525 & 1,233,358 & 1,101,274 & 988,726 & 894,110 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,284,130 & 1,129,929 & 1,002,217 & 895,976 & 806,047 & 730,904 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,284,130 & 1,129,928 & 1,002,216 & 895,923 & 805,911 & 730,644 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 126,752 & 126,743 & 126,716 & 126,652 & 126,478 & 126,240 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minsup} = 14}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,119,006 & 2,916,190 & 2,399,859 & 2,021,727 & 1,747,076 & 1,534,885 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,727,340 & 2,790,561 & 2,301,336 & 1,948,286 & 1,680,334 & 1,485,466 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,727,339 & 2,790,547 & 2,301,264 & 1,948,081 & 1,679,416 & 1,482,724 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 537,899 & 537,774 & 537,328 & 536,225 & 533,228 & 526,147 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minsup} = 24}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 5,991,112 & 5,044,232 & 4,517,900 & 4,112,343 & 3,773,772 & 3,423,454 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 5,706,748 & 4,904,484 & 4,430,131 & 4,041,273 & 3,712,031 & 3,357,823 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 5,706,744 & 4,904,420 & 4,429,752 & 4,040,166 & 3,709,467 & 3,352,790 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 841,369 & 840,970 & 838,452 & 830,080 & 812,793 & 785,422 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minsup} = 34}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 36,517,396 & 34,219,382 & 31,977,235 & 29,399,324 & 26,176,962 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 36,148,568 & 33,947,656 & 31,690,858 & 29,135,445 & 25,872,139 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 36,148,106 & 33,944,569 & 31,678,889 & 29,102,992 & 25,803,949 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 7,452,659 & 7,440,512 & 7,403,338 & 7,327,730 & 7,200,566 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}\n\nFurthermore, from Fig \\ref{runtime_minuo} and Table \\ref{candidates_minuo}, under different \\textit{minuo} settings, we can clearly observe that SUMU$_\\textit{PES}$ is the fastest variant of SUMU. On the datasets Sign, Syn10k, Syn20k, and Syn40k, there are some fluctuations that occur in all the variants of SUMU, but the overall trend is still clear. Regardless of which dataset is processed, the runtime curve of SUMU$_\\textit{PES}$ becomes smoother as \\textit{minuo} is adjusted. In particular, the number of candidate patterns generated on the Bible and FIFA datasets has not changed. In experiments under different \\textit{minuo} and a fixed \\textit{minsup}, we can draw the following conclusions.\n\n\\begin{enumerate}[label=(\\arabic*)]\n\t\\item Unlike the experiments under tuning \\textit{minsup}, the runtime of each variant of SUMU is not much affected by the setting of \\textit{minuo}. On the datasets Bible and FIFA, the runtimes of SUMU$_\\textit{PEUO}$, SUMU$_\\textit{TPUO}$, and SUMU$_\\textit{PES}$ hardly increase when \\textit{minuo} decreases. While the candidate patterns for SUMU$_\\textit{PEUO}$ and SUMU$_\\textit{TPUO}$ are increased substantially. This suggests that the support measure plays a greater role in determining the program runtime than the utility occupancy measure.\n\t\n\t\\item SUMU$_\\textit{PES}$ still achieves the fastest runtime due to the most reasonable pruning strategies it uses. Moreover, on each dataset, as \\textit{minuo} decreases, it does not generate many more candidate patterns. Upper bounds \\textit{PES} and \\textit{RSS} already make it possible to reduce many invalid candidate patterns.\n\t\n\t\\item Although \\textit{TPUO} and \\textit{TSUO} are tighter upper bounds, as \\textit{minuo} decreases, they still do not reduce many irrelevant candidate patterns for SUMU$_\\textit{TPUO}$. \t\n\\end{enumerate}\n\n\n",
        "Descriptive_question1": "What is the number of candidate patterns generated by SUMU_simple for the Bible dataset at minsup_1 in table_6?",
        "Descriptive_question2": "Which variant of SUMU generated the least number of candidate patterns for the FIFA dataset at minsup_6 in table_6?",
        "Reasoning_question1": "Why does SUMU_PES consistently generate fewer candidate patterns compared to other variants across all datasets in table_6, and what does this suggest about its pruning strategies?",
        "Reasoning_question2": "How does the trend in the number of candidate patterns generated change as minsup increases from minsup_1 to minsup_6 for the Syn40k dataset in table_6, and what might explain this behavior?",
        "Descriptive_answer1": "464,804",
        "Descriptive_answer2": "SUMU_PES",
        "Reasoning_answer1": "SUMU_PES consistently generates fewer candidate patterns compared to other variants across all datasets in table_6 because it employs a comprehensive set of pruning strategies, including upper bounds like PES and RSS alongside PEUO and RSUO. This allows it to effectively filter out unpromising candidate patterns early in the process. Looking at the data, for instance, in the Bible dataset at minsup_1, SUMU_PES generates only 35,367 patterns compared to 464,804 by SUMU_simple, a significant reduction. This trend holds across all datasets and minsup levels, as seen in FIFA at minsup_6 where SUMU_PES generates just 296 patterns against 18,504 by SUMU_simple. This suggests that SUMU_PES's pruning strategies are highly effective in minimizing the search space, likely by leveraging tighter constraints and more aggressive elimination of irrelevant patterns, leading to better efficiency without compromising the identification of meaningful patterns.",
        "Reasoning_answer2": "For the Syn40k dataset in table_6, as minsup increases from minsup_1 to minsup_6, the number of candidate patterns generated by all SUMU variants shows a consistent downward trend. Specifically, for SUMU_simple, the count decreases from 39,435,204 at minsup_1 to 6,268,961 at minsup_6; for SUMU_PEUO, it drops from 38,953,081 to 5,995,403; for SUMU_TPUO, from 38,953,054 to 5,995,389; and for SUMU_PES, from 7,454,317 to 850,635. This behavior can be explained by the nature of the minsup parameter, which sets a minimum support threshold for patterns to be considered frequent. As minsup increases, fewer patterns meet this threshold, resulting in a reduced number of candidate patterns being generated. This aligns with the general principle in pattern mining where higher support thresholds filter out more patterns, focusing only on those that appear more frequently in the dataset."
    },
    {
        "paper_id": "2212.10452.json",
        "table_id": "table_7",
        "table_content": "\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minuo}}\n\t\\label{candidates_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minsup} = 500}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,921,104 & 672,801 & 390,245 & 264,424 & 195,445 & 153,171 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minsup} = 4000}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 280,551 & 178,428 & 144,686 & 102,302 & 69,806 & 48,009 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minsup} = 70}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,588,257 & 1,393,525 & 1,233,358 & 1,101,274 & 988,726 & 894,110 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,284,130 & 1,129,929 & 1,002,217 & 895,976 & 806,047 & 730,904 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,284,130 & 1,129,928 & 1,002,216 & 895,923 & 805,911 & 730,644 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 126,752 & 126,743 & 126,716 & 126,652 & 126,478 & 126,240 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minsup} = 14}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,119,006 & 2,916,190 & 2,399,859 & 2,021,727 & 1,747,076 & 1,534,885 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,727,340 & 2,790,561 & 2,301,336 & 1,948,286 & 1,680,334 & 1,485,466 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,727,339 & 2,790,547 & 2,301,264 & 1,948,081 & 1,679,416 & 1,482,724 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 537,899 & 537,774 & 537,328 & 536,225 & 533,228 & 526,147 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minsup} = 24}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 5,991,112 & 5,044,232 & 4,517,900 & 4,112,343 & 3,773,772 & 3,423,454 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 5,706,748 & 4,904,484 & 4,430,131 & 4,041,273 & 3,712,031 & 3,357,823 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 5,706,744 & 4,904,420 & 4,429,752 & 4,040,166 & 3,709,467 & 3,352,790 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 841,369 & 840,970 & 838,452 & 830,080 & 812,793 & 785,422 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minsup} = 34}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 36,517,396 & 34,219,382 & 31,977,235 & 29,399,324 & 26,176,962 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 36,148,568 & 33,947,656 & 31,690,858 & 29,135,445 & 25,872,139 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 36,148,106 & 33,944,569 & 31,678,889 & 29,102,992 & 25,803,949 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 7,452,659 & 7,440,512 & 7,403,338 & 7,327,730 & 7,200,566 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}",
        "caption": "Number of candidate patterns generated by varying \\textit{minuo}",
        "label": "candidates_minuo",
        "section_info": "5 Experiments\n\\section{Experiments}  \\label{sec:experiments}\n\nWe selected both real and synthetic datasets to conduct related experiments. The proposed SUMU algorithm is the first approach to mining sequential patterns with a utility occupancy measure. Thus, there is no suitable algorithm for comparison. We mainly focused on verifying the efficiency of the proposed upper bounds and pruning strategies and the effectiveness of SUMU. The code related to SUMU is programmed using the Java language and developed in Eclipse. Our extensive experiments were conducted on a bare computer, which is equipped with an i7-12700F 2.10 GHz CPU and 16 GB of RAM. The experimental details and results are shown below.\n\n\\subsection{Experimental Setup and Datasets}\n\nThree real datasets (including Bible, FIFA, and Sign) and three synthetic datasets (including Syn10k, Syn20k, and Syn40k) were used in the experiments. The real datasets are often used in the evaluation of pattern mining algorithms and can be accessed from the website SPMF\\footnote{\\url{http://www.philippe-fournier-viger.com/spmf/}}. Regarding the used synthetic datasets, they can be generated by the IBM Quest Synthetic Data Generator \\cite{QSD}. Each dataset has its own characteristics and can represent a specific type of data in practical applications. The characteristics of these datasets are described below.\n\n$ \\bullet $ \\textit{\\textbf{Bible}} contains 13,905 items and 36,369 sequences, which are transformed from the book Bible. Its average sequence length is 21.64.\n\n$ \\bullet $ \\textit{\\textbf{FIFA}} contains 2,990 items and 20,450 sequences  derived from the website of FIFA World Cup 98. Its average sequence length is 36.23.\n\n$ \\bullet $ \\textit{\\textbf{Sign}} a small but dense dataset of sign language utterances, with 267 items and 730 sequences. Its average sequence length is 27.11.\n\n$ \\bullet $ \\textit{\\textbf{Syn10k}} is a synthetic dataset with 10,000 sequence records. It has 7,312 distinct items, and its average sequence length is 26.97.\n\n$ \\bullet $ \\textit{\\textbf{Syn20k}} is a synthetic dataset with 20,000 sequence records. It has 7,442 distinct items, and its average sequence length is 26.84.\n\n$ \\bullet $ \\textit{\\textbf{Syn40k}} is a synthetic dataset with 40,000 sequence records. It has 7,537 distinct items, and its average sequence length is 26.84.\n\nTo better evaluate the proposed SUMU algorithm, several variants regarding SUMU have also been designed. Therefore, the experimental results can better show the capabilities of the designed upper bounds and pruning strategies. In our experiments, the proposed SUMU algorithm with upper bounds \\textit{PEUO} and \\textit{RSUO} is denoted as SUMU$_\\textit{simple}$. It means that only Strategies \\ref{strategy2} and \\ref{strategy3} are used in SUMU$_\\textit{simple}$. On the basis of SUMU$_\\textit{simple}$, if unpromising items are filtered out (with Strategy \\ref{strategy1}) before generating HUOSPs, then this variant of SUMU is denoted as SUMU$_\\textit{PEUO}$. To analyze the performance of gap between \\textit{PEUO} and \\textit{TPUO}, between \\textit{RSUO} and \\textit{TSUO} in the experiments, we also designed another variant of SUMU (with Strategies \\ref{strategy1}, \\ref{strategy4}, and \\ref{strategy5}), denoted as SUMU$_\\textit{TPUO}$. In addition, on the basis of SUMU$_\\textit{PEUO}$, the fourth variant, namely SUMU$_\\textit{PES}$, is designed to evaluate the two upper bounds on the support measure. Those variants of SUMU are compared to comprehensively evaluate the effectiveness and efficiency of SUMU.\n\n\\subsection{Pattern Analysis}\n\nIn this section, we mainly discuss the effect of the change in the number of HUOSPs as the \\textit{minsup} or \\textit{minsuo} changes. The results for various \\textit{minsup} and under a fixed \\textit{minuo} are shown in Table \\ref{patterns_minsup}. Likewise, the results for various \\textit{minuo} and under a fixed \\textit{minsup} are shown in Table \\ref{patterns_minuo}. For each dataset, we use \\textit{minsup}$_1$, \\textit{minsup}$_2$ (or \\textit{minuo}$_1$, \\textit{minuo}$_2$), and so on to indicate that we increasingly adjust the parameter \\textit{minsup} (or \\textit{minuo}). For instance, in our experiments, for the Bible dataset, the six parameters on \\textit{minsup} are set to 300, 400, 500, 600, 700, and 800; and the six parameters on \\textit{minuo} are set to 0.01, 0.03, 0.05, 0.07, 0.09, and 0.11. The detailed parameter settings can be observed in Fig. \\ref{runtime_minsup} and Fig. \\ref{runtime_minuo}.\n\n\n\n\n\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minsup}}\n\t\\label{patterns_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$    \\\\ \\hline\n\t\t\tBible, \\textit{minuo} = 0.1   & 21,442 & 11,008 & 6,527 & 4,290 & 2,993 & 2,211  \\\\ \\hline\n\t\t\tFIFA, \\textit{minuo} = 0.1 & 1,162 & 259 & 87 & 38 & 14 & 7    \\\\ \\hline\n\t\t\tSign, \\textit{minuo} = 0.1 & 147,517 & 74,532 & 40,936 & 23,879 & 14,521 & 9,165  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minuo} = 0.1 &   5,732,182 & 1,311,583 & 488,651 & 165,915 & 96,636 & 76,824  \\\\ \\hline\n\t\t\tSyn20k, \\textit{minuo} = 0.1 &   3,751,369 & 1,470,986 & 766,501 & 325,895 & 178,157 & 124,254 \\\\ \\hline\n\t\t\tSyn40k, \\textit{minuo} = 0.1&  7,223,421 & 5,144,928 & 3,710,872 & 2,087,673 & 1,142,202 & 770,393 \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\n\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minuo}}\n\t\\label{patterns_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$    \\\\ \\hline{Bible}, \\textit{minsup} = 500   & 11721 & 11668 & 11390 & 10433 & 8037 & 5012\n\t\t\t\\\\ \\hline\n\t\t\tFIFA, \\textit{minsup} = 4,000 & 1,093 & 870 & 499 & 212 & 69 & 20  \\\\ \\hline\n\t\t\tSign, \\textit{minsup} = 70 & 40,936 & 28,375 & 18,330 & 11,087 & 6,134 & 3,136  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minsup} = 14 & 488,651 & 435,881 & 367,058 & 287,760 & 204,788 & 130,351   \\\\ \\hline\n\t\t\tSyn20k, \\textit{minsup} = 24 & 766,501 & 660,716 & 513,359 & 355,981 & 217,649 & 117,495  \\\\ \\hline\n\t\t\tSyn40k, \\textit{minsup} = 34 & 7,223,421 & 6,737,579 & 5,831,242 & 4,550,068 & 3,092,664 & 1,777,904  \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\nFrom Tables \\ref{patterns_minsup} and \\ref{patterns_minuo}, it is clear that the number of generated HUOSPs on each dataset is quite different as \\textit{minsup} or \\textit{minuo} is adjusted. Particularly, the number of generated HUOSPs on the synthetic datasets is higher than that on the real datasets. This is because for these synthetic datasets, each of their itemsets contains multiple items and can format more candidate patterns. Furthermore, as \\textit{minsup} decreases by interval, the number of HUOSPs increases rapidly. For example, the difference between the number of patterns generated by \\textit{minsup}$_1$ and the number of patterns generated under \\textit{minsup}$_2$ is smaller than the difference between \\textit{minsup}$_2$ and \\textit{minsup}$_1$. This phenomenon is reasonable and also occurs in frequent itemset mining or sequential pattern mining. However, this phenomenon is contrary to the utility occupancy measure. The number of generated HUOSPs gradually increases as \\textit{minuo} is decreased. This is because the HUOSPs generated by the algorithm SUMU do not vary much for smaller \\textit{minuo} settings. In fact, such similar situations also can be found in the HUOPM algorithm \\cite{gan2019huopm}. \n\n\\subsection{Efficiency Analysis}\n\nIn this subsection, we conducted extensive experiments to evaluate the performance of the different upper bounds and pruning strategies used in SUMU. The results in terms of runtime for various \\textit{minsup} and \\textit{minuo} settings are shown in Fig. \\ref{runtime_minsup} and Fig. \\ref{runtime_minuo}. And the results in terms of candidate patterns for various \\textit{minsup} and \\textit{minuo} settings are shown in Tables \\ref{candidates_minsup} and \\ref{candidates_minuo}.\n\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.32]{figs/runtimeminsup.pdf}\n\t\\caption{Running time under various \\textit{minsup} and a fixed \\textit{minuo} = 0.1.}\n\t\\label{runtime_minsup}\n\\end{figure}\n\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minsup}}\n\t\\label{candidates_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 464,804 & 263,807 & 171,872 & 123,398 & 94,177 & 75,449 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 35,367 & 18,999 & 11,721 & 7,967 & 5,737 & 4,354 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minuo} = 0.1}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 678,816 & 268,081 & 115,283 & 57,603 & 31,384 & 18,504 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 14,710 & 5,787 & 2,399 & 1,099 & 557 & 296 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,237,763 & 2,494,589 & 1,588,257 & 1,061,989 & 742,966 & 538,042 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 372,610 & 208,839 & 126,752 & 81,340 & 54,695 & 38,095 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 34,672,439 & 10,131,127 & 4,119,006 & 1,870,268 & 1,126,628 & 802,480 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 32,762,145 & 9,533,071 & 3,727,340 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 32,762,136 & 9,533,068 & 3,727,339 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 5,968,170 & 1,412,210 & 537,899 & 194,708 & 115,473 & 89,559 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 23,741,985 & 11,371,864 & 5,991,112 & 3,132,183 & 2,004,809 & 1,497,083 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 23,391,275 & 11,167,493 & 5,706,748 & 2,943,750 & 1,845,769 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 23,391,272 & 11,167,489 & 5,706,744 & 2,943,747 & 1,845,768 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 3,916,112 & 1,578,521 & 841,369 & 377,140 & 215,095 & 151,514 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 29,989,061 & 21,471,028 & 14,398,082 & 8,602,677 & 6,268,961 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 29,551,866 & 21,110,384 & 14,105,707 & 8,252,704 & 5,995,403 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 29,551,845 & 21,110,361 & 14,105,693 & 8,252,697 & 5,995,389 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 5,326,206 & 3,857,950 & 2,206,183 & 1,239,766 & 850,635 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}\n\n\n\n\nFrom Fig \\ref{runtime_minsup} and Table \\ref{candidates_minsup}, under different \\textit{minsup} settings, we can clearly see that the runtime of the variant SUMU$_\\textit{simple}$ is the worst on the datasets Bible and FIFA; the runtime of the variant SUMU$_\\textit{TPUO}$ is the worst on the datasets Sign, Syn20k, and Syn40k. SUMU$_\\textit{simple}$ and SUMU$_\\textit{TPUO}$ perform similarly on Syn10k. However, when \\textit{minsup} is set to 10, the runtime of SUMU$_\\textit{TPUO}$ exceeds that of SUMU$_\\textit{simple}$. In addition, the variant SUMU$_\\textit{PES}$ which uses four upper bounds (\\textit{PEUO}, \\textit{RSUO}, \\textit{PES}, and \\textit{RSS}) can achieve the best performance on all datasets. And the variant SUMU$_\\textit{PEUO}$ takes the second least amount of runtime. SUMU$_\\textit{PES}$ is able to minimize candidate pattern generation, while SUMU$_\\textit{simple}$ is the least reduced. The results of our experiment are as expected. In experiments under different \\textit{minsup} and a fixed \\textit{minuo}, we can draw the following conclusions.\n\n\n\n\\begin{enumerate}[label=(\\arabic*)]\n\t\\item SUMU$_\\textit{PES}$ adopts a sufficient number of pruning strategies to significantly reduce candidate patterns while achieving the shortest  runtime. Compared to several other variants of SUMU, SUMU$_\\textit{PES}$ generates much fewer candidate patterns. Although the number of candidate patterns is many times less than the other variants, the overall performance is not more than a few times better. This is a lot of unpromising candidate patterns that are also ignored in the subsequent program steps.\n\t\n\t\\item The difference between SUMU$_\\textit{simple}$ and SUMU$_\\textit{PEUO}$ demonstrate that the pruning Strategy \\ref{strategy1} is ineffective on synthetic datasets. This is because \\textit{minsup} are set to relatively small values, and thus there are not many unpromising items appearing in the sequence dataset.\n\t\n\t\\item Although \\textit{TPUO} and \\textit{TSUO} are tighter upper bounds, their calculation makes SUMU$_\\textit{TPUO}$ take longer than SUMU$_\\textit{PEUO}$. For a candidate pattern, SUMU$_\\textit{PEUO}$ is able to compute upper bounds \\textit{PEUO} and \\textit{RSUO} in a liner time. While for \\textit{TSUO}, it requires multiple sorting operations, which is a complex process. In addition, SUMU$_\\textit{TPUO}$ does not reduce any candidate pattern on many datasets (including Bible, FIFA, and Sign). Even if it works on the few remaining datasets, it only reduces the number of candidate patterns by a particularly small amount.\n\\end{enumerate}\n\n\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.32]{figs/runtimeminuo.pdf}\n\t\\caption{Running time under various \\textit{minuo} and a fixed \\textit{minsup}. (a) Bible, \\textit{minsup} = 500. (b) FIFA,  \\textit{minsup} = 4,000. (c) Sign, \\textit{minsup} = 70. (d) Syn10k, \\textit{minsup} = 14. (e) Syn20k, \\textit{minsup} = 24. (f) Syn40k, \\textit{minsup} = 34.}\n\t\\label{runtime_minuo}\n\\end{figure}\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minuo}}\n\t\\label{candidates_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minsup} = 500}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,921,104 & 672,801 & 390,245 & 264,424 & 195,445 & 153,171 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minsup} = 4000}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 280,551 & 178,428 & 144,686 & 102,302 & 69,806 & 48,009 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minsup} = 70}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,588,257 & 1,393,525 & 1,233,358 & 1,101,274 & 988,726 & 894,110 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,284,130 & 1,129,929 & 1,002,217 & 895,976 & 806,047 & 730,904 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,284,130 & 1,129,928 & 1,002,216 & 895,923 & 805,911 & 730,644 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 126,752 & 126,743 & 126,716 & 126,652 & 126,478 & 126,240 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minsup} = 14}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,119,006 & 2,916,190 & 2,399,859 & 2,021,727 & 1,747,076 & 1,534,885 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,727,340 & 2,790,561 & 2,301,336 & 1,948,286 & 1,680,334 & 1,485,466 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,727,339 & 2,790,547 & 2,301,264 & 1,948,081 & 1,679,416 & 1,482,724 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 537,899 & 537,774 & 537,328 & 536,225 & 533,228 & 526,147 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minsup} = 24}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 5,991,112 & 5,044,232 & 4,517,900 & 4,112,343 & 3,773,772 & 3,423,454 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 5,706,748 & 4,904,484 & 4,430,131 & 4,041,273 & 3,712,031 & 3,357,823 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 5,706,744 & 4,904,420 & 4,429,752 & 4,040,166 & 3,709,467 & 3,352,790 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 841,369 & 840,970 & 838,452 & 830,080 & 812,793 & 785,422 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minsup} = 34}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 36,517,396 & 34,219,382 & 31,977,235 & 29,399,324 & 26,176,962 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 36,148,568 & 33,947,656 & 31,690,858 & 29,135,445 & 25,872,139 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 36,148,106 & 33,944,569 & 31,678,889 & 29,102,992 & 25,803,949 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 7,452,659 & 7,440,512 & 7,403,338 & 7,327,730 & 7,200,566 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}\n\nFurthermore, from Fig \\ref{runtime_minuo} and Table \\ref{candidates_minuo}, under different \\textit{minuo} settings, we can clearly observe that SUMU$_\\textit{PES}$ is the fastest variant of SUMU. On the datasets Sign, Syn10k, Syn20k, and Syn40k, there are some fluctuations that occur in all the variants of SUMU, but the overall trend is still clear. Regardless of which dataset is processed, the runtime curve of SUMU$_\\textit{PES}$ becomes smoother as \\textit{minuo} is adjusted. In particular, the number of candidate patterns generated on the Bible and FIFA datasets has not changed. In experiments under different \\textit{minuo} and a fixed \\textit{minsup}, we can draw the following conclusions.\n\n\\begin{enumerate}[label=(\\arabic*)]\n\t\\item Unlike the experiments under tuning \\textit{minsup}, the runtime of each variant of SUMU is not much affected by the setting of \\textit{minuo}. On the datasets Bible and FIFA, the runtimes of SUMU$_\\textit{PEUO}$, SUMU$_\\textit{TPUO}$, and SUMU$_\\textit{PES}$ hardly increase when \\textit{minuo} decreases. While the candidate patterns for SUMU$_\\textit{PEUO}$ and SUMU$_\\textit{TPUO}$ are increased substantially. This suggests that the support measure plays a greater role in determining the program runtime than the utility occupancy measure.\n\t\n\t\\item SUMU$_\\textit{PES}$ still achieves the fastest runtime due to the most reasonable pruning strategies it uses. Moreover, on each dataset, as \\textit{minuo} decreases, it does not generate many more candidate patterns. Upper bounds \\textit{PES} and \\textit{RSS} already make it possible to reduce many invalid candidate patterns.\n\t\n\t\\item Although \\textit{TPUO} and \\textit{TSUO} are tighter upper bounds, as \\textit{minuo} decreases, they still do not reduce many irrelevant candidate patterns for SUMU$_\\textit{TPUO}$. \t\n\\end{enumerate}\n\n\n\\subsection{Memory Evaluation}\n\nThe memory consumption of each variant of SUMU is close and fluctuating, and we present the approximate memory consumption under different datasets. We investigate the reasons for the disparities in memory consumption based on program design details. The experimental results regarding memory consumption are shown in Table \\ref{memory}.\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Memory consumption}\n\t\\label{memory}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{}   & \\multicolumn{6}{c}{\\textbf{Approximate memory consumed (MB)}} \\\\ \\cline{2-7} \n\t\t\t&          Bible         &  FIFA & Sign  & Syn10k  & Syn20k  & Syn40k  \\\\ \\hline{fixed \\textit{minuo}}  & 1,000 $\\sim$ 1,400 & 1,400 $\\sim$ 1,700 & 200 $\\sim$ 400 & 300 $\\sim$ 800 & 600 $\\sim$ 800 & 1,200 $\\sim$ 1,500  \\\\ \\hline{fixed \\textit{minsup}}  & 1,000 $\\sim$ 1,400 & 1,500 $\\sim$ 1,700 & 200 $\\sim$ 400 & 300 $\\sim$ 600 & 300 $\\sim$ 800 & 1,200 $\\sim$ 1,500    \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\nSince each variant of SUMU uses both UOL-Chain and UO-Table, they consume little difference in memory, and the difference is within a reasonable range. The variants of SUMU employ a different number of pruning strategies, and thus they differ somewhat in the use of some auxiliary data structures. If the pruning Strategy \\ref{strategy1} is used, then unpromising items should be filtered. To find out which items are unpromising, the program utilizes a hash table to record the support for each item. On the contrary, if all items are used directly, it is sufficient for the program to use a single list to record those items that occur in the sequence database. The difference between SUMU$_{\\textit{PEUO}}$ and SUMU$_{\\textit{TPUO}}$ is that they use different upper bounds and pruning strategies. For the calculation of \\textit{PEUO} and \\textit{RSUO} of a pattern, this is relatively simple. Scanning the UOL-Chain of a pattern quickly and accumulating corresponding values. However, for the calculation of \\textit{TPUO} and \\textit{TSUO} of a pattern, several \\textit{minsup}-sized priority queues are required. This allows computing tighter upper bound values, but also consumes additional memory space. As for SUMU$_{\\textit{PES}}$, it uses more upper bounds \\textit{PES} and \\textit{RSS} (adopts more pruning Strategies \\ref{strategy6} and \\ref{strategy7}) compared to SUMU$_{\\textit{PEUO}}$. This suggests that in pattern extension, the program needs the associated hash tables to decide which candidate patterns satisfy the upper bounds \\textit{PES} and \\textit{RSS}. It seems that the more upper bounds and pruning strategies are used, the more memory is consumed. Nevertheless, in the experiments, effective pruning strategies can avoid unnecessary UOL-Chain and UO-Table builds due to the non-generation of some candidate patterns, also saving memory consumption. Therefore, the memory consumption of each variant of SUMU is roughly equal.\n\n\\subsection{Scalability}\n\nThis section selected five synthetic datasets to evaluate the scalability of each variant of SUMU. The dataset size increases from 10k to 50k sequence records, increasing by 10k each time. We set a relative support for experiments, e.g., \\textit{minsup} was set to 10, 20, 30, 40, and 50 for the five synthetic datasets, respectively. In addition, the \\textit{minuo} is set to 0.1 in order to generate more HUOSPs. We analyze the scalability in terms of runtime and candidate pattern generation, and the experimental results are shown in Fig. \\ref{scalability}.\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.4]{figs/scalability.pdf}\n\t\\caption{Scalability of the compared variants of SUMU}\n\t\\label{scalability}\n\\end{figure}\n\nFrom Fig. \\ref{scalability}, it is clear that the runtime of each variant of SUMU grows as the size of the processed dataset increases. This is consistent with our assumption that larger datasets carry more candidate patterns, increasing the processing difficulty. The use of UOL-Chain and UO-Table makes the trend of each variant of SUMU the same, with only differences in efficiency. The difference between all SUMU variants is clear, with SUMU$_{\\textit{PES}}$ performing best and SUMU$_{\\textit{TPUO}}$ performing worst. For SUMU$_{\\textit{PES}}$, there is no such rapid growth of candidate patterns. While other variants of SUMU generate a large number of candidate patterns. Therefore, it performs well when handling large-scale datasets. The large number of sorting operations required for the calculation of tighter upper bounds, in particular, causes SUMU$_{\\textit{TPUO}}$ to perform poorly. The difference between SUMU$_{\\textit{PEUO}}$ and SUMU$_{\\textit{simple}}$ illustrates the effectiveness of the pruning strategy \\ref{strategy1}.\n\n\n5.3 Efficiency Analysis\n\\subsection{Efficiency Analysis}\n\nIn this subsection, we conducted extensive experiments to evaluate the performance of the different upper bounds and pruning strategies used in SUMU. The results in terms of runtime for various \\textit{minsup} and \\textit{minuo} settings are shown in Fig. \\ref{runtime_minsup} and Fig. \\ref{runtime_minuo}. And the results in terms of candidate patterns for various \\textit{minsup} and \\textit{minuo} settings are shown in Tables \\ref{candidates_minsup} and \\ref{candidates_minuo}.\n\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.32]{figs/runtimeminsup.pdf}\n\t\\caption{Running time under various \\textit{minsup} and a fixed \\textit{minuo} = 0.1.}\n\t\\label{runtime_minsup}\n\\end{figure}\n\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minsup}}\n\t\\label{candidates_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 464,804 & 263,807 & 171,872 & 123,398 & 94,177 & 75,449 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 35,367 & 18,999 & 11,721 & 7,967 & 5,737 & 4,354 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minuo} = 0.1}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 678,816 & 268,081 & 115,283 & 57,603 & 31,384 & 18,504 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 14,710 & 5,787 & 2,399 & 1,099 & 557 & 296 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,237,763 & 2,494,589 & 1,588,257 & 1,061,989 & 742,966 & 538,042 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 372,610 & 208,839 & 126,752 & 81,340 & 54,695 & 38,095 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 34,672,439 & 10,131,127 & 4,119,006 & 1,870,268 & 1,126,628 & 802,480 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 32,762,145 & 9,533,071 & 3,727,340 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 32,762,136 & 9,533,068 & 3,727,339 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 5,968,170 & 1,412,210 & 537,899 & 194,708 & 115,473 & 89,559 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 23,741,985 & 11,371,864 & 5,991,112 & 3,132,183 & 2,004,809 & 1,497,083 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 23,391,275 & 11,167,493 & 5,706,748 & 2,943,750 & 1,845,769 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 23,391,272 & 11,167,489 & 5,706,744 & 2,943,747 & 1,845,768 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 3,916,112 & 1,578,521 & 841,369 & 377,140 & 215,095 & 151,514 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 29,989,061 & 21,471,028 & 14,398,082 & 8,602,677 & 6,268,961 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 29,551,866 & 21,110,384 & 14,105,707 & 8,252,704 & 5,995,403 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 29,551,845 & 21,110,361 & 14,105,693 & 8,252,697 & 5,995,389 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 5,326,206 & 3,857,950 & 2,206,183 & 1,239,766 & 850,635 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}\n\n\n\n\nFrom Fig \\ref{runtime_minsup} and Table \\ref{candidates_minsup}, under different \\textit{minsup} settings, we can clearly see that the runtime of the variant SUMU$_\\textit{simple}$ is the worst on the datasets Bible and FIFA; the runtime of the variant SUMU$_\\textit{TPUO}$ is the worst on the datasets Sign, Syn20k, and Syn40k. SUMU$_\\textit{simple}$ and SUMU$_\\textit{TPUO}$ perform similarly on Syn10k. However, when \\textit{minsup} is set to 10, the runtime of SUMU$_\\textit{TPUO}$ exceeds that of SUMU$_\\textit{simple}$. In addition, the variant SUMU$_\\textit{PES}$ which uses four upper bounds (\\textit{PEUO}, \\textit{RSUO}, \\textit{PES}, and \\textit{RSS}) can achieve the best performance on all datasets. And the variant SUMU$_\\textit{PEUO}$ takes the second least amount of runtime. SUMU$_\\textit{PES}$ is able to minimize candidate pattern generation, while SUMU$_\\textit{simple}$ is the least reduced. The results of our experiment are as expected. In experiments under different \\textit{minsup} and a fixed \\textit{minuo}, we can draw the following conclusions.\n\n\n\n\\begin{enumerate}[label=(\\arabic*)]\n\t\\item SUMU$_\\textit{PES}$ adopts a sufficient number of pruning strategies to significantly reduce candidate patterns while achieving the shortest  runtime. Compared to several other variants of SUMU, SUMU$_\\textit{PES}$ generates much fewer candidate patterns. Although the number of candidate patterns is many times less than the other variants, the overall performance is not more than a few times better. This is a lot of unpromising candidate patterns that are also ignored in the subsequent program steps.\n\t\n\t\\item The difference between SUMU$_\\textit{simple}$ and SUMU$_\\textit{PEUO}$ demonstrate that the pruning Strategy \\ref{strategy1} is ineffective on synthetic datasets. This is because \\textit{minsup} are set to relatively small values, and thus there are not many unpromising items appearing in the sequence dataset.\n\t\n\t\\item Although \\textit{TPUO} and \\textit{TSUO} are tighter upper bounds, their calculation makes SUMU$_\\textit{TPUO}$ take longer than SUMU$_\\textit{PEUO}$. For a candidate pattern, SUMU$_\\textit{PEUO}$ is able to compute upper bounds \\textit{PEUO} and \\textit{RSUO} in a liner time. While for \\textit{TSUO}, it requires multiple sorting operations, which is a complex process. In addition, SUMU$_\\textit{TPUO}$ does not reduce any candidate pattern on many datasets (including Bible, FIFA, and Sign). Even if it works on the few remaining datasets, it only reduces the number of candidate patterns by a particularly small amount.\n\\end{enumerate}\n\n\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.32]{figs/runtimeminuo.pdf}\n\t\\caption{Running time under various \\textit{minuo} and a fixed \\textit{minsup}. (a) Bible, \\textit{minsup} = 500. (b) FIFA,  \\textit{minsup} = 4,000. (c) Sign, \\textit{minsup} = 70. (d) Syn10k, \\textit{minsup} = 14. (e) Syn20k, \\textit{minsup} = 24. (f) Syn40k, \\textit{minsup} = 34.}\n\t\\label{runtime_minuo}\n\\end{figure}\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minuo}}\n\t\\label{candidates_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minsup} = 500}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,921,104 & 672,801 & 390,245 & 264,424 & 195,445 & 153,171 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minsup} = 4000}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 280,551 & 178,428 & 144,686 & 102,302 & 69,806 & 48,009 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minsup} = 70}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,588,257 & 1,393,525 & 1,233,358 & 1,101,274 & 988,726 & 894,110 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,284,130 & 1,129,929 & 1,002,217 & 895,976 & 806,047 & 730,904 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,284,130 & 1,129,928 & 1,002,216 & 895,923 & 805,911 & 730,644 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 126,752 & 126,743 & 126,716 & 126,652 & 126,478 & 126,240 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minsup} = 14}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,119,006 & 2,916,190 & 2,399,859 & 2,021,727 & 1,747,076 & 1,534,885 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,727,340 & 2,790,561 & 2,301,336 & 1,948,286 & 1,680,334 & 1,485,466 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,727,339 & 2,790,547 & 2,301,264 & 1,948,081 & 1,679,416 & 1,482,724 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 537,899 & 537,774 & 537,328 & 536,225 & 533,228 & 526,147 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minsup} = 24}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 5,991,112 & 5,044,232 & 4,517,900 & 4,112,343 & 3,773,772 & 3,423,454 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 5,706,748 & 4,904,484 & 4,430,131 & 4,041,273 & 3,712,031 & 3,357,823 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 5,706,744 & 4,904,420 & 4,429,752 & 4,040,166 & 3,709,467 & 3,352,790 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 841,369 & 840,970 & 838,452 & 830,080 & 812,793 & 785,422 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minsup} = 34}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 36,517,396 & 34,219,382 & 31,977,235 & 29,399,324 & 26,176,962 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 36,148,568 & 33,947,656 & 31,690,858 & 29,135,445 & 25,872,139 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 36,148,106 & 33,944,569 & 31,678,889 & 29,102,992 & 25,803,949 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 7,452,659 & 7,440,512 & 7,403,338 & 7,327,730 & 7,200,566 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}\n\nFurthermore, from Fig \\ref{runtime_minuo} and Table \\ref{candidates_minuo}, under different \\textit{minuo} settings, we can clearly observe that SUMU$_\\textit{PES}$ is the fastest variant of SUMU. On the datasets Sign, Syn10k, Syn20k, and Syn40k, there are some fluctuations that occur in all the variants of SUMU, but the overall trend is still clear. Regardless of which dataset is processed, the runtime curve of SUMU$_\\textit{PES}$ becomes smoother as \\textit{minuo} is adjusted. In particular, the number of candidate patterns generated on the Bible and FIFA datasets has not changed. In experiments under different \\textit{minuo} and a fixed \\textit{minsup}, we can draw the following conclusions.\n\n\\begin{enumerate}[label=(\\arabic*)]\n\t\\item Unlike the experiments under tuning \\textit{minsup}, the runtime of each variant of SUMU is not much affected by the setting of \\textit{minuo}. On the datasets Bible and FIFA, the runtimes of SUMU$_\\textit{PEUO}$, SUMU$_\\textit{TPUO}$, and SUMU$_\\textit{PES}$ hardly increase when \\textit{minuo} decreases. While the candidate patterns for SUMU$_\\textit{PEUO}$ and SUMU$_\\textit{TPUO}$ are increased substantially. This suggests that the support measure plays a greater role in determining the program runtime than the utility occupancy measure.\n\t\n\t\\item SUMU$_\\textit{PES}$ still achieves the fastest runtime due to the most reasonable pruning strategies it uses. Moreover, on each dataset, as \\textit{minuo} decreases, it does not generate many more candidate patterns. Upper bounds \\textit{PES} and \\textit{RSS} already make it possible to reduce many invalid candidate patterns.\n\t\n\t\\item Although \\textit{TPUO} and \\textit{TSUO} are tighter upper bounds, as \\textit{minuo} decreases, they still do not reduce many irrelevant candidate patterns for SUMU$_\\textit{TPUO}$. \t\n\\end{enumerate}\n\n\n",
        "Descriptive_question1": "What is the number of candidate patterns generated by SUMU_simple for the Bible dataset at minuo_1 in table_7?",
        "Descriptive_question2": "Which SUMU variant generated 1,099 candidate patterns for the FIFA dataset across all minuo levels in table_7?",
        "Reasoning_question1": "Why does the number of candidate patterns for SUMU_PES remain constant across different minuo levels for the Bible and FIFA datasets in table_7?",
        "Reasoning_question2": "How does the effectiveness of pruning strategies impact the number of candidate patterns generated by different SUMU variants as minuo varies in table_7?",
        "Descriptive_answer1": "1,921,104",
        "Descriptive_answer2": "SUMU_PES",
        "Reasoning_answer1": "The number of candidate patterns for SUMU_PES remains constant across different minuo levels for the Bible and FIFA datasets in table_7 because SUMU_PES employs additional upper bounds (PES and RSS) and pruning strategies that effectively filter out unpromising patterns early in the process. Let's break this down: First, unlike other variants, SUMU_PES uses a comprehensive set of pruning strategies that are not solely dependent on the minuo parameter. These strategies likely focus on the support measure and other constraints that remain fixed in these experiments (as minsup is constant). Second, for datasets like Bible and FIFA, which may have specific characteristics such as smaller size or distinct pattern distributions compared to synthetic datasets, these pruning strategies are particularly effective at eliminating invalid candidate patterns regardless of minuo changes. Therefore, as minuo varies from minuo_1 to minuo_6, the number of candidate patterns stays constant at 11,721 for Bible and 1,099 for FIFA, indicating that the pruning is robust against changes in the utility occupancy threshold.",
        "Reasoning_answer2": "The effectiveness of pruning strategies significantly impacts the number of candidate patterns generated by different SUMU variants as minuo varies in table_7, and this can be understood through a step-by-step analysis. First, consider SUMU_simple, which uses minimal pruning strategies (only Strategies 2 and 3). Without advanced filtering, it generates a high number of candidate patterns that decrease as minuo increases (e.g., for Bible, from 1,921,104 at minuo_1 to 153,171 at minuo_6), reflecting that higher minuo thresholds naturally filter more patterns but still leave many unpruned due to the lack of additional strategies. Next, SUMU_PEUO and SUMU_TPUO incorporate Strategy 1 (filtering unpromising items) alongside their respective upper bounds, leading to fewer candidate patterns than SUMU_simple (e.g., for Bible, SUMU_PEUO drops from 1,102,717 at minuo_1 to 91,903 at minuo_6). However, SUMU_TPUO, despite using tighter upper bounds (TPUO and TSUO), does not significantly reduce patterns further compared to SUMU_PEUO, suggesting the computational complexity of tighter bounds offers minimal pruning benefits here. Finally, SUMU_PES, with the most comprehensive pruning (including Strategies 6 and 7 with upper bounds PES and RSS), drastically reduces candidate patterns and often maintains a constant count across minuo levels (e.g., 11,721 for Bible across all minuo), as its strategies effectively eliminate invalid patterns irrespective of minuo changes. Thus, the more effective and numerous the pruning strategies, the greater the reduction in candidate patterns, with SUMU_PES demonstrating the strongest impact due to its robust filtering mechanisms."
    },
    {
        "paper_id": "2212.10452.json",
        "table_id": "table_8",
        "table_content": "\\begin{table}[h]\n\t\\centering\n\t\\caption{Memory consumption}\n\t\\label{memory}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{}   & \\multicolumn{6}{c}{\\textbf{Approximate memory consumed (MB)}} \\\\ \\cline{2-7} \n\t\t\t&          Bible         &  FIFA & Sign  & Syn10k  & Syn20k  & Syn40k  \\\\ \\hline{fixed \\textit{minuo}}  & 1,000 $\\sim$ 1,400 & 1,400 $\\sim$ 1,700 & 200 $\\sim$ 400 & 300 $\\sim$ 800 & 600 $\\sim$ 800 & 1,200 $\\sim$ 1,500  \\\\ \\hline{fixed \\textit{minsup}}  & 1,000 $\\sim$ 1,400 & 1,500 $\\sim$ 1,700 & 200 $\\sim$ 400 & 300 $\\sim$ 600 & 300 $\\sim$ 800 & 1,200 $\\sim$ 1,500    \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}",
        "caption": "Memory consumption",
        "label": "memory",
        "section_info": "5 Experiments\n\\section{Experiments}  \\label{sec:experiments}\n\nWe selected both real and synthetic datasets to conduct related experiments. The proposed SUMU algorithm is the first approach to mining sequential patterns with a utility occupancy measure. Thus, there is no suitable algorithm for comparison. We mainly focused on verifying the efficiency of the proposed upper bounds and pruning strategies and the effectiveness of SUMU. The code related to SUMU is programmed using the Java language and developed in Eclipse. Our extensive experiments were conducted on a bare computer, which is equipped with an i7-12700F 2.10 GHz CPU and 16 GB of RAM. The experimental details and results are shown below.\n\n\\subsection{Experimental Setup and Datasets}\n\nThree real datasets (including Bible, FIFA, and Sign) and three synthetic datasets (including Syn10k, Syn20k, and Syn40k) were used in the experiments. The real datasets are often used in the evaluation of pattern mining algorithms and can be accessed from the website SPMF\\footnote{\\url{http://www.philippe-fournier-viger.com/spmf/}}. Regarding the used synthetic datasets, they can be generated by the IBM Quest Synthetic Data Generator \\cite{QSD}. Each dataset has its own characteristics and can represent a specific type of data in practical applications. The characteristics of these datasets are described below.\n\n$ \\bullet $ \\textit{\\textbf{Bible}} contains 13,905 items and 36,369 sequences, which are transformed from the book Bible. Its average sequence length is 21.64.\n\n$ \\bullet $ \\textit{\\textbf{FIFA}} contains 2,990 items and 20,450 sequences  derived from the website of FIFA World Cup 98. Its average sequence length is 36.23.\n\n$ \\bullet $ \\textit{\\textbf{Sign}} a small but dense dataset of sign language utterances, with 267 items and 730 sequences. Its average sequence length is 27.11.\n\n$ \\bullet $ \\textit{\\textbf{Syn10k}} is a synthetic dataset with 10,000 sequence records. It has 7,312 distinct items, and its average sequence length is 26.97.\n\n$ \\bullet $ \\textit{\\textbf{Syn20k}} is a synthetic dataset with 20,000 sequence records. It has 7,442 distinct items, and its average sequence length is 26.84.\n\n$ \\bullet $ \\textit{\\textbf{Syn40k}} is a synthetic dataset with 40,000 sequence records. It has 7,537 distinct items, and its average sequence length is 26.84.\n\nTo better evaluate the proposed SUMU algorithm, several variants regarding SUMU have also been designed. Therefore, the experimental results can better show the capabilities of the designed upper bounds and pruning strategies. In our experiments, the proposed SUMU algorithm with upper bounds \\textit{PEUO} and \\textit{RSUO} is denoted as SUMU$_\\textit{simple}$. It means that only Strategies \\ref{strategy2} and \\ref{strategy3} are used in SUMU$_\\textit{simple}$. On the basis of SUMU$_\\textit{simple}$, if unpromising items are filtered out (with Strategy \\ref{strategy1}) before generating HUOSPs, then this variant of SUMU is denoted as SUMU$_\\textit{PEUO}$. To analyze the performance of gap between \\textit{PEUO} and \\textit{TPUO}, between \\textit{RSUO} and \\textit{TSUO} in the experiments, we also designed another variant of SUMU (with Strategies \\ref{strategy1}, \\ref{strategy4}, and \\ref{strategy5}), denoted as SUMU$_\\textit{TPUO}$. In addition, on the basis of SUMU$_\\textit{PEUO}$, the fourth variant, namely SUMU$_\\textit{PES}$, is designed to evaluate the two upper bounds on the support measure. Those variants of SUMU are compared to comprehensively evaluate the effectiveness and efficiency of SUMU.\n\n\\subsection{Pattern Analysis}\n\nIn this section, we mainly discuss the effect of the change in the number of HUOSPs as the \\textit{minsup} or \\textit{minsuo} changes. The results for various \\textit{minsup} and under a fixed \\textit{minuo} are shown in Table \\ref{patterns_minsup}. Likewise, the results for various \\textit{minuo} and under a fixed \\textit{minsup} are shown in Table \\ref{patterns_minuo}. For each dataset, we use \\textit{minsup}$_1$, \\textit{minsup}$_2$ (or \\textit{minuo}$_1$, \\textit{minuo}$_2$), and so on to indicate that we increasingly adjust the parameter \\textit{minsup} (or \\textit{minuo}). For instance, in our experiments, for the Bible dataset, the six parameters on \\textit{minsup} are set to 300, 400, 500, 600, 700, and 800; and the six parameters on \\textit{minuo} are set to 0.01, 0.03, 0.05, 0.07, 0.09, and 0.11. The detailed parameter settings can be observed in Fig. \\ref{runtime_minsup} and Fig. \\ref{runtime_minuo}.\n\n\n\n\n\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minsup}}\n\t\\label{patterns_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$    \\\\ \\hline\n\t\t\tBible, \\textit{minuo} = 0.1   & 21,442 & 11,008 & 6,527 & 4,290 & 2,993 & 2,211  \\\\ \\hline\n\t\t\tFIFA, \\textit{minuo} = 0.1 & 1,162 & 259 & 87 & 38 & 14 & 7    \\\\ \\hline\n\t\t\tSign, \\textit{minuo} = 0.1 & 147,517 & 74,532 & 40,936 & 23,879 & 14,521 & 9,165  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minuo} = 0.1 &   5,732,182 & 1,311,583 & 488,651 & 165,915 & 96,636 & 76,824  \\\\ \\hline\n\t\t\tSyn20k, \\textit{minuo} = 0.1 &   3,751,369 & 1,470,986 & 766,501 & 325,895 & 178,157 & 124,254 \\\\ \\hline\n\t\t\tSyn40k, \\textit{minuo} = 0.1&  7,223,421 & 5,144,928 & 3,710,872 & 2,087,673 & 1,142,202 & 770,393 \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\n\\begin{table}[H]\n\t\\centering\n\t\\caption{Number of patterns generated by varying \\textit{minuo}}\n\t\\label{patterns_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{\\textbf{Dataset}}   & \\multicolumn{6}{c}{\\# \\textbf{patterns}} \\\\ \\cline{2-7} \n\t\t\t&          $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$    \\\\ \\hline{Bible}, \\textit{minsup} = 500   & 11721 & 11668 & 11390 & 10433 & 8037 & 5012\n\t\t\t\\\\ \\hline\n\t\t\tFIFA, \\textit{minsup} = 4,000 & 1,093 & 870 & 499 & 212 & 69 & 20  \\\\ \\hline\n\t\t\tSign, \\textit{minsup} = 70 & 40,936 & 28,375 & 18,330 & 11,087 & 6,134 & 3,136  \\\\ \\hline\n\t\t\tSyn10k, \\textit{minsup} = 14 & 488,651 & 435,881 & 367,058 & 287,760 & 204,788 & 130,351   \\\\ \\hline\n\t\t\tSyn20k, \\textit{minsup} = 24 & 766,501 & 660,716 & 513,359 & 355,981 & 217,649 & 117,495  \\\\ \\hline\n\t\t\tSyn40k, \\textit{minsup} = 34 & 7,223,421 & 6,737,579 & 5,831,242 & 4,550,068 & 3,092,664 & 1,777,904  \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\nFrom Tables \\ref{patterns_minsup} and \\ref{patterns_minuo}, it is clear that the number of generated HUOSPs on each dataset is quite different as \\textit{minsup} or \\textit{minuo} is adjusted. Particularly, the number of generated HUOSPs on the synthetic datasets is higher than that on the real datasets. This is because for these synthetic datasets, each of their itemsets contains multiple items and can format more candidate patterns. Furthermore, as \\textit{minsup} decreases by interval, the number of HUOSPs increases rapidly. For example, the difference between the number of patterns generated by \\textit{minsup}$_1$ and the number of patterns generated under \\textit{minsup}$_2$ is smaller than the difference between \\textit{minsup}$_2$ and \\textit{minsup}$_1$. This phenomenon is reasonable and also occurs in frequent itemset mining or sequential pattern mining. However, this phenomenon is contrary to the utility occupancy measure. The number of generated HUOSPs gradually increases as \\textit{minuo} is decreased. This is because the HUOSPs generated by the algorithm SUMU do not vary much for smaller \\textit{minuo} settings. In fact, such similar situations also can be found in the HUOPM algorithm \\cite{gan2019huopm}. \n\n\\subsection{Efficiency Analysis}\n\nIn this subsection, we conducted extensive experiments to evaluate the performance of the different upper bounds and pruning strategies used in SUMU. The results in terms of runtime for various \\textit{minsup} and \\textit{minuo} settings are shown in Fig. \\ref{runtime_minsup} and Fig. \\ref{runtime_minuo}. And the results in terms of candidate patterns for various \\textit{minsup} and \\textit{minuo} settings are shown in Tables \\ref{candidates_minsup} and \\ref{candidates_minuo}.\n\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.32]{figs/runtimeminsup.pdf}\n\t\\caption{Running time under various \\textit{minsup} and a fixed \\textit{minuo} = 0.1.}\n\t\\label{runtime_minsup}\n\\end{figure}\n\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minsup}}\n\t\\label{candidates_minsup}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minsup}_{1}$         &  $\\textit{minsup}_{2}$ & $\\textit{minsup}_{3}$  & $\\textit{minsup}_{4}$  & $\\textit{minsup}_{5}$  & $\\textit{minsup}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 464,804 & 263,807 & 171,872 & 123,398 & 94,177 & 75,449 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 321,866 & 168,372 & 103,179 & 68,755 & 48,108 & 35,632 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 35,367 & 18,999 & 11,721 & 7,967 & 5,737 & 4,354 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minuo} = 0.1}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 678,816 & 268,081 & 115,283 & 57,603 & 31,384 & 18,504 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 214,531 & 80,469 & 35,373 & 17,818 & 9,930 & 5,545 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 14,710 & 5,787 & 2,399 & 1,099 & 557 & 296 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,237,763 & 2,494,589 & 1,588,257 & 1,061,989 & 742,966 & 538,042 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,668,153 & 2,131,189 & 1,284,130 & 834,756 & 553,160 & 390,477 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 372,610 & 208,839 & 126,752 & 81,340 & 54,695 & 38,095 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 34,672,439 & 10,131,127 & 4,119,006 & 1,870,268 & 1,126,628 & 802,480 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 32,762,145 & 9,533,071 & 3,727,340 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 32,762,136 & 9,533,068 & 3,727,339 & 1,661,839 & 974,145 & 725,941 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 5,968,170 & 1,412,210 & 537,899 & 194,708 & 115,473 & 89,559 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 23,741,985 & 11,371,864 & 5,991,112 & 3,132,183 & 2,004,809 & 1,497,083 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 23,391,275 & 11,167,493 & 5,706,748 & 2,943,750 & 1,845,769 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 23,391,272 & 11,167,489 & 5,706,744 & 2,943,747 & 1,845,768 & 1,364,875 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 3,916,112 & 1,578,521 & 841,369 & 377,140 & 215,095 & 151,514 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minuo} = 0.1}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 29,989,061 & 21,471,028 & 14,398,082 & 8,602,677 & 6,268,961 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 29,551,866 & 21,110,384 & 14,105,707 & 8,252,704 & 5,995,403 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 29,551,845 & 21,110,361 & 14,105,693 & 8,252,697 & 5,995,389 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 5,326,206 & 3,857,950 & 2,206,183 & 1,239,766 & 850,635 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}\n\n\n\n\nFrom Fig \\ref{runtime_minsup} and Table \\ref{candidates_minsup}, under different \\textit{minsup} settings, we can clearly see that the runtime of the variant SUMU$_\\textit{simple}$ is the worst on the datasets Bible and FIFA; the runtime of the variant SUMU$_\\textit{TPUO}$ is the worst on the datasets Sign, Syn20k, and Syn40k. SUMU$_\\textit{simple}$ and SUMU$_\\textit{TPUO}$ perform similarly on Syn10k. However, when \\textit{minsup} is set to 10, the runtime of SUMU$_\\textit{TPUO}$ exceeds that of SUMU$_\\textit{simple}$. In addition, the variant SUMU$_\\textit{PES}$ which uses four upper bounds (\\textit{PEUO}, \\textit{RSUO}, \\textit{PES}, and \\textit{RSS}) can achieve the best performance on all datasets. And the variant SUMU$_\\textit{PEUO}$ takes the second least amount of runtime. SUMU$_\\textit{PES}$ is able to minimize candidate pattern generation, while SUMU$_\\textit{simple}$ is the least reduced. The results of our experiment are as expected. In experiments under different \\textit{minsup} and a fixed \\textit{minuo}, we can draw the following conclusions.\n\n\n\n\\begin{enumerate}[label=(\\arabic*)]\n\t\\item SUMU$_\\textit{PES}$ adopts a sufficient number of pruning strategies to significantly reduce candidate patterns while achieving the shortest  runtime. Compared to several other variants of SUMU, SUMU$_\\textit{PES}$ generates much fewer candidate patterns. Although the number of candidate patterns is many times less than the other variants, the overall performance is not more than a few times better. This is a lot of unpromising candidate patterns that are also ignored in the subsequent program steps.\n\t\n\t\\item The difference between SUMU$_\\textit{simple}$ and SUMU$_\\textit{PEUO}$ demonstrate that the pruning Strategy \\ref{strategy1} is ineffective on synthetic datasets. This is because \\textit{minsup} are set to relatively small values, and thus there are not many unpromising items appearing in the sequence dataset.\n\t\n\t\\item Although \\textit{TPUO} and \\textit{TSUO} are tighter upper bounds, their calculation makes SUMU$_\\textit{TPUO}$ take longer than SUMU$_\\textit{PEUO}$. For a candidate pattern, SUMU$_\\textit{PEUO}$ is able to compute upper bounds \\textit{PEUO} and \\textit{RSUO} in a liner time. While for \\textit{TSUO}, it requires multiple sorting operations, which is a complex process. In addition, SUMU$_\\textit{TPUO}$ does not reduce any candidate pattern on many datasets (including Bible, FIFA, and Sign). Even if it works on the few remaining datasets, it only reduces the number of candidate patterns by a particularly small amount.\n\\end{enumerate}\n\n\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.32]{figs/runtimeminuo.pdf}\n\t\\caption{Running time under various \\textit{minuo} and a fixed \\textit{minsup}. (a) Bible, \\textit{minsup} = 500. (b) FIFA,  \\textit{minsup} = 4,000. (c) Sign, \\textit{minsup} = 70. (d) Syn10k, \\textit{minsup} = 14. (e) Syn20k, \\textit{minsup} = 24. (f) Syn40k, \\textit{minsup} = 34.}\n\t\\label{runtime_minuo}\n\\end{figure}\n\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Number of candidate patterns generated by varying \\textit{minuo}}\n\t\\label{candidates_minuo}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{|c|c|c|c|c|c|c|c|}\n\t\t\t\\hline \\textbf{Dataset} & \\textbf{Result} & $\\textit{minuo}_{1}$         &  $\\textit{minuo}_{2}$ & $\\textit{minuo}_{3}$  & $\\textit{minuo}_{4}$  & $\\textit{minuo}_{5}$  & $\\textit{minuo}_{6}$ \\\\\n\t\t\t\\hline  \\hline\n\t\t\t\\multirow{4}{*}{\\shortstack{Bible\\\\ \\textit{minsup} = 500}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,921,104 & 672,801 & 390,245 & 264,424 & 195,445 & 153,171 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,102,717 & 443,242 & 248,731 & 162,948 & 117,785 & 91,903 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 & 11,721 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{FIFA \\\\ \\textit{minsup} = 4000}}\n\t\t\t& {SUMU$_\\textit{simple}$} & 280,551 & 178,428 & 144,686 & 102,302 & 69,806 & 48,009 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,098 & 30,487 & 25,968 & 22,128 & 19,040 & 16,766 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 & 1,099 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Sign\\\\ \\textit{minsup} = 70}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 1,588,257 & 1,393,525 & 1,233,358 & 1,101,274 & 988,726 & 894,110 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 1,284,130 & 1,129,929 & 1,002,217 & 895,976 & 806,047 & 730,904 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 1,284,130 & 1,129,928 & 1,002,216 & 895,923 & 805,911 & 730,644 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 126,752 & 126,743 & 126,716 & 126,652 & 126,478 & 126,240 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn10k \\\\ \\textit{minsup} = 14}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 4,119,006 & 2,916,190 & 2,399,859 & 2,021,727 & 1,747,076 & 1,534,885 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 3,727,340 & 2,790,561 & 2,301,336 & 1,948,286 & 1,680,334 & 1,485,466 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 3,727,339 & 2,790,547 & 2,301,264 & 1,948,081 & 1,679,416 & 1,482,724 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 537,899 & 537,774 & 537,328 & 536,225 & 533,228 & 526,147 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn20k \\\\ \\textit{minsup} = 24}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 5,991,112 & 5,044,232 & 4,517,900 & 4,112,343 & 3,773,772 & 3,423,454 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 5,706,748 & 4,904,484 & 4,430,131 & 4,041,273 & 3,712,031 & 3,357,823 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 5,706,744 & 4,904,420 & 4,429,752 & 4,040,166 & 3,709,467 & 3,352,790 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 841,369 & 840,970 & 838,452 & 830,080 & 812,793 & 785,422 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\multirow{4}{*}{\\shortstack{Syn40k \\\\ \\textit{minsup} = 34}} \n\t\t\t& {SUMU$_\\textit{simple}$} & 39,435,204 & 36,517,396 & 34,219,382 & 31,977,235 & 29,399,324 & 26,176,962 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PEUO}$} & 38,953,081 & 36,148,568 & 33,947,656 & 31,690,858 & 29,135,445 & 25,872,139 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{TPUO}$} & 38,953,054 & 36,148,106 & 33,944,569 & 31,678,889 & 29,102,992 & 25,803,949 \\\\\n\t\t\t\\cline{2-2}\n\t\t\t& {SUMU$_\\textit{PES}$} & 7,454,317 & 7,452,659 & 7,440,512 & 7,403,338 & 7,327,730 & 7,200,566 \\\\\n\t\t\t\\hline\n\t\t\t\n\t\t\t\\hline\n\t\t\\end{tabular}\n\t}\n\t\n\\end{table}\n\nFurthermore, from Fig \\ref{runtime_minuo} and Table \\ref{candidates_minuo}, under different \\textit{minuo} settings, we can clearly observe that SUMU$_\\textit{PES}$ is the fastest variant of SUMU. On the datasets Sign, Syn10k, Syn20k, and Syn40k, there are some fluctuations that occur in all the variants of SUMU, but the overall trend is still clear. Regardless of which dataset is processed, the runtime curve of SUMU$_\\textit{PES}$ becomes smoother as \\textit{minuo} is adjusted. In particular, the number of candidate patterns generated on the Bible and FIFA datasets has not changed. In experiments under different \\textit{minuo} and a fixed \\textit{minsup}, we can draw the following conclusions.\n\n\\begin{enumerate}[label=(\\arabic*)]\n\t\\item Unlike the experiments under tuning \\textit{minsup}, the runtime of each variant of SUMU is not much affected by the setting of \\textit{minuo}. On the datasets Bible and FIFA, the runtimes of SUMU$_\\textit{PEUO}$, SUMU$_\\textit{TPUO}$, and SUMU$_\\textit{PES}$ hardly increase when \\textit{minuo} decreases. While the candidate patterns for SUMU$_\\textit{PEUO}$ and SUMU$_\\textit{TPUO}$ are increased substantially. This suggests that the support measure plays a greater role in determining the program runtime than the utility occupancy measure.\n\t\n\t\\item SUMU$_\\textit{PES}$ still achieves the fastest runtime due to the most reasonable pruning strategies it uses. Moreover, on each dataset, as \\textit{minuo} decreases, it does not generate many more candidate patterns. Upper bounds \\textit{PES} and \\textit{RSS} already make it possible to reduce many invalid candidate patterns.\n\t\n\t\\item Although \\textit{TPUO} and \\textit{TSUO} are tighter upper bounds, as \\textit{minuo} decreases, they still do not reduce many irrelevant candidate patterns for SUMU$_\\textit{TPUO}$. \t\n\\end{enumerate}\n\n\n\\subsection{Memory Evaluation}\n\nThe memory consumption of each variant of SUMU is close and fluctuating, and we present the approximate memory consumption under different datasets. We investigate the reasons for the disparities in memory consumption based on program design details. The experimental results regarding memory consumption are shown in Table \\ref{memory}.\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Memory consumption}\n\t\\label{memory}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{}   & \\multicolumn{6}{c}{\\textbf{Approximate memory consumed (MB)}} \\\\ \\cline{2-7} \n\t\t\t&          Bible         &  FIFA & Sign  & Syn10k  & Syn20k  & Syn40k  \\\\ \\hline{fixed \\textit{minuo}}  & 1,000 $\\sim$ 1,400 & 1,400 $\\sim$ 1,700 & 200 $\\sim$ 400 & 300 $\\sim$ 800 & 600 $\\sim$ 800 & 1,200 $\\sim$ 1,500  \\\\ \\hline{fixed \\textit{minsup}}  & 1,000 $\\sim$ 1,400 & 1,500 $\\sim$ 1,700 & 200 $\\sim$ 400 & 300 $\\sim$ 600 & 300 $\\sim$ 800 & 1,200 $\\sim$ 1,500    \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\nSince each variant of SUMU uses both UOL-Chain and UO-Table, they consume little difference in memory, and the difference is within a reasonable range. The variants of SUMU employ a different number of pruning strategies, and thus they differ somewhat in the use of some auxiliary data structures. If the pruning Strategy \\ref{strategy1} is used, then unpromising items should be filtered. To find out which items are unpromising, the program utilizes a hash table to record the support for each item. On the contrary, if all items are used directly, it is sufficient for the program to use a single list to record those items that occur in the sequence database. The difference between SUMU$_{\\textit{PEUO}}$ and SUMU$_{\\textit{TPUO}}$ is that they use different upper bounds and pruning strategies. For the calculation of \\textit{PEUO} and \\textit{RSUO} of a pattern, this is relatively simple. Scanning the UOL-Chain of a pattern quickly and accumulating corresponding values. However, for the calculation of \\textit{TPUO} and \\textit{TSUO} of a pattern, several \\textit{minsup}-sized priority queues are required. This allows computing tighter upper bound values, but also consumes additional memory space. As for SUMU$_{\\textit{PES}}$, it uses more upper bounds \\textit{PES} and \\textit{RSS} (adopts more pruning Strategies \\ref{strategy6} and \\ref{strategy7}) compared to SUMU$_{\\textit{PEUO}}$. This suggests that in pattern extension, the program needs the associated hash tables to decide which candidate patterns satisfy the upper bounds \\textit{PES} and \\textit{RSS}. It seems that the more upper bounds and pruning strategies are used, the more memory is consumed. Nevertheless, in the experiments, effective pruning strategies can avoid unnecessary UOL-Chain and UO-Table builds due to the non-generation of some candidate patterns, also saving memory consumption. Therefore, the memory consumption of each variant of SUMU is roughly equal.\n\n\\subsection{Scalability}\n\nThis section selected five synthetic datasets to evaluate the scalability of each variant of SUMU. The dataset size increases from 10k to 50k sequence records, increasing by 10k each time. We set a relative support for experiments, e.g., \\textit{minsup} was set to 10, 20, 30, 40, and 50 for the five synthetic datasets, respectively. In addition, the \\textit{minuo} is set to 0.1 in order to generate more HUOSPs. We analyze the scalability in terms of runtime and candidate pattern generation, and the experimental results are shown in Fig. \\ref{scalability}.\n\n\\begin{figure}[h]\n\t\\centering\n\t\\includegraphics[trim=0 0 0 0,clip,scale=0.4]{figs/scalability.pdf}\n\t\\caption{Scalability of the compared variants of SUMU}\n\t\\label{scalability}\n\\end{figure}\n\nFrom Fig. \\ref{scalability}, it is clear that the runtime of each variant of SUMU grows as the size of the processed dataset increases. This is consistent with our assumption that larger datasets carry more candidate patterns, increasing the processing difficulty. The use of UOL-Chain and UO-Table makes the trend of each variant of SUMU the same, with only differences in efficiency. The difference between all SUMU variants is clear, with SUMU$_{\\textit{PES}}$ performing best and SUMU$_{\\textit{TPUO}}$ performing worst. For SUMU$_{\\textit{PES}}$, there is no such rapid growth of candidate patterns. While other variants of SUMU generate a large number of candidate patterns. Therefore, it performs well when handling large-scale datasets. The large number of sorting operations required for the calculation of tighter upper bounds, in particular, causes SUMU$_{\\textit{TPUO}}$ to perform poorly. The difference between SUMU$_{\\textit{PEUO}}$ and SUMU$_{\\textit{simple}}$ illustrates the effectiveness of the pruning strategy \\ref{strategy1}.\n\n\n5.4 Memory Evaluation\n\\subsection{Memory Evaluation}\n\nThe memory consumption of each variant of SUMU is close and fluctuating, and we present the approximate memory consumption under different datasets. We investigate the reasons for the disparities in memory consumption based on program design details. The experimental results regarding memory consumption are shown in Table \\ref{memory}.\n\n\\begin{table}[h]\n\t\\centering\n\t\\caption{Memory consumption}\n\t\\label{memory}\n\t\\resizebox{\\columnwidth}{!}{\n\t\t\\begin{tabular}{c|cccccc}\n\t\t\t\\hline \\hline\n\t\t\t\\multirow{2}{*}{}   & \\multicolumn{6}{c}{\\textbf{Approximate memory consumed (MB)}} \\\\ \\cline{2-7} \n\t\t\t&          Bible         &  FIFA & Sign  & Syn10k  & Syn20k  & Syn40k  \\\\ \\hline{fixed \\textit{minuo}}  & 1,000 $\\sim$ 1,400 & 1,400 $\\sim$ 1,700 & 200 $\\sim$ 400 & 300 $\\sim$ 800 & 600 $\\sim$ 800 & 1,200 $\\sim$ 1,500  \\\\ \\hline{fixed \\textit{minsup}}  & 1,000 $\\sim$ 1,400 & 1,500 $\\sim$ 1,700 & 200 $\\sim$ 400 & 300 $\\sim$ 600 & 300 $\\sim$ 800 & 1,200 $\\sim$ 1,500    \\\\ \\hline\n\t\t\t\\hline \n\t\t\\end{tabular}\n\t}\n\\end{table}\n\n\nSince each variant of SUMU uses both UOL-Chain and UO-Table, they consume little difference in memory, and the difference is within a reasonable range. The variants of SUMU employ a different number of pruning strategies, and thus they differ somewhat in the use of some auxiliary data structures. If the pruning Strategy \\ref{strategy1} is used, then unpromising items should be filtered. To find out which items are unpromising, the program utilizes a hash table to record the support for each item. On the contrary, if all items are used directly, it is sufficient for the program to use a single list to record those items that occur in the sequence database. The difference between SUMU$_{\\textit{PEUO}}$ and SUMU$_{\\textit{TPUO}}$ is that they use different upper bounds and pruning strategies. For the calculation of \\textit{PEUO} and \\textit{RSUO} of a pattern, this is relatively simple. Scanning the UOL-Chain of a pattern quickly and accumulating corresponding values. However, for the calculation of \\textit{TPUO} and \\textit{TSUO} of a pattern, several \\textit{minsup}-sized priority queues are required. This allows computing tighter upper bound values, but also consumes additional memory space. As for SUMU$_{\\textit{PES}}$, it uses more upper bounds \\textit{PES} and \\textit{RSS} (adopts more pruning Strategies \\ref{strategy6} and \\ref{strategy7}) compared to SUMU$_{\\textit{PEUO}}$. This suggests that in pattern extension, the program needs the associated hash tables to decide which candidate patterns satisfy the upper bounds \\textit{PES} and \\textit{RSS}. It seems that the more upper bounds and pruning strategies are used, the more memory is consumed. Nevertheless, in the experiments, effective pruning strategies can avoid unnecessary UOL-Chain and UO-Table builds due to the non-generation of some candidate patterns, also saving memory consumption. Therefore, the memory consumption of each variant of SUMU is roughly equal.\n\n",
        "Descriptive_question1": "What is the approximate memory consumption range for the Bible dataset in table_8 under fixed minuo?",
        "Descriptive_question2": "Which dataset shows the lowest memory consumption range in table_8 under fixed minsup?",
        "Reasoning_question1": "Why might the memory consumption for FIFA be higher than for Sign in table_8, despite both being real datasets?",
        "Reasoning_question2": "How does the variation in memory consumption between fixed minuo and fixed minsup settings in table_8 reflect the impact of pruning strategies on memory usage?",
        "Descriptive_answer1": "1,000 ~ 1,400",
        "Descriptive_answer2": "Sign",
        "Reasoning_answer1": "The higher memory consumption for FIFA compared to Sign in Table 8 can be attributed to the differences in dataset characteristics and complexity. First, FIFA has a significantly larger number of sequences (20,450) and items (2,990) compared to Sign (730 sequences and 267 items), as noted in the dataset descriptions. This larger dataset size likely requires more memory for storing data structures like UOL-Chain and UO-Table. Second, FIFA's average sequence length (36.23) is longer than Sign's (27.11), potentially leading to more complex pattern generation and thus higher memory usage during processing. Additionally, the experimental results show FIFA's memory range is 1,400 ~ 1,700 MB (fixed minuo) and 1,500 ~ 1,700 MB (fixed minsup), while Sign's range is consistently 200 ~ 400 MB, indicating that the computational overhead for FIFA is inherently greater due to its scale and density, regardless of the pruning strategies applied.",
        "Reasoning_answer2": "The variation in memory consumption between fixed minuo and fixed minsup settings in Table 8 highlights the nuanced impact of pruning strategies on memory usage. For most datasets, such as Bible and Sign, the memory consumption remains identical across both settings (e.g., Bible at 1,000 ~ 1,400 MB and Sign at 200 ~ 400 MB), suggesting that the core data structures like UOL-Chain and UO-Table dominate memory usage, and the parameter settings (minuo or minsup) have minimal differential effect. However, slight variations appear in datasets like FIFA (1,400 ~ 1,700 MB under fixed minuo vs. 1,500 ~ 1,700 MB under fixed minsup) and Syn10k (300 ~ 800 MB vs. 300 ~ 600 MB), indicating that specific pruning strategies tied to these parameters can influence auxiliary data structures, such as hash tables for filtering unpromising items under Strategy 1. The text explains that while more pruning strategies might intuitively increase memory due to additional structures, effective pruning can also reduce memory by avoiding unnecessary builds of candidate patterns and associated data structures. This balance results in the observed minor fluctuations, showing that pruning strategies have a limited but discernible impact on memory consumption depending on dataset characteristics and parameter settings."
    },
    {
        "paper_id": "1812.04423.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with constant coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$ & 3.45 (9)&  3.86e1 (41)&  3.80e2 (117)&  3.88e3 (351)& 4.07e4 (1100) \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.07(6) & 3.78 (15) &3.20e1 (37) & 3.17e2 (104) & 3.17e3 (318)\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 2.92 (8) & 5.75 (26) & 7.53 (29) & 8.73 (32) & 9.67(36) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.53 (9)&  1.71 (14)&  1.94 (14)&  1.99 (14)& 2.00 (13) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (8)&  1.21 (10)&  1.04 (7)&  1.02 (6)& 1.02 (6) \\\\\\hline\n\\end{tabular}\n\\label{tab:2d}\n\\end{table}",
        "caption": "Estimated condition numbers (number of PCG iterations) in 2D with constant coefficients.",
        "label": "tab:2d",
        "section_info": "4 Numerical Experiments\n\\section{Numerical Experiments}\n\\label{sec:num}\nIn this section, we present several numerical experiments in both 2D and 3D to verify the result in Theorem~\\ref{thm:aux} on the performance of the proposed preconditioners. In all these tests, we use 2-sweeps symmetric Gauss-Seidel smoother. The stopping criteria is $\\|r_{k}\\| / \\|r_{0}\\| <10^{-12}$ for the PCG algorithm, where $r_{k}= f-Au_{k}$ is the residual. For the coarse solver, we use the AMG algorithm implemented in $i$FEM~\\cite{Chen.L2008}. \n\n\\subsection{2D Examples}\nIn the first example, we consider the model problem \\eqref{eqn:model} in the unit square $\\Omega = [0,1]^{2}$ with constant coefficient $\\kappa =1$. Figure~\\ref{fig:poly2d} is an example of the polytopal mesh of the unit square domain (with 100 elements) generated using \\mcode{PolyMesher} \\cite{Talischi.C;Paulino.G;Pereira.A;Menezes.I2012}, and Figure~\\ref{fig:tri2d} is the corresponding Delaunay triangular mesh. The VEM discretization is defined on the polytopal mesh (cf. Figure~\\ref{fig:poly2d}), while the auxiliary space using the standard conforming $\\P_{1}$ finite element discretization is defined on the corresponding triangular mesh (cf. Figure~\\ref{fig:tri2d}). \n\\begin{figure}[htbp]\n\\centering\n\t\\parbox{0.45\\textwidth}{\n       \\includegraphics[width=0.4\\textwidth]{figures/polytopalmesh100.pdf}\n       \\caption{Polygonal Mesh $\\cT_{h}$ of the Unit Square Domain (100 Elements)}\n       \\label{fig:poly2d}}\n       \\quad\n       \\begin{minipage}{0.45\\textwidth}\n       \\includegraphics[width=0.89\\textwidth]{figures/trianglemesh100}\n       \\caption{The Corresponding Delaunay Triangle Mesh $\\cT_{h}^{c}$}\n       \\label{fig:tri2d}\n       \\end{minipage}\n\\end{figure}\n\n\nTables~\\ref{tab:2d} shows the estimated condition numbers  (the number of PCG iterations) for the additive and multiplicative preconditioned systems. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with constant coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$ & 3.45 (9)&  3.86e1 (41)&  3.80e2 (117)&  3.88e3 (351)& 4.07e4 (1100) \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.07(6) & 3.78 (15) &3.20e1 (37) & 3.17e2 (104) & 3.17e3 (318)\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 2.92 (8) & 5.75 (26) & 7.53 (29) & 8.73 (32) & 9.67(36) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.53 (9)&  1.71 (14)&  1.94 (14)&  1.99 (14)& 2.00 (13) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (8)&  1.21 (10)&  1.04 (7)&  1.02 (6)& 1.02 (6) \\\\\\hline\n\\end{tabular}\n\\label{tab:2d}\n\\end{table}\n}\nFor comparison, we also include the estimated condition numbers $\\cK(A)$, $\\cK(B_{{\\rm sgs}}A)$ and $\\cK(B_{{\\rm fict}}A)$, where $B_{{\\rm sgs}}$ is the (2-sweep) symmetric Gauss-Seidel preconditioner (same below) and $B_{{\\rm fict}}$ is the fictitious space preconditioner using the conforming FEM.  As we can observe from this table, while the condition numbers $\\cK(A)$, $\\cK(B_{{\\rm sgs}}A)$ increase as the mesh is refined. The condition number $\\cK(B_{{\\rm fict}}A)$ increase slightly as the mesh is refined. \nOn the other hand, the condition numbers of $\\mathcal{K}(B_{{\\rm add}}A)$ and $\\mathcal{K}(B_{{\\rm mul}}A)$ are uniformly bounded. \n\n\nIn the second test, we consider the problem with jump coefficients. The coefficients $\\kappa$ are generated randomly on each polygon element (see Figure~\\ref{fig:jump2d} for an example of the coefficient distribution with 100 elements, the integer in each polygonal element is the magnitude of the coefficient.). \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/polyjump100.pdf}\n       \\caption{Random Jump Coefficients $10^{k}$ (100 Elements)}\n       \\label{fig:jump2d}\n\\end{center}\n \\end{figure}\nNote that the coefficient settings are different in different polytopal meshes. Tables~\\ref{tab:2djump} shows the estimated condition numbers (the number of PCG iterations). Here, ``-'' means the PCG algorithm failed to converge after 1200 iterations.  As we can see from this table, while $\\cK(A)$ and $\\cK(B_{{\\rm sgs}}A)$ increase dramatically, the condition number $\\mathcal{K}(B_{{\\rm add}}A)$ and $\\mathcal{K}(B_{{\\rm mul}}A)$ are nearly uniformly bounded. These observations verify the estimate given in Theorem~\\ref{thm:aux}. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with jump coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$                   & 2.44 (11)&  2.73e6 (578)&  - & - & - \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.18(5) & 3.90e2 (26) &3.93e3 (409) & - & -\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 3.27 (8) & 6.94 (33) & 6.42 (36) & 11.6 (44) & 13.6 (53) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.54 (9)&  3.51 (20)&  3.60 (25)&  3.67 (25)& 3.80 (26) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (6)&  1.74 (15)&  1.82 (16)&  1.84 (16)& 1.88 (17) \\\\\\hline\n\\end{tabular}\n\\label{tab:2djump}\n\\end{table}\n}\n\nIn the third test, we consider the performance of the preconditioners for voronoi meshes which violate the assumption ({\\bf A}) (see Figure~\\ref{fig:2dvoronoi} for an example of 100 polygons). As we can observe from this figure, the aspect ratio for some polygons are quite high -- thus the partition is no longer quasi-uniform. Similar to before, we use Delaunay triangulation of this mesh to construct the auxiliary space.\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/voronoi100.pdf}\n       \\caption{Voronoi mesh (100 Elements)}\n       \\label{fig:2dvoronoi}\n\\end{center}\n \\end{figure}\n \n Table~\\ref{tab:2dvoronoi} shows the estimated condition numbers with the number of PCG iterations for different preconditioners. As we can see from this table, both additive and multiplicative auxiliary space preconditioners are still robust with respect to the problem size. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D voronoi polygonal mesh.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n$\\mathcal{K}(A)$& 4.76 (9) & 6.89e1 (52) & 6.59e2 (171) & 6.49e3 (537) & - \\\\\\hline\n$\\mathcal{K}(B_{\\rm sgs}A)$& 1.13 (6) & 4.93 (17) & 3.81e1 (45) & 3.57e2 (134) & 3.40e3 (400) \\\\\\hline\n$\\mathcal{K}(B_{\\rm fict}A)$& 4.66 (9) & 7.92 (34) & 2.04e1 (43) & 2.32e1 (46) & 1.62e1 (52) \\\\\\hline\n$\\mathcal{K}(B_{\\rm add}A)$& 1.58 (9) & 1.72 (16) & 3.09 (18) & 3.16 (19) & 1.91 (17) \\\\\\hline\n$\\mathcal{K}(B_{\\rm mul}A)$& 1.32 (11) & 2.25 (16) & 1.48 (13) & 1.29 (12) & 1.14 (10) \\\\\\hline\n\\end{tabular}\n\\label{tab:2dvoronoi}\n\\end{table}\n}\n\n\\subsection{3D Example}\nNow we consider the model problem \\eqref{eqn:model} in a 3D cubic domain $\\Omega =[0,1]^{3}$. We subdivide the domain into hexagonal elements (cubes) with mesh size $h$ at each level. The VEM discretization is defined on the hexagon mesh. For the auxiliary space, we further divide each hexagon into six tetrahedrons to construct the auxiliary mesh and to define the $\\P_{1}$ conforming finite element discretization on this auxiliary mesh (see for example, Figure~\\ref{fig:jump3d}).\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIn this example, we test various discontinuous coefficient settings. Let $\\Omega_{1} =[0.25,0.5]^{3}$ and $\\Omega_{2} = [0.5,0.75]^{3}$ (see Figure~\\ref{fig:jump3d}). We set the coefficient $\\kappa|_{\\Omega_{1}\\cup\\Omega_{2}}= \\kappa_{1}= 10^{k}$ (with $k=-6, -4, -2, 0, 2, 4, 6$) and $\\kappa|_{\\Omega\\setminus (\\Omega_{1}\\cup\\Omega_{2})} = 1$.\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/touchingcubesmesh.jpg}\n       \\caption{{\\footnotesize 3D uniform mesh with jump coefficients}}\n       \\label{fig:jump3d}\n\\end{center}\n \\end{figure}\nTable~\\ref{tab:3d} presents the estimated condition number of the preconditioned systems with respect to different choice of $\\kappa_{1}$ and mesh size.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n{\\scriptsize\n\\begin{table}\n\\caption{Estimated condition numbers (number of PCG iterations) in 3D. The coefficient $\\kappa|_{\\Omega_{1}\\cup\\Omega_{2}} =\\kappa_{1}= 10^{k}$ for various choices of $k$, and $\\kappa|_{\\Omega\\setminus(\\Omega_{1}\\cup \\Omega_{2})} =1.$ }\n\\begin{center}\\begin{tabular}{c|c||c|c|c|c|c|c}\n\\hline\n $\\kappa_{1}$ & $ h$ & $2^{-2}$ &  $2^{-3}$ &  $2^{-4}$ &  $2^{-5}$ &  $2^{-6}$ &  $2^{-7}$ \\\\\n\\hline\\hline\n\\multirow{5}{*}{$10^{-6}$} & $\\mathcal{K}(A)$& 1.15e6 (8) & 8.76e6 (28) & 6.94e7 (56) & 5.54e8 (110) & 4.43e9 (215) & 3.54e10 (420) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (23) & 2.44e1 (43) & 9.57e1 (71) & 3.81e2 (118) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.44 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.33 (7) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (10) & 1.35 (11) & 1.73 (15) & 1.92 (17) & 1.98 (17) & 1.99 (16) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{-4}$} & $\\mathcal{K}(A)$& 1.15e4 (7) & 8.76e4 (26) & 6.94e5 (51) & 5.54e6 (99) & 4.43e7 (194) & 3.54e8 (379) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (23) & 2.44e1 (38) & 9.57e1 (63) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.44 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.33 (7) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (9) & 1.35 (11) & 1.73 (14) & 1.92 (15) & 1.98 (15) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{-2}$} & $\\mathcal{K}(A)$& 1.15e2 (7) & 8.76e2 (24) & 6.94e3 (46) & 5.54e4 (90) & 4.43e5 (175) & 3.54e6 (346) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{1}  & $\\mathcal{K}(A)$& 4.44 (6) & 1.74e1 (21) & 6.94e1 (40) & 5.54e2 (78) & 4.43e3 (153) & 3.54e4 (302)\\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{2}$} & $\\mathcal{K}(A)$& 3.88e2 (6) & 2.00e2 (22) & 9.98e1 (44) & 2.77e2 (80) & 1.11e3 (143) & 4.43e3 (273) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{4}$} & $\\mathcal{K}(A)$& 3.88e4 (6) & 2.00e4 (22) & 9.98e3 (47) & 5.00e3 (89) & 2.50e3 (163) & 4.43e3 (295) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{6}$} & $\\mathcal{K}(A)$ & 3.88e6 (9) & 2.00e6 (22) & 9.99e5 (51) & 5.00e5 (96) & 2.50e5 (180) & 1.25e5 (331) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$ & 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$ & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\n\\hline\n\\end{tabular}\n\\label{tab:3d}\n\\end{center}\n\\end{table}\n}\nAs we can see from Table~\\ref{tab:3d}, the condition number of $A$ depends on both the coefficient $\\kappa$ and the mesh size. On the other hand, both the fictitious space preconditioner and the auxiliary space preconditioners (additive or multiplicative) are efficient and robust with respect to jumps in the coefficient $\\kappa$ and the mesh size. These results justify Theorem~\\ref{thm:aux} and Corollary~\\ref{cor:fict}.\n\n4.1 2D Examples\n\\subsection{2D Examples}\nIn the first example, we consider the model problem \\eqref{eqn:model} in the unit square $\\Omega = [0,1]^{2}$ with constant coefficient $\\kappa =1$. Figure~\\ref{fig:poly2d} is an example of the polytopal mesh of the unit square domain (with 100 elements) generated using \\mcode{PolyMesher} \\cite{Talischi.C;Paulino.G;Pereira.A;Menezes.I2012}, and Figure~\\ref{fig:tri2d} is the corresponding Delaunay triangular mesh. The VEM discretization is defined on the polytopal mesh (cf. Figure~\\ref{fig:poly2d}), while the auxiliary space using the standard conforming $\\P_{1}$ finite element discretization is defined on the corresponding triangular mesh (cf. Figure~\\ref{fig:tri2d}). \n\\begin{figure}[htbp]\n\\centering\n\t\\parbox{0.45\\textwidth}{\n       \\includegraphics[width=0.4\\textwidth]{figures/polytopalmesh100.pdf}\n       \\caption{Polygonal Mesh $\\cT_{h}$ of the Unit Square Domain (100 Elements)}\n       \\label{fig:poly2d}}\n       \\quad\n       \\begin{minipage}{0.45\\textwidth}\n       \\includegraphics[width=0.89\\textwidth]{figures/trianglemesh100}\n       \\caption{The Corresponding Delaunay Triangle Mesh $\\cT_{h}^{c}$}\n       \\label{fig:tri2d}\n       \\end{minipage}\n\\end{figure}\n\n\nTables~\\ref{tab:2d} shows the estimated condition numbers  (the number of PCG iterations) for the additive and multiplicative preconditioned systems. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with constant coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$ & 3.45 (9)&  3.86e1 (41)&  3.80e2 (117)&  3.88e3 (351)& 4.07e4 (1100) \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.07(6) & 3.78 (15) &3.20e1 (37) & 3.17e2 (104) & 3.17e3 (318)\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 2.92 (8) & 5.75 (26) & 7.53 (29) & 8.73 (32) & 9.67(36) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.53 (9)&  1.71 (14)&  1.94 (14)&  1.99 (14)& 2.00 (13) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (8)&  1.21 (10)&  1.04 (7)&  1.02 (6)& 1.02 (6) \\\\\\hline\n\\end{tabular}\n\\label{tab:2d}\n\\end{table}\n}\nFor comparison, we also include the estimated condition numbers $\\cK(A)$, $\\cK(B_{{\\rm sgs}}A)$ and $\\cK(B_{{\\rm fict}}A)$, where $B_{{\\rm sgs}}$ is the (2-sweep) symmetric Gauss-Seidel preconditioner (same below) and $B_{{\\rm fict}}$ is the fictitious space preconditioner using the conforming FEM.  As we can observe from this table, while the condition numbers $\\cK(A)$, $\\cK(B_{{\\rm sgs}}A)$ increase as the mesh is refined. The condition number $\\cK(B_{{\\rm fict}}A)$ increase slightly as the mesh is refined. \nOn the other hand, the condition numbers of $\\mathcal{K}(B_{{\\rm add}}A)$ and $\\mathcal{K}(B_{{\\rm mul}}A)$ are uniformly bounded. \n\n\nIn the second test, we consider the problem with jump coefficients. The coefficients $\\kappa$ are generated randomly on each polygon element (see Figure~\\ref{fig:jump2d} for an example of the coefficient distribution with 100 elements, the integer in each polygonal element is the magnitude of the coefficient.). \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/polyjump100.pdf}\n       \\caption{Random Jump Coefficients $10^{k}$ (100 Elements)}\n       \\label{fig:jump2d}\n\\end{center}\n \\end{figure}\nNote that the coefficient settings are different in different polytopal meshes. Tables~\\ref{tab:2djump} shows the estimated condition numbers (the number of PCG iterations). Here, ``-'' means the PCG algorithm failed to converge after 1200 iterations.  As we can see from this table, while $\\cK(A)$ and $\\cK(B_{{\\rm sgs}}A)$ increase dramatically, the condition number $\\mathcal{K}(B_{{\\rm add}}A)$ and $\\mathcal{K}(B_{{\\rm mul}}A)$ are nearly uniformly bounded. These observations verify the estimate given in Theorem~\\ref{thm:aux}. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with jump coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$                   & 2.44 (11)&  2.73e6 (578)&  - & - & - \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.18(5) & 3.90e2 (26) &3.93e3 (409) & - & -\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 3.27 (8) & 6.94 (33) & 6.42 (36) & 11.6 (44) & 13.6 (53) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.54 (9)&  3.51 (20)&  3.60 (25)&  3.67 (25)& 3.80 (26) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (6)&  1.74 (15)&  1.82 (16)&  1.84 (16)& 1.88 (17) \\\\\\hline\n\\end{tabular}\n\\label{tab:2djump}\n\\end{table}\n}\n\nIn the third test, we consider the performance of the preconditioners for voronoi meshes which violate the assumption ({\\bf A}) (see Figure~\\ref{fig:2dvoronoi} for an example of 100 polygons). As we can observe from this figure, the aspect ratio for some polygons are quite high -- thus the partition is no longer quasi-uniform. Similar to before, we use Delaunay triangulation of this mesh to construct the auxiliary space.\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/voronoi100.pdf}\n       \\caption{Voronoi mesh (100 Elements)}\n       \\label{fig:2dvoronoi}\n\\end{center}\n \\end{figure}\n \n Table~\\ref{tab:2dvoronoi} shows the estimated condition numbers with the number of PCG iterations for different preconditioners. As we can see from this table, both additive and multiplicative auxiliary space preconditioners are still robust with respect to the problem size. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D voronoi polygonal mesh.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n$\\mathcal{K}(A)$& 4.76 (9) & 6.89e1 (52) & 6.59e2 (171) & 6.49e3 (537) & - \\\\\\hline\n$\\mathcal{K}(B_{\\rm sgs}A)$& 1.13 (6) & 4.93 (17) & 3.81e1 (45) & 3.57e2 (134) & 3.40e3 (400) \\\\\\hline\n$\\mathcal{K}(B_{\\rm fict}A)$& 4.66 (9) & 7.92 (34) & 2.04e1 (43) & 2.32e1 (46) & 1.62e1 (52) \\\\\\hline\n$\\mathcal{K}(B_{\\rm add}A)$& 1.58 (9) & 1.72 (16) & 3.09 (18) & 3.16 (19) & 1.91 (17) \\\\\\hline\n$\\mathcal{K}(B_{\\rm mul}A)$& 1.32 (11) & 2.25 (16) & 1.48 (13) & 1.29 (12) & 1.14 (10) \\\\\\hline\n\\end{tabular}\n\\label{tab:2dvoronoi}\n\\end{table}\n}\n\n",
        "Descriptive_question1": "What is the estimated condition number for K(A) with 10 polytopal elements in table_1?",
        "Descriptive_question2": "How many PCG iterations are needed for K(B_mul A) with 10^5 polytopal elements in table_1?",
        "Reasoning_question1": "How does the condition number of K(B_add A) change as the number of polytopal elements increases from 10 to 10^5 in table_1, and what does this suggest about its scalability?",
        "Reasoning_question2": "Comparing the performance of K(B_sgs A) and K(B_mul A) in table_1, which preconditioner appears more effective for larger mesh sizes, and why?",
        "Descriptive_answer1": "3.45",
        "Descriptive_answer2": "6",
        "Reasoning_answer1": "As the number of polytopal elements increases from 10 to 10^5 in table_1, the condition number of K(B_add A) rises slightly from 1.53 to 2.00. This modest increase indicates that the condition number remains nearly stable despite a significant growth in mesh size. Starting with the value at 10 elements (1.53), it incrementally increases to 1.71 at 10^2, 1.94 at 10^3, 1.99 at 10^4, and finally reaches 2.00 at 10^5. This trend suggests excellent scalability, as the condition number does not grow proportionally with the mesh size, implying that the additive preconditioner effectively controls the system's conditioning across varying scales.",
        "Reasoning_answer2": "Comparing K(B_sgs A) and K(B_mul A) in table_1, K(B_mul A) appears more effective for larger mesh sizes. For K(B_sgs A), the condition number increases significantly from 1.07 at 10 elements to 3.17e3 at 10^5 elements, indicating a substantial rise in computational effort as reflected by the PCG iterations increasing from 6 to 318. In contrast, K(B_mul A) maintains a nearly constant condition number, starting at 1.06 at 10 elements and only slightly varying to 1.02 at 10^5 elements, with PCG iterations decreasing from 8 to 6. This stability in condition number and lower iteration count at larger mesh sizes suggest that the multiplicative preconditioner handles the increasing complexity of finer meshes much better, likely due to its ability to more effectively reduce the spectral radius of the preconditioned system."
    },
    {
        "paper_id": "1812.04423.json",
        "table_id": "table_2",
        "table_content": "\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with jump coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$                   & 2.44 (11)&  2.73e6 (578)&  - & - & - \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.18(5) & 3.90e2 (26) &3.93e3 (409) & - & -\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 3.27 (8) & 6.94 (33) & 6.42 (36) & 11.6 (44) & 13.6 (53) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.54 (9)&  3.51 (20)&  3.60 (25)&  3.67 (25)& 3.80 (26) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (6)&  1.74 (15)&  1.82 (16)&  1.84 (16)& 1.88 (17) \\\\\\hline\n\\end{tabular}\n\\label{tab:2djump}\n\\end{table}",
        "caption": "Estimated condition numbers (number of PCG iterations) in 2D with jump coefficients.",
        "label": "tab:2djump",
        "section_info": "4 Numerical Experiments\n\\section{Numerical Experiments}\n\\label{sec:num}\nIn this section, we present several numerical experiments in both 2D and 3D to verify the result in Theorem~\\ref{thm:aux} on the performance of the proposed preconditioners. In all these tests, we use 2-sweeps symmetric Gauss-Seidel smoother. The stopping criteria is $\\|r_{k}\\| / \\|r_{0}\\| <10^{-12}$ for the PCG algorithm, where $r_{k}= f-Au_{k}$ is the residual. For the coarse solver, we use the AMG algorithm implemented in $i$FEM~\\cite{Chen.L2008}. \n\n\\subsection{2D Examples}\nIn the first example, we consider the model problem \\eqref{eqn:model} in the unit square $\\Omega = [0,1]^{2}$ with constant coefficient $\\kappa =1$. Figure~\\ref{fig:poly2d} is an example of the polytopal mesh of the unit square domain (with 100 elements) generated using \\mcode{PolyMesher} \\cite{Talischi.C;Paulino.G;Pereira.A;Menezes.I2012}, and Figure~\\ref{fig:tri2d} is the corresponding Delaunay triangular mesh. The VEM discretization is defined on the polytopal mesh (cf. Figure~\\ref{fig:poly2d}), while the auxiliary space using the standard conforming $\\P_{1}$ finite element discretization is defined on the corresponding triangular mesh (cf. Figure~\\ref{fig:tri2d}). \n\\begin{figure}[htbp]\n\\centering\n\t\\parbox{0.45\\textwidth}{\n       \\includegraphics[width=0.4\\textwidth]{figures/polytopalmesh100.pdf}\n       \\caption{Polygonal Mesh $\\cT_{h}$ of the Unit Square Domain (100 Elements)}\n       \\label{fig:poly2d}}\n       \\quad\n       \\begin{minipage}{0.45\\textwidth}\n       \\includegraphics[width=0.89\\textwidth]{figures/trianglemesh100}\n       \\caption{The Corresponding Delaunay Triangle Mesh $\\cT_{h}^{c}$}\n       \\label{fig:tri2d}\n       \\end{minipage}\n\\end{figure}\n\n\nTables~\\ref{tab:2d} shows the estimated condition numbers  (the number of PCG iterations) for the additive and multiplicative preconditioned systems. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with constant coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$ & 3.45 (9)&  3.86e1 (41)&  3.80e2 (117)&  3.88e3 (351)& 4.07e4 (1100) \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.07(6) & 3.78 (15) &3.20e1 (37) & 3.17e2 (104) & 3.17e3 (318)\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 2.92 (8) & 5.75 (26) & 7.53 (29) & 8.73 (32) & 9.67(36) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.53 (9)&  1.71 (14)&  1.94 (14)&  1.99 (14)& 2.00 (13) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (8)&  1.21 (10)&  1.04 (7)&  1.02 (6)& 1.02 (6) \\\\\\hline\n\\end{tabular}\n\\label{tab:2d}\n\\end{table}\n}\nFor comparison, we also include the estimated condition numbers $\\cK(A)$, $\\cK(B_{{\\rm sgs}}A)$ and $\\cK(B_{{\\rm fict}}A)$, where $B_{{\\rm sgs}}$ is the (2-sweep) symmetric Gauss-Seidel preconditioner (same below) and $B_{{\\rm fict}}$ is the fictitious space preconditioner using the conforming FEM.  As we can observe from this table, while the condition numbers $\\cK(A)$, $\\cK(B_{{\\rm sgs}}A)$ increase as the mesh is refined. The condition number $\\cK(B_{{\\rm fict}}A)$ increase slightly as the mesh is refined. \nOn the other hand, the condition numbers of $\\mathcal{K}(B_{{\\rm add}}A)$ and $\\mathcal{K}(B_{{\\rm mul}}A)$ are uniformly bounded. \n\n\nIn the second test, we consider the problem with jump coefficients. The coefficients $\\kappa$ are generated randomly on each polygon element (see Figure~\\ref{fig:jump2d} for an example of the coefficient distribution with 100 elements, the integer in each polygonal element is the magnitude of the coefficient.). \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/polyjump100.pdf}\n       \\caption{Random Jump Coefficients $10^{k}$ (100 Elements)}\n       \\label{fig:jump2d}\n\\end{center}\n \\end{figure}\nNote that the coefficient settings are different in different polytopal meshes. Tables~\\ref{tab:2djump} shows the estimated condition numbers (the number of PCG iterations). Here, ``-'' means the PCG algorithm failed to converge after 1200 iterations.  As we can see from this table, while $\\cK(A)$ and $\\cK(B_{{\\rm sgs}}A)$ increase dramatically, the condition number $\\mathcal{K}(B_{{\\rm add}}A)$ and $\\mathcal{K}(B_{{\\rm mul}}A)$ are nearly uniformly bounded. These observations verify the estimate given in Theorem~\\ref{thm:aux}. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with jump coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$                   & 2.44 (11)&  2.73e6 (578)&  - & - & - \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.18(5) & 3.90e2 (26) &3.93e3 (409) & - & -\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 3.27 (8) & 6.94 (33) & 6.42 (36) & 11.6 (44) & 13.6 (53) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.54 (9)&  3.51 (20)&  3.60 (25)&  3.67 (25)& 3.80 (26) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (6)&  1.74 (15)&  1.82 (16)&  1.84 (16)& 1.88 (17) \\\\\\hline\n\\end{tabular}\n\\label{tab:2djump}\n\\end{table}\n}\n\nIn the third test, we consider the performance of the preconditioners for voronoi meshes which violate the assumption ({\\bf A}) (see Figure~\\ref{fig:2dvoronoi} for an example of 100 polygons). As we can observe from this figure, the aspect ratio for some polygons are quite high -- thus the partition is no longer quasi-uniform. Similar to before, we use Delaunay triangulation of this mesh to construct the auxiliary space.\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/voronoi100.pdf}\n       \\caption{Voronoi mesh (100 Elements)}\n       \\label{fig:2dvoronoi}\n\\end{center}\n \\end{figure}\n \n Table~\\ref{tab:2dvoronoi} shows the estimated condition numbers with the number of PCG iterations for different preconditioners. As we can see from this table, both additive and multiplicative auxiliary space preconditioners are still robust with respect to the problem size. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D voronoi polygonal mesh.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n$\\mathcal{K}(A)$& 4.76 (9) & 6.89e1 (52) & 6.59e2 (171) & 6.49e3 (537) & - \\\\\\hline\n$\\mathcal{K}(B_{\\rm sgs}A)$& 1.13 (6) & 4.93 (17) & 3.81e1 (45) & 3.57e2 (134) & 3.40e3 (400) \\\\\\hline\n$\\mathcal{K}(B_{\\rm fict}A)$& 4.66 (9) & 7.92 (34) & 2.04e1 (43) & 2.32e1 (46) & 1.62e1 (52) \\\\\\hline\n$\\mathcal{K}(B_{\\rm add}A)$& 1.58 (9) & 1.72 (16) & 3.09 (18) & 3.16 (19) & 1.91 (17) \\\\\\hline\n$\\mathcal{K}(B_{\\rm mul}A)$& 1.32 (11) & 2.25 (16) & 1.48 (13) & 1.29 (12) & 1.14 (10) \\\\\\hline\n\\end{tabular}\n\\label{tab:2dvoronoi}\n\\end{table}\n}\n\n\\subsection{3D Example}\nNow we consider the model problem \\eqref{eqn:model} in a 3D cubic domain $\\Omega =[0,1]^{3}$. We subdivide the domain into hexagonal elements (cubes) with mesh size $h$ at each level. The VEM discretization is defined on the hexagon mesh. For the auxiliary space, we further divide each hexagon into six tetrahedrons to construct the auxiliary mesh and to define the $\\P_{1}$ conforming finite element discretization on this auxiliary mesh (see for example, Figure~\\ref{fig:jump3d}).\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIn this example, we test various discontinuous coefficient settings. Let $\\Omega_{1} =[0.25,0.5]^{3}$ and $\\Omega_{2} = [0.5,0.75]^{3}$ (see Figure~\\ref{fig:jump3d}). We set the coefficient $\\kappa|_{\\Omega_{1}\\cup\\Omega_{2}}= \\kappa_{1}= 10^{k}$ (with $k=-6, -4, -2, 0, 2, 4, 6$) and $\\kappa|_{\\Omega\\setminus (\\Omega_{1}\\cup\\Omega_{2})} = 1$.\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/touchingcubesmesh.jpg}\n       \\caption{{\\footnotesize 3D uniform mesh with jump coefficients}}\n       \\label{fig:jump3d}\n\\end{center}\n \\end{figure}\nTable~\\ref{tab:3d} presents the estimated condition number of the preconditioned systems with respect to different choice of $\\kappa_{1}$ and mesh size.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n{\\scriptsize\n\\begin{table}\n\\caption{Estimated condition numbers (number of PCG iterations) in 3D. The coefficient $\\kappa|_{\\Omega_{1}\\cup\\Omega_{2}} =\\kappa_{1}= 10^{k}$ for various choices of $k$, and $\\kappa|_{\\Omega\\setminus(\\Omega_{1}\\cup \\Omega_{2})} =1.$ }\n\\begin{center}\\begin{tabular}{c|c||c|c|c|c|c|c}\n\\hline\n $\\kappa_{1}$ & $ h$ & $2^{-2}$ &  $2^{-3}$ &  $2^{-4}$ &  $2^{-5}$ &  $2^{-6}$ &  $2^{-7}$ \\\\\n\\hline\\hline\n\\multirow{5}{*}{$10^{-6}$} & $\\mathcal{K}(A)$& 1.15e6 (8) & 8.76e6 (28) & 6.94e7 (56) & 5.54e8 (110) & 4.43e9 (215) & 3.54e10 (420) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (23) & 2.44e1 (43) & 9.57e1 (71) & 3.81e2 (118) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.44 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.33 (7) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (10) & 1.35 (11) & 1.73 (15) & 1.92 (17) & 1.98 (17) & 1.99 (16) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{-4}$} & $\\mathcal{K}(A)$& 1.15e4 (7) & 8.76e4 (26) & 6.94e5 (51) & 5.54e6 (99) & 4.43e7 (194) & 3.54e8 (379) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (23) & 2.44e1 (38) & 9.57e1 (63) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.44 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.33 (7) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (9) & 1.35 (11) & 1.73 (14) & 1.92 (15) & 1.98 (15) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{-2}$} & $\\mathcal{K}(A)$& 1.15e2 (7) & 8.76e2 (24) & 6.94e3 (46) & 5.54e4 (90) & 4.43e5 (175) & 3.54e6 (346) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{1}  & $\\mathcal{K}(A)$& 4.44 (6) & 1.74e1 (21) & 6.94e1 (40) & 5.54e2 (78) & 4.43e3 (153) & 3.54e4 (302)\\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{2}$} & $\\mathcal{K}(A)$& 3.88e2 (6) & 2.00e2 (22) & 9.98e1 (44) & 2.77e2 (80) & 1.11e3 (143) & 4.43e3 (273) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{4}$} & $\\mathcal{K}(A)$& 3.88e4 (6) & 2.00e4 (22) & 9.98e3 (47) & 5.00e3 (89) & 2.50e3 (163) & 4.43e3 (295) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{6}$} & $\\mathcal{K}(A)$ & 3.88e6 (9) & 2.00e6 (22) & 9.99e5 (51) & 5.00e5 (96) & 2.50e5 (180) & 1.25e5 (331) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$ & 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$ & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\n\\hline\n\\end{tabular}\n\\label{tab:3d}\n\\end{center}\n\\end{table}\n}\nAs we can see from Table~\\ref{tab:3d}, the condition number of $A$ depends on both the coefficient $\\kappa$ and the mesh size. On the other hand, both the fictitious space preconditioner and the auxiliary space preconditioners (additive or multiplicative) are efficient and robust with respect to jumps in the coefficient $\\kappa$ and the mesh size. These results justify Theorem~\\ref{thm:aux} and Corollary~\\ref{cor:fict}.\n\n4.1 2D Examples\n\\subsection{2D Examples}\nIn the first example, we consider the model problem \\eqref{eqn:model} in the unit square $\\Omega = [0,1]^{2}$ with constant coefficient $\\kappa =1$. Figure~\\ref{fig:poly2d} is an example of the polytopal mesh of the unit square domain (with 100 elements) generated using \\mcode{PolyMesher} \\cite{Talischi.C;Paulino.G;Pereira.A;Menezes.I2012}, and Figure~\\ref{fig:tri2d} is the corresponding Delaunay triangular mesh. The VEM discretization is defined on the polytopal mesh (cf. Figure~\\ref{fig:poly2d}), while the auxiliary space using the standard conforming $\\P_{1}$ finite element discretization is defined on the corresponding triangular mesh (cf. Figure~\\ref{fig:tri2d}). \n\\begin{figure}[htbp]\n\\centering\n\t\\parbox{0.45\\textwidth}{\n       \\includegraphics[width=0.4\\textwidth]{figures/polytopalmesh100.pdf}\n       \\caption{Polygonal Mesh $\\cT_{h}$ of the Unit Square Domain (100 Elements)}\n       \\label{fig:poly2d}}\n       \\quad\n       \\begin{minipage}{0.45\\textwidth}\n       \\includegraphics[width=0.89\\textwidth]{figures/trianglemesh100}\n       \\caption{The Corresponding Delaunay Triangle Mesh $\\cT_{h}^{c}$}\n       \\label{fig:tri2d}\n       \\end{minipage}\n\\end{figure}\n\n\nTables~\\ref{tab:2d} shows the estimated condition numbers  (the number of PCG iterations) for the additive and multiplicative preconditioned systems. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with constant coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$ & 3.45 (9)&  3.86e1 (41)&  3.80e2 (117)&  3.88e3 (351)& 4.07e4 (1100) \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.07(6) & 3.78 (15) &3.20e1 (37) & 3.17e2 (104) & 3.17e3 (318)\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 2.92 (8) & 5.75 (26) & 7.53 (29) & 8.73 (32) & 9.67(36) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.53 (9)&  1.71 (14)&  1.94 (14)&  1.99 (14)& 2.00 (13) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (8)&  1.21 (10)&  1.04 (7)&  1.02 (6)& 1.02 (6) \\\\\\hline\n\\end{tabular}\n\\label{tab:2d}\n\\end{table}\n}\nFor comparison, we also include the estimated condition numbers $\\cK(A)$, $\\cK(B_{{\\rm sgs}}A)$ and $\\cK(B_{{\\rm fict}}A)$, where $B_{{\\rm sgs}}$ is the (2-sweep) symmetric Gauss-Seidel preconditioner (same below) and $B_{{\\rm fict}}$ is the fictitious space preconditioner using the conforming FEM.  As we can observe from this table, while the condition numbers $\\cK(A)$, $\\cK(B_{{\\rm sgs}}A)$ increase as the mesh is refined. The condition number $\\cK(B_{{\\rm fict}}A)$ increase slightly as the mesh is refined. \nOn the other hand, the condition numbers of $\\mathcal{K}(B_{{\\rm add}}A)$ and $\\mathcal{K}(B_{{\\rm mul}}A)$ are uniformly bounded. \n\n\nIn the second test, we consider the problem with jump coefficients. The coefficients $\\kappa$ are generated randomly on each polygon element (see Figure~\\ref{fig:jump2d} for an example of the coefficient distribution with 100 elements, the integer in each polygonal element is the magnitude of the coefficient.). \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/polyjump100.pdf}\n       \\caption{Random Jump Coefficients $10^{k}$ (100 Elements)}\n       \\label{fig:jump2d}\n\\end{center}\n \\end{figure}\nNote that the coefficient settings are different in different polytopal meshes. Tables~\\ref{tab:2djump} shows the estimated condition numbers (the number of PCG iterations). Here, ``-'' means the PCG algorithm failed to converge after 1200 iterations.  As we can see from this table, while $\\cK(A)$ and $\\cK(B_{{\\rm sgs}}A)$ increase dramatically, the condition number $\\mathcal{K}(B_{{\\rm add}}A)$ and $\\mathcal{K}(B_{{\\rm mul}}A)$ are nearly uniformly bounded. These observations verify the estimate given in Theorem~\\ref{thm:aux}. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with jump coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$                   & 2.44 (11)&  2.73e6 (578)&  - & - & - \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.18(5) & 3.90e2 (26) &3.93e3 (409) & - & -\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 3.27 (8) & 6.94 (33) & 6.42 (36) & 11.6 (44) & 13.6 (53) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.54 (9)&  3.51 (20)&  3.60 (25)&  3.67 (25)& 3.80 (26) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (6)&  1.74 (15)&  1.82 (16)&  1.84 (16)& 1.88 (17) \\\\\\hline\n\\end{tabular}\n\\label{tab:2djump}\n\\end{table}\n}\n\nIn the third test, we consider the performance of the preconditioners for voronoi meshes which violate the assumption ({\\bf A}) (see Figure~\\ref{fig:2dvoronoi} for an example of 100 polygons). As we can observe from this figure, the aspect ratio for some polygons are quite high -- thus the partition is no longer quasi-uniform. Similar to before, we use Delaunay triangulation of this mesh to construct the auxiliary space.\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/voronoi100.pdf}\n       \\caption{Voronoi mesh (100 Elements)}\n       \\label{fig:2dvoronoi}\n\\end{center}\n \\end{figure}\n \n Table~\\ref{tab:2dvoronoi} shows the estimated condition numbers with the number of PCG iterations for different preconditioners. As we can see from this table, both additive and multiplicative auxiliary space preconditioners are still robust with respect to the problem size. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D voronoi polygonal mesh.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n$\\mathcal{K}(A)$& 4.76 (9) & 6.89e1 (52) & 6.59e2 (171) & 6.49e3 (537) & - \\\\\\hline\n$\\mathcal{K}(B_{\\rm sgs}A)$& 1.13 (6) & 4.93 (17) & 3.81e1 (45) & 3.57e2 (134) & 3.40e3 (400) \\\\\\hline\n$\\mathcal{K}(B_{\\rm fict}A)$& 4.66 (9) & 7.92 (34) & 2.04e1 (43) & 2.32e1 (46) & 1.62e1 (52) \\\\\\hline\n$\\mathcal{K}(B_{\\rm add}A)$& 1.58 (9) & 1.72 (16) & 3.09 (18) & 3.16 (19) & 1.91 (17) \\\\\\hline\n$\\mathcal{K}(B_{\\rm mul}A)$& 1.32 (11) & 2.25 (16) & 1.48 (13) & 1.29 (12) & 1.14 (10) \\\\\\hline\n\\end{tabular}\n\\label{tab:2dvoronoi}\n\\end{table}\n}\n\n",
        "Descriptive_question1": "What is the condition number for K(A) with 10 polytopal elements in table_2?",
        "Descriptive_question2": "How many PCG iterations are required for K(B_mul A) with 10^5 polytopal elements in table_2?",
        "Reasoning_question1": "Why do the condition numbers for K(A) and K(B_sgs A) fail to converge after a certain number of polytopal elements in table_2, and what does this suggest about their performance with jump coefficients?",
        "Reasoning_question2": "Compare the trend in condition numbers for K(B_add A) and K(B_mul A) as the number of polytopal elements increases in table_2. What does this indicate about the effectiveness of additive versus multiplicative preconditioners in handling jump coefficients?",
        "Descriptive_answer1": "2.44",
        "Descriptive_answer2": "17",
        "Reasoning_answer1": "The condition numbers for K(A) and K(B_sgs A) fail to converge after a certain number of polytopal elements (specifically, at 10^3 for K(A) and 10^4 for K(B_sgs A)) in table_2 because of the dramatic increase in condition numbers as the mesh is refined and jump coefficients are introduced. Observing the data, K(A) at 10^2 elements has a condition number of 2.73e6 with 578 iterations, and it fails to converge beyond 10^3 elements (indicated by '-'). Similarly, K(B_sgs A) shows a sharp rise from 3.90e2 at 10^2 elements to 3.93e3 at 10^3 elements, failing to converge at 10^4 elements. This suggests that both systems are highly sensitive to the discontinuities introduced by jump coefficients, leading to ill-conditioned matrices that require an excessive number of iterations (beyond the threshold of 1200) to reach the stopping criterion. This indicates poor performance and lack of robustness in handling jump coefficients, as the preconditioners struggle to stabilize the system under these conditions.",
        "Reasoning_answer2": "Comparing the trends in condition numbers for K(B_add A) and K(B_mul A) as the number of polytopal elements increases in table_2, we see distinct patterns. For K(B_add A), the condition number starts at 1.54 with 10 elements and gradually increases to 3.80 at 10^5 elements, showing a moderate rise (an increase of about 2.26 over five orders of magnitude). The number of iterations also rises slightly from 9 to 26. In contrast, K(B_mul A) begins at 1.06 with 10 elements and increases to only 1.88 at 10^5 elements, a much smaller rise of 0.82. Its iterations also increase minimally from 6 to 17. This indicates that both preconditioners are nearly uniformly bounded, but K(B_mul A) demonstrates superior stability with a flatter trend in condition numbers. This suggests that the multiplicative preconditioner is more effective in handling jump coefficients, as it maintains lower condition numbers and requires fewer iterations, providing better robustness and efficiency compared to the additive preconditioner in this context."
    },
    {
        "paper_id": "1812.04423.json",
        "table_id": "table_3",
        "table_content": "\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D voronoi polygonal mesh.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n$\\mathcal{K}(A)$& 4.76 (9) & 6.89e1 (52) & 6.59e2 (171) & 6.49e3 (537) & - \\\\\\hline\n$\\mathcal{K}(B_{\\rm sgs}A)$& 1.13 (6) & 4.93 (17) & 3.81e1 (45) & 3.57e2 (134) & 3.40e3 (400) \\\\\\hline\n$\\mathcal{K}(B_{\\rm fict}A)$& 4.66 (9) & 7.92 (34) & 2.04e1 (43) & 2.32e1 (46) & 1.62e1 (52) \\\\\\hline\n$\\mathcal{K}(B_{\\rm add}A)$& 1.58 (9) & 1.72 (16) & 3.09 (18) & 3.16 (19) & 1.91 (17) \\\\\\hline\n$\\mathcal{K}(B_{\\rm mul}A)$& 1.32 (11) & 2.25 (16) & 1.48 (13) & 1.29 (12) & 1.14 (10) \\\\\\hline\n\\end{tabular}\n\\label{tab:2dvoronoi}\n\\end{table}",
        "caption": "Estimated condition numbers (number of PCG iterations) in 2D voronoi polygonal mesh.",
        "label": "tab:2dvoronoi",
        "section_info": "4 Numerical Experiments\n\\section{Numerical Experiments}\n\\label{sec:num}\nIn this section, we present several numerical experiments in both 2D and 3D to verify the result in Theorem~\\ref{thm:aux} on the performance of the proposed preconditioners. In all these tests, we use 2-sweeps symmetric Gauss-Seidel smoother. The stopping criteria is $\\|r_{k}\\| / \\|r_{0}\\| <10^{-12}$ for the PCG algorithm, where $r_{k}= f-Au_{k}$ is the residual. For the coarse solver, we use the AMG algorithm implemented in $i$FEM~\\cite{Chen.L2008}. \n\n\\subsection{2D Examples}\nIn the first example, we consider the model problem \\eqref{eqn:model} in the unit square $\\Omega = [0,1]^{2}$ with constant coefficient $\\kappa =1$. Figure~\\ref{fig:poly2d} is an example of the polytopal mesh of the unit square domain (with 100 elements) generated using \\mcode{PolyMesher} \\cite{Talischi.C;Paulino.G;Pereira.A;Menezes.I2012}, and Figure~\\ref{fig:tri2d} is the corresponding Delaunay triangular mesh. The VEM discretization is defined on the polytopal mesh (cf. Figure~\\ref{fig:poly2d}), while the auxiliary space using the standard conforming $\\P_{1}$ finite element discretization is defined on the corresponding triangular mesh (cf. Figure~\\ref{fig:tri2d}). \n\\begin{figure}[htbp]\n\\centering\n\t\\parbox{0.45\\textwidth}{\n       \\includegraphics[width=0.4\\textwidth]{figures/polytopalmesh100.pdf}\n       \\caption{Polygonal Mesh $\\cT_{h}$ of the Unit Square Domain (100 Elements)}\n       \\label{fig:poly2d}}\n       \\quad\n       \\begin{minipage}{0.45\\textwidth}\n       \\includegraphics[width=0.89\\textwidth]{figures/trianglemesh100}\n       \\caption{The Corresponding Delaunay Triangle Mesh $\\cT_{h}^{c}$}\n       \\label{fig:tri2d}\n       \\end{minipage}\n\\end{figure}\n\n\nTables~\\ref{tab:2d} shows the estimated condition numbers  (the number of PCG iterations) for the additive and multiplicative preconditioned systems. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with constant coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$ & 3.45 (9)&  3.86e1 (41)&  3.80e2 (117)&  3.88e3 (351)& 4.07e4 (1100) \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.07(6) & 3.78 (15) &3.20e1 (37) & 3.17e2 (104) & 3.17e3 (318)\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 2.92 (8) & 5.75 (26) & 7.53 (29) & 8.73 (32) & 9.67(36) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.53 (9)&  1.71 (14)&  1.94 (14)&  1.99 (14)& 2.00 (13) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (8)&  1.21 (10)&  1.04 (7)&  1.02 (6)& 1.02 (6) \\\\\\hline\n\\end{tabular}\n\\label{tab:2d}\n\\end{table}\n}\nFor comparison, we also include the estimated condition numbers $\\cK(A)$, $\\cK(B_{{\\rm sgs}}A)$ and $\\cK(B_{{\\rm fict}}A)$, where $B_{{\\rm sgs}}$ is the (2-sweep) symmetric Gauss-Seidel preconditioner (same below) and $B_{{\\rm fict}}$ is the fictitious space preconditioner using the conforming FEM.  As we can observe from this table, while the condition numbers $\\cK(A)$, $\\cK(B_{{\\rm sgs}}A)$ increase as the mesh is refined. The condition number $\\cK(B_{{\\rm fict}}A)$ increase slightly as the mesh is refined. \nOn the other hand, the condition numbers of $\\mathcal{K}(B_{{\\rm add}}A)$ and $\\mathcal{K}(B_{{\\rm mul}}A)$ are uniformly bounded. \n\n\nIn the second test, we consider the problem with jump coefficients. The coefficients $\\kappa$ are generated randomly on each polygon element (see Figure~\\ref{fig:jump2d} for an example of the coefficient distribution with 100 elements, the integer in each polygonal element is the magnitude of the coefficient.). \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/polyjump100.pdf}\n       \\caption{Random Jump Coefficients $10^{k}$ (100 Elements)}\n       \\label{fig:jump2d}\n\\end{center}\n \\end{figure}\nNote that the coefficient settings are different in different polytopal meshes. Tables~\\ref{tab:2djump} shows the estimated condition numbers (the number of PCG iterations). Here, ``-'' means the PCG algorithm failed to converge after 1200 iterations.  As we can see from this table, while $\\cK(A)$ and $\\cK(B_{{\\rm sgs}}A)$ increase dramatically, the condition number $\\mathcal{K}(B_{{\\rm add}}A)$ and $\\mathcal{K}(B_{{\\rm mul}}A)$ are nearly uniformly bounded. These observations verify the estimate given in Theorem~\\ref{thm:aux}. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with jump coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$                   & 2.44 (11)&  2.73e6 (578)&  - & - & - \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.18(5) & 3.90e2 (26) &3.93e3 (409) & - & -\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 3.27 (8) & 6.94 (33) & 6.42 (36) & 11.6 (44) & 13.6 (53) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.54 (9)&  3.51 (20)&  3.60 (25)&  3.67 (25)& 3.80 (26) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (6)&  1.74 (15)&  1.82 (16)&  1.84 (16)& 1.88 (17) \\\\\\hline\n\\end{tabular}\n\\label{tab:2djump}\n\\end{table}\n}\n\nIn the third test, we consider the performance of the preconditioners for voronoi meshes which violate the assumption ({\\bf A}) (see Figure~\\ref{fig:2dvoronoi} for an example of 100 polygons). As we can observe from this figure, the aspect ratio for some polygons are quite high -- thus the partition is no longer quasi-uniform. Similar to before, we use Delaunay triangulation of this mesh to construct the auxiliary space.\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/voronoi100.pdf}\n       \\caption{Voronoi mesh (100 Elements)}\n       \\label{fig:2dvoronoi}\n\\end{center}\n \\end{figure}\n \n Table~\\ref{tab:2dvoronoi} shows the estimated condition numbers with the number of PCG iterations for different preconditioners. As we can see from this table, both additive and multiplicative auxiliary space preconditioners are still robust with respect to the problem size. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D voronoi polygonal mesh.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n$\\mathcal{K}(A)$& 4.76 (9) & 6.89e1 (52) & 6.59e2 (171) & 6.49e3 (537) & - \\\\\\hline\n$\\mathcal{K}(B_{\\rm sgs}A)$& 1.13 (6) & 4.93 (17) & 3.81e1 (45) & 3.57e2 (134) & 3.40e3 (400) \\\\\\hline\n$\\mathcal{K}(B_{\\rm fict}A)$& 4.66 (9) & 7.92 (34) & 2.04e1 (43) & 2.32e1 (46) & 1.62e1 (52) \\\\\\hline\n$\\mathcal{K}(B_{\\rm add}A)$& 1.58 (9) & 1.72 (16) & 3.09 (18) & 3.16 (19) & 1.91 (17) \\\\\\hline\n$\\mathcal{K}(B_{\\rm mul}A)$& 1.32 (11) & 2.25 (16) & 1.48 (13) & 1.29 (12) & 1.14 (10) \\\\\\hline\n\\end{tabular}\n\\label{tab:2dvoronoi}\n\\end{table}\n}\n\n\\subsection{3D Example}\nNow we consider the model problem \\eqref{eqn:model} in a 3D cubic domain $\\Omega =[0,1]^{3}$. We subdivide the domain into hexagonal elements (cubes) with mesh size $h$ at each level. The VEM discretization is defined on the hexagon mesh. For the auxiliary space, we further divide each hexagon into six tetrahedrons to construct the auxiliary mesh and to define the $\\P_{1}$ conforming finite element discretization on this auxiliary mesh (see for example, Figure~\\ref{fig:jump3d}).\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIn this example, we test various discontinuous coefficient settings. Let $\\Omega_{1} =[0.25,0.5]^{3}$ and $\\Omega_{2} = [0.5,0.75]^{3}$ (see Figure~\\ref{fig:jump3d}). We set the coefficient $\\kappa|_{\\Omega_{1}\\cup\\Omega_{2}}= \\kappa_{1}= 10^{k}$ (with $k=-6, -4, -2, 0, 2, 4, 6$) and $\\kappa|_{\\Omega\\setminus (\\Omega_{1}\\cup\\Omega_{2})} = 1$.\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/touchingcubesmesh.jpg}\n       \\caption{{\\footnotesize 3D uniform mesh with jump coefficients}}\n       \\label{fig:jump3d}\n\\end{center}\n \\end{figure}\nTable~\\ref{tab:3d} presents the estimated condition number of the preconditioned systems with respect to different choice of $\\kappa_{1}$ and mesh size.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n{\\scriptsize\n\\begin{table}\n\\caption{Estimated condition numbers (number of PCG iterations) in 3D. The coefficient $\\kappa|_{\\Omega_{1}\\cup\\Omega_{2}} =\\kappa_{1}= 10^{k}$ for various choices of $k$, and $\\kappa|_{\\Omega\\setminus(\\Omega_{1}\\cup \\Omega_{2})} =1.$ }\n\\begin{center}\\begin{tabular}{c|c||c|c|c|c|c|c}\n\\hline\n $\\kappa_{1}$ & $ h$ & $2^{-2}$ &  $2^{-3}$ &  $2^{-4}$ &  $2^{-5}$ &  $2^{-6}$ &  $2^{-7}$ \\\\\n\\hline\\hline\n\\multirow{5}{*}{$10^{-6}$} & $\\mathcal{K}(A)$& 1.15e6 (8) & 8.76e6 (28) & 6.94e7 (56) & 5.54e8 (110) & 4.43e9 (215) & 3.54e10 (420) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (23) & 2.44e1 (43) & 9.57e1 (71) & 3.81e2 (118) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.44 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.33 (7) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (10) & 1.35 (11) & 1.73 (15) & 1.92 (17) & 1.98 (17) & 1.99 (16) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{-4}$} & $\\mathcal{K}(A)$& 1.15e4 (7) & 8.76e4 (26) & 6.94e5 (51) & 5.54e6 (99) & 4.43e7 (194) & 3.54e8 (379) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (23) & 2.44e1 (38) & 9.57e1 (63) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.44 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.33 (7) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (9) & 1.35 (11) & 1.73 (14) & 1.92 (15) & 1.98 (15) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{-2}$} & $\\mathcal{K}(A)$& 1.15e2 (7) & 8.76e2 (24) & 6.94e3 (46) & 5.54e4 (90) & 4.43e5 (175) & 3.54e6 (346) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{1}  & $\\mathcal{K}(A)$& 4.44 (6) & 1.74e1 (21) & 6.94e1 (40) & 5.54e2 (78) & 4.43e3 (153) & 3.54e4 (302)\\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{2}$} & $\\mathcal{K}(A)$& 3.88e2 (6) & 2.00e2 (22) & 9.98e1 (44) & 2.77e2 (80) & 1.11e3 (143) & 4.43e3 (273) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{4}$} & $\\mathcal{K}(A)$& 3.88e4 (6) & 2.00e4 (22) & 9.98e3 (47) & 5.00e3 (89) & 2.50e3 (163) & 4.43e3 (295) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{6}$} & $\\mathcal{K}(A)$ & 3.88e6 (9) & 2.00e6 (22) & 9.99e5 (51) & 5.00e5 (96) & 2.50e5 (180) & 1.25e5 (331) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$ & 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$ & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\n\\hline\n\\end{tabular}\n\\label{tab:3d}\n\\end{center}\n\\end{table}\n}\nAs we can see from Table~\\ref{tab:3d}, the condition number of $A$ depends on both the coefficient $\\kappa$ and the mesh size. On the other hand, both the fictitious space preconditioner and the auxiliary space preconditioners (additive or multiplicative) are efficient and robust with respect to jumps in the coefficient $\\kappa$ and the mesh size. These results justify Theorem~\\ref{thm:aux} and Corollary~\\ref{cor:fict}.\n\n4.1 2D Examples\n\\subsection{2D Examples}\nIn the first example, we consider the model problem \\eqref{eqn:model} in the unit square $\\Omega = [0,1]^{2}$ with constant coefficient $\\kappa =1$. Figure~\\ref{fig:poly2d} is an example of the polytopal mesh of the unit square domain (with 100 elements) generated using \\mcode{PolyMesher} \\cite{Talischi.C;Paulino.G;Pereira.A;Menezes.I2012}, and Figure~\\ref{fig:tri2d} is the corresponding Delaunay triangular mesh. The VEM discretization is defined on the polytopal mesh (cf. Figure~\\ref{fig:poly2d}), while the auxiliary space using the standard conforming $\\P_{1}$ finite element discretization is defined on the corresponding triangular mesh (cf. Figure~\\ref{fig:tri2d}). \n\\begin{figure}[htbp]\n\\centering\n\t\\parbox{0.45\\textwidth}{\n       \\includegraphics[width=0.4\\textwidth]{figures/polytopalmesh100.pdf}\n       \\caption{Polygonal Mesh $\\cT_{h}$ of the Unit Square Domain (100 Elements)}\n       \\label{fig:poly2d}}\n       \\quad\n       \\begin{minipage}{0.45\\textwidth}\n       \\includegraphics[width=0.89\\textwidth]{figures/trianglemesh100}\n       \\caption{The Corresponding Delaunay Triangle Mesh $\\cT_{h}^{c}$}\n       \\label{fig:tri2d}\n       \\end{minipage}\n\\end{figure}\n\n\nTables~\\ref{tab:2d} shows the estimated condition numbers  (the number of PCG iterations) for the additive and multiplicative preconditioned systems. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with constant coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$ & 3.45 (9)&  3.86e1 (41)&  3.80e2 (117)&  3.88e3 (351)& 4.07e4 (1100) \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.07(6) & 3.78 (15) &3.20e1 (37) & 3.17e2 (104) & 3.17e3 (318)\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 2.92 (8) & 5.75 (26) & 7.53 (29) & 8.73 (32) & 9.67(36) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.53 (9)&  1.71 (14)&  1.94 (14)&  1.99 (14)& 2.00 (13) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (8)&  1.21 (10)&  1.04 (7)&  1.02 (6)& 1.02 (6) \\\\\\hline\n\\end{tabular}\n\\label{tab:2d}\n\\end{table}\n}\nFor comparison, we also include the estimated condition numbers $\\cK(A)$, $\\cK(B_{{\\rm sgs}}A)$ and $\\cK(B_{{\\rm fict}}A)$, where $B_{{\\rm sgs}}$ is the (2-sweep) symmetric Gauss-Seidel preconditioner (same below) and $B_{{\\rm fict}}$ is the fictitious space preconditioner using the conforming FEM.  As we can observe from this table, while the condition numbers $\\cK(A)$, $\\cK(B_{{\\rm sgs}}A)$ increase as the mesh is refined. The condition number $\\cK(B_{{\\rm fict}}A)$ increase slightly as the mesh is refined. \nOn the other hand, the condition numbers of $\\mathcal{K}(B_{{\\rm add}}A)$ and $\\mathcal{K}(B_{{\\rm mul}}A)$ are uniformly bounded. \n\n\nIn the second test, we consider the problem with jump coefficients. The coefficients $\\kappa$ are generated randomly on each polygon element (see Figure~\\ref{fig:jump2d} for an example of the coefficient distribution with 100 elements, the integer in each polygonal element is the magnitude of the coefficient.). \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/polyjump100.pdf}\n       \\caption{Random Jump Coefficients $10^{k}$ (100 Elements)}\n       \\label{fig:jump2d}\n\\end{center}\n \\end{figure}\nNote that the coefficient settings are different in different polytopal meshes. Tables~\\ref{tab:2djump} shows the estimated condition numbers (the number of PCG iterations). Here, ``-'' means the PCG algorithm failed to converge after 1200 iterations.  As we can see from this table, while $\\cK(A)$ and $\\cK(B_{{\\rm sgs}}A)$ increase dramatically, the condition number $\\mathcal{K}(B_{{\\rm add}}A)$ and $\\mathcal{K}(B_{{\\rm mul}}A)$ are nearly uniformly bounded. These observations verify the estimate given in Theorem~\\ref{thm:aux}. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with jump coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$                   & 2.44 (11)&  2.73e6 (578)&  - & - & - \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.18(5) & 3.90e2 (26) &3.93e3 (409) & - & -\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 3.27 (8) & 6.94 (33) & 6.42 (36) & 11.6 (44) & 13.6 (53) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.54 (9)&  3.51 (20)&  3.60 (25)&  3.67 (25)& 3.80 (26) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (6)&  1.74 (15)&  1.82 (16)&  1.84 (16)& 1.88 (17) \\\\\\hline\n\\end{tabular}\n\\label{tab:2djump}\n\\end{table}\n}\n\nIn the third test, we consider the performance of the preconditioners for voronoi meshes which violate the assumption ({\\bf A}) (see Figure~\\ref{fig:2dvoronoi} for an example of 100 polygons). As we can observe from this figure, the aspect ratio for some polygons are quite high -- thus the partition is no longer quasi-uniform. Similar to before, we use Delaunay triangulation of this mesh to construct the auxiliary space.\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/voronoi100.pdf}\n       \\caption{Voronoi mesh (100 Elements)}\n       \\label{fig:2dvoronoi}\n\\end{center}\n \\end{figure}\n \n Table~\\ref{tab:2dvoronoi} shows the estimated condition numbers with the number of PCG iterations for different preconditioners. As we can see from this table, both additive and multiplicative auxiliary space preconditioners are still robust with respect to the problem size. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D voronoi polygonal mesh.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n$\\mathcal{K}(A)$& 4.76 (9) & 6.89e1 (52) & 6.59e2 (171) & 6.49e3 (537) & - \\\\\\hline\n$\\mathcal{K}(B_{\\rm sgs}A)$& 1.13 (6) & 4.93 (17) & 3.81e1 (45) & 3.57e2 (134) & 3.40e3 (400) \\\\\\hline\n$\\mathcal{K}(B_{\\rm fict}A)$& 4.66 (9) & 7.92 (34) & 2.04e1 (43) & 2.32e1 (46) & 1.62e1 (52) \\\\\\hline\n$\\mathcal{K}(B_{\\rm add}A)$& 1.58 (9) & 1.72 (16) & 3.09 (18) & 3.16 (19) & 1.91 (17) \\\\\\hline\n$\\mathcal{K}(B_{\\rm mul}A)$& 1.32 (11) & 2.25 (16) & 1.48 (13) & 1.29 (12) & 1.14 (10) \\\\\\hline\n\\end{tabular}\n\\label{tab:2dvoronoi}\n\\end{table}\n}\n\n",
        "Descriptive_question1": "What is the estimated condition number for K(A) with 10 polytopal elements in table_3?",
        "Descriptive_question2": "How many PCG iterations are required for K(B_mul A) with 10^5 polytopal elements in table_3?",
        "Reasoning_question1": "How does the condition number of K(B_add A) change as the number of polytopal elements increases in table_3, and what might this indicate about the preconditioner's performance?",
        "Reasoning_question2": "Comparing K(A) and K(B_mul A) in table_3, which preconditioner appears more effective in maintaining a stable condition number across different mesh sizes, and why?",
        "Descriptive_answer1": "4.76",
        "Descriptive_answer2": "10",
        "Reasoning_answer1": "Looking at the condition numbers for K(B_add A) in table_3, I observe that they start at 1.58 for 10 elements, increase to 1.72 for 10^2, then to 3.09 for 10^3, slightly up to 3.16 for 10^4, and decrease to 1.91 for 10^5. This shows a general trend of slight increase up to 10^4 elements, with a small drop at 10^5. The fluctuations are relatively small, ranging between 1.58 and 3.16. This limited variation suggests that the additive preconditioner is fairly robust, as the condition number does not grow significantly with mesh refinement. This might indicate that the preconditioner effectively handles the increasing complexity of the system, maintaining numerical stability across different mesh sizes.",
        "Reasoning_answer2": "Comparing K(A) and K(B_mul A) in table_3, I notice that for K(A), the condition number starts at 4.76 for 10 elements and increases dramatically to 6.89e1 at 10^2, 6.59e2 at 10^3, and 6.49e3 at 10^4, failing to converge at 10^5. In contrast, for K(B_mul A), the condition numbers are much lower, starting at 1.32 for 10 elements, peaking at 2.25 for 10^2, and then decreasing to 1.48, 1.29, and 1.14 for 10^3, 10^4, and 10^5 respectively. This shows that K(B_mul A) maintains a stable and low condition number across all mesh sizes, with minimal variation. The multiplicative preconditioner appears far more effective because it prevents the condition number from growing with mesh refinement, indicating better control over the system's numerical behavior and improved convergence properties of the PCG algorithm."
    },
    {
        "paper_id": "1812.04423.json",
        "table_id": "table_4",
        "table_content": "\\begin{table}\n\\caption{Estimated condition numbers (number of PCG iterations) in 3D. The coefficient $\\kappa|_{\\Omega_{1}\\cup\\Omega_{2}} =\\kappa_{1}= 10^{k}$ for various choices of $k$, and $\\kappa|_{\\Omega\\setminus(\\Omega_{1}\\cup \\Omega_{2})} =1.$ }\n\\begin{center}\\begin{tabular}{c|c||c|c|c|c|c|c}\n\\hline\n $\\kappa_{1}$ & $ h$ & $2^{-2}$ &  $2^{-3}$ &  $2^{-4}$ &  $2^{-5}$ &  $2^{-6}$ &  $2^{-7}$ \\\\\n\\hline\\hline\n\\multirow{5}{*}{$10^{-6}$} & $\\mathcal{K}(A)$& 1.15e6 (8) & 8.76e6 (28) & 6.94e7 (56) & 5.54e8 (110) & 4.43e9 (215) & 3.54e10 (420) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (23) & 2.44e1 (43) & 9.57e1 (71) & 3.81e2 (118) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.44 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.33 (7) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (10) & 1.35 (11) & 1.73 (15) & 1.92 (17) & 1.98 (17) & 1.99 (16) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{-4}$} & $\\mathcal{K}(A)$& 1.15e4 (7) & 8.76e4 (26) & 6.94e5 (51) & 5.54e6 (99) & 4.43e7 (194) & 3.54e8 (379) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (23) & 2.44e1 (38) & 9.57e1 (63) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.44 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.33 (7) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (9) & 1.35 (11) & 1.73 (14) & 1.92 (15) & 1.98 (15) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{-2}$} & $\\mathcal{K}(A)$& 1.15e2 (7) & 8.76e2 (24) & 6.94e3 (46) & 5.54e4 (90) & 4.43e5 (175) & 3.54e6 (346) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{1}  & $\\mathcal{K}(A)$& 4.44 (6) & 1.74e1 (21) & 6.94e1 (40) & 5.54e2 (78) & 4.43e3 (153) & 3.54e4 (302)\\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{2}$} & $\\mathcal{K}(A)$& 3.88e2 (6) & 2.00e2 (22) & 9.98e1 (44) & 2.77e2 (80) & 1.11e3 (143) & 4.43e3 (273) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{4}$} & $\\mathcal{K}(A)$& 3.88e4 (6) & 2.00e4 (22) & 9.98e3 (47) & 5.00e3 (89) & 2.50e3 (163) & 4.43e3 (295) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{6}$} & $\\mathcal{K}(A)$ & 3.88e6 (9) & 2.00e6 (22) & 9.99e5 (51) & 5.00e5 (96) & 2.50e5 (180) & 1.25e5 (331) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$ & 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$ & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\n\\hline\n\\end{tabular}\n\\label{tab:3d}\n\\end{center}\n\\end{table}",
        "caption": "Estimated condition numbers (number of PCG iterations) in 3D. The coefficient $\\kappa|_{\\Omega_{1}\\cup\\Omega_{2}} =\\kappa_{1}= 10^{k}$ for various choices of $k$, and $\\kappa|_{\\Omega\\setminus(\\Omega_{1}\\cup \\Omega_{2})} =1.$ ",
        "label": "tab:3d",
        "section_info": "4 Numerical Experiments\n\\section{Numerical Experiments}\n\\label{sec:num}\nIn this section, we present several numerical experiments in both 2D and 3D to verify the result in Theorem~\\ref{thm:aux} on the performance of the proposed preconditioners. In all these tests, we use 2-sweeps symmetric Gauss-Seidel smoother. The stopping criteria is $\\|r_{k}\\| / \\|r_{0}\\| <10^{-12}$ for the PCG algorithm, where $r_{k}= f-Au_{k}$ is the residual. For the coarse solver, we use the AMG algorithm implemented in $i$FEM~\\cite{Chen.L2008}. \n\n\\subsection{2D Examples}\nIn the first example, we consider the model problem \\eqref{eqn:model} in the unit square $\\Omega = [0,1]^{2}$ with constant coefficient $\\kappa =1$. Figure~\\ref{fig:poly2d} is an example of the polytopal mesh of the unit square domain (with 100 elements) generated using \\mcode{PolyMesher} \\cite{Talischi.C;Paulino.G;Pereira.A;Menezes.I2012}, and Figure~\\ref{fig:tri2d} is the corresponding Delaunay triangular mesh. The VEM discretization is defined on the polytopal mesh (cf. Figure~\\ref{fig:poly2d}), while the auxiliary space using the standard conforming $\\P_{1}$ finite element discretization is defined on the corresponding triangular mesh (cf. Figure~\\ref{fig:tri2d}). \n\\begin{figure}[htbp]\n\\centering\n\t\\parbox{0.45\\textwidth}{\n       \\includegraphics[width=0.4\\textwidth]{figures/polytopalmesh100.pdf}\n       \\caption{Polygonal Mesh $\\cT_{h}$ of the Unit Square Domain (100 Elements)}\n       \\label{fig:poly2d}}\n       \\quad\n       \\begin{minipage}{0.45\\textwidth}\n       \\includegraphics[width=0.89\\textwidth]{figures/trianglemesh100}\n       \\caption{The Corresponding Delaunay Triangle Mesh $\\cT_{h}^{c}$}\n       \\label{fig:tri2d}\n       \\end{minipage}\n\\end{figure}\n\n\nTables~\\ref{tab:2d} shows the estimated condition numbers  (the number of PCG iterations) for the additive and multiplicative preconditioned systems. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with constant coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$ & 3.45 (9)&  3.86e1 (41)&  3.80e2 (117)&  3.88e3 (351)& 4.07e4 (1100) \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.07(6) & 3.78 (15) &3.20e1 (37) & 3.17e2 (104) & 3.17e3 (318)\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 2.92 (8) & 5.75 (26) & 7.53 (29) & 8.73 (32) & 9.67(36) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.53 (9)&  1.71 (14)&  1.94 (14)&  1.99 (14)& 2.00 (13) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (8)&  1.21 (10)&  1.04 (7)&  1.02 (6)& 1.02 (6) \\\\\\hline\n\\end{tabular}\n\\label{tab:2d}\n\\end{table}\n}\nFor comparison, we also include the estimated condition numbers $\\cK(A)$, $\\cK(B_{{\\rm sgs}}A)$ and $\\cK(B_{{\\rm fict}}A)$, where $B_{{\\rm sgs}}$ is the (2-sweep) symmetric Gauss-Seidel preconditioner (same below) and $B_{{\\rm fict}}$ is the fictitious space preconditioner using the conforming FEM.  As we can observe from this table, while the condition numbers $\\cK(A)$, $\\cK(B_{{\\rm sgs}}A)$ increase as the mesh is refined. The condition number $\\cK(B_{{\\rm fict}}A)$ increase slightly as the mesh is refined. \nOn the other hand, the condition numbers of $\\mathcal{K}(B_{{\\rm add}}A)$ and $\\mathcal{K}(B_{{\\rm mul}}A)$ are uniformly bounded. \n\n\nIn the second test, we consider the problem with jump coefficients. The coefficients $\\kappa$ are generated randomly on each polygon element (see Figure~\\ref{fig:jump2d} for an example of the coefficient distribution with 100 elements, the integer in each polygonal element is the magnitude of the coefficient.). \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/polyjump100.pdf}\n       \\caption{Random Jump Coefficients $10^{k}$ (100 Elements)}\n       \\label{fig:jump2d}\n\\end{center}\n \\end{figure}\nNote that the coefficient settings are different in different polytopal meshes. Tables~\\ref{tab:2djump} shows the estimated condition numbers (the number of PCG iterations). Here, ``-'' means the PCG algorithm failed to converge after 1200 iterations.  As we can see from this table, while $\\cK(A)$ and $\\cK(B_{{\\rm sgs}}A)$ increase dramatically, the condition number $\\mathcal{K}(B_{{\\rm add}}A)$ and $\\mathcal{K}(B_{{\\rm mul}}A)$ are nearly uniformly bounded. These observations verify the estimate given in Theorem~\\ref{thm:aux}. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D with jump coefficients.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n $\\mathcal{K}(A)$                   & 2.44 (11)&  2.73e6 (578)&  - & - & - \\\\\\hline\n $\\mathcal{K}(B_{\\rm sgs}A)$ & 1.18(5) & 3.90e2 (26) &3.93e3 (409) & - & -\\\\\\hline\n $\\mathcal{K}(B_{\\rm fict}A)$& 3.27 (8) & 6.94 (33) & 6.42 (36) & 11.6 (44) & 13.6 (53) \\\\\\hline\n $\\mathcal{K}(B_{\\rm add}A)$ &  1.54 (9)&  3.51 (20)&  3.60 (25)&  3.67 (25)& 3.80 (26) \\\\\\hline\n $\\mathcal{K}(B_{\\rm mul}A)$ &  1.06 (6)&  1.74 (15)&  1.82 (16)&  1.84 (16)& 1.88 (17) \\\\\\hline\n\\end{tabular}\n\\label{tab:2djump}\n\\end{table}\n}\n\nIn the third test, we consider the performance of the preconditioners for voronoi meshes which violate the assumption ({\\bf A}) (see Figure~\\ref{fig:2dvoronoi} for an example of 100 polygons). As we can observe from this figure, the aspect ratio for some polygons are quite high -- thus the partition is no longer quasi-uniform. Similar to before, we use Delaunay triangulation of this mesh to construct the auxiliary space.\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/voronoi100.pdf}\n       \\caption{Voronoi mesh (100 Elements)}\n       \\label{fig:2dvoronoi}\n\\end{center}\n \\end{figure}\n \n Table~\\ref{tab:2dvoronoi} shows the estimated condition numbers with the number of PCG iterations for different preconditioners. As we can see from this table, both additive and multiplicative auxiliary space preconditioners are still robust with respect to the problem size. \n{\\footnotesize\n\\begin{table}\n\n\\caption{Estimated condition numbers (number of PCG iterations) in 2D voronoi polygonal mesh.}\n\\begin{tabular}{c||c|c|c|c|c}\n\\hline\n  \\# Polytopal Elements & 10 & $10^{2}$   & $10^{3}$        & $10^{4}$        & $10^{5}$  \n\\\\\n\n\n\\hline\\hline\n$\\mathcal{K}(A)$& 4.76 (9) & 6.89e1 (52) & 6.59e2 (171) & 6.49e3 (537) & - \\\\\\hline\n$\\mathcal{K}(B_{\\rm sgs}A)$& 1.13 (6) & 4.93 (17) & 3.81e1 (45) & 3.57e2 (134) & 3.40e3 (400) \\\\\\hline\n$\\mathcal{K}(B_{\\rm fict}A)$& 4.66 (9) & 7.92 (34) & 2.04e1 (43) & 2.32e1 (46) & 1.62e1 (52) \\\\\\hline\n$\\mathcal{K}(B_{\\rm add}A)$& 1.58 (9) & 1.72 (16) & 3.09 (18) & 3.16 (19) & 1.91 (17) \\\\\\hline\n$\\mathcal{K}(B_{\\rm mul}A)$& 1.32 (11) & 2.25 (16) & 1.48 (13) & 1.29 (12) & 1.14 (10) \\\\\\hline\n\\end{tabular}\n\\label{tab:2dvoronoi}\n\\end{table}\n}\n\n\\subsection{3D Example}\nNow we consider the model problem \\eqref{eqn:model} in a 3D cubic domain $\\Omega =[0,1]^{3}$. We subdivide the domain into hexagonal elements (cubes) with mesh size $h$ at each level. The VEM discretization is defined on the hexagon mesh. For the auxiliary space, we further divide each hexagon into six tetrahedrons to construct the auxiliary mesh and to define the $\\P_{1}$ conforming finite element discretization on this auxiliary mesh (see for example, Figure~\\ref{fig:jump3d}).\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIn this example, we test various discontinuous coefficient settings. Let $\\Omega_{1} =[0.25,0.5]^{3}$ and $\\Omega_{2} = [0.5,0.75]^{3}$ (see Figure~\\ref{fig:jump3d}). We set the coefficient $\\kappa|_{\\Omega_{1}\\cup\\Omega_{2}}= \\kappa_{1}= 10^{k}$ (with $k=-6, -4, -2, 0, 2, 4, 6$) and $\\kappa|_{\\Omega\\setminus (\\Omega_{1}\\cup\\Omega_{2})} = 1$.\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/touchingcubesmesh.jpg}\n       \\caption{{\\footnotesize 3D uniform mesh with jump coefficients}}\n       \\label{fig:jump3d}\n\\end{center}\n \\end{figure}\nTable~\\ref{tab:3d} presents the estimated condition number of the preconditioned systems with respect to different choice of $\\kappa_{1}$ and mesh size.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n{\\scriptsize\n\\begin{table}\n\\caption{Estimated condition numbers (number of PCG iterations) in 3D. The coefficient $\\kappa|_{\\Omega_{1}\\cup\\Omega_{2}} =\\kappa_{1}= 10^{k}$ for various choices of $k$, and $\\kappa|_{\\Omega\\setminus(\\Omega_{1}\\cup \\Omega_{2})} =1.$ }\n\\begin{center}\\begin{tabular}{c|c||c|c|c|c|c|c}\n\\hline\n $\\kappa_{1}$ & $ h$ & $2^{-2}$ &  $2^{-3}$ &  $2^{-4}$ &  $2^{-5}$ &  $2^{-6}$ &  $2^{-7}$ \\\\\n\\hline\\hline\n\\multirow{5}{*}{$10^{-6}$} & $\\mathcal{K}(A)$& 1.15e6 (8) & 8.76e6 (28) & 6.94e7 (56) & 5.54e8 (110) & 4.43e9 (215) & 3.54e10 (420) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (23) & 2.44e1 (43) & 9.57e1 (71) & 3.81e2 (118) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.44 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.33 (7) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (10) & 1.35 (11) & 1.73 (15) & 1.92 (17) & 1.98 (17) & 1.99 (16) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{-4}$} & $\\mathcal{K}(A)$& 1.15e4 (7) & 8.76e4 (26) & 6.94e5 (51) & 5.54e6 (99) & 4.43e7 (194) & 3.54e8 (379) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (23) & 2.44e1 (38) & 9.57e1 (63) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.44 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.33 (7) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (9) & 1.35 (11) & 1.73 (14) & 1.92 (15) & 1.98 (15) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{-2}$} & $\\mathcal{K}(A)$& 1.15e2 (7) & 8.76e2 (24) & 6.94e3 (46) & 5.54e4 (90) & 4.43e5 (175) & 3.54e6 (346) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{1}  & $\\mathcal{K}(A)$& 4.44 (6) & 1.74e1 (21) & 6.94e1 (40) & 5.54e2 (78) & 4.43e3 (153) & 3.54e4 (302)\\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{2}$} & $\\mathcal{K}(A)$& 3.88e2 (6) & 2.00e2 (22) & 9.98e1 (44) & 2.77e2 (80) & 1.11e3 (143) & 4.43e3 (273) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{4}$} & $\\mathcal{K}(A)$& 3.88e4 (6) & 2.00e4 (22) & 9.98e3 (47) & 5.00e3 (89) & 2.50e3 (163) & 4.43e3 (295) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{6}$} & $\\mathcal{K}(A)$ & 3.88e6 (9) & 2.00e6 (22) & 9.99e5 (51) & 5.00e5 (96) & 2.50e5 (180) & 1.25e5 (331) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$ & 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$ & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\n\\hline\n\\end{tabular}\n\\label{tab:3d}\n\\end{center}\n\\end{table}\n}\nAs we can see from Table~\\ref{tab:3d}, the condition number of $A$ depends on both the coefficient $\\kappa$ and the mesh size. On the other hand, both the fictitious space preconditioner and the auxiliary space preconditioners (additive or multiplicative) are efficient and robust with respect to jumps in the coefficient $\\kappa$ and the mesh size. These results justify Theorem~\\ref{thm:aux} and Corollary~\\ref{cor:fict}.\n\n4.2 3D Example\n\\subsection{3D Example}\nNow we consider the model problem \\eqref{eqn:model} in a 3D cubic domain $\\Omega =[0,1]^{3}$. We subdivide the domain into hexagonal elements (cubes) with mesh size $h$ at each level. The VEM discretization is defined on the hexagon mesh. For the auxiliary space, we further divide each hexagon into six tetrahedrons to construct the auxiliary mesh and to define the $\\P_{1}$ conforming finite element discretization on this auxiliary mesh (see for example, Figure~\\ref{fig:jump3d}).\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIn this example, we test various discontinuous coefficient settings. Let $\\Omega_{1} =[0.25,0.5]^{3}$ and $\\Omega_{2} = [0.5,0.75]^{3}$ (see Figure~\\ref{fig:jump3d}). We set the coefficient $\\kappa|_{\\Omega_{1}\\cup\\Omega_{2}}= \\kappa_{1}= 10^{k}$ (with $k=-6, -4, -2, 0, 2, 4, 6$) and $\\kappa|_{\\Omega\\setminus (\\Omega_{1}\\cup\\Omega_{2})} = 1$.\n\\begin{figure}[h]\n\n\\begin{center}\n       \\includegraphics[width=0.45\\textwidth]{figures/touchingcubesmesh.jpg}\n       \\caption{{\\footnotesize 3D uniform mesh with jump coefficients}}\n       \\label{fig:jump3d}\n\\end{center}\n \\end{figure}\nTable~\\ref{tab:3d} presents the estimated condition number of the preconditioned systems with respect to different choice of $\\kappa_{1}$ and mesh size.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n{\\scriptsize\n\\begin{table}\n\\caption{Estimated condition numbers (number of PCG iterations) in 3D. The coefficient $\\kappa|_{\\Omega_{1}\\cup\\Omega_{2}} =\\kappa_{1}= 10^{k}$ for various choices of $k$, and $\\kappa|_{\\Omega\\setminus(\\Omega_{1}\\cup \\Omega_{2})} =1.$ }\n\\begin{center}\\begin{tabular}{c|c||c|c|c|c|c|c}\n\\hline\n $\\kappa_{1}$ & $ h$ & $2^{-2}$ &  $2^{-3}$ &  $2^{-4}$ &  $2^{-5}$ &  $2^{-6}$ &  $2^{-7}$ \\\\\n\\hline\\hline\n\\multirow{5}{*}{$10^{-6}$} & $\\mathcal{K}(A)$& 1.15e6 (8) & 8.76e6 (28) & 6.94e7 (56) & 5.54e8 (110) & 4.43e9 (215) & 3.54e10 (420) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (23) & 2.44e1 (43) & 9.57e1 (71) & 3.81e2 (118) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.44 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.33 (7) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (10) & 1.35 (11) & 1.73 (15) & 1.92 (17) & 1.98 (17) & 1.99 (16) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{-4}$} & $\\mathcal{K}(A)$& 1.15e4 (7) & 8.76e4 (26) & 6.94e5 (51) & 5.54e6 (99) & 4.43e7 (194) & 3.54e8 (379) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (23) & 2.44e1 (38) & 9.57e1 (63) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.44 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.33 (7) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (9) & 1.35 (11) & 1.73 (14) & 1.92 (15) & 1.98 (15) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{-2}$} & $\\mathcal{K}(A)$& 1.15e2 (7) & 8.76e2 (24) & 6.94e3 (46) & 5.54e4 (90) & 4.43e5 (175) & 3.54e6 (346) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.37 (8) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (4) \\\\\\hline\\hline\n\\multirow{5}{*}{1}  & $\\mathcal{K}(A)$& 4.44 (6) & 1.74e1 (21) & 6.94e1 (40) & 5.54e2 (78) & 4.43e3 (153) & 3.54e4 (302)\\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{2}$} & $\\mathcal{K}(A)$& 3.88e2 (6) & 2.00e2 (22) & 9.98e1 (44) & 2.77e2 (80) & 1.11e3 (143) & 4.43e3 (273) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{4}$} & $\\mathcal{K}(A)$& 3.88e4 (6) & 2.00e4 (22) & 9.98e3 (47) & 5.00e3 (89) & 2.50e3 (163) & 4.43e3 (295) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$& 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$& 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\\hline\n\\multirow{5}{*}{$10^{6}$} & $\\mathcal{K}(A)$ & 3.88e6 (9) & 2.00e6 (22) & 9.99e5 (51) & 5.00e5 (96) & 2.50e5 (180) & 1.25e5 (331) \\\\\n\t&$\\mathcal{K}(B_{\\rm sgs}A)$& 1.10 (7) & 2.11 (12) & 6.54 (21) & 2.44e1 (35) & 9.57e1 (62) & 3.81e2 (115) \\\\\n\t&$\\mathcal{K}(B_{\\rm fict}A)$& 1.32 (6) & 1.45 (11) & 1.41 (10) & 1.39 (9) & 1.33 (7) & 1.28 (6) \\\\\n\t&$\\mathcal{K}(B_{\\rm add}A)$ & 1.14 (8) & 1.36 (10) & 1.73 (13) & 1.92 (14) & 1.98 (14) & 1.99 (13) \\\\\n\t&$\\mathcal{K}(B_{\\rm mul}A)$ & 1.00 (5) & 1.00 (5) & 1.00 (5) & 1.00 (4) & 1.00 (4) & 1.00 (3) \\\\\\hline\n\\hline\n\\end{tabular}\n\\label{tab:3d}\n\\end{center}\n\\end{table}\n}\nAs we can see from Table~\\ref{tab:3d}, the condition number of $A$ depends on both the coefficient $\\kappa$ and the mesh size. On the other hand, both the fictitious space preconditioner and the auxiliary space preconditioners (additive or multiplicative) are efficient and robust with respect to jumps in the coefficient $\\kappa$ and the mesh size. These results justify Theorem~\\ref{thm:aux} and Corollary~\\ref{cor:fict}.\n\n",
        "Descriptive_question1": "What is the condition number for K(A) when kappa1 is 10^{-6} and h is 2^{-2} in table_4?",
        "Descriptive_question2": "What is the number of PCG iterations for K(B_mul A) when kappa1 is 1 and h is 2^{-7} in table_4?",
        "Reasoning_question1": "How does the condition number of K(A) change with increasing mesh size (decreasing h) for a fixed kappa1 value of 10^{-6} in table_4, and what does this suggest about the system's behavior?",
        "Reasoning_question2": "Compare the robustness of K(B_add A) and K(B_mul A) with respect to changes in kappa1 and mesh size in table_4. Which preconditioner appears more stable, and why?",
        "Descriptive_answer1": "1.15e6",
        "Descriptive_answer2": "3",
        "Reasoning_answer1": "Looking at the condition number of K(A) for a fixed kappa1 value of 10^{-6} across different mesh sizes (h) in table_4, I observe a clear trend. As h decreases from 2^{-2} to 2^{-7} (meaning the mesh becomes finer), the condition number increases dramatically: from 1.15e6 at h=2^{-2} to 8.76e6 at h=2^{-3}, then to 6.94e7 at h=2^{-4}, 5.54e8 at h=2^{-5}, 4.43e9 at h=2^{-6}, and finally to 3.54e10 at h=2^{-7}. This indicates an exponential growth in the condition number as the mesh is refined. The increasing number of PCG iterations (from 8 to 420) alongside the condition number further supports this observation. This behavior suggests that the system becomes increasingly ill-conditioned with finer meshes, likely due to a larger number of degrees of freedom and greater sensitivity to numerical errors, making it harder to solve without preconditioning.",
        "Reasoning_answer2": "To compare the robustness of K(B_add A) and K(B_mul A) with respect to changes in kappa1 and mesh size in table_4, I first examine K(B_add A). For different kappa1 values (from 10^{-6} to 10^{6}) and across mesh sizes (h from 2^{-2} to 2^{-7}), the condition numbers for K(B_add A) vary slightly, ranging between 1.14 and 1.99. The number of PCG iterations also remains relatively low, between 8 and 17, showing minimal fluctuation. Now, looking at K(B_mul A), the condition number is consistently 1.00 across all kappa1 values and mesh sizes, with PCG iterations ranging only from 3 to 5, indicating even less variation. Comparing the two, while K(B_add A) shows a small increase in condition number as the mesh refines (e.g., from 1.14 to 1.99 for kappa1=10^{-6}), K(B_mul A) remains completely stable at 1.00. This stability in condition number and the consistently lower number of iterations for K(B_mul A) suggest that it is more robust to changes in both kappa1 (coefficient jumps) and mesh size. The reason likely lies in the multiplicative nature of the preconditioner, which may better handle the coupling of different scales or discontinuities in the system compared to the additive approach."
    },
    {
        "paper_id": "2212.10548.json",
        "table_id": "table_1",
        "table_content": "\\begin{table*}[htb]\n    \\centering\n \\adjustbox{max width=0.75\\linewidth}{\n\\begin{tabular}{@{}lnnnkkkeq@{}}\n\\toprule\n & \\multicolumn{3}{c}{OTE} & \\multicolumn{3}{c}{NER} & \\multicolumn{1}{c}{AM} &  \\multicolumn{1}{c}{Avg} \\\\ \\midrule\n &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{FR} &  \\multicolumn{1}{c}{RU} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{DE} &  \\multicolumn{1}{c}{IT} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{} \\\\ \\midrule\nGiza++ \\cite{och-ney-2003-systematic} & 77.0 & 73.3 & 72.4 & 73.3 & 75.3 & 68.4 & 86.6 & 77.7 \\\\\nFastAlign \\cite{fastalign} & 75.0 & 72.9 & 76.9 & 70.2 & 77.0 & 67.0 & 85.7 & 77.4 \\\\\nSimAlign \\cite{jalili-sabet-etal-2020-simalign} & 86.7 & 86.3 & 87.7 & 85.4 & 87.4 & 81.3 & 84.1 & 85.3 \\\\\nAWESOME \\cite{DBLP:conf/eacl/DouN21} & 91.5 & 91.1 & 93.7 & 87.3 & 90.7 & 83.1 & 54.8 & 78.0 \\\\ \\midrule\nXLM-RoBERTa-xl \\cite{xlmr} & 80.2 & 76.2 & 74.5 & 73.9 & 68.3 & 73.9 & 66.5 & 71.8 \\\\\nSpan Translation & 66.5 & 46.3 & 58.7 & 68.8 & 63.5 & 69.2 & 21.6 & 48.7 \\\\ \\midrule\nT-Projection & \\textbf{95.1} & \\textbf{92.3} & \\textbf{95.0} & \\textbf{93.6} & \\textbf{94.0} & \\textbf{87.2} & \\textbf{96.0} & \\textbf{93.9} \\\\ \\bottomrule\n\\end{tabular}\n}\n    \\caption{F1 scores for annotation projection in the OTE, NER and Argument Mining tasks.}\n    \\label{tab:Results}\n\\end{table*}",
        "caption": "F1 scores for annotation projection in the OTE, NER and Argument Mining tasks.",
        "label": "tab:Results",
        "section_info": "5 Intrinsic Evaluation\n\\section{Intrinsic Evaluation} \\label{sec:Results}\n\nIn this section we present a set of experiments to evaluate \n T-Projection with respect to current state-of-the-art approaches for annotation projection. We also analyze separately the performance of the \\emph{candidate generation} and \\emph{candidate selection}\nsteps. \n\nFor the OTE task we train T-Projection and XLM-RoBERTa with the English\nABSA 2016 training set. We also train the four word alignment systems (excluding SimAlign which is an\nunsupervised method) using the English training set together with the respective\ntranslations as parallel corpora. We augment the parallel data with 50,000\nrandom parallel sentences from ParaCrawl v8 \\cite{espla-etal-2019-paracrawl}. Models are evaluated with respect to the manually label projections generated by \\citet{garcia-ferrero-etal-2022-model}. \nAs the Europarl-based NER dataset \\cite{agerri-etal-2018-building} provides\nonly test data for each language, T-Projection and XLM-RoBERTa are trained\nusing the full English CoNLL 2003 dataset\n\\cite{tjong-kim-sang-de-meulder-2003-introduction} together with the labeled\nEnglish Europarl test data. The word alignment models are in turn trained with\nthe the parallel sentences from the Europarl-based NER data plus 50,000\nparallel sentences extracted from Europarl v8 \\cite{DBLP:conf/mtsummit/Koehn05}. We evaluate the model with respect to the manual annotations provided by \\citet{agerri-etal-2018-building}.\nWith respect to Argument Mining, we use the Neoplasm training set from the\nAbstRCT dataset to train T-Projection and XLM-RoBERTa, adding its Spanish translation as parallel\ncorpora for the word alignment systems. As this is a medical text corpus, the\nparallel corpora is complemented with 50,000 parallel sentences\nfrom the WMT19 Biomedical Translation Task \\cite{bawden-etal-2019-findings}. We evaluate the models with respect to the manually projected labels by \\citet{DBLP:journals/corr/abs-2301-10527}.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\subsection{Annotation Projection Quality}\n\n\\begin{table*}[htb]\n    \\centering\n \\adjustbox{max width=0.75\\linewidth}{\n\\begin{tabular}{@{}lnnnkkkeq@{}}\n\\toprule\n & \\multicolumn{3}{c}{OTE} & \\multicolumn{3}{c}{NER} & \\multicolumn{1}{c}{AM} &  \\multicolumn{1}{c}{Avg} \\\\ \\midrule\n &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{FR} &  \\multicolumn{1}{c}{RU} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{DE} &  \\multicolumn{1}{c}{IT} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{} \\\\ \\midrule\nGiza++ \\cite{och-ney-2003-systematic} & 77.0 & 73.3 & 72.4 & 73.3 & 75.3 & 68.4 & 86.6 & 77.7 \\\\\nFastAlign \\cite{fastalign} & 75.0 & 72.9 & 76.9 & 70.2 & 77.0 & 67.0 & 85.7 & 77.4 \\\\\nSimAlign \\cite{jalili-sabet-etal-2020-simalign} & 86.7 & 86.3 & 87.7 & 85.4 & 87.4 & 81.3 & 84.1 & 85.3 \\\\\nAWESOME \\cite{DBLP:conf/eacl/DouN21} & 91.5 & 91.1 & 93.7 & 87.3 & 90.7 & 83.1 & 54.8 & 78.0 \\\\ \\midrule\nXLM-RoBERTa-xl \\cite{xlmr} & 80.2 & 76.2 & 74.5 & 73.9 & 68.3 & 73.9 & 66.5 & 71.8 \\\\\nSpan Translation & 66.5 & 46.3 & 58.7 & 68.8 & 63.5 & 69.2 & 21.6 & 48.7 \\\\ \\midrule\nT-Projection & \\textbf{95.1} & \\textbf{92.3} & \\textbf{95.0} & \\textbf{93.6} & \\textbf{94.0} & \\textbf{87.2} & \\textbf{96.0} & \\textbf{93.9} \\\\ \\bottomrule\n\\end{tabular}\n}\n    \\caption{F1 scores for annotation projection in the OTE, NER and Argument Mining tasks.}\n    \\label{tab:Results}\n\\end{table*}\n\nTable \\ref{tab:Results} reports the results of the automatically projected\ndatasets generated by each projection method with respect to the\nhuman-projected versions of those datasets. The systems based on word\nalignments obtain good results across the board, especially those using\nlanguage models, namely, SimAlign and AWESOME. In particular, AWESOME achieves\ngood results for OTE and NER but very poor in AM. Manual\ninspection of the projections found out that AWESOME struggles to align\narticles and prepositions which are included in long sequences.\n\nXLM-RoBERTa-xl shows a strong zero-shot cross-lingual performance. However, the\ngenerated datasets are of lower quality than the ones generated by the\nword-alignment systems. The results of the Span Translation approach are quite\ndisappointing, especially when dealing with the long sequences of the AM task. \nTranslating the labeled spans independently produce translations\nthat, in many cases, cannot be located in the target sentence. \n\nOur T-Projection method achieves the best results for every task and language.\nIn OTE, it outperforms every other method by more than 2 points in F1 score\naveraged across the three languages. This suggests that T-Projection robustly\nprojects labeled spans into machine-translated data. The NER evaluation is\nslightly different because the parallel data was translated by human experts.\nIn this setting, T-Projection clearly improves AWESOME's results by 4.7 points,\nwhich constitutes a significant leap in the quality of the generated datasets. \n\n\nDespite the fact that the word\nalignment systems have been trained using Europarl domain-specific data, and that\nmost of the training data used for T-Projection is coming from the CoNLL-2003\ndataset (news domain) plus very few annotated sentences (699) from Europarl,\nT-Projection still clearly obtains the best results in NER label projection. This\nsuggests that our system can also be applied in out-of-domain settings. \n\nFinally, T-Projection obtains the overall highest scores for Argument Mining\nwhich means that our approach is particularly good projecting long sequences.\nThus, T-Projection outperforms the second best model by\n9.4 points in F1 score. In fact, the 96.0 F1 result obtained indicates that\nT-Projection is almost correctly projecting all the examples in the dataset.\n\nIf we look at the average over the three tasks and 5 languages, T-Projection improves\nby 8.6 points in F1 score the results of the second-best system, SimAlign.\nThese results constitute a huge improvement over all the previous annotation projection\napproaches. To the best of our knowledge, these are by a wide margin the best\nannotation projection results published for sequence labeling.\n\n\\subsection{The Role of the Candidates}\n\n\n\\begin{table}[htb]\n    \\centering\n \\adjustbox{max width=0.99\\linewidth}{\n\\begin{tabular}{@{}lnnnkkkeq@{}}\n\\toprule\n & \\multicolumn{3}{c}{OTE} & \\multicolumn{3}{c}{NER} & \\multicolumn{1}{c}{AM} &  \\multicolumn{1}{c}{Avg} \\\\ \\midrule\n &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{FR} &  \\multicolumn{1}{c}{RU} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{DE} &  \\multicolumn{1}{c}{IT} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{} \\\\ \\midrule\nT-Projection & 95.1 & 92.3 & 95.0 & 93.6 & 94.0 & 87.2 & 96.0 & 93.9 \\\\ \\midrule\n\\begin{tabular}[c]{@{}l@{}}Ngrams +\\\\ Candidate \\\\ Selection\\end{tabular} & 89.7 & 86.1 & 93.8 & 83.8 & 79.3 & 73.3 & 73.5 & 80.7 \\\\ \\hdashline[3pt/6pt]\n\\begin{tabular}[c]{@{}l@{}}mT5 +\\\\ Most Probable \\\\ Candidate\\end{tabular} & 83.7 & 87.2 & 85.3 & 79.5 & 82.8 & 72.3 & 90.9 & 84.8 \\\\ \\hdashline[3pt/6pt]\n\\begin{tabular}[c]{@{}l@{}}mT5 +\\\\ Upper bound\\end{tabular} & 98.6 & 97.0 & 97.9 & 98.0 & 98.5 & 94.0 & 99.3 & 98.0 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{F1 scores for different candidate generation and candidate selection\nmethods.}\n\\label{tab:CandidateResults}\n\\end{table}\n\n\nWe perform a set of experiments to measure the relevance and performance of the\n\\emph{candidate generation} and \\emph{candidate selection} tasks.  First, we replace mT5 with\nan ngram-based candidate generation approach. We consider as candidate spans\nevery possible ngram with size $1..sentence\\_length$ (i.e \\textit{\"Serves\",\n\"really\", \"good\", \"sushi\", \"Serves really\"...}). Table\n\\ref{tab:CandidateResults} shows that this approach results in lower\nperformance compared with our technique using mT5. Ngrams are\nmuch noisier than the candidates generated by mT5, most of them \nare very similar to each other, and this makes selecting the right candidate a more challenging task. Thus, this experiment proves that our mT5 candidate\ngeneration approach is crucial to obtain good performance.\n\nWe also replace the \\emph{candidate selection} method with the \\emph{most probable\ncandidate}. In other words, we only use the most probable beam generated by\nmT5 to label the target sentence. When using mT5 by itself, it obtains\ncompetitive results, close to those of the word alignment systems in\nTable \\ref{tab:Results}. Still, the average performance drops by 9.2 points.\nThis further confirms that both the \\emph{candidate generation} and\n\\emph{selection} steps are crucial for the T-Projection method. \n\nIn a final experiment we define an upperbound for \\emph{candidate selection}\nconsisting of assuming that our model will always select the correct projection\ncontained among the generated candidates. The upper bound achieves an average F1 score of\n98. This result confirms with a very high probability that the correct candidate is almost \nalways among the 100 candidates generated by mT5. \n\n\\begin{table*}[htb]\n \\adjustbox{max width=0.95\\linewidth}{\n\\begin{tabular}{@{}lllnkeqq@{}}\n\\toprule\nLanguage & No. of Speakers & Lang family & \\multicolumn{1}{c}{Fine-tune$_{en}$} & \\multicolumn{1}{c}{AWESOME+EN} & \\multicolumn{1}{c}{EasyProject+EN} & \\multicolumn{1}{c}{T-Projection} & \\multicolumn{1}{c}{T-Projection+EN} \\\\ \\midrule\nHausa & 63M & Afro-Asiatic /Chadic & 71.7 & \\textbf{72.7} & 72.2 & \\textbf{72.7} & 72.0 \\\\\nIgbo & 27M & NC / Volta-Niger & 59.3 & 63.5 & 65.6 & 71.4 & \\textbf{71.6} \\\\\nChichewa & 14M & English-Creole & \\textbf{79.5} & 75.1 & 75.3 & 77.2 & 77.8 \\\\\nchiShona & 12M & NC / Bantu & 35.2 & 69.5 & 55.9 & \\textbf{74.9} & 74.3 \\\\\nKiswahili & 98M & NC / Bantu & \\textbf{87.7} & 82.4 & 83.6 & 84.5 & 84.1 \\\\\nisiXhosa & 9M & NC / Bantu & 24.0 & 61.7 & 71.1 & \\textbf{72.3} & 71.7 \\\\\nYoruba & 42M & NC / Volta-Niger & 36.0 & 38.1 & 36.8 & \\textbf{42.7} & 42.1 \\\\\nisiZulu & 27M & NC / Bantu & 43.9 & 68.9 & \\textbf{73.0} & 66.7 & 64.9 \\\\ \\midrule\nAVG &  &  & 54.7 & 66.5 & 66.7 & \\textbf{70.3} & 69.8 \\\\ \\bottomrule\n\\end{tabular}\n}\n\\caption{F1 scores on MasakhaNER2.0 for mDebertaV3 trained with projected annotations from different systems. \"+EN\" denotes concatenation of the automatically generated target language dataset with the source English dataset.}\n\\label{tab:MasakhaNER2}\n\\end{table*}\n\n5.1 Annotation Projection Quality\n\\subsection{Annotation Projection Quality}\n\n\\begin{table*}[htb]\n    \\centering\n \\adjustbox{max width=0.75\\linewidth}{\n\\begin{tabular}{@{}lnnnkkkeq@{}}\n\\toprule\n & \\multicolumn{3}{c}{OTE} & \\multicolumn{3}{c}{NER} & \\multicolumn{1}{c}{AM} &  \\multicolumn{1}{c}{Avg} \\\\ \\midrule\n &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{FR} &  \\multicolumn{1}{c}{RU} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{DE} &  \\multicolumn{1}{c}{IT} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{} \\\\ \\midrule\nGiza++ \\cite{och-ney-2003-systematic} & 77.0 & 73.3 & 72.4 & 73.3 & 75.3 & 68.4 & 86.6 & 77.7 \\\\\nFastAlign \\cite{fastalign} & 75.0 & 72.9 & 76.9 & 70.2 & 77.0 & 67.0 & 85.7 & 77.4 \\\\\nSimAlign \\cite{jalili-sabet-etal-2020-simalign} & 86.7 & 86.3 & 87.7 & 85.4 & 87.4 & 81.3 & 84.1 & 85.3 \\\\\nAWESOME \\cite{DBLP:conf/eacl/DouN21} & 91.5 & 91.1 & 93.7 & 87.3 & 90.7 & 83.1 & 54.8 & 78.0 \\\\ \\midrule\nXLM-RoBERTa-xl \\cite{xlmr} & 80.2 & 76.2 & 74.5 & 73.9 & 68.3 & 73.9 & 66.5 & 71.8 \\\\\nSpan Translation & 66.5 & 46.3 & 58.7 & 68.8 & 63.5 & 69.2 & 21.6 & 48.7 \\\\ \\midrule\nT-Projection & \\textbf{95.1} & \\textbf{92.3} & \\textbf{95.0} & \\textbf{93.6} & \\textbf{94.0} & \\textbf{87.2} & \\textbf{96.0} & \\textbf{93.9} \\\\ \\bottomrule\n\\end{tabular}\n}\n    \\caption{F1 scores for annotation projection in the OTE, NER and Argument Mining tasks.}\n    \\label{tab:Results}\n\\end{table*}\n\nTable \\ref{tab:Results} reports the results of the automatically projected\ndatasets generated by each projection method with respect to the\nhuman-projected versions of those datasets. The systems based on word\nalignments obtain good results across the board, especially those using\nlanguage models, namely, SimAlign and AWESOME. In particular, AWESOME achieves\ngood results for OTE and NER but very poor in AM. Manual\ninspection of the projections found out that AWESOME struggles to align\narticles and prepositions which are included in long sequences.\n\nXLM-RoBERTa-xl shows a strong zero-shot cross-lingual performance. However, the\ngenerated datasets are of lower quality than the ones generated by the\nword-alignment systems. The results of the Span Translation approach are quite\ndisappointing, especially when dealing with the long sequences of the AM task. \nTranslating the labeled spans independently produce translations\nthat, in many cases, cannot be located in the target sentence. \n\nOur T-Projection method achieves the best results for every task and language.\nIn OTE, it outperforms every other method by more than 2 points in F1 score\naveraged across the three languages. This suggests that T-Projection robustly\nprojects labeled spans into machine-translated data. The NER evaluation is\nslightly different because the parallel data was translated by human experts.\nIn this setting, T-Projection clearly improves AWESOME's results by 4.7 points,\nwhich constitutes a significant leap in the quality of the generated datasets. \n\n\nDespite the fact that the word\nalignment systems have been trained using Europarl domain-specific data, and that\nmost of the training data used for T-Projection is coming from the CoNLL-2003\ndataset (news domain) plus very few annotated sentences (699) from Europarl,\nT-Projection still clearly obtains the best results in NER label projection. This\nsuggests that our system can also be applied in out-of-domain settings. \n\nFinally, T-Projection obtains the overall highest scores for Argument Mining\nwhich means that our approach is particularly good projecting long sequences.\nThus, T-Projection outperforms the second best model by\n9.4 points in F1 score. In fact, the 96.0 F1 result obtained indicates that\nT-Projection is almost correctly projecting all the examples in the dataset.\n\nIf we look at the average over the three tasks and 5 languages, T-Projection improves\nby 8.6 points in F1 score the results of the second-best system, SimAlign.\nThese results constitute a huge improvement over all the previous annotation projection\napproaches. To the best of our knowledge, these are by a wide margin the best\nannotation projection results published for sequence labeling.\n\n5.2 The Role of the Candidates\n\\subsection{The Role of the Candidates}\n\n\n\\begin{table}[htb]\n    \\centering\n \\adjustbox{max width=0.99\\linewidth}{\n\\begin{tabular}{@{}lnnnkkkeq@{}}\n\\toprule\n & \\multicolumn{3}{c}{OTE} & \\multicolumn{3}{c}{NER} & \\multicolumn{1}{c}{AM} &  \\multicolumn{1}{c}{Avg} \\\\ \\midrule\n &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{FR} &  \\multicolumn{1}{c}{RU} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{DE} &  \\multicolumn{1}{c}{IT} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{} \\\\ \\midrule\nT-Projection & 95.1 & 92.3 & 95.0 & 93.6 & 94.0 & 87.2 & 96.0 & 93.9 \\\\ \\midrule\n\\begin{tabular}[c]{@{}l@{}}Ngrams +\\\\ Candidate \\\\ Selection\\end{tabular} & 89.7 & 86.1 & 93.8 & 83.8 & 79.3 & 73.3 & 73.5 & 80.7 \\\\ \\hdashline[3pt/6pt]\n\\begin{tabular}[c]{@{}l@{}}mT5 +\\\\ Most Probable \\\\ Candidate\\end{tabular} & 83.7 & 87.2 & 85.3 & 79.5 & 82.8 & 72.3 & 90.9 & 84.8 \\\\ \\hdashline[3pt/6pt]\n\\begin{tabular}[c]{@{}l@{}}mT5 +\\\\ Upper bound\\end{tabular} & 98.6 & 97.0 & 97.9 & 98.0 & 98.5 & 94.0 & 99.3 & 98.0 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{F1 scores for different candidate generation and candidate selection\nmethods.}\n\\label{tab:CandidateResults}\n\\end{table}\n\n\nWe perform a set of experiments to measure the relevance and performance of the\n\\emph{candidate generation} and \\emph{candidate selection} tasks.  First, we replace mT5 with\nan ngram-based candidate generation approach. We consider as candidate spans\nevery possible ngram with size $1..sentence\\_length$ (i.e \\textit{\"Serves\",\n\"really\", \"good\", \"sushi\", \"Serves really\"...}). Table\n\\ref{tab:CandidateResults} shows that this approach results in lower\nperformance compared with our technique using mT5. Ngrams are\nmuch noisier than the candidates generated by mT5, most of them \nare very similar to each other, and this makes selecting the right candidate a more challenging task. Thus, this experiment proves that our mT5 candidate\ngeneration approach is crucial to obtain good performance.\n\nWe also replace the \\emph{candidate selection} method with the \\emph{most probable\ncandidate}. In other words, we only use the most probable beam generated by\nmT5 to label the target sentence. When using mT5 by itself, it obtains\ncompetitive results, close to those of the word alignment systems in\nTable \\ref{tab:Results}. Still, the average performance drops by 9.2 points.\nThis further confirms that both the \\emph{candidate generation} and\n\\emph{selection} steps are crucial for the T-Projection method. \n\nIn a final experiment we define an upperbound for \\emph{candidate selection}\nconsisting of assuming that our model will always select the correct projection\ncontained among the generated candidates. The upper bound achieves an average F1 score of\n98. This result confirms with a very high probability that the correct candidate is almost \nalways among the 100 candidates generated by mT5. \n\n\\begin{table*}[htb]\n \\adjustbox{max width=0.95\\linewidth}{\n\\begin{tabular}{@{}lllnkeqq@{}}\n\\toprule\nLanguage & No. of Speakers & Lang family & \\multicolumn{1}{c}{Fine-tune$_{en}$} & \\multicolumn{1}{c}{AWESOME+EN} & \\multicolumn{1}{c}{EasyProject+EN} & \\multicolumn{1}{c}{T-Projection} & \\multicolumn{1}{c}{T-Projection+EN} \\\\ \\midrule\nHausa & 63M & Afro-Asiatic /Chadic & 71.7 & \\textbf{72.7} & 72.2 & \\textbf{72.7} & 72.0 \\\\\nIgbo & 27M & NC / Volta-Niger & 59.3 & 63.5 & 65.6 & 71.4 & \\textbf{71.6} \\\\\nChichewa & 14M & English-Creole & \\textbf{79.5} & 75.1 & 75.3 & 77.2 & 77.8 \\\\\nchiShona & 12M & NC / Bantu & 35.2 & 69.5 & 55.9 & \\textbf{74.9} & 74.3 \\\\\nKiswahili & 98M & NC / Bantu & \\textbf{87.7} & 82.4 & 83.6 & 84.5 & 84.1 \\\\\nisiXhosa & 9M & NC / Bantu & 24.0 & 61.7 & 71.1 & \\textbf{72.3} & 71.7 \\\\\nYoruba & 42M & NC / Volta-Niger & 36.0 & 38.1 & 36.8 & \\textbf{42.7} & 42.1 \\\\\nisiZulu & 27M & NC / Bantu & 43.9 & 68.9 & \\textbf{73.0} & 66.7 & 64.9 \\\\ \\midrule\nAVG &  &  & 54.7 & 66.5 & 66.7 & \\textbf{70.3} & 69.8 \\\\ \\bottomrule\n\\end{tabular}\n}\n\\caption{F1 scores on MasakhaNER2.0 for mDebertaV3 trained with projected annotations from different systems. \"+EN\" denotes concatenation of the automatically generated target language dataset with the source English dataset.}\n\\label{tab:MasakhaNER2}\n\\end{table*}\n\n6 Extrinsic Evaluation\n\\section{Extrinsic Evaluation}\\label{sec:ExtrinsicEval}\n\nIn this section we evaluate T-projection in a real-world low-resource scenario, namely, Named Entity Recognition in African Languages. We compare the results obtained by training on NER dataset automatically generated by T-Projection with respect to those automatically projected using two state-of-the-art label projection systems: AWESOME (The second-best NER system in Table \\ref{tab:Results}) and EasyProject. \nWe use the exactly same settings as \\citet{DBLP:journals/corr/abs-2211-15613}. For each target language in MasakhaNER2.0, we first translate the English CoNLL dataset using the NLLB-200 3 billion parameter model. Next, we project the English labels into the target language. It should be noted that EasyProject performs both of these processes in a single step. Subsequently, we train an mDebertaV3 \\cite{DBLP:conf/iclr/HeLGC21} model using the automatically generated datasets for each target language. Finally, this model is evaluated in the gold MasakhaNER2.0 test data. We only evaluate the 8 languages in MasakhaNER2.0 supported by mT5. We focus on named entities referring to Person, Location and Organization. \n\n\nTable~\\ref{tab:MasakhaNER2} presents the results of the evaluated models on the gold MasakhaNER2.0 test sets. For T-projection, we present the results of training with the automatically generated data for the target language only, and also by adding the original English CoNLL data concatenated with the automatically generated data for each target language. Regarding other systems, we only show the former results, as it was the only metric reported by previous work. In order to train and evaluate the NER models we apply the same hyperparameter settings and code as the authors of EasyProject.\n\nThe results show that T-projection achieves superior performance for seven out of the eight languages. Our model demonstrates a more pronounced performance difference in agglutinative languages such as Igbo and Shona. As outlined in Section \\ref{sec:Results}, our model produces superior alignments compared to AWESOME. On the other side, we found that EasyProject, which utilizes markers for simultaneous translation and projection, introduces translation artifacts that hinder the performance of the downstream model. These artifacts are particularly noticeable in agglutinative languages, as EasyProject tends to separate words. For instance, in the case of Shona, consider the English sentence \\textit{\"[Germany]'s representative to the [European Union]'s veterinary committee [Werner Zwingmann]\"}. Our system produces the Shona sentence \\textit{\"Mumiriri [weGermany] kukomiti yemhuka [yeEuropean Union] [Werner Zwingmann]\"}, while EasyProject produces \\textit{\"Mumiriri we [Germany] ku [European Union] komiti yezvokurapa mhuka [Werner Zwingmann]\"}. When training mDebertaV3 with T-projection generated data, which preserves the agglutinated words, we achieve better results \ncompared to EasyProject that introduce artifacts by separating agglutinated words during translation and projection. Our system is only inferior in the Zulu language; however, on average, we improve the results by 3.6 F1 points. In contrast with previous work, our experiments revealed that concatenating English and translated data did not yield better results, potentially due to the superior quality of the data generated by T-Projection. \n\nTo the best of our knowledge, these are the best zero-shot results achieved for MasakhaNER2.0, underscoring the significant benefits of T-projection for NLP tasks in low-resource languages.\n\n\n\n\n",
        "Descriptive_question1": "What is the F1 score for T-Projection in the OTE task for Spanish in table_1?",
        "Descriptive_question2": "Which method achieved the highest average F1 score across all tasks in table_1?",
        "Reasoning_question1": "Why does T-Projection outperform other methods like AWESOME in the Argument Mining task in table_1?",
        "Reasoning_question2": "How does the performance of T-Projection in NER across different languages in table_1 suggest its applicability in out-of-domain settings?",
        "Descriptive_answer1": "95.1",
        "Descriptive_answer2": "T-Projection",
        "Reasoning_answer1": "T-Projection outperforms other methods like AWESOME in the Argument Mining task due to its ability to handle long sequences effectively. As observed in the table, T-Projection achieves an F1 score of 96.0 for AM in Spanish, significantly higher than AWESOME's 54.8. The context explains that AWESOME struggles with aligning articles and prepositions in long sequences, leading to poorer performance in tasks like AM, which often involve complex and extended text spans. In contrast, T-Projection's methodology, which likely involves sophisticated candidate generation and selection steps, allows it to project labels more accurately across these sequences, resulting in superior performance and indicating that it is almost correctly projecting all examples in the dataset.",
        "Reasoning_answer2": "The performance of T-Projection in NER across different languages in table_1 suggests its strong applicability in out-of-domain settings due to its consistently high F1 scores despite varying training data domains. For instance, T-Projection achieves scores of 93.6 in Spanish, 94.0 in German, and 87.2 in Italian, outperforming other methods by a significant margin. The context highlights that while word alignment systems were trained on Europarl domain-specific data, T-Projection's training data primarily comes from the CoNLL-2003 news domain with minimal Europarl sentences. This discrepancy in training domains does not hinder T-Projection's performance; instead, it demonstrates a robust improvement over AWESOME by 4.7 points on average. This indicates that T-Projection can generalize well across domains, making it highly effective for tasks where training and application data may not perfectly align."
    },
    {
        "paper_id": "2212.10548.json",
        "table_id": "table_2",
        "table_content": "\\begin{table}[htb]\n    \\centering\n \\adjustbox{max width=0.99\\linewidth}{\n\\begin{tabular}{@{}lnnnkkkeq@{}}\n\\toprule\n & \\multicolumn{3}{c}{OTE} & \\multicolumn{3}{c}{NER} & \\multicolumn{1}{c}{AM} &  \\multicolumn{1}{c}{Avg} \\\\ \\midrule\n &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{FR} &  \\multicolumn{1}{c}{RU} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{DE} &  \\multicolumn{1}{c}{IT} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{} \\\\ \\midrule\nT-Projection & 95.1 & 92.3 & 95.0 & 93.6 & 94.0 & 87.2 & 96.0 & 93.9 \\\\ \\midrule\n\\begin{tabular}[c]{@{}l@{}}Ngrams +\\\\ Candidate \\\\ Selection\\end{tabular} & 89.7 & 86.1 & 93.8 & 83.8 & 79.3 & 73.3 & 73.5 & 80.7 \\\\ \\hdashline[3pt/6pt]\n\\begin{tabular}[c]{@{}l@{}}mT5 +\\\\ Most Probable \\\\ Candidate\\end{tabular} & 83.7 & 87.2 & 85.3 & 79.5 & 82.8 & 72.3 & 90.9 & 84.8 \\\\ \\hdashline[3pt/6pt]\n\\begin{tabular}[c]{@{}l@{}}mT5 +\\\\ Upper bound\\end{tabular} & 98.6 & 97.0 & 97.9 & 98.0 & 98.5 & 94.0 & 99.3 & 98.0 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{F1 scores for different candidate generation and candidate selection\nmethods.}\n\\label{tab:CandidateResults}\n\\end{table}",
        "caption": "F1 scores for different candidate generation and candidate selection\nmethods.",
        "label": "tab:CandidateResults",
        "section_info": "5 Intrinsic Evaluation\n\\section{Intrinsic Evaluation} \\label{sec:Results}\n\nIn this section we present a set of experiments to evaluate \n T-Projection with respect to current state-of-the-art approaches for annotation projection. We also analyze separately the performance of the \\emph{candidate generation} and \\emph{candidate selection}\nsteps. \n\nFor the OTE task we train T-Projection and XLM-RoBERTa with the English\nABSA 2016 training set. We also train the four word alignment systems (excluding SimAlign which is an\nunsupervised method) using the English training set together with the respective\ntranslations as parallel corpora. We augment the parallel data with 50,000\nrandom parallel sentences from ParaCrawl v8 \\cite{espla-etal-2019-paracrawl}. Models are evaluated with respect to the manually label projections generated by \\citet{garcia-ferrero-etal-2022-model}. \nAs the Europarl-based NER dataset \\cite{agerri-etal-2018-building} provides\nonly test data for each language, T-Projection and XLM-RoBERTa are trained\nusing the full English CoNLL 2003 dataset\n\\cite{tjong-kim-sang-de-meulder-2003-introduction} together with the labeled\nEnglish Europarl test data. The word alignment models are in turn trained with\nthe the parallel sentences from the Europarl-based NER data plus 50,000\nparallel sentences extracted from Europarl v8 \\cite{DBLP:conf/mtsummit/Koehn05}. We evaluate the model with respect to the manual annotations provided by \\citet{agerri-etal-2018-building}.\nWith respect to Argument Mining, we use the Neoplasm training set from the\nAbstRCT dataset to train T-Projection and XLM-RoBERTa, adding its Spanish translation as parallel\ncorpora for the word alignment systems. As this is a medical text corpus, the\nparallel corpora is complemented with 50,000 parallel sentences\nfrom the WMT19 Biomedical Translation Task \\cite{bawden-etal-2019-findings}. We evaluate the models with respect to the manually projected labels by \\citet{DBLP:journals/corr/abs-2301-10527}.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\subsection{Annotation Projection Quality}\n\n\\begin{table*}[htb]\n    \\centering\n \\adjustbox{max width=0.75\\linewidth}{\n\\begin{tabular}{@{}lnnnkkkeq@{}}\n\\toprule\n & \\multicolumn{3}{c}{OTE} & \\multicolumn{3}{c}{NER} & \\multicolumn{1}{c}{AM} &  \\multicolumn{1}{c}{Avg} \\\\ \\midrule\n &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{FR} &  \\multicolumn{1}{c}{RU} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{DE} &  \\multicolumn{1}{c}{IT} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{} \\\\ \\midrule\nGiza++ \\cite{och-ney-2003-systematic} & 77.0 & 73.3 & 72.4 & 73.3 & 75.3 & 68.4 & 86.6 & 77.7 \\\\\nFastAlign \\cite{fastalign} & 75.0 & 72.9 & 76.9 & 70.2 & 77.0 & 67.0 & 85.7 & 77.4 \\\\\nSimAlign \\cite{jalili-sabet-etal-2020-simalign} & 86.7 & 86.3 & 87.7 & 85.4 & 87.4 & 81.3 & 84.1 & 85.3 \\\\\nAWESOME \\cite{DBLP:conf/eacl/DouN21} & 91.5 & 91.1 & 93.7 & 87.3 & 90.7 & 83.1 & 54.8 & 78.0 \\\\ \\midrule\nXLM-RoBERTa-xl \\cite{xlmr} & 80.2 & 76.2 & 74.5 & 73.9 & 68.3 & 73.9 & 66.5 & 71.8 \\\\\nSpan Translation & 66.5 & 46.3 & 58.7 & 68.8 & 63.5 & 69.2 & 21.6 & 48.7 \\\\ \\midrule\nT-Projection & \\textbf{95.1} & \\textbf{92.3} & \\textbf{95.0} & \\textbf{93.6} & \\textbf{94.0} & \\textbf{87.2} & \\textbf{96.0} & \\textbf{93.9} \\\\ \\bottomrule\n\\end{tabular}\n}\n    \\caption{F1 scores for annotation projection in the OTE, NER and Argument Mining tasks.}\n    \\label{tab:Results}\n\\end{table*}\n\nTable \\ref{tab:Results} reports the results of the automatically projected\ndatasets generated by each projection method with respect to the\nhuman-projected versions of those datasets. The systems based on word\nalignments obtain good results across the board, especially those using\nlanguage models, namely, SimAlign and AWESOME. In particular, AWESOME achieves\ngood results for OTE and NER but very poor in AM. Manual\ninspection of the projections found out that AWESOME struggles to align\narticles and prepositions which are included in long sequences.\n\nXLM-RoBERTa-xl shows a strong zero-shot cross-lingual performance. However, the\ngenerated datasets are of lower quality than the ones generated by the\nword-alignment systems. The results of the Span Translation approach are quite\ndisappointing, especially when dealing with the long sequences of the AM task. \nTranslating the labeled spans independently produce translations\nthat, in many cases, cannot be located in the target sentence. \n\nOur T-Projection method achieves the best results for every task and language.\nIn OTE, it outperforms every other method by more than 2 points in F1 score\naveraged across the three languages. This suggests that T-Projection robustly\nprojects labeled spans into machine-translated data. The NER evaluation is\nslightly different because the parallel data was translated by human experts.\nIn this setting, T-Projection clearly improves AWESOME's results by 4.7 points,\nwhich constitutes a significant leap in the quality of the generated datasets. \n\n\nDespite the fact that the word\nalignment systems have been trained using Europarl domain-specific data, and that\nmost of the training data used for T-Projection is coming from the CoNLL-2003\ndataset (news domain) plus very few annotated sentences (699) from Europarl,\nT-Projection still clearly obtains the best results in NER label projection. This\nsuggests that our system can also be applied in out-of-domain settings. \n\nFinally, T-Projection obtains the overall highest scores for Argument Mining\nwhich means that our approach is particularly good projecting long sequences.\nThus, T-Projection outperforms the second best model by\n9.4 points in F1 score. In fact, the 96.0 F1 result obtained indicates that\nT-Projection is almost correctly projecting all the examples in the dataset.\n\nIf we look at the average over the three tasks and 5 languages, T-Projection improves\nby 8.6 points in F1 score the results of the second-best system, SimAlign.\nThese results constitute a huge improvement over all the previous annotation projection\napproaches. To the best of our knowledge, these are by a wide margin the best\nannotation projection results published for sequence labeling.\n\n\\subsection{The Role of the Candidates}\n\n\n\\begin{table}[htb]\n    \\centering\n \\adjustbox{max width=0.99\\linewidth}{\n\\begin{tabular}{@{}lnnnkkkeq@{}}\n\\toprule\n & \\multicolumn{3}{c}{OTE} & \\multicolumn{3}{c}{NER} & \\multicolumn{1}{c}{AM} &  \\multicolumn{1}{c}{Avg} \\\\ \\midrule\n &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{FR} &  \\multicolumn{1}{c}{RU} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{DE} &  \\multicolumn{1}{c}{IT} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{} \\\\ \\midrule\nT-Projection & 95.1 & 92.3 & 95.0 & 93.6 & 94.0 & 87.2 & 96.0 & 93.9 \\\\ \\midrule\n\\begin{tabular}[c]{@{}l@{}}Ngrams +\\\\ Candidate \\\\ Selection\\end{tabular} & 89.7 & 86.1 & 93.8 & 83.8 & 79.3 & 73.3 & 73.5 & 80.7 \\\\ \\hdashline[3pt/6pt]\n\\begin{tabular}[c]{@{}l@{}}mT5 +\\\\ Most Probable \\\\ Candidate\\end{tabular} & 83.7 & 87.2 & 85.3 & 79.5 & 82.8 & 72.3 & 90.9 & 84.8 \\\\ \\hdashline[3pt/6pt]\n\\begin{tabular}[c]{@{}l@{}}mT5 +\\\\ Upper bound\\end{tabular} & 98.6 & 97.0 & 97.9 & 98.0 & 98.5 & 94.0 & 99.3 & 98.0 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{F1 scores for different candidate generation and candidate selection\nmethods.}\n\\label{tab:CandidateResults}\n\\end{table}\n\n\nWe perform a set of experiments to measure the relevance and performance of the\n\\emph{candidate generation} and \\emph{candidate selection} tasks.  First, we replace mT5 with\nan ngram-based candidate generation approach. We consider as candidate spans\nevery possible ngram with size $1..sentence\\_length$ (i.e \\textit{\"Serves\",\n\"really\", \"good\", \"sushi\", \"Serves really\"...}). Table\n\\ref{tab:CandidateResults} shows that this approach results in lower\nperformance compared with our technique using mT5. Ngrams are\nmuch noisier than the candidates generated by mT5, most of them \nare very similar to each other, and this makes selecting the right candidate a more challenging task. Thus, this experiment proves that our mT5 candidate\ngeneration approach is crucial to obtain good performance.\n\nWe also replace the \\emph{candidate selection} method with the \\emph{most probable\ncandidate}. In other words, we only use the most probable beam generated by\nmT5 to label the target sentence. When using mT5 by itself, it obtains\ncompetitive results, close to those of the word alignment systems in\nTable \\ref{tab:Results}. Still, the average performance drops by 9.2 points.\nThis further confirms that both the \\emph{candidate generation} and\n\\emph{selection} steps are crucial for the T-Projection method. \n\nIn a final experiment we define an upperbound for \\emph{candidate selection}\nconsisting of assuming that our model will always select the correct projection\ncontained among the generated candidates. The upper bound achieves an average F1 score of\n98. This result confirms with a very high probability that the correct candidate is almost \nalways among the 100 candidates generated by mT5. \n\n\\begin{table*}[htb]\n \\adjustbox{max width=0.95\\linewidth}{\n\\begin{tabular}{@{}lllnkeqq@{}}\n\\toprule\nLanguage & No. of Speakers & Lang family & \\multicolumn{1}{c}{Fine-tune$_{en}$} & \\multicolumn{1}{c}{AWESOME+EN} & \\multicolumn{1}{c}{EasyProject+EN} & \\multicolumn{1}{c}{T-Projection} & \\multicolumn{1}{c}{T-Projection+EN} \\\\ \\midrule\nHausa & 63M & Afro-Asiatic /Chadic & 71.7 & \\textbf{72.7} & 72.2 & \\textbf{72.7} & 72.0 \\\\\nIgbo & 27M & NC / Volta-Niger & 59.3 & 63.5 & 65.6 & 71.4 & \\textbf{71.6} \\\\\nChichewa & 14M & English-Creole & \\textbf{79.5} & 75.1 & 75.3 & 77.2 & 77.8 \\\\\nchiShona & 12M & NC / Bantu & 35.2 & 69.5 & 55.9 & \\textbf{74.9} & 74.3 \\\\\nKiswahili & 98M & NC / Bantu & \\textbf{87.7} & 82.4 & 83.6 & 84.5 & 84.1 \\\\\nisiXhosa & 9M & NC / Bantu & 24.0 & 61.7 & 71.1 & \\textbf{72.3} & 71.7 \\\\\nYoruba & 42M & NC / Volta-Niger & 36.0 & 38.1 & 36.8 & \\textbf{42.7} & 42.1 \\\\\nisiZulu & 27M & NC / Bantu & 43.9 & 68.9 & \\textbf{73.0} & 66.7 & 64.9 \\\\ \\midrule\nAVG &  &  & 54.7 & 66.5 & 66.7 & \\textbf{70.3} & 69.8 \\\\ \\bottomrule\n\\end{tabular}\n}\n\\caption{F1 scores on MasakhaNER2.0 for mDebertaV3 trained with projected annotations from different systems. \"+EN\" denotes concatenation of the automatically generated target language dataset with the source English dataset.}\n\\label{tab:MasakhaNER2}\n\\end{table*}\n\n5.2 The Role of the Candidates\n\\subsection{The Role of the Candidates}\n\n\n\\begin{table}[htb]\n    \\centering\n \\adjustbox{max width=0.99\\linewidth}{\n\\begin{tabular}{@{}lnnnkkkeq@{}}\n\\toprule\n & \\multicolumn{3}{c}{OTE} & \\multicolumn{3}{c}{NER} & \\multicolumn{1}{c}{AM} &  \\multicolumn{1}{c}{Avg} \\\\ \\midrule\n &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{FR} &  \\multicolumn{1}{c}{RU} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{DE} &  \\multicolumn{1}{c}{IT} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{} \\\\ \\midrule\nT-Projection & 95.1 & 92.3 & 95.0 & 93.6 & 94.0 & 87.2 & 96.0 & 93.9 \\\\ \\midrule\n\\begin{tabular}[c]{@{}l@{}}Ngrams +\\\\ Candidate \\\\ Selection\\end{tabular} & 89.7 & 86.1 & 93.8 & 83.8 & 79.3 & 73.3 & 73.5 & 80.7 \\\\ \\hdashline[3pt/6pt]\n\\begin{tabular}[c]{@{}l@{}}mT5 +\\\\ Most Probable \\\\ Candidate\\end{tabular} & 83.7 & 87.2 & 85.3 & 79.5 & 82.8 & 72.3 & 90.9 & 84.8 \\\\ \\hdashline[3pt/6pt]\n\\begin{tabular}[c]{@{}l@{}}mT5 +\\\\ Upper bound\\end{tabular} & 98.6 & 97.0 & 97.9 & 98.0 & 98.5 & 94.0 & 99.3 & 98.0 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{F1 scores for different candidate generation and candidate selection\nmethods.}\n\\label{tab:CandidateResults}\n\\end{table}\n\n\nWe perform a set of experiments to measure the relevance and performance of the\n\\emph{candidate generation} and \\emph{candidate selection} tasks.  First, we replace mT5 with\nan ngram-based candidate generation approach. We consider as candidate spans\nevery possible ngram with size $1..sentence\\_length$ (i.e \\textit{\"Serves\",\n\"really\", \"good\", \"sushi\", \"Serves really\"...}). Table\n\\ref{tab:CandidateResults} shows that this approach results in lower\nperformance compared with our technique using mT5. Ngrams are\nmuch noisier than the candidates generated by mT5, most of them \nare very similar to each other, and this makes selecting the right candidate a more challenging task. Thus, this experiment proves that our mT5 candidate\ngeneration approach is crucial to obtain good performance.\n\nWe also replace the \\emph{candidate selection} method with the \\emph{most probable\ncandidate}. In other words, we only use the most probable beam generated by\nmT5 to label the target sentence. When using mT5 by itself, it obtains\ncompetitive results, close to those of the word alignment systems in\nTable \\ref{tab:Results}. Still, the average performance drops by 9.2 points.\nThis further confirms that both the \\emph{candidate generation} and\n\\emph{selection} steps are crucial for the T-Projection method. \n\nIn a final experiment we define an upperbound for \\emph{candidate selection}\nconsisting of assuming that our model will always select the correct projection\ncontained among the generated candidates. The upper bound achieves an average F1 score of\n98. This result confirms with a very high probability that the correct candidate is almost \nalways among the 100 candidates generated by mT5. \n\n\\begin{table*}[htb]\n \\adjustbox{max width=0.95\\linewidth}{\n\\begin{tabular}{@{}lllnkeqq@{}}\n\\toprule\nLanguage & No. of Speakers & Lang family & \\multicolumn{1}{c}{Fine-tune$_{en}$} & \\multicolumn{1}{c}{AWESOME+EN} & \\multicolumn{1}{c}{EasyProject+EN} & \\multicolumn{1}{c}{T-Projection} & \\multicolumn{1}{c}{T-Projection+EN} \\\\ \\midrule\nHausa & 63M & Afro-Asiatic /Chadic & 71.7 & \\textbf{72.7} & 72.2 & \\textbf{72.7} & 72.0 \\\\\nIgbo & 27M & NC / Volta-Niger & 59.3 & 63.5 & 65.6 & 71.4 & \\textbf{71.6} \\\\\nChichewa & 14M & English-Creole & \\textbf{79.5} & 75.1 & 75.3 & 77.2 & 77.8 \\\\\nchiShona & 12M & NC / Bantu & 35.2 & 69.5 & 55.9 & \\textbf{74.9} & 74.3 \\\\\nKiswahili & 98M & NC / Bantu & \\textbf{87.7} & 82.4 & 83.6 & 84.5 & 84.1 \\\\\nisiXhosa & 9M & NC / Bantu & 24.0 & 61.7 & 71.1 & \\textbf{72.3} & 71.7 \\\\\nYoruba & 42M & NC / Volta-Niger & 36.0 & 38.1 & 36.8 & \\textbf{42.7} & 42.1 \\\\\nisiZulu & 27M & NC / Bantu & 43.9 & 68.9 & \\textbf{73.0} & 66.7 & 64.9 \\\\ \\midrule\nAVG &  &  & 54.7 & 66.5 & 66.7 & \\textbf{70.3} & 69.8 \\\\ \\bottomrule\n\\end{tabular}\n}\n\\caption{F1 scores on MasakhaNER2.0 for mDebertaV3 trained with projected annotations from different systems. \"+EN\" denotes concatenation of the automatically generated target language dataset with the source English dataset.}\n\\label{tab:MasakhaNER2}\n\\end{table*}\n\n",
        "Descriptive_question1": "What is the average F1 score for T-Projection in table_2?",
        "Descriptive_question2": "What is the F1 score for mT5 + Upper bound in the AM task for ES in table_2?",
        "Reasoning_question1": "How does the performance of Ngrams + Candidate Selection compare to mT5 + Most Probable Candidate across different tasks in table_2?",
        "Reasoning_question2": "What can be inferred about the effectiveness of the candidate generation and selection steps in T-Projection based on the results in table_2?",
        "Descriptive_answer1": "93.9",
        "Descriptive_answer2": "99.3",
        "Reasoning_answer1": "To compare the performance of Ngrams + Candidate Selection and mT5 + Most Probable Candidate across different tasks in table_2, I first look at the average F1 scores for each method. Ngrams + Candidate Selection has an average F1 score of 80.7, while mT5 + Most Probable Candidate scores 84.8, indicating a better overall performance by mT5 by 4.1 points. Breaking it down by task, for OTE, Ngrams + Candidate Selection scores 89.7 (ES), 86.1 (FR), and 93.8 (RU), averaging to 89.9, whereas mT5 + Most Probable Candidate scores 83.7 (ES), 87.2 (FR), and 85.3 (RU), averaging to 85.4, showing Ngrams outperforming mT5 by about 4.5 points in this task. For NER, Ngrams scores 83.8 (ES), 79.3 (DE), and 73.3 (IT), averaging to 78.8, while mT5 scores 79.5 (ES), 82.8 (DE), and 72.3 (IT), averaging to 78.2, a negligible difference of 0.6 points in favor of Ngrams. Finally, for AM (ES), Ngrams scores 73.5, significantly lower than mT5's 90.9, a difference of 17.4 points. Thus, while Ngrams performs better in OTE, mT5 excels notably in AM, contributing to its higher average score.",
        "Reasoning_answer2": "Analyzing the effectiveness of the candidate generation and selection steps in T-Projection from table_2, I observe that T-Projection achieves the highest average F1 score of 93.9, surpassing other methods. Comparing this to Ngrams + Candidate Selection, which scores 80.7, and mT5 + Most Probable Candidate, which scores 84.8, it’s evident that T-Projection’s combination of mT5 for candidate generation and its selection method is superior. The significant drop in performance when using Ngrams (down by 13.2 points on average) suggests that mT5 generates higher quality candidates, reducing noise and improving selection accuracy. Furthermore, using only the most probable candidate from mT5 results in a 9.1-point drop compared to T-Projection, highlighting the importance of the selection step in identifying the best candidate among those generated. Lastly, the mT5 + Upper bound score of 98.0 indicates that the correct candidate is almost always among the generated ones, reinforcing that mT5’s candidate generation is highly effective, and the primary challenge lies in perfecting the selection process. Therefore, both steps are critical to T-Projection’s success, with mT5 providing robust candidates and the selection mechanism ensuring the optimal choice."
    },
    {
        "paper_id": "2212.10548.json",
        "table_id": "table_3",
        "table_content": "\\begin{table*}[htb]\n \\adjustbox{max width=0.95\\linewidth}{\n\\begin{tabular}{@{}lllnkeqq@{}}\n\\toprule\nLanguage & No. of Speakers & Lang family & \\multicolumn{1}{c}{Fine-tune$_{en}$} & \\multicolumn{1}{c}{AWESOME+EN} & \\multicolumn{1}{c}{EasyProject+EN} & \\multicolumn{1}{c}{T-Projection} & \\multicolumn{1}{c}{T-Projection+EN} \\\\ \\midrule\nHausa & 63M & Afro-Asiatic /Chadic & 71.7 & \\textbf{72.7} & 72.2 & \\textbf{72.7} & 72.0 \\\\\nIgbo & 27M & NC / Volta-Niger & 59.3 & 63.5 & 65.6 & 71.4 & \\textbf{71.6} \\\\\nChichewa & 14M & English-Creole & \\textbf{79.5} & 75.1 & 75.3 & 77.2 & 77.8 \\\\\nchiShona & 12M & NC / Bantu & 35.2 & 69.5 & 55.9 & \\textbf{74.9} & 74.3 \\\\\nKiswahili & 98M & NC / Bantu & \\textbf{87.7} & 82.4 & 83.6 & 84.5 & 84.1 \\\\\nisiXhosa & 9M & NC / Bantu & 24.0 & 61.7 & 71.1 & \\textbf{72.3} & 71.7 \\\\\nYoruba & 42M & NC / Volta-Niger & 36.0 & 38.1 & 36.8 & \\textbf{42.7} & 42.1 \\\\\nisiZulu & 27M & NC / Bantu & 43.9 & 68.9 & \\textbf{73.0} & 66.7 & 64.9 \\\\ \\midrule\nAVG &  &  & 54.7 & 66.5 & 66.7 & \\textbf{70.3} & 69.8 \\\\ \\bottomrule\n\\end{tabular}\n}\n\\caption{F1 scores on MasakhaNER2.0 for mDebertaV3 trained with projected annotations from different systems. \"+EN\" denotes concatenation of the automatically generated target language dataset with the source English dataset.}\n\\label{tab:MasakhaNER2}\n\\end{table*}",
        "caption": "F1 scores on MasakhaNER2.0 for mDebertaV3 trained with projected annotations from different systems. \"+EN\" denotes concatenation of the automatically generated target language dataset with the source English dataset.",
        "label": "tab:MasakhaNER2",
        "section_info": "6 Extrinsic Evaluation\n\\section{Extrinsic Evaluation}\\label{sec:ExtrinsicEval}\n\nIn this section we evaluate T-projection in a real-world low-resource scenario, namely, Named Entity Recognition in African Languages. We compare the results obtained by training on NER dataset automatically generated by T-Projection with respect to those automatically projected using two state-of-the-art label projection systems: AWESOME (The second-best NER system in Table \\ref{tab:Results}) and EasyProject. \nWe use the exactly same settings as \\citet{DBLP:journals/corr/abs-2211-15613}. For each target language in MasakhaNER2.0, we first translate the English CoNLL dataset using the NLLB-200 3 billion parameter model. Next, we project the English labels into the target language. It should be noted that EasyProject performs both of these processes in a single step. Subsequently, we train an mDebertaV3 \\cite{DBLP:conf/iclr/HeLGC21} model using the automatically generated datasets for each target language. Finally, this model is evaluated in the gold MasakhaNER2.0 test data. We only evaluate the 8 languages in MasakhaNER2.0 supported by mT5. We focus on named entities referring to Person, Location and Organization. \n\n\nTable~\\ref{tab:MasakhaNER2} presents the results of the evaluated models on the gold MasakhaNER2.0 test sets. For T-projection, we present the results of training with the automatically generated data for the target language only, and also by adding the original English CoNLL data concatenated with the automatically generated data for each target language. Regarding other systems, we only show the former results, as it was the only metric reported by previous work. In order to train and evaluate the NER models we apply the same hyperparameter settings and code as the authors of EasyProject.\n\nThe results show that T-projection achieves superior performance for seven out of the eight languages. Our model demonstrates a more pronounced performance difference in agglutinative languages such as Igbo and Shona. As outlined in Section \\ref{sec:Results}, our model produces superior alignments compared to AWESOME. On the other side, we found that EasyProject, which utilizes markers for simultaneous translation and projection, introduces translation artifacts that hinder the performance of the downstream model. These artifacts are particularly noticeable in agglutinative languages, as EasyProject tends to separate words. For instance, in the case of Shona, consider the English sentence \\textit{\"[Germany]'s representative to the [European Union]'s veterinary committee [Werner Zwingmann]\"}. Our system produces the Shona sentence \\textit{\"Mumiriri [weGermany] kukomiti yemhuka [yeEuropean Union] [Werner Zwingmann]\"}, while EasyProject produces \\textit{\"Mumiriri we [Germany] ku [European Union] komiti yezvokurapa mhuka [Werner Zwingmann]\"}. When training mDebertaV3 with T-projection generated data, which preserves the agglutinated words, we achieve better results \ncompared to EasyProject that introduce artifacts by separating agglutinated words during translation and projection. Our system is only inferior in the Zulu language; however, on average, we improve the results by 3.6 F1 points. In contrast with previous work, our experiments revealed that concatenating English and translated data did not yield better results, potentially due to the superior quality of the data generated by T-Projection. \n\nTo the best of our knowledge, these are the best zero-shot results achieved for MasakhaNER2.0, underscoring the significant benefits of T-projection for NLP tasks in low-resource languages.\n\n\n\n\n",
        "Descriptive_question1": "What is the model used for training in table_3?",
        "Descriptive_question2": "Which dataset is concatenated with the target language data in table_3 denoted by '+EN'?",
        "Reasoning_question1": "Why does T-projection perform better in agglutinative languages like Igbo and Shona according to the results in table_3?",
        "Reasoning_question2": "How does the performance of T-projection compare to AWESOME and EasyProject across the languages evaluated in table_3?",
        "Descriptive_answer1": "mDebertaV3",
        "Descriptive_answer2": "English",
        "Reasoning_answer1": "T-Projection performs better in agglutinative languages like Igbo and Shona because it preserves the structure of agglutinated words during translation and label projection. According to the context, systems like EasyProject introduce artifacts by separating agglutinated words, which disrupts the linguistic structure and hinders the performance of the downstream model. For instance, in Shona, T-Projection maintains the integrity of complex word forms, as seen in the example sentence provided, leading to higher F1 scores (e.g., 74.9 for Shona and 71.4 for Igbo under T-Projection compared to lower scores with EasyProject and AWESOME). This preservation of linguistic structure results in better alignments and improved model performance for such languages.",
        "Reasoning_answer2": "T-Projection outperforms both AWESOME and EasyProject in most of the languages evaluated in Table 3. Analyzing the F1 scores, T-Projection achieves the highest scores in seven out of eight languages, with notable improvements in languages like Igbo (71.4 vs. 63.5 for AWESOME and 65.6 for EasyProject) and chiShona (74.9 vs. 69.5 for AWESOME and 55.9 for EasyProject). On average, T-Projection scores 70.3, surpassing AWESOME's 66.5 and EasyProject's 66.7. The context explains that T-Projection's superior performance is due to better alignments and fewer translation artifacts compared to EasyProject, which struggles with agglutinative structures, and AWESOME, which also falls short in alignment quality. The only exception is isiZulu, where EasyProject slightly outperforms with 73.0 compared to T-Projection's 66.7, but overall, T-Projection demonstrates a clear advantage with an average improvement of 3.6 F1 points."
    },
    {
        "paper_id": "2212.10548.json",
        "table_id": "table_4",
        "table_content": "\\begin{table}[htb]\n    \\centering\n \\adjustbox{max width=\\linewidth}{\n\\begin{tabular}{@{}llcnkeq@{}}\n\\toprule\n& Model & \\#Params & \\multicolumn{1}{c}{OTE} & \\multicolumn{1}{c}{NER} & \\multicolumn{1}{c}{AM} & \\multicolumn{1}{c}{Average} \\\\ \\midrule\n\n\\multirow{3}{*}{ MT Size } & m2m100 & 418M & 92.3 & 91.7 & 95.5 & 93.1 \\\\\n& m2m100 & 1.2B & 94.0 & \\textbf{92.0} & 95.8 & \\textbf{93.9} \\\\\n& m2m100 & 12B & \\textbf{94.1} & 91.6 & 96.0 & \\textbf{93.9} \\\\ \\midrule\n\\multirow{4}{*}{ mT5 size } & mT5-small & 60M & 36.4 & 66.3 & 00.0 & 34.2 \\\\\n& mT5-base & 220M & 72.8 & 86.2 & 33.6 & 64.2 \\\\\n& mT5-large & 738M & 90.9 & 90.1 & 65.3 & 82.1 \\\\\n& mT5-xl & 3B & \\textbf{94.1} & \\textbf{91.6} & \\textbf{96.0} & \\textbf{93.9} \\\\\n\\bottomrule\n\\end{tabular}\n}\n    \\caption{F1 scores of T-Projection when using translation and mT5 models of different size}\n    \\label{tab:ModelSize}\n\\end{table}",
        "caption": "F1 scores of T-Projection when using translation and mT5 models of different size",
        "label": "tab:ModelSize",
        "section_info": "9 Model size vs Performance\n\\section{Model size vs Performance}\\label{sec:ModelSize}\n\n\\begin{table}[htb]\n    \\centering\n \\adjustbox{max width=\\linewidth}{\n\\begin{tabular}{@{}llcnkeq@{}}\n\\toprule\n& Model & \\#Params & \\multicolumn{1}{c}{OTE} & \\multicolumn{1}{c}{NER} & \\multicolumn{1}{c}{AM} & \\multicolumn{1}{c}{Average} \\\\ \\midrule\n\n\\multirow{3}{*}{ MT Size } & m2m100 & 418M & 92.3 & 91.7 & 95.5 & 93.1 \\\\\n& m2m100 & 1.2B & 94.0 & \\textbf{92.0} & 95.8 & \\textbf{93.9} \\\\\n& m2m100 & 12B & \\textbf{94.1} & 91.6 & 96.0 & \\textbf{93.9} \\\\ \\midrule\n\\multirow{4}{*}{ mT5 size } & mT5-small & 60M & 36.4 & 66.3 & 00.0 & 34.2 \\\\\n& mT5-base & 220M & 72.8 & 86.2 & 33.6 & 64.2 \\\\\n& mT5-large & 738M & 90.9 & 90.1 & 65.3 & 82.1 \\\\\n& mT5-xl & 3B & \\textbf{94.1} & \\textbf{91.6} & \\textbf{96.0} & \\textbf{93.9} \\\\\n\\bottomrule\n\\end{tabular}\n}\n    \\caption{F1 scores of T-Projection when using translation and mT5 models of different size}\n    \\label{tab:ModelSize}\n\\end{table}\n\nWe analyze the performance of T-Projection when using an mT5 model and a\ntranslation system with different number of parameters. Table \\ref{tab:ModelSize}\nshows the average F1 performance across all the tasks and languages. First, we\nexperiment with M2M100 models of different sizes. The results show that the size of the\ntranslation model does not have a significant impact on the performance of T-Projection.\n\nHowever, the size of the mT5 model used does have a big impact on the final\nperformance of the system. Although for OTE and NER switching from a 3B to a\n738M parameter mT5 model produces competitive results, this is not the case when\napplied to AM. The overall trend is that when decreasing the number of parameters\nresults keep decreasing. Summarizing, in order to achieve competitive\nperformance for every task T-Projection requires a mT5 model with 3B parameters,\nalthough a 738M parameter model is still competitive for OTE and NER. \n\n\n\n\n\\begin{table*}[htb]\n    \\centering\n \\adjustbox{max width=\\linewidth}{\n\\begin{tabular}{@{}lccnnnkkkeq@{}}\n\\toprule\n & & & \\multicolumn{3}{c}{OTE} & \\multicolumn{3}{c}{NER} & \\multicolumn{1}{c}{AM} &  \\multicolumn{1}{c}{Average} \\\\ \\midrule\nSystem & Data Augmentation & Backbone & \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{FR} &  \\multicolumn{1}{c}{RU} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{DE} &  \\multicolumn{1}{c}{IT} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{} \\\\ \\midrule\nGiza++ \\cite{och-ney-2003-systematic} & 0 & mBERT & 76.2 & 73.8 & 78.2 & 71.4 & 66.6 & 65.7 & 86.4 & 76.8 \\\\\nFastAlign \\cite{fastalign} & 0 & mBERT & 72.3 & 70.4 & 74.6 & 60.3 & 64.0 & 57.5 & 84.0 & 72.4 \\\\\nSimAlign \\cite{jalili-sabet-etal-2020-simalign} & - & mBERT & 86.7 & 86.3 & 87.7 & 85.4 & 87.4 & 81.3 & 84.1 & 85.3 \\\\\nAWESOME \\cite{DBLP:conf/eacl/DouN21} & 0 & mBERT & 88.9 & 89.8 & 91.2 & 86.1 & 89.4 & 83.0 & 57.1 & 77.8 \\\\ \\midrule\nGiza++ \\cite{och-ney-2003-systematic} & 50000 & mBERT & 77.0 & 73.3 & 72.4 & 73.3 & 75.3 & 68.4 & 86.6 & 77.7 \\\\\nFastAlign \\cite{fastalign} & 50000 & mBERT & 75.0 & 72.9 & 76.9 & 70.2 & 77.0 & 67.0 & 85.7 & 77.4 \\\\\nAWESOME \\cite{DBLP:conf/eacl/DouN21} & 50000 & mBERT & 91.5 & 91.1 & 93.7 & 87.3 & 90.7 & 83.1 & 54.8 & 78.0 \\\\ \\midrule\nSimAlign \\cite{jalili-sabet-etal-2020-simalign} & - & XLM-RoBERTa-xl & 86.2 & 86.1 & 89.5 & 85.8 & 88.4 & 81.2 & 76.9 & 83.1 \\\\\nAWESOME \\cite{DBLP:conf/eacl/DouN21} & 50000 & XLM-RoBERTa-large & 86.1 & 86.1 & 87.4 & 87.2 & 87.5 & 83.1 & 54.8 & 75.8 \\\\ \\midrule\nT-Projection & - & - & \\textbf{95.1} & \\textbf{92.3} & \\textbf{95.0} & \\textbf{93.6} & \\textbf{94.0} & \\textbf{87.2} & \\textbf{96.0} & \\textbf{93.9} \\\\\n\\bottomrule\n\\end{tabular}\n}\n\n    \\caption{Results of the different word alignment systems when we train with and without a data augmentation corpus and different backbone models}\n    \\label{tab:WordAlignmentTune}\n\\end{table*}\n\n",
        "Descriptive_question1": "What is the highest average F1 score reported in table_4?",
        "Descriptive_question2": "Which mT5 model size achieved the lowest average F1 score in table_4?",
        "Reasoning_question1": "How does the size of the mT5 model impact the performance of T-Projection across different tasks in table_4?",
        "Reasoning_question2": "Why might the translation model size have less impact on T-Projection performance compared to the mT5 model size in table_4?",
        "Descriptive_answer1": "93.9",
        "Descriptive_answer2": "mT5-small",
        "Reasoning_answer1": "Analyzing the data in table_4, it is evident that the size of the mT5 model significantly impacts the performance of T-Projection across various tasks. For instance, starting with mT5-small (60M parameters), the average F1 score is only 34.2, with particularly poor performance in AM at 0.0. As the model size increases to mT5-base (220M), there is a substantial jump in performance, with the average F1 score rising to 64.2 and improvements across all tasks (OTE: 72.8, NER: 86.2, AM: 33.6). Further increasing to mT5-large (738M) results in an average F1 score of 82.1, showing competitive results in OTE (90.9) and NER (90.1), though AM performance (65.3) still lags. Finally, at mT5-xl (3B), the performance peaks with an average F1 score of 93.9, achieving the highest scores in all tasks (OTE: 94.1, NER: 91.6, AM: 96.0). This trend indicates a clear correlation: larger mT5 models consistently yield better performance, suggesting that increased parameter count enhances the model's capacity to handle complex tasks effectively.",
        "Reasoning_answer2": "The translation model size, as represented by the m2m100 variants in table_4, appears to have less impact on T-Projection performance compared to the mT5 model size, likely due to the differing roles these models play in the system. Looking at the m2m100 models, increasing from 418M to 1.2B parameters improves the average F1 score slightly from 93.1 to 93.9, and further increasing to 12B shows no additional average gain (stays at 93.9), with only marginal improvements in specific tasks like OTE (94.0 to 94.1) and AM (95.8 to 96.0). This suggests that beyond a certain threshold, additional parameters in the translation model do not significantly enhance performance, possibly because the translation task is less computationally intensive or already sufficiently optimized at smaller sizes. In contrast, the mT5 model size impacts performance dramatically, as seen in the jump from 34.2 (mT5-small) to 93.9 (mT5-xl), indicating that the mT5 model's role in processing or fine-tuning for specific tasks like OTE, NER, and AM requires greater computational capacity. Therefore, the mT5 model's size is more critical to overall performance, as it likely handles more complex downstream processing where parameter count directly correlates with capability."
    },
    {
        "paper_id": "2212.10548.json",
        "table_id": "table_5",
        "table_content": "\\begin{table*}[htb]\n    \\centering\n \\adjustbox{max width=\\linewidth}{\n\\begin{tabular}{@{}lccnnnkkkeq@{}}\n\\toprule\n & & & \\multicolumn{3}{c}{OTE} & \\multicolumn{3}{c}{NER} & \\multicolumn{1}{c}{AM} &  \\multicolumn{1}{c}{Average} \\\\ \\midrule\nSystem & Data Augmentation & Backbone & \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{FR} &  \\multicolumn{1}{c}{RU} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{DE} &  \\multicolumn{1}{c}{IT} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{} \\\\ \\midrule\nGiza++ \\cite{och-ney-2003-systematic} & 0 & mBERT & 76.2 & 73.8 & 78.2 & 71.4 & 66.6 & 65.7 & 86.4 & 76.8 \\\\\nFastAlign \\cite{fastalign} & 0 & mBERT & 72.3 & 70.4 & 74.6 & 60.3 & 64.0 & 57.5 & 84.0 & 72.4 \\\\\nSimAlign \\cite{jalili-sabet-etal-2020-simalign} & - & mBERT & 86.7 & 86.3 & 87.7 & 85.4 & 87.4 & 81.3 & 84.1 & 85.3 \\\\\nAWESOME \\cite{DBLP:conf/eacl/DouN21} & 0 & mBERT & 88.9 & 89.8 & 91.2 & 86.1 & 89.4 & 83.0 & 57.1 & 77.8 \\\\ \\midrule\nGiza++ \\cite{och-ney-2003-systematic} & 50000 & mBERT & 77.0 & 73.3 & 72.4 & 73.3 & 75.3 & 68.4 & 86.6 & 77.7 \\\\\nFastAlign \\cite{fastalign} & 50000 & mBERT & 75.0 & 72.9 & 76.9 & 70.2 & 77.0 & 67.0 & 85.7 & 77.4 \\\\\nAWESOME \\cite{DBLP:conf/eacl/DouN21} & 50000 & mBERT & 91.5 & 91.1 & 93.7 & 87.3 & 90.7 & 83.1 & 54.8 & 78.0 \\\\ \\midrule\nSimAlign \\cite{jalili-sabet-etal-2020-simalign} & - & XLM-RoBERTa-xl & 86.2 & 86.1 & 89.5 & 85.8 & 88.4 & 81.2 & 76.9 & 83.1 \\\\\nAWESOME \\cite{DBLP:conf/eacl/DouN21} & 50000 & XLM-RoBERTa-large & 86.1 & 86.1 & 87.4 & 87.2 & 87.5 & 83.1 & 54.8 & 75.8 \\\\ \\midrule\nT-Projection & - & - & \\textbf{95.1} & \\textbf{92.3} & \\textbf{95.0} & \\textbf{93.6} & \\textbf{94.0} & \\textbf{87.2} & \\textbf{96.0} & \\textbf{93.9} \\\\\n\\bottomrule\n\\end{tabular}\n}\n\n    \\caption{Results of the different word alignment systems when we train with and without a data augmentation corpus and different backbone models}\n    \\label{tab:WordAlignmentTune}\n\\end{table*}",
        "caption": "Results of the different word alignment systems when we train with and without a data augmentation corpus and different backbone models",
        "label": "tab:WordAlignmentTune",
        "section_info": "10 Tunning the Word Alignment Systems\n\\section{Tunning the Word Alignment Systems}\n\\label{sec:alignmentTune}\nTo validate our results and further demonstrate the performance of T-Projection, we conduct a set of experiments that evaluate the performance of word-alignment systems under different settings. We first compare the annotation projection performance when using and not using 50,000 parallel sentences as data augmentation for training the word aligners. Note that in Section \\ref{sec:Results} all the results we show correspond to using 50,000 extra parallel sentences for training the word-alignment systems. As Table \\ref{tab:WordAlignmentTune} shows, using the augmented dataset achieves the best performance. SimAlign \\cite{DBLP:conf/eacl/DouN21} and AWESOME \\cite{DBLP:conf/eacl/DouN21} recommend using their systems with multilingual-bert-cased \\cite{DBLP:conf/naacl/DevlinCLT19} as backbone. However, we also test XLM-RoBERTa-xl \\cite{xlmr} 3 billion parameter model with SimAlign and XLM-RoBERTa-large (355M parameters) model with AWESOME (The released AWESOME code at the time of writing this paper doesn't support XLM-RoBERTa-xl). Using XLM-RoBERTa produce worse results than using mBERT. These experiments show that we are using the word-alignment systems in their best-performing settings. \n\n\n",
        "Descriptive_question1": "Which backbone model is used with SimAlign in table_5?",
        "Descriptive_question2": "What is the size of the data augmentation corpus mentioned in table_5?",
        "Reasoning_question1": "Why might the use of XLM-RoBERTa as a backbone model result in worse performance compared to multilingual-bert-cased in table_5?",
        "Reasoning_question2": "How does the inclusion of a data augmentation corpus impact the performance of word alignment systems as shown in table_5?",
        "Descriptive_answer1": "mBERT",
        "Descriptive_answer2": "50000",
        "Reasoning_answer1": "The worse performance of XLM-RoBERTa compared to multilingual-bert-cased (mBERT) as a backbone model in table_5 could be due to several factors. First, while XLM-RoBERTa models (both large and xl versions) are generally larger and potentially more powerful, they might not be as well-suited or optimized for the specific task of word alignment as mBERT. The table shows that SimAlign with XLM-RoBERTa-xl scores lower on average (83.1) compared to SimAlign with mBERT (85.3). Similarly, AWESOME with XLM-RoBERTa-large scores an average of 75.8, which is lower than AWESOME with mBERT (77.8 or 78.0). This suggests that mBERT might capture cross-lingual alignments better due to its training objectives or architecture being more aligned with the needs of word alignment systems. Additionally, the context in section 10 indicates that SimAlign and AWESOME recommend mBERT as their backbone, implying that these systems were likely fine-tuned or designed with mBERT in mind, potentially leading to compatibility issues or suboptimal performance with XLM-RoBERTa variants.",
        "Reasoning_answer2": "The inclusion of a data augmentation corpus of 50,000 parallel sentences generally has a positive impact on the performance of word alignment systems, as observed in table_5. Comparing the results, systems like Giza++ show an improvement in average score from 76.8 without augmentation to 77.7 with augmentation. FastAlign also improves from 72.4 to 77.4, and AWESOME sees a slight increase from 77.8 to 78.0. This trend suggests that the additional training data helps the models better learn alignment patterns across languages by providing more examples to generalize from. However, the improvement is not universal or drastic in all metrics; for instance, some specific language scores (like RU for Giza++) slightly decrease. Overall, the augmentation corpus enriches the training process, likely reducing overfitting and enhancing the model’s ability to handle diverse linguistic structures, as supported by the context in section 10 which states that using the augmented dataset achieves the best performance."
    },
    {
        "paper_id": "2212.10548.json",
        "table_id": "table_6",
        "table_content": "\\begin{table*}[htb]\n    \\centering\n \\adjustbox{max width=0.95\\linewidth}{\n\\begin{tabular}{@{}lnnnkkkeq@{}}\n\\toprule\n & \\multicolumn{3}{c}{OTE} & \\multicolumn{3}{c}{NER} & \\multicolumn{1}{c}{AM} &  \\multicolumn{1}{c}{Average} \\\\ \\midrule\n Candidate Scorer &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{FR} &  \\multicolumn{1}{c}{RU} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{DE} &  \\multicolumn{1}{c}{IT} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{} \\\\ \\midrule\nPrism-745M  & 91.4 & 86.8 & 94.3 & \\textbf{93.8} & 93.4 & 85.4 & \\textbf{96.3} & 92.7 \\\\\nM2M100-12B & 95.1 & \\textbf{92.3} & 95.0 & 93.6 & 94.0 & 87.2 & 96.0 & \\textbf{93.9} \\\\\nNLLB200-3B & \\textbf{96.6} & 90.5 & \\textbf{95.6} & 91.0 & \\textbf{94.3} & \\textbf{87.7} & 93.9 & 93.0  \\\\ \\hdashline[3pt/6pt]\nLASER 2.0 & 89.0 & 80.6 & 91.3 & 91.2 & 91.6 & 86.5 & 70.4 & 82.4 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Results of T-Projection when selecting candidates using translation probability scores with different MT systems vs using the cosine similarity of the multilingual vector representations of the candidates computed using LASER 2.0}\n\\label{tab:laser}\n\\end{table*}",
        "caption": "Results of T-Projection when selecting candidates using translation probability scores with different MT systems vs using the cosine similarity of the multilingual vector representations of the candidates computed using LASER 2.0",
        "label": "tab:laser",
        "section_info": "11 MT models vs Laser\n\\section{MT models vs Laser}\n\n\\begin{table*}[htb]\n    \\centering\n \\adjustbox{max width=0.95\\linewidth}{\n\\begin{tabular}{@{}lnnnkkkeq@{}}\n\\toprule\n & \\multicolumn{3}{c}{OTE} & \\multicolumn{3}{c}{NER} & \\multicolumn{1}{c}{AM} &  \\multicolumn{1}{c}{Average} \\\\ \\midrule\n Candidate Scorer &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{FR} &  \\multicolumn{1}{c}{RU} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{DE} &  \\multicolumn{1}{c}{IT} &  \\multicolumn{1}{c}{ES} &  \\multicolumn{1}{c}{} \\\\ \\midrule\nPrism-745M  & 91.4 & 86.8 & 94.3 & \\textbf{93.8} & 93.4 & 85.4 & \\textbf{96.3} & 92.7 \\\\\nM2M100-12B & 95.1 & \\textbf{92.3} & 95.0 & 93.6 & 94.0 & 87.2 & 96.0 & \\textbf{93.9} \\\\\nNLLB200-3B & \\textbf{96.6} & 90.5 & \\textbf{95.6} & 91.0 & \\textbf{94.3} & \\textbf{87.7} & 93.9 & 93.0  \\\\ \\hdashline[3pt/6pt]\nLASER 2.0 & 89.0 & 80.6 & 91.3 & 91.2 & 91.6 & 86.5 & 70.4 & 82.4 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Results of T-Projection when selecting candidates using translation probability scores with different MT systems vs using the cosine similarity of the multilingual vector representations of the candidates computed using LASER 2.0}\n\\label{tab:laser}\n\\end{table*}\n\n\n\n\n\nWe conducted experiments using M2M100-12B \\cite{DBLP:journals/jmlr/FanBSMEGBCWCGBL21}, NLLB200-3B \\cite{DBLP:journals/corr/abs-2207-04672} and prism\n\\cite{DBLP:conf/emnlp/ThompsonP20} as model for computing translation probabilities. We also experiment with using LASER 2.0 \\cite{DBLP:journals/tacl/ArtetxeS19} sentence representations instead of the translation probabilities of NMTscore. We encode the source span as well as all the projection candidates using LASER encoder. We then rank them using cosine similarity. Table \\ref{tab:laser} shows the results. LASER2.0 is competitive when dealing with the short labeled sequences in the OTE and NER task. But the performance decreases when dealing with large sequences in the AM task. M2M100, NLLB200, and Prism exhibit comparable performance, with some of them achieving the best results in specific languages, but overall, their average performance is very similar.\n\n\n",
        "Descriptive_question1": "What is the highest OTE score for FR in table_6?",
        "Descriptive_question2": "Which model scored 70.4 in AM for ES in table_6?",
        "Reasoning_question1": "Why might LASER 2.0 have a significantly lower average score compared to other models in table_6?",
        "Reasoning_question2": "How does the performance of M2M100-12B compare to NLLB200-3B across different tasks and languages in table_6?",
        "Descriptive_answer1": "92.3",
        "Descriptive_answer2": "LASER 2.0",
        "Reasoning_answer1": "LASER 2.0 likely has a significantly lower average score of 82.4 compared to other models (which average around 93.0 or higher) because it struggles with larger sequences, particularly in the AM task where it scored only 70.4 for ES. This poor performance in AM drags down its overall average. In contrast, it remains competitive in tasks with shorter sequences like OTE and NER, with scores closer to the other models (e.g., 89.0 for OTE ES and 91.2 for NER ES). The table context suggests that LASER 2.0 uses cosine similarity of multilingual vector representations, which may not capture nuanced translation probabilities as effectively as the MT systems used by other models, especially for complex or longer text in AM.",
        "Reasoning_answer2": "Comparing M2M100-12B and NLLB200-3B across tasks and languages in table_6 reveals nuanced differences. For OTE, NLLB200-3B outperforms M2M100-12B in ES (96.6 vs. 95.1) and RU (95.6 vs. 95.0), while M2M100-12B takes the lead in FR (92.3 vs. 90.5). In NER, NLLB200-3B is slightly better in DE (94.3 vs. 94.0) and IT (87.7 vs. 87.2), but M2M100-12B edges out in ES (93.6 vs. 91.0). For AM in ES, M2M100-12B scores 96.0, slightly higher than NLLB200-3B's 93.9. Overall, M2M100-12B has a marginally better average score of 93.9 compared to NLLB200-3B's 93.0. This suggests that while both models perform comparably, M2M100-12B has a slight edge overall, though NLLB200-3B excels in specific languages and tasks."
    },
    {
        "paper_id": "2212.10548.json",
        "table_id": "table_7",
        "table_content": "\\begin{table}[H]\n    \\centering\n \\adjustbox{max width=\\linewidth}{\n\\begin{tabular}{@{}lcq@{}}\n\\toprule\nTask & Split & \\multicolumn{1}{c}{Sentence No} \\\\ \\midrule\n\\multicolumn{3}{c}{ABSA} \\\\ \\midrule\nABSA \\cite{pontiki-etal-2016-semeval} & Train & 2000 \\\\\nABSA \\cite{pontiki-etal-2016-semeval} & Test & 676 \\\\ \\midrule\n\\multicolumn{3}{c}{NER} \\\\  \\midrule\nEuroparl \\cite{agerri-etal-2018-building} & Test & 799 \\\\\nCoNLL03 \\cite{tjong-kim-sang-de-meulder-2003-introduction} & Train & 14987 \\\\\nCoNLL03 \\cite{tjong-kim-sang-de-meulder-2003-introduction} & Dev & 3466 \\\\\nCoNLL03 \\cite{tjong-kim-sang-de-meulder-2003-introduction} & Test & 3684 \\\\ \\hdashline[3pt/6pt]\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (hau) & 1632 \\\\\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (ibo) & 2180 \\\\\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (sna) & 1772 \\\\\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (swa) & 1882 \\\\\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (xho) & 1632 \\\\\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (yor) & 1963 \\\\\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (nya) & 1784 \\\\\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (zul) & 1669 \\\\ \\midrule\n\\multicolumn{3}{c}{AM} \\\\ \\midrule\nAbsRCT Neoplasm \\cite{DBLP:conf/ecai/0002CV20} & Train & 4404 \\\\\nAbsRCT Neoplasm \\cite{DBLP:conf/ecai/0002CV20} & Dev & 679 \\\\\nAbsRCT Neoplasm \\cite{DBLP:conf/ecai/0002CV20} & Test & 1251 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\n    \\caption{Size (Number of sentences) of the dataset we use to train and evaluate our systems.}\n    \\label{tab:DatasetLen}\n\\end{table}",
        "caption": "Size (Number of sentences) of the dataset we use to train and evaluate our systems.",
        "label": "tab:DatasetLen",
        "section_info": "13 Dataset details\n\\section{Dataset details}\n\\label{sec:DatasetDetails}\nWe list the size (number of sentences) of the dataset we use in Table \\ref{tab:DatasetLen}. Note that all the datasets we use are parallel in all the languages, and the number of sentences is the same for all the languages. \n\n\n\\begin{table}[H]\n    \\centering\n \\adjustbox{max width=\\linewidth}{\n\\begin{tabular}{@{}lcq@{}}\n\\toprule\nTask & Split & \\multicolumn{1}{c}{Sentence No} \\\\ \\midrule\n\\multicolumn{3}{c}{ABSA} \\\\ \\midrule\nABSA \\cite{pontiki-etal-2016-semeval} & Train & 2000 \\\\\nABSA \\cite{pontiki-etal-2016-semeval} & Test & 676 \\\\ \\midrule\n\\multicolumn{3}{c}{NER} \\\\  \\midrule\nEuroparl \\cite{agerri-etal-2018-building} & Test & 799 \\\\\nCoNLL03 \\cite{tjong-kim-sang-de-meulder-2003-introduction} & Train & 14987 \\\\\nCoNLL03 \\cite{tjong-kim-sang-de-meulder-2003-introduction} & Dev & 3466 \\\\\nCoNLL03 \\cite{tjong-kim-sang-de-meulder-2003-introduction} & Test & 3684 \\\\ \\hdashline[3pt/6pt]\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (hau) & 1632 \\\\\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (ibo) & 2180 \\\\\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (sna) & 1772 \\\\\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (swa) & 1882 \\\\\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (xho) & 1632 \\\\\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (yor) & 1963 \\\\\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (nya) & 1784 \\\\\nMasakhaNER2.0 \\cite{adelani-etal-2022-masakhaner} & Test (zul) & 1669 \\\\ \\midrule\n\\multicolumn{3}{c}{AM} \\\\ \\midrule\nAbsRCT Neoplasm \\cite{DBLP:conf/ecai/0002CV20} & Train & 4404 \\\\\nAbsRCT Neoplasm \\cite{DBLP:conf/ecai/0002CV20} & Dev & 679 \\\\\nAbsRCT Neoplasm \\cite{DBLP:conf/ecai/0002CV20} & Test & 1251 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\n    \\caption{Size (Number of sentences) of the dataset we use to train and evaluate our systems.}\n    \\label{tab:DatasetLen}\n\\end{table}\n\nFor OTE, we use the SemEval-2016 Task 5 Aspect Based Sentiment Analysis (ABSA) dataset \\cite{pontiki-etal-2016-semeval}. We train T-Projection with the concatenation of the English train and test splits. We evaluate all the systems by projecting the training split. \nFor NER we use the English, Spanish, German, Italian europarl parallel dataset from \\cite{agerri-etal-2018-building}. We train T-Projection with the concatenation of the English europarl dataset with the train, dev and test splits of the English CoNLL 2003 dataset \\cite{tjong-kim-sang-de-meulder-2003-introduction}. We evaluate the systems by projecting the English NER europarl test splits.\nFor Argument Mining, we use the AbstRCT Neoplasm English dataset \\cite{DBLP:conf/ecai/0002CV20} and the Spanish AbsRCT corpus generated by machine translating the English AbstRCT corpus with DeepL and manually projecting the labels. We train T-Projection of the concatenation of the English Neoplasm train, dev and test split. We evaluate the systems by projecting the English Neoplasm train split. \n\n\n",
        "Descriptive_question1": "What is the number of sentences in the Train split for ABSA in table_7?",
        "Descriptive_question2": "How many sentences are in the Test split for Europarl in table_7?",
        "Reasoning_question1": "Which dataset in table_7 has the largest number of sentences in its Train split, and what might this suggest about the complexity of training for that task?",
        "Reasoning_question2": "Comparing the Test splits for MasakhaNER2.0 across different languages in table_7, which language has the highest number of sentences, and what could be a possible reason for this variation?",
        "Descriptive_answer1": "2000",
        "Descriptive_answer2": "799",
        "Reasoning_answer1": "The dataset with the largest number of sentences in its Train split is CoNLL03 for the NER task, with 14,987 sentences. This significantly higher number compared to other datasets suggests that the NER task, particularly with CoNLL03, may require more extensive training data to capture the complexity and variability of named entity recognition across different contexts and domains. A larger training set could imply that the task involves a broader range of entity types or more diverse linguistic patterns, necessitating more examples for the model to generalize effectively. Additionally, the size of the dataset might reflect the availability of annotated data for this well-established benchmark, allowing for more comprehensive training.",
        "Reasoning_answer2": "Among the Test splits for MasakhaNER2.0, the language with the highest number of sentences is 'ibo' with 2,180 sentences. This variation in the number of sentences across different languages could be due to several factors. One possible reason is the availability of annotated data for each language, as some languages like 'ibo' might have more resources or larger annotated corpora due to greater research focus or community efforts. Another reason could be related to the linguistic complexity or diversity of the language, where languages with more speakers or more diverse dialects might have larger datasets to ensure representativeness. Additionally, the variation might reflect the specific goals of the MasakhaNER2.0 project, prioritizing certain languages based on their underrepresentation in NLP research or the need for robust evaluation in specific linguistic contexts."
    },
    {
        "paper_id": "2212.10509.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}[ht]\n    \\centering\n    \\footnotesize\n    \\setlength{\\tabcolsep}{2.0pt}\n    \\begin{tabular}{ccccc}\\toprule\n        Model &  HpQA\\textsuperscript{Br}  &  HpQA & 2WikiMQA & MQ\\textsuperscript{2H} \\\\\n        \\midrule\n        InterAug      &   $-$ | $-$    &        30.3 | $-$\\p{xx}   &        $-$ | $-$        &        $-$ | $-$            \\\\\n        RECITE        &   $-$ | $-$    &        37.1 | 48.4        &        $-$ | $-$        &        $-$ | $-$            \\\\\n        ReAct         &   $-$ | $-$    &        35.1 | $-$\\p{xx}   &        $-$ | $-$        &        $-$ | $-$            \\\\\n        SelfAsk       &   $-$ | $-$    &         $-$ | $-$         &       40.1 | $-$\\p{xx}   &       15.2 | $-$\\p{xx}     \\\\\n        DecomP        &  \\p{x..}$-$ | 50.0  &         $-$ | $-$         &   \\p{x..}$-$ | 59.3      &       $-$ | $-$       \\\\\n        \\midrule\n        \\sys QA       &   \\textbf{45.8 | 58.5}   &    \\bf{49.3 | 60.7}       &   \\bf{57.7 | 68.0}      &  \\bf{34.2 | 43.8} \\\\\n        \\bottomrule\n    \\end{tabular}\n    \\caption{Comparison with other LLM-based ODQA systems on EM and F1 scores. `$-$': score is unavailable. HpQA\\textsuperscript{Br}: Bridge questions subset of HotpotQA. MQ\\textsuperscript{2H}: MuSiQue 2-hop questions. \\iconsys QA with GPT3 (ours) outperforms other systems by a large margin. Note: Comparisons aren't head-to-head as discussed in the text. App.~\\S\\ref{sec:sota-differences} reports updated SOTA numbers, including contemporaneous and newer works.\n    \\label{table:extrinsic-comparison}\n    }\n\\end{table}",
        "caption": "Comparison with other LLM-based ODQA systems on EM and F1 scores. `$-$': score is unavailable. HpQA\\textsuperscript{Br}: Bridge questions subset of HotpotQA. MQ\\textsuperscript{2H}: MuSiQue 2-hop questions. \\iconsys QA with GPT3 (ours) outperforms other systems by a large margin. Note: Comparisons aren't head-to-head as discussed in the text. App.~\\S\\ref{sec:sota-differences} reports updated SOTA numbers, including contemporaneous and newer works.\n    \\label{table:extrinsic-comparison}\n    ",
        "label": "table:extrinsic-comparison",
        "section_info": "5 Results\n\\section{Results}\n\\label{sec:exp-results}\n\n\n\\begin{figure*}[ht]\n\\centering\n\\includegraphics[width=0.95\\textwidth]{images/ood_retrieval_results.pdf}\n\\caption{Retrieval recall for OneR and IRCoT using Flan-T5-XXL (Left) and GPT3 (Right) in out-of-distribution (OOD) setting. HQ (HotpotQA), 2W (2WikiMultihopQA), MQ (MuSiQue). The result X$\\rightarrow$Y indicates prompt demonstrations are from dataset X and evaluation is on dataset Y. \\iconsys outperforms OneR in such an OOD setting.}\n\\label{fig:ood-retrieval-results}\n\\end{figure*}\n\n\\begin{figure*}[ht]\n\\centering\n\\includegraphics[width=0.95\\textwidth]{images/ood_qa_results.pdf}\n\\caption{Answer F1 for NoR QA, OneR QA and IRCoT QA using Flan-T5-XXL (Left) and GPT3 (Right) in out-of-distribution (OOD) setting. HQ (HotpotQA), 2W (2WikiMultihopQA), MQ (MuSiQue). The result X$\\rightarrow$Y indicates prompt demonstrations are from dataset X and evaluation is on dataset Y. \\iconsys QA outperforms OneR QA and NoR QA in such OOD setting.}\n\\label{fig:ood-qa-results}\n\\end{figure*}\n\n\n\n\\paragraph{\\iconsys retrieval is better than one-step. }\n\nFig.~\\ref{fig:main-retrieval-results} compares OneR with \\iconsys retrievers made from \\texttt{Flan-T5-XXL} and \\texttt{GPT3} LMs. For both models, \\iconsys significantly outperforms one-step retrieval across all datasets. For \\texttt{Flan-T5-XXL}, \\iconsys improves our recall metric relative to one-step retrieval, on HotpotQA by 7.9, on 2WikiMultihopQA by 14.3, on MuSiQue by 3.5, and on IIRC by 10.2 points. For \\texttt{GPT3}, this improvement is by 11.3, 22.6, 12.5, and 21.2 points, respectively.\n\n\n\\paragraph{\\iconsys QA outperforms NoR and OneR QA.}\n\nFig.~\\ref{fig:main-qa-results} compares ODQA performance using NoR, OneR and \\iconsys retriever made from \\texttt{Flan-T5-XXL} and \\texttt{GPT3} LMs. For \\texttt{Flan-T5-XXL}, \\iconsys QA outperforms OneR QA on HotpotQA by 9.4, on 2WikiMultihopQA by 15.3, on MuSiQue by 5.0 and IIRC by 2.5 F1 points. For \\texttt{GPT3}, the corresponding numbers (except for IIRC) are 7.1, 13.2, and 7.1 F1 points. For \\texttt{GPT3}, \\iconsys doesn't improve the QA score on IIRC, despite significantly improved retrieval (21 points as shown in Fig.~\\ref{fig:main-retrieval-results}). This is likely because IIRC relevant knowledge may already be present in GPT3, as also evidenced by its NoR QA score being similar. For other datasets and model combinations, NoR QA is much worse than \\iconsys QA, indicating the limits of the models' parametric knowledge.\n\n\n\\begin{figure}[ht]\n\\centering\n\\includegraphics[width=0.475\\textwidth]{images/factual_errors.pdf}\n\\caption{Number of questions, out of 40, where CoT generated by GPT3 using different methods has at least 1 factual error. Factual errors: \\iconsys $<$ OneR $<$ NoR.}\n\\label{fig:cot-factual-errors}\n\\end{figure}\n\n\n\\begin{figure*}[ht]\n\\centering\n\\includegraphics[width=0.95\\textwidth]{images/model_scale_retrieval_results.pdf}\n\\caption{Retrieval recall for OneR (bottom) and \\iconsys (top) for LMs of increasing sizes: Flan-T5 \\{base (0.2B), large (0.7B), XL (3B), XXL (11B)\\} and GPT3 (175B) on HotpotQA, 2WikiMultihopQA, MuSiQue. \\iconsys outperforms OneR for all model sizes, including the 0.3B model, and the difference roughly grows with model size. Note: OneR doesn't use LM in its retrieval and so has a fixed score.}\n\\label{fig:model-scale-retrieval-results}\n\\end{figure*}\n\n\\begin{figure*}[ht]\n\\centering\n\\includegraphics[width=0.95\\textwidth]{images/model_scale_qa_results.pdf}\n\\caption{Answer F1 for ODQA models made using OneR (bottom) and \\iconsys (top) for LMs of increasing sizes: Flan-T5 \\{base (0.2B), large (0.7B), XL (3B), XXL (11B)\\} and GPT3 (175B) on HotpotQA, 2WikiMultihopQA and MuSiQue. \\iconsys QA outperforms OneR QA for all model sizes except for the smallest, 0.3B. \\iconsys with 3B model even outperforms OneR with 58X larger GPT3 model showing the value of improved retrieval.}\n\\label{fig:model-scale-qa-results}\n\\end{figure*}\n\n\n\n\\paragraph{\\iconsys is effective in OOD setting. }\n\nSince CoT may not always be easy to write for new datasets, we evaluate NoR, OneR, and IRCoT on generalization to new datasets, i.e. OOD setting. To do so, we use prompt demonstrations from one dataset to evaluate on another dataset.\\footnote{We use the evaluation dataset's corpus for retrieval.} For all pairs of the datasets\\footnote{We skip IIRC in this exploration as the task is structured a bit differently and requires special handling (see App.~\\ref{sec:apndx-iirc-special-handling}).} and for both \\texttt{Flan-T5-XXL} and \\texttt{GPT3}, we find the same trend as in the IID setting: \\iconsys retrieval outperforms OneR (Fig.~\\ref{fig:ood-retrieval-results}), and IRCoT QA outperforms both OneR QA and NoR QA (Fig.~\\ref{fig:ood-qa-results}).\n\n\n\n\\paragraph{\\iconsys generates CoT with fewer factual errors.}\n\nTo assess whether our approach also improves the factuality of generated CoTs, we manually annotated CoTs generated by NoR QA, OneR QA, and IRCoT QA using GPT3 for 40 randomly sampled questions from each of the four datasets. We considered CoT to have a factual error if at least one of the facts\\footnote{all sentences before the final ``answer is:'' sentence.} is not true.\\footnote{Note that factual error doesn't necessarily mean the predicted answer is incorrect and vice-versa. This is because the model can generate a wrong answer despite all correct facts, and vice-versa. We also account for the possibility of answer annotation errors in the original datasets.} As Fig.~\\ref{fig:cot-factual-errors} shows, NoR makes the most factual errors, OneR makes fewer, and \\iconsys the least. In particular, \\iconsys reduces the factual errors over OneR by 50\\% on HotpotQA and 40\\% on 2WikiMultihopQA.\n\n\nTable~\\ref{table:nor-oner-cot-examples} illustrates how the CoT predictions for different methods vary qualitatively. Since NoR relies completely on parametric knowledge, it often makes a factual error in the first sentence, which derails the full CoT. OneR can retrieve relevant information closest to the question and is less likely to make such errors early on, but it still makes errors later in the CoT. IRCoT, on the other hand, is often able to prevent such errors in each step.\n\n\n\\paragraph{\\iconsys is also effective for smaller models.}\n\nTo see how effective \\iconsys is at different LM sizes, we show the scaling plots in Fig.~\\ref{fig:model-scale-retrieval-results}.\\footnote{We skip IIRC here as the smaller models are not good at identifying Wikipedia titles from a paragraph and a question which is necessary for IIRC (see App.~\\ref{sec:apndx-iirc-special-handling}).} We compare the recall for OneR and \\iconsys using \\texttt{Flan-T5} \\{base (0.2B), large (0.7B), XL (3B), XXL (11B)\\}, and GPT3 \\texttt{code-davinci-002} (175B). \\iconsys with even the smallest model (0.2B) is better than OneR, and the performance roughly improves with the model size. This shows the CoT generation capabilities of even small models can be leveraged for improving retrieval. Furthermore, we show the effect of model size on the QA score in Fig.~\\ref{fig:model-scale-qa-results}. For all sizes except the smallest (0.2B), we see \\iconsys QA is better than OneR QA. Moreover, \\iconsys with a 3B model even outperforms OneR and NoR with a 58X larger 175B GPT3 model in all datasets.\n\n\n\n\\paragraph{\\iconsys is SOTA for few-shot multistep ODQA.\\footnote{\\label{footnote:sota}App.~\\S\\ref{sec:sota-differences} reports updated SOTA numbers, including contemporaneous and newer works.}}\n\n\nWe compare \\iconsys QA with five recent approaches to using LLMs for ODQA: Internet-Augmented QA~\\cite{internet-augmented-qa}, RECITE~\\cite{recitationlm} ReAct~\\cite{react}, SelfAsk~\\cite{selfask}, and DecomP~\\cite{old-decomp}. Although these are not head-to-head comparisons as different methods use different APIs, knowledge sources, and even LLMs (see App.~\\ref{sec:sota-differences} for details), it is still informative to explore, in a leaderboard-style fashion, how \\iconsys performs relative to the best numbers published for these recent systems.\n\n\\vspace{0.1cm}\n\\begin{table}[ht]\n    \\centering\n    \\footnotesize\n    \\setlength{\\tabcolsep}{2.0pt}\n    \\begin{tabular}{ccccc}\\toprule\n        Model &  HpQA\\textsuperscript{Br}  &  HpQA & 2WikiMQA & MQ\\textsuperscript{2H} \\\\\n        \\midrule\n        InterAug      &   $-$ | $-$    &        30.3 | $-$\\p{xx}   &        $-$ | $-$        &        $-$ | $-$            \\\\\n        RECITE        &   $-$ | $-$    &        37.1 | 48.4        &        $-$ | $-$        &        $-$ | $-$            \\\\\n        ReAct         &   $-$ | $-$    &        35.1 | $-$\\p{xx}   &        $-$ | $-$        &        $-$ | $-$            \\\\\n        SelfAsk       &   $-$ | $-$    &         $-$ | $-$         &       40.1 | $-$\\p{xx}   &       15.2 | $-$\\p{xx}     \\\\\n        DecomP        &  \\p{x..}$-$ | 50.0  &         $-$ | $-$         &   \\p{x..}$-$ | 59.3      &       $-$ | $-$       \\\\\n        \\midrule\n        \\sys QA       &   \\textbf{45.8 | 58.5}   &    \\bf{49.3 | 60.7}       &   \\bf{57.7 | 68.0}      &  \\bf{34.2 | 43.8} \\\\\n        \\bottomrule\n    \\end{tabular}\n    \\caption{Comparison with other LLM-based ODQA systems on EM and F1 scores. `$-$': score is unavailable. HpQA\\textsuperscript{Br}: Bridge questions subset of HotpotQA. MQ\\textsuperscript{2H}: MuSiQue 2-hop questions. \\iconsys QA with GPT3 (ours) outperforms other systems by a large margin. Note: Comparisons aren't head-to-head as discussed in the text. App.~\\S\\ref{sec:sota-differences} reports updated SOTA numbers, including contemporaneous and newer works.\n    \\label{table:extrinsic-comparison}\n    }\n\\end{table}\n\n\nAs shown in Table~\\ref{table:extrinsic-comparison}, \\iconsys QA significantly outperforms all of these recent systems by a large margin, setting a new state of the art in terms of what's achievable via retrieval-augmented LLMs (without supervised training).\n\n\n\n\\begin{table*}[t]\n    \\centering\n    \\footnotesize\n    \\setlength{\\tabcolsep}{4.0pt}\n    \\begin{tabular}{llllll}\\toprule\n        & \\textbf{Question} &   \\textbf{Gold Facts} & \\textbf{NoR QA} & \\textbf{OneR QA} & \\textbf{IRCoT QA} \\\\\n        \\midrule\n\n\n        \\multirow{2}{*}{\\rotatebox[origin=c]{90}{\\parbox[c]{3.0cm}{\\centering HotpotQA}}}\n        & \\begin{minipage}[t]{2cm} The tap dancing soloist in My Feet Keep Dancing also starred in what 1935 MGM musical? \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} $\\Rightarrow$ The tap dancing soloist in My Feet Keep Dancing is \\textcolor{blue}{Fayard Nicholas}. \\\\ $\\Rightarrow$ Fayard Nicholas starred in the 1935 MGM musical \\textcolor{blue}{An All-Colored Vaudeville Show}. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The tap dancing soloist in My Feet Keep Dancing is \\textcolor{red}{Eleanor Powell}. Eleanor Powell also starred in the 1935 MGM musical \\textcolor{red}{Broadway Melody} of 1936. So the answer is: Broadway Melody of 1936. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The tap dancing soloist in My Feet Keep Dancing is \\textcolor{blue}{Fayard Nicholas}. Fayard Nicholas also starred in the 1935 MGM musical \\textcolor{red}{Top Hat}. So the answer is: Top Hat. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The tap dancing soloist in My Feet Keep Dancing is \\textcolor{blue}{Fayard Nicholas}. Fayard Nicholas starred in the 1935 MGM musical \\textcolor{blue}{An All-Colored Vaudeville Show}. So the answer is: An All-Colored Vaudeville Show. \\end{minipage} \\\\\n\n        \\midrule\n\n\n        \\multirow{1}{*}{\\rotatebox[origin=c]{90}{\\parbox[c]{4.0cm}{\\centering 2WikiMQA}}}\n        & \\begin{minipage}[t]{2cm} Who is the father-in-law of Queen Hyojeong? \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} $\\Rightarrow$ Queen Hyojeong is the wife of \\textcolor{blue}{King Heonjong of Joseon}. \\\\ $\\Rightarrow$ King Heonjong of Joseon is the son of \\textcolor{blue}{Crown Prince Hyomyeong}. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Queen Hyojeong has a husband named \\textcolor{red}{King Gojong}. King Gojong has a father named \\textcolor{red}{Heungseon Daewongun}. Thus, Queen Hyojeong has a father-in-law named Heungseon Daewongun. So the answer is: Heungseon Daewongun. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Queen Hyojeong is the wife of \\textcolor{blue}{King Heonjong of Joseon}. King Heonjong of Joseon is the son of \\textcolor{red}{King Sejo of Joseon}. Thus, King Sejo of Joseon is the father-in-law of Queen Hyojeong. So the answer is: King Sejo of Joseon. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Queen Hyojeong is the wife of \\textcolor{blue}{King Heonjong of Joseon}. King Heonjong of Joseon is the son of \\textcolor{blue}{Crown Prince Hyomyeong}. Thus, Crown Prince Hyomyeong is the father-in-law of Queen Hyojeong. So the answer is: Crown Prince Hyomyeong. \\end{minipage} \\\\\n\n        \\midrule\n\n\n        \\multirow{1}{*}{\\rotatebox[origin=c]{90}{\\parbox[c]{3.0cm}{\\centering MuSiQue}}}\n        & \\begin{minipage}[t]{2cm} What is the name of the castle in the city where the performer of A Collection 1984–1989 was born? \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} $\\Rightarrow$ A Collection 1984–1989 was performed by \\textcolor{blue}{Jane Siberry}. \\\\ $\\Rightarrow$ Jane Siberry was born in \\textcolor{blue}{Toronto}. \\\\ $\\Rightarrow$ The castle in Toronto is the \\textcolor{blue}{Casa Loma}. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The performer of A Collection 1984–1989 is \\textcolor{red}{The The}. The The was born in \\textcolor{red}{London}. The name of the castle in London is the \\textcolor{red}{Tower of London}. So the answer is: the Tower of London. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} A Collection 1984–1989 was performed by \\textcolor{blue}{Jane Siberry}. Jane Siberry was born in \\textcolor{blue}{Toronto}. The castle in Toronto is \\textcolor{red}{Peqin Castle}. So the answer is: Peqin Castle. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} A Collection 1984–1989 was performed by \\textcolor{blue}{Jane Siberry}. Jane Siberry was born in \\textcolor{blue}{Toronto}. The castle in Toronto is the \\textcolor{blue}{Casa Loma}. So the answer is: Casa Loma. \\end{minipage} \\\\\n\n        \\bottomrule\n\n    \\end{tabular}\n    \\caption{Example CoTs generated by GPT3 with different methods. Since NoR relies on parametric knowledge, it often makes a factual error in the first sentence derailing the full CoT. OneR can retrieve relevant information closest to the question and is less likely to make such errors early on, but it still makes errors later in the CoT. As \\iconsys performs retrieval after each step, it is often able to prevent such errors in each step. More examples are in App.~\\ref{sec:apdx-nor-oner-cot-examples}.}\n    \\label{table:nor-oner-cot-examples}\n\\end{table*}\n\n\n\n\n\n\n\n  9 Comparison with Previous Systems for ODQA with LLMs\n\\section{Comparison with Previous Systems for ODQA with LLMs}\n\\label{sec:sota-differences}\n\nWe showed a leaderboard-style comparison with previous approaches to using large language models for open-domain QA in \\S~\\ref{sec:exp-results}. We noted though that the comparison is not head-to-head given various differences. We briefly describe each method and the differences in API, LLM, retrieval corpus, and other choices here.\n\nInternet-Augmented QA~\\cite{internet-augmented-qa} does (one-step) Google Search retrieval, performs additional LLM-based filtering on it, and then prompts an LLM to answer the question using the resulting context. It uses the Gopher 280B language model. RECITE~\\cite{recitationlm} bypasses the retrieval and instead prompts an LLM to first generate (recite) one or several relevant passages from its own memory, and generate the answer conditioned on this generation. They experiment with many LLMs, the highest performing of which is \\texttt{code-davinci-002} which we report here. ReAct~\\cite{react} prompts LLMs to produce reasoning and action traces where actions are calls to a Wikipedia API to return the summary for a given Wikipedia page title. It uses the PALM 540B model. SelfAsk~\\cite{selfask} prompts LLMs to decompose a question into subquestions and answers these subquestions by issuing separate calls to the Google Search API. It uses the GPT3 (\\texttt{text-davinci-002}) model. Finally, DecomP~\\cite{decomp} is a general framework that decomposes a task and delegates sub-tasks to appropriate sub-models. Similar to our system, it uses BM25 Search and the GPT3 (\\texttt{code-davinci-002}) model. And lastly, DSP~\\cite{dsp} provides a way to programmatically define interactions between LLM and retrieval for ODQA (e.g., via question decomposition), bootstrap demonstrations for such a program, and use them to make the answer prediction. It uses GPT3.5 LLM with ColBERT-based retrieval. Since most of these methods use different knowledge sources or APIs and are built using different LLMs and retrieval models, it's difficult to make a fair scientific comparison across these systems. Additionally, the evaluations in the respective papers are on different random subsets (from the same distribution) of test instances. \n\nDespite these differences, it is still informative to explore, in a leaderboard-style fashion, how \\iconsys performs relative to the best numbers published for these recent systems. Table~\\ref{table:extended-extrinsic-comparison} shows results from different systems, including contemporaneous and newer numbers. The two new systems in this table (relative to Table~\\ref{table:extrinsic-comparison}) are DecomP (newer version) and DSP. While \\iconsys remains SOTA on MuSiQue, DSP outperforms it on HotpotQA by 2.0 points and the newer version of Decomp outperforms \\iconsys on 2WikiMultihopQA by 2.8 points. We speculate DecomP performs well on 2WikiMultihopQA because it has only a few easy-to-predict decomposition patterns, which DecomP's question decomposition can leverage. The lack of such patterns in HotpotQA and MuSiQue causes it to underperform compared to \\iconsys. Lastly, it will be useful to assess whether DSP, which is hardcoded for 2-hop questions like that of HotpotQA, will work well for a dataset with a varied number of hops like that of MuSiQue. We leave this further investigation to future work.\n\n\n\n",
        "Descriptive_question1": "What is the F1 score for 'sys QA on HpQA in table_1?",
        "Descriptive_question2": "Which model has an F1 score of 68.0 on 2WikiMQA in table_1?",
        "Reasoning_question1": "Why might 'sys QA outperform other systems by a large margin on multiple datasets in table_1?",
        "Reasoning_question2": "How does the performance of 'sys QA on MQ^2H compare to other models in table_1, and what might explain this difference?",
        "Descriptive_answer1": "60.7",
        "Descriptive_answer2": "sys QA",
        "Reasoning_answer1": "The outperformance of 'sys QA' across multiple datasets in table_1 can be attributed to several potential factors. First, considering the scores, 'sys QA' consistently achieves the highest EM and F1 scores across all listed datasets, such as 60.7 on HpQA and 68.0 on 2WikiMQA, compared to other models like RECITE (48.4 on HpQA) and DecomP (59.3 on 2WikiMQA). This suggests a superior model architecture or training methodology. Second, the table caption indicates that 'sys QA' uses GPT3, which might provide advanced language understanding and reasoning capabilities not present in the models used by other systems, as discussed in the context of different LLMs in section 9. Third, the associated text mentions that 'sys QA' is retrieval-augmented, potentially allowing it to access and integrate external knowledge more effectively than others, which could explain the large margin of improvement. Finally, the context of non-head-to-head comparisons implies possible differences in implementation details or data usage, but the consistent high scores suggest that 'sys QA' has a robust generalization ability across varied question types and datasets.",
        "Reasoning_answer2": "In table_1, 'sys QA' achieves an F1 score of 43.8 on MQ^2H (MuSiQue 2-hop questions), which is significantly higher than SelfAsk's score of 15.2, with all other models reporting unavailable scores ('-'). This performance difference might be explained by several factors. First, the high score of 'sys QA' indicates it is particularly effective at handling multi-hop questions, which require reasoning over multiple pieces of information, as noted in the table caption about MuSiQue 2-hop questions. Second, the text in section 5 suggests that 'sys QA' benefits from improved retrieval mechanisms, as seen in the discussion of outperforming OneR and NoR in various settings, potentially aiding in accurately gathering and synthesizing information across hops. Third, comparing with SelfAsk, which also likely involves question decomposition (as per section 9), 'sys QA' might have a more refined approach to integrating retrieved information with reasoning, leading to better performance. Lastly, the lack of scores from other models makes it unclear if they were tested on this dataset or failed to produce competitive results, further highlighting 'sys QA's unique strength in this specific context."
    },
    {
        "paper_id": "2212.10509.json",
        "table_id": "table_2",
        "table_content": "\\begin{table*}[t]\n    \\centering\n    \\footnotesize\n    \\setlength{\\tabcolsep}{4.0pt}\n    \\begin{tabular}{llllll}\\toprule\n        & \\textbf{Question} &   \\textbf{Gold Facts} & \\textbf{NoR QA} & \\textbf{OneR QA} & \\textbf{IRCoT QA} \\\\\n        \\midrule\n\n\n        \\multirow{2}{*}{\\rotatebox[origin=c]{90}{\\parbox[c]{3.0cm}{\\centering HotpotQA}}}\n        & \\begin{minipage}[t]{2cm} The tap dancing soloist in My Feet Keep Dancing also starred in what 1935 MGM musical? \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} $\\Rightarrow$ The tap dancing soloist in My Feet Keep Dancing is \\textcolor{blue}{Fayard Nicholas}. \\\\ $\\Rightarrow$ Fayard Nicholas starred in the 1935 MGM musical \\textcolor{blue}{An All-Colored Vaudeville Show}. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The tap dancing soloist in My Feet Keep Dancing is \\textcolor{red}{Eleanor Powell}. Eleanor Powell also starred in the 1935 MGM musical \\textcolor{red}{Broadway Melody} of 1936. So the answer is: Broadway Melody of 1936. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The tap dancing soloist in My Feet Keep Dancing is \\textcolor{blue}{Fayard Nicholas}. Fayard Nicholas also starred in the 1935 MGM musical \\textcolor{red}{Top Hat}. So the answer is: Top Hat. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The tap dancing soloist in My Feet Keep Dancing is \\textcolor{blue}{Fayard Nicholas}. Fayard Nicholas starred in the 1935 MGM musical \\textcolor{blue}{An All-Colored Vaudeville Show}. So the answer is: An All-Colored Vaudeville Show. \\end{minipage} \\\\\n\n        \\midrule\n\n\n        \\multirow{1}{*}{\\rotatebox[origin=c]{90}{\\parbox[c]{4.0cm}{\\centering 2WikiMQA}}}\n        & \\begin{minipage}[t]{2cm} Who is the father-in-law of Queen Hyojeong? \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} $\\Rightarrow$ Queen Hyojeong is the wife of \\textcolor{blue}{King Heonjong of Joseon}. \\\\ $\\Rightarrow$ King Heonjong of Joseon is the son of \\textcolor{blue}{Crown Prince Hyomyeong}. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Queen Hyojeong has a husband named \\textcolor{red}{King Gojong}. King Gojong has a father named \\textcolor{red}{Heungseon Daewongun}. Thus, Queen Hyojeong has a father-in-law named Heungseon Daewongun. So the answer is: Heungseon Daewongun. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Queen Hyojeong is the wife of \\textcolor{blue}{King Heonjong of Joseon}. King Heonjong of Joseon is the son of \\textcolor{red}{King Sejo of Joseon}. Thus, King Sejo of Joseon is the father-in-law of Queen Hyojeong. So the answer is: King Sejo of Joseon. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Queen Hyojeong is the wife of \\textcolor{blue}{King Heonjong of Joseon}. King Heonjong of Joseon is the son of \\textcolor{blue}{Crown Prince Hyomyeong}. Thus, Crown Prince Hyomyeong is the father-in-law of Queen Hyojeong. So the answer is: Crown Prince Hyomyeong. \\end{minipage} \\\\\n\n        \\midrule\n\n\n        \\multirow{1}{*}{\\rotatebox[origin=c]{90}{\\parbox[c]{3.0cm}{\\centering MuSiQue}}}\n        & \\begin{minipage}[t]{2cm} What is the name of the castle in the city where the performer of A Collection 1984–1989 was born? \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} $\\Rightarrow$ A Collection 1984–1989 was performed by \\textcolor{blue}{Jane Siberry}. \\\\ $\\Rightarrow$ Jane Siberry was born in \\textcolor{blue}{Toronto}. \\\\ $\\Rightarrow$ The castle in Toronto is the \\textcolor{blue}{Casa Loma}. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The performer of A Collection 1984–1989 is \\textcolor{red}{The The}. The The was born in \\textcolor{red}{London}. The name of the castle in London is the \\textcolor{red}{Tower of London}. So the answer is: the Tower of London. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} A Collection 1984–1989 was performed by \\textcolor{blue}{Jane Siberry}. Jane Siberry was born in \\textcolor{blue}{Toronto}. The castle in Toronto is \\textcolor{red}{Peqin Castle}. So the answer is: Peqin Castle. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} A Collection 1984–1989 was performed by \\textcolor{blue}{Jane Siberry}. Jane Siberry was born in \\textcolor{blue}{Toronto}. The castle in Toronto is the \\textcolor{blue}{Casa Loma}. So the answer is: Casa Loma. \\end{minipage} \\\\\n\n        \\bottomrule\n\n    \\end{tabular}\n    \\caption{Example CoTs generated by GPT3 with different methods. Since NoR relies on parametric knowledge, it often makes a factual error in the first sentence derailing the full CoT. OneR can retrieve relevant information closest to the question and is less likely to make such errors early on, but it still makes errors later in the CoT. As \\iconsys performs retrieval after each step, it is often able to prevent such errors in each step. More examples are in App.~\\ref{sec:apdx-nor-oner-cot-examples}.}\n    \\label{table:nor-oner-cot-examples}\n\\end{table*}",
        "caption": "Example CoTs generated by GPT3 with different methods. Since NoR relies on parametric knowledge, it often makes a factual error in the first sentence derailing the full CoT. OneR can retrieve relevant information closest to the question and is less likely to make such errors early on, but it still makes errors later in the CoT. As \\iconsys performs retrieval after each step, it is often able to prevent such errors in each step. More examples are in App.~\\ref{sec:apdx-nor-oner-cot-examples}.",
        "label": "table:nor-oner-cot-examples",
        "section_info": "5 Results\n\\section{Results}\n\\label{sec:exp-results}\n\n\n\\begin{figure*}[ht]\n\\centering\n\\includegraphics[width=0.95\\textwidth]{images/ood_retrieval_results.pdf}\n\\caption{Retrieval recall for OneR and IRCoT using Flan-T5-XXL (Left) and GPT3 (Right) in out-of-distribution (OOD) setting. HQ (HotpotQA), 2W (2WikiMultihopQA), MQ (MuSiQue). The result X$\\rightarrow$Y indicates prompt demonstrations are from dataset X and evaluation is on dataset Y. \\iconsys outperforms OneR in such an OOD setting.}\n\\label{fig:ood-retrieval-results}\n\\end{figure*}\n\n\\begin{figure*}[ht]\n\\centering\n\\includegraphics[width=0.95\\textwidth]{images/ood_qa_results.pdf}\n\\caption{Answer F1 for NoR QA, OneR QA and IRCoT QA using Flan-T5-XXL (Left) and GPT3 (Right) in out-of-distribution (OOD) setting. HQ (HotpotQA), 2W (2WikiMultihopQA), MQ (MuSiQue). The result X$\\rightarrow$Y indicates prompt demonstrations are from dataset X and evaluation is on dataset Y. \\iconsys QA outperforms OneR QA and NoR QA in such OOD setting.}\n\\label{fig:ood-qa-results}\n\\end{figure*}\n\n\n\n\\paragraph{\\iconsys retrieval is better than one-step. }\n\nFig.~\\ref{fig:main-retrieval-results} compares OneR with \\iconsys retrievers made from \\texttt{Flan-T5-XXL} and \\texttt{GPT3} LMs. For both models, \\iconsys significantly outperforms one-step retrieval across all datasets. For \\texttt{Flan-T5-XXL}, \\iconsys improves our recall metric relative to one-step retrieval, on HotpotQA by 7.9, on 2WikiMultihopQA by 14.3, on MuSiQue by 3.5, and on IIRC by 10.2 points. For \\texttt{GPT3}, this improvement is by 11.3, 22.6, 12.5, and 21.2 points, respectively.\n\n\n\\paragraph{\\iconsys QA outperforms NoR and OneR QA.}\n\nFig.~\\ref{fig:main-qa-results} compares ODQA performance using NoR, OneR and \\iconsys retriever made from \\texttt{Flan-T5-XXL} and \\texttt{GPT3} LMs. For \\texttt{Flan-T5-XXL}, \\iconsys QA outperforms OneR QA on HotpotQA by 9.4, on 2WikiMultihopQA by 15.3, on MuSiQue by 5.0 and IIRC by 2.5 F1 points. For \\texttt{GPT3}, the corresponding numbers (except for IIRC) are 7.1, 13.2, and 7.1 F1 points. For \\texttt{GPT3}, \\iconsys doesn't improve the QA score on IIRC, despite significantly improved retrieval (21 points as shown in Fig.~\\ref{fig:main-retrieval-results}). This is likely because IIRC relevant knowledge may already be present in GPT3, as also evidenced by its NoR QA score being similar. For other datasets and model combinations, NoR QA is much worse than \\iconsys QA, indicating the limits of the models' parametric knowledge.\n\n\n\\begin{figure}[ht]\n\\centering\n\\includegraphics[width=0.475\\textwidth]{images/factual_errors.pdf}\n\\caption{Number of questions, out of 40, where CoT generated by GPT3 using different methods has at least 1 factual error. Factual errors: \\iconsys $<$ OneR $<$ NoR.}\n\\label{fig:cot-factual-errors}\n\\end{figure}\n\n\n\\begin{figure*}[ht]\n\\centering\n\\includegraphics[width=0.95\\textwidth]{images/model_scale_retrieval_results.pdf}\n\\caption{Retrieval recall for OneR (bottom) and \\iconsys (top) for LMs of increasing sizes: Flan-T5 \\{base (0.2B), large (0.7B), XL (3B), XXL (11B)\\} and GPT3 (175B) on HotpotQA, 2WikiMultihopQA, MuSiQue. \\iconsys outperforms OneR for all model sizes, including the 0.3B model, and the difference roughly grows with model size. Note: OneR doesn't use LM in its retrieval and so has a fixed score.}\n\\label{fig:model-scale-retrieval-results}\n\\end{figure*}\n\n\\begin{figure*}[ht]\n\\centering\n\\includegraphics[width=0.95\\textwidth]{images/model_scale_qa_results.pdf}\n\\caption{Answer F1 for ODQA models made using OneR (bottom) and \\iconsys (top) for LMs of increasing sizes: Flan-T5 \\{base (0.2B), large (0.7B), XL (3B), XXL (11B)\\} and GPT3 (175B) on HotpotQA, 2WikiMultihopQA and MuSiQue. \\iconsys QA outperforms OneR QA for all model sizes except for the smallest, 0.3B. \\iconsys with 3B model even outperforms OneR with 58X larger GPT3 model showing the value of improved retrieval.}\n\\label{fig:model-scale-qa-results}\n\\end{figure*}\n\n\n\n\\paragraph{\\iconsys is effective in OOD setting. }\n\nSince CoT may not always be easy to write for new datasets, we evaluate NoR, OneR, and IRCoT on generalization to new datasets, i.e. OOD setting. To do so, we use prompt demonstrations from one dataset to evaluate on another dataset.\\footnote{We use the evaluation dataset's corpus for retrieval.} For all pairs of the datasets\\footnote{We skip IIRC in this exploration as the task is structured a bit differently and requires special handling (see App.~\\ref{sec:apndx-iirc-special-handling}).} and for both \\texttt{Flan-T5-XXL} and \\texttt{GPT3}, we find the same trend as in the IID setting: \\iconsys retrieval outperforms OneR (Fig.~\\ref{fig:ood-retrieval-results}), and IRCoT QA outperforms both OneR QA and NoR QA (Fig.~\\ref{fig:ood-qa-results}).\n\n\n\n\\paragraph{\\iconsys generates CoT with fewer factual errors.}\n\nTo assess whether our approach also improves the factuality of generated CoTs, we manually annotated CoTs generated by NoR QA, OneR QA, and IRCoT QA using GPT3 for 40 randomly sampled questions from each of the four datasets. We considered CoT to have a factual error if at least one of the facts\\footnote{all sentences before the final ``answer is:'' sentence.} is not true.\\footnote{Note that factual error doesn't necessarily mean the predicted answer is incorrect and vice-versa. This is because the model can generate a wrong answer despite all correct facts, and vice-versa. We also account for the possibility of answer annotation errors in the original datasets.} As Fig.~\\ref{fig:cot-factual-errors} shows, NoR makes the most factual errors, OneR makes fewer, and \\iconsys the least. In particular, \\iconsys reduces the factual errors over OneR by 50\\% on HotpotQA and 40\\% on 2WikiMultihopQA.\n\n\nTable~\\ref{table:nor-oner-cot-examples} illustrates how the CoT predictions for different methods vary qualitatively. Since NoR relies completely on parametric knowledge, it often makes a factual error in the first sentence, which derails the full CoT. OneR can retrieve relevant information closest to the question and is less likely to make such errors early on, but it still makes errors later in the CoT. IRCoT, on the other hand, is often able to prevent such errors in each step.\n\n\n\\paragraph{\\iconsys is also effective for smaller models.}\n\nTo see how effective \\iconsys is at different LM sizes, we show the scaling plots in Fig.~\\ref{fig:model-scale-retrieval-results}.\\footnote{We skip IIRC here as the smaller models are not good at identifying Wikipedia titles from a paragraph and a question which is necessary for IIRC (see App.~\\ref{sec:apndx-iirc-special-handling}).} We compare the recall for OneR and \\iconsys using \\texttt{Flan-T5} \\{base (0.2B), large (0.7B), XL (3B), XXL (11B)\\}, and GPT3 \\texttt{code-davinci-002} (175B). \\iconsys with even the smallest model (0.2B) is better than OneR, and the performance roughly improves with the model size. This shows the CoT generation capabilities of even small models can be leveraged for improving retrieval. Furthermore, we show the effect of model size on the QA score in Fig.~\\ref{fig:model-scale-qa-results}. For all sizes except the smallest (0.2B), we see \\iconsys QA is better than OneR QA. Moreover, \\iconsys with a 3B model even outperforms OneR and NoR with a 58X larger 175B GPT3 model in all datasets.\n\n\n\n\\paragraph{\\iconsys is SOTA for few-shot multistep ODQA.\\footnote{\\label{footnote:sota}App.~\\S\\ref{sec:sota-differences} reports updated SOTA numbers, including contemporaneous and newer works.}}\n\n\nWe compare \\iconsys QA with five recent approaches to using LLMs for ODQA: Internet-Augmented QA~\\cite{internet-augmented-qa}, RECITE~\\cite{recitationlm} ReAct~\\cite{react}, SelfAsk~\\cite{selfask}, and DecomP~\\cite{old-decomp}. Although these are not head-to-head comparisons as different methods use different APIs, knowledge sources, and even LLMs (see App.~\\ref{sec:sota-differences} for details), it is still informative to explore, in a leaderboard-style fashion, how \\iconsys performs relative to the best numbers published for these recent systems.\n\n\\vspace{0.1cm}\n\\begin{table}[ht]\n    \\centering\n    \\footnotesize\n    \\setlength{\\tabcolsep}{2.0pt}\n    \\begin{tabular}{ccccc}\\toprule\n        Model &  HpQA\\textsuperscript{Br}  &  HpQA & 2WikiMQA & MQ\\textsuperscript{2H} \\\\\n        \\midrule\n        InterAug      &   $-$ | $-$    &        30.3 | $-$\\p{xx}   &        $-$ | $-$        &        $-$ | $-$            \\\\\n        RECITE        &   $-$ | $-$    &        37.1 | 48.4        &        $-$ | $-$        &        $-$ | $-$            \\\\\n        ReAct         &   $-$ | $-$    &        35.1 | $-$\\p{xx}   &        $-$ | $-$        &        $-$ | $-$            \\\\\n        SelfAsk       &   $-$ | $-$    &         $-$ | $-$         &       40.1 | $-$\\p{xx}   &       15.2 | $-$\\p{xx}     \\\\\n        DecomP        &  \\p{x..}$-$ | 50.0  &         $-$ | $-$         &   \\p{x..}$-$ | 59.3      &       $-$ | $-$       \\\\\n        \\midrule\n        \\sys QA       &   \\textbf{45.8 | 58.5}   &    \\bf{49.3 | 60.7}       &   \\bf{57.7 | 68.0}      &  \\bf{34.2 | 43.8} \\\\\n        \\bottomrule\n    \\end{tabular}\n    \\caption{Comparison with other LLM-based ODQA systems on EM and F1 scores. `$-$': score is unavailable. HpQA\\textsuperscript{Br}: Bridge questions subset of HotpotQA. MQ\\textsuperscript{2H}: MuSiQue 2-hop questions. \\iconsys QA with GPT3 (ours) outperforms other systems by a large margin. Note: Comparisons aren't head-to-head as discussed in the text. App.~\\S\\ref{sec:sota-differences} reports updated SOTA numbers, including contemporaneous and newer works.\n    \\label{table:extrinsic-comparison}\n    }\n\\end{table}\n\n\nAs shown in Table~\\ref{table:extrinsic-comparison}, \\iconsys QA significantly outperforms all of these recent systems by a large margin, setting a new state of the art in terms of what's achievable via retrieval-augmented LLMs (without supervised training).\n\n\n\n\\begin{table*}[t]\n    \\centering\n    \\footnotesize\n    \\setlength{\\tabcolsep}{4.0pt}\n    \\begin{tabular}{llllll}\\toprule\n        & \\textbf{Question} &   \\textbf{Gold Facts} & \\textbf{NoR QA} & \\textbf{OneR QA} & \\textbf{IRCoT QA} \\\\\n        \\midrule\n\n\n        \\multirow{2}{*}{\\rotatebox[origin=c]{90}{\\parbox[c]{3.0cm}{\\centering HotpotQA}}}\n        & \\begin{minipage}[t]{2cm} The tap dancing soloist in My Feet Keep Dancing also starred in what 1935 MGM musical? \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} $\\Rightarrow$ The tap dancing soloist in My Feet Keep Dancing is \\textcolor{blue}{Fayard Nicholas}. \\\\ $\\Rightarrow$ Fayard Nicholas starred in the 1935 MGM musical \\textcolor{blue}{An All-Colored Vaudeville Show}. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The tap dancing soloist in My Feet Keep Dancing is \\textcolor{red}{Eleanor Powell}. Eleanor Powell also starred in the 1935 MGM musical \\textcolor{red}{Broadway Melody} of 1936. So the answer is: Broadway Melody of 1936. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The tap dancing soloist in My Feet Keep Dancing is \\textcolor{blue}{Fayard Nicholas}. Fayard Nicholas also starred in the 1935 MGM musical \\textcolor{red}{Top Hat}. So the answer is: Top Hat. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The tap dancing soloist in My Feet Keep Dancing is \\textcolor{blue}{Fayard Nicholas}. Fayard Nicholas starred in the 1935 MGM musical \\textcolor{blue}{An All-Colored Vaudeville Show}. So the answer is: An All-Colored Vaudeville Show. \\end{minipage} \\\\\n\n        \\midrule\n\n\n        \\multirow{1}{*}{\\rotatebox[origin=c]{90}{\\parbox[c]{4.0cm}{\\centering 2WikiMQA}}}\n        & \\begin{minipage}[t]{2cm} Who is the father-in-law of Queen Hyojeong? \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} $\\Rightarrow$ Queen Hyojeong is the wife of \\textcolor{blue}{King Heonjong of Joseon}. \\\\ $\\Rightarrow$ King Heonjong of Joseon is the son of \\textcolor{blue}{Crown Prince Hyomyeong}. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Queen Hyojeong has a husband named \\textcolor{red}{King Gojong}. King Gojong has a father named \\textcolor{red}{Heungseon Daewongun}. Thus, Queen Hyojeong has a father-in-law named Heungseon Daewongun. So the answer is: Heungseon Daewongun. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Queen Hyojeong is the wife of \\textcolor{blue}{King Heonjong of Joseon}. King Heonjong of Joseon is the son of \\textcolor{red}{King Sejo of Joseon}. Thus, King Sejo of Joseon is the father-in-law of Queen Hyojeong. So the answer is: King Sejo of Joseon. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Queen Hyojeong is the wife of \\textcolor{blue}{King Heonjong of Joseon}. King Heonjong of Joseon is the son of \\textcolor{blue}{Crown Prince Hyomyeong}. Thus, Crown Prince Hyomyeong is the father-in-law of Queen Hyojeong. So the answer is: Crown Prince Hyomyeong. \\end{minipage} \\\\\n\n        \\midrule\n\n\n        \\multirow{1}{*}{\\rotatebox[origin=c]{90}{\\parbox[c]{3.0cm}{\\centering MuSiQue}}}\n        & \\begin{minipage}[t]{2cm} What is the name of the castle in the city where the performer of A Collection 1984–1989 was born? \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} $\\Rightarrow$ A Collection 1984–1989 was performed by \\textcolor{blue}{Jane Siberry}. \\\\ $\\Rightarrow$ Jane Siberry was born in \\textcolor{blue}{Toronto}. \\\\ $\\Rightarrow$ The castle in Toronto is the \\textcolor{blue}{Casa Loma}. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The performer of A Collection 1984–1989 is \\textcolor{red}{The The}. The The was born in \\textcolor{red}{London}. The name of the castle in London is the \\textcolor{red}{Tower of London}. So the answer is: the Tower of London. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} A Collection 1984–1989 was performed by \\textcolor{blue}{Jane Siberry}. Jane Siberry was born in \\textcolor{blue}{Toronto}. The castle in Toronto is \\textcolor{red}{Peqin Castle}. So the answer is: Peqin Castle. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} A Collection 1984–1989 was performed by \\textcolor{blue}{Jane Siberry}. Jane Siberry was born in \\textcolor{blue}{Toronto}. The castle in Toronto is the \\textcolor{blue}{Casa Loma}. So the answer is: Casa Loma. \\end{minipage} \\\\\n\n        \\bottomrule\n\n    \\end{tabular}\n    \\caption{Example CoTs generated by GPT3 with different methods. Since NoR relies on parametric knowledge, it often makes a factual error in the first sentence derailing the full CoT. OneR can retrieve relevant information closest to the question and is less likely to make such errors early on, but it still makes errors later in the CoT. As \\iconsys performs retrieval after each step, it is often able to prevent such errors in each step. More examples are in App.~\\ref{sec:apdx-nor-oner-cot-examples}.}\n    \\label{table:nor-oner-cot-examples}\n\\end{table*}\n\n\n\n\n\n\n\n  8 Special Handling of Models for IIRC\n\\section{Special Handling of Models for IIRC}\n\\label{sec:apndx-iirc-special-handling}\n\nIIRC is slightly different from the other datasets, in that the question is grounded in the main passage and other supporting paragraphs come from the Wikipedia pages of entities mentioned in this passage. We modify the retrievers and readers to account for this difference: (i) We always keep the main passage as part of the input to the model regardless of the retrieval strategy used. (ii) For all the retrieval methods, we first prompt the model to generate a list of Wikipedia page titles using the main passage and the question. We map these generated titles to the nearest Wikipedia page titles in the corpus (found using BM25), and then the rest of the paragraph retrieval queries are scoped within only those Wikipedia pages.\n\nTo prompt the model to generate Wikipedia page titles using the main passage and the question for IIRC, we use the following template.\n\n\\begin{small}\n\\begin{verbatim}\nWikipedia Title: <Main Page Title>\n<Main Paragraph Text>\n\nQ: The question is: '<Question>'. Generate titles \nof <N> Wikipedia pages that have relevant\ninformation to answer this question.\nA: [\"<Title-1>\", \"<Title-2>\", ...]\n\\end{verbatim}\n\\end{small}\n\nFor ``training'', i.e., for demonstrations, N ($\\le 3$) is the number of supporting Wikipedia page titles for the question. At test time, since the number of supporting page titles is unknown, we use a fixed value of 3. We found this trick of prompting the model to generate more titles at the test time improves its recall over letting the model decide by itself how many titles to generate.\n\n\n\\begin{table*}[ht]\n    \\centering\n    \\footnotesize\n    \\setlength{\\tabcolsep}{10.0pt}\n    \\begin{tabular}{lccccc}\\toprule\n        Model &  HpQA\\textsuperscript{Br} &  HpQA & 2WikiMQA & MQ\\textsuperscript{2H} & MQ \\\\\n        \\midrule\n        InterAug~\\cite{internet-augmented-qa}  &         $-$ | $-$   &        30.3 | $-$\\p{xx}   &        $-$ | $-$        &        $-$ | $-$        &   $-$ | $-$       \\\\\n        RECITE~\\cite{recitationlm}             &         $-$ | $-$   &        37.1 | 48.4        &        $-$ | $-$        &        $-$ | $-$        &   $-$ | $-$       \\\\\n        ReAct~\\cite{react}                     &         $-$ | $-$   &        35.1 | $-$\\p{xx}   &        $-$ | $-$        &        $-$ | $-$        &   $-$ | $-$       \\\\\n        SelfAsk~\\cite{selfask}                 &         $-$ | $-$   &         $-$ | $-$         &       40.1 | $-$\\p{xx}   &       15.2 | $-$\\p{xx} &   $-$ | $-$        \\\\\n        DecomP~\\cite{old-decomp}               &  \\p{x..}$-$ | 50.0  &         $-$ | $-$         &   \\p{x..}$-$ | 59.3      &       $-$ | $-$        &   $-$ | $-$ \\\\\n        \\midrule\n        DecomP~\\cite{decomp} *                 &         $-$ | $-$   &         \\p{x..}$-$ | 53.5 &   \\p{x..}$-$ | \\textbf{70.8}      &       $-$ | $-$        &  \\p{xx}$-$ | 30.9 \\\\\n        DSP~\\cite{dsp} *                       &         $-$ | $-$   &     \\bf{51.4 | 62.9}      &        $-$ | $-$        &        $-$ | $-$        &        $-$ | $-$   \\\\\n        \\midrule\n        \\sys QA (ours)                         & \\textbf{45.8 | 58.5} &       49.3 | 60.7       &   57.7 | 68.0      &  \\bf{34.2 | 43.8}       &   \\textbf{26.5 | 36.5}      \\\\\n        \\bottomrule\n    \\end{tabular}\n    \\caption{Extended comparison with published LLM-based ODQA systems (as of May 25, 2023) on EM and F1 scores (with new numbers marked with *). `$-$': score is unavailable. HpQA\\textsuperscript{Br}: Bridge questions subset of HotpotQA. MQ\\textsuperscript{2H}: MuSiQue 2-hop questions. IRCoT remains SOTA for MuSiQue and is close to SOTA for HotpotQA and 2WikiMultihopQA. Note the comparisons here are not head-to-head as discussed in the text.}\n    \\label{table:extended-extrinsic-comparison}\n\\end{table*}\n\n\n\\begin{table*}[ht]\n    \\centering\n    \\footnotesize\n    \\setlength{\\tabcolsep}{3.5pt}\n    \\begin{tabular}{cccccccccccc}\\toprule\n        & &\\p{0}& \\multicolumn{4}{c}{Flan-T5-XXL} & \\p{0}& \\multicolumn{4}{c}{GPT3} \\\\\n        \\cmidrule{4-7} \\cmidrule{9-12}\n        & Model &\\p{0}&   HotpotQA & 2WikiMQA & MuSiQue & IIRC &  \\p{0}&   HotpotQA & 2WikiMQA & MuSiQue & IIRC\\\\\n        \\midrule\n        \\multirow{2}{*}{ZeroR QA}\n        & Direct       &\\p{0}&        \\bf{25.3}\\std{0.3}  &  \\bf{32.7}\\std{0.3}  &  \\bf{13.7}\\std{0.3}  &  \\bf{28.9}\\std{0.3} &   \\p{0}&     \\nf{41.0}\\std{1.1}  &  \\nf{38.5}\\std{1.1}  &  \\nf{19.0}\\std{1.2} &  \\nf{40.9}\\std{0.7} \\\\\n        & CoT          &\\p{0}&        \\nf{22.9}\\std{0.1}  &  \\nf{31.7}\\std{1.5}  &  \\nf{10.3}\\std{0.5}  &  \\nf{24.4}\\std{0.1} &   \\p{0}&     \\bf{47.5}\\std{0.4}  &  \\bf{41.2}\\std{1.0}  &  \\bf{25.2}\\std{1.2} &  \\bf{52.1}\\std{0.1} \\\\\n        \\midrule\n        \\multirow{2}{*}{OneR QA}\n        & Direct       &\\p{0}&        \\bf{49.7}\\std{0.5}  &  \\bf{51.2}\\std{0.3}  &  \\bf{25.8}\\std{0.6}  &  \\bf{40.0}\\std{1.3} &   \\p{0}&     \\nf{50.7}\\std{0.1}  &  \\nf{46.4}\\std{2.9}  &  \\nf{20.4}\\std{0.3} &  \\nf{40.1}\\std{0.9} \\\\\n        & CoT          &\\p{0}&        \\nf{43.1}\\std{0.7}  &  \\nf{47.8}\\std{0.9}  &  \\nf{17.6}\\std{0.2}  &  \\nf{34.5}\\std{1.5} &   \\p{0}&     \\bf{53.6}\\std{0.7}  &  \\bf{54.8}\\std{2.1}  &  \\bf{29.4}\\std{0.8} &  \\bf{49.8}\\std{2.3} \\\\\n        \\midrule\n        \\multirow{2}{*}{\\fixedicon\\sys QA}\n        & Direct       &\\p{0}&        \\bf{59.1}\\std{0.9}  &  \\bf{66.5}\\std{1.4}  &  \\bf{30.8}\\std{0.2}  &  \\bf{42.5}\\std{2.1} &   \\p{0}&     \\nf{60.6}\\std{1.0}  &  \\nf{63.5}\\std{2.7}  &  \\nf{36.0}\\std{0.5} &  \\nf{47.9}\\std{2.3} \\\\\n        & CoT          &\\p{0}&        \\nf{52.0}\\std{0.6}  &  \\nf{55.1}\\std{1.0}  &  \\nf{24.9}\\std{1.0}  &  \\nf{36.5}\\std{1.3} &   \\p{0}&     \\bf{60.7}\\std{1.1}  &  \\bf{68.0}\\std{1.5}  &  \\bf{36.5}\\std{1.2} &  \\bf{49.9}\\std{1.1} \\\\\n        \\bottomrule\n    \\end{tabular}\n    \\caption{Answer F1 for different ODQA models made from NoR, One and \\iconsys retrievals, and Direct and CoT prompting readers. For \\texttt{Flan-T5-XXL}, Direct prompting is a better choice for the reader, and for \\texttt{GPT3}, CoT prompting is a better choice for the reader. Hence, we make different reader choices for Flan-T5 and GPT3 for the experiments in the main paper. Note that \\iconsys QA $>$ OneR QA $>$ ZeroR QA holds up regardless of this choice.}\n    \\label{table:direct-cot-reader-results}\n\\end{table*}\n\n\n\n\\begin{table*}[htb]\n    \\centering\n    \\footnotesize\n    \\setlength{\\tabcolsep}{4.0pt}\n    \\begin{tabular}{llllll}\\toprule\n        & \\textbf{Question} &   \\textbf{Gold Facts} & \\textbf{NoR QA} & \\textbf{OneR QA} & \\textbf{IRCoT QA} \\\\\n\n        \\midrule\n\n\n        \\multirow{2}{*}{\\rotatebox[origin=c]{90}{\\parbox[c]{2.5cm}{\\centering HotpotQA}}}\n        & \\begin{minipage}[t]{2cm} What was the island, on which Marinelli Glacier is located, formerly known as?\\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} $\\Rightarrow$ Marinelli Glacier is located on the island of \\textcolor{blue}{Tierra del Fuego}. \\\\ $\\Rightarrow$ The island of Tierra del Fuego was formerly known as \\textcolor{blue}{Isla de Xativa}. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Marinelli Glacier is located on the island of \\textcolor{red}{Graham Land}. Graham Land was formerly known as \\textcolor{red}{Graham's Land}. So the answer is: Graham's Land. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Marinelli Glacier is located on \\textcolor{blue}{Isla Grande de Tierra del Fuego}. Isla Grande de Tierra del Fuego was formerly known as \\textcolor{red}{Tierra del Fuego}. So the answer is: Tierra del Fuego. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Marinelli Glacier is located on the island of \\textcolor{blue}{Tierra del Fuego}. The island of Tierra del Fuego was formerly known as \\textcolor{blue}{Isla de Xativa}. So the answer is: Isla de Xativa. \\end{minipage} \\\\\n\n        \\midrule\n\n\n        \\multirow{2}{*}{\\rotatebox[origin=c]{90}{\\parbox[c]{3.0cm}{\\centering 2WikiMQA}}}\n        & \\begin{minipage}[t]{2cm} Who is the child of the director of film Mukhyamantri (1996 Film)? \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} $\\Rightarrow$ The director of Mukhyamantri (1996 film) is \\textcolor{blue}{Anjan Choudhury}. \\\\ $\\Rightarrow$ Anjan Choudhury has a child named \\textcolor{blue}{Sandip Choudhury}. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The film Mukhyamantri was directed by \\textcolor{red}{S. V. Rajendra Singh Babu}. S. V. Rajendra Singh Babu has a child named \\textcolor{red}{S. V. Rajendra Singh Babu}. So the answer is: S. V. Rajendra Singh Babu. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Mukhyamantri (1996 film) was directed by \\textcolor{blue}{Anjan Choudhury}. Anjan Choudhury has a child named \\textcolor{red}{Aniruddha Choudhury}. So the answer is: Aniruddha Choudhury. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The director of Mukhyamantri (1996 film) is \\textcolor{blue}{Anjan Choudhury}. Anjan Choudhury has a child named \\textcolor{blue}{Sandip Choudhury}. So the answer is: Sandip Choudhury. \\end{minipage} \\\\\n\n        \\midrule\n\n\n        \\multirow{2}{*}{\\rotatebox[origin=c]{90}{\\parbox[c]{2.7cm}{\\centering MuSiQue}}}\n        & \\begin{minipage}[t]{2cm} When did the Governorship end of the city where the author of Sacerdotii Nostri Primordia died? \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} $\\Rightarrow$ Sacerdotii Nostri Primordia was written by \\textcolor{blue}{Pope John XXIII}. \\\\ $\\Rightarrow$ Pope John XXIII died in \\textcolor{blue}{Vatican City}. \\\\ $\\Rightarrow$ The Governorship of Vatican City ended in \\textcolor{blue}{1952}. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The author of Sacerdotii Nostri Primordia is \\textcolor{red}{Pope Pius IX}. Pope Pius IX died in the city of \\textcolor{red}{Rome}. The Governorship of Rome ended in \\textcolor{red}{1870}. So the answer is: 1870. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Sacerdotii Nostri Primordia was written by \\textcolor{blue}{Pope John XXIII}. Pope John XXIII died in the city of \\textcolor{red}{Rome}. The Governorship of Rome ended in \\textcolor{red}{1870}. So the answer is: 1870. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Sacerdotii Nostri Primordia was written by \\textcolor{blue}{Pope John XXIII}. Pope John XXIII died in \\textcolor{blue}{Vatican City}. The Governorship of Vatican City ended in \\textcolor{blue}{1952}. So the answer is: 1952. \\end{minipage} \\\\\n\n        \\bottomrule\n\n    \\end{tabular}\n    \\caption{Additional CoTs generated by GPT3 with different methods. ZeroR is most prone to factual errors. OneR often fixes some of the factual information which is closest to the question but doesn't always fix it all the way. Since IRCoT retrieves after each step, it can also fix the errors at each step.  More examples are in Table~\\ref{table:nor-oner-cot-examples}.}\n    \\label{table:apdx-nor-oner-cot-examples}\n\\end{table*}\n\n\n10 Additional CoT Generation Examples\n\\section{Additional CoT Generation Examples}\n\\label{sec:apdx-nor-oner-cot-examples}\n\nTable~\\ref{table:apdx-nor-oner-cot-examples} provides illustrations, in addition to the ones provided in Table~\\ref{table:nor-oner-cot-examples}, for how the CoT generations for NoR QA, OneR QA, and IRCoT QA methods vary. This gives an insight into how IRCoT improves QA performance. Since NoR relies completely on parametric knowledge, it often makes a factual error in the first sentence, which derails the full reasoning chain. Some of this factual information can be fixed by OneR, especially information closest to the question (i.e., can be retrieved using the question). This is insufficient for fixing all the mistakes. Since IRCoT involves retrieval after each step, it can fix errors at each step.\n\n\n\n\n\n",
        "Descriptive_question1": "In table_2, what is the name of the tap dancing soloist in My Feet Keep Dancing according to IRCoT QA?",
        "Descriptive_question2": "In table_2, what is the castle in Toronto mentioned in the MuSiQue example for IRCoT QA?",
        "Reasoning_question1": "In table_2, why does NoR QA frequently provide incorrect answers compared to IRCoT QA across the examples shown?",
        "Reasoning_question2": "In table_2, how does the accuracy of factual information in OneR QA compare to IRCoT QA in the reasoning steps for the 2WikiMQA question about Queen Hyojeong?",
        "Descriptive_answer1": "Fayard Nicholas",
        "Descriptive_answer2": "Casa Loma",
        "Reasoning_answer1": "NoR QA frequently provides incorrect answers compared to IRCoT QA because it relies solely on parametric knowledge, which often leads to factual errors in the initial steps of reasoning. As noted in the table caption, NoR tends to make a factual error in the first sentence, derailing the entire chain of thought (CoT). For example, in the HotpotQA question about the tap dancing soloist, NoR incorrectly identifies the soloist as Eleanor Powell instead of Fayard Nicholas, leading to a wrong conclusion. In contrast, IRCoT QA performs retrieval after each step, allowing it to correct potential errors at every stage and align with the gold facts, as seen across all examples where IRCoT consistently provides the correct answers matching the gold facts.",
        "Reasoning_answer2": "For the 2WikiMQA question about Queen Hyojeong's father-in-law, OneR QA shows partial accuracy in its reasoning steps compared to IRCoT QA. OneR correctly identifies Queen Hyojeong's husband as King Heonjong of Joseon, matching the gold fact in the first step. However, it fails in the subsequent step by incorrectly stating King Heonjong's father as King Sejo of Joseon instead of Crown Prince Hyomyeong, leading to an incorrect final answer. On the other hand, IRCoT QA maintains accuracy throughout all steps, correctly identifying both King Heonjong as the husband and Crown Prince Hyomyeong as the father, thus providing the correct answer. This demonstrates that while OneR can retrieve relevant initial information, it still makes errors later in the reasoning process, whereas IRCoT's step-by-step retrieval ensures factual accuracy at each stage."
    },
    {
        "paper_id": "2212.10509.json",
        "table_id": "table_3",
        "table_content": "\\begin{table*}[ht]\n    \\centering\n    \\footnotesize\n    \\setlength{\\tabcolsep}{10.0pt}\n    \\begin{tabular}{lccccc}\\toprule\n        Model &  HpQA\\textsuperscript{Br} &  HpQA & 2WikiMQA & MQ\\textsuperscript{2H} & MQ \\\\\n        \\midrule\n        InterAug~\\cite{internet-augmented-qa}  &         $-$ | $-$   &        30.3 | $-$\\p{xx}   &        $-$ | $-$        &        $-$ | $-$        &   $-$ | $-$       \\\\\n        RECITE~\\cite{recitationlm}             &         $-$ | $-$   &        37.1 | 48.4        &        $-$ | $-$        &        $-$ | $-$        &   $-$ | $-$       \\\\\n        ReAct~\\cite{react}                     &         $-$ | $-$   &        35.1 | $-$\\p{xx}   &        $-$ | $-$        &        $-$ | $-$        &   $-$ | $-$       \\\\\n        SelfAsk~\\cite{selfask}                 &         $-$ | $-$   &         $-$ | $-$         &       40.1 | $-$\\p{xx}   &       15.2 | $-$\\p{xx} &   $-$ | $-$        \\\\\n        DecomP~\\cite{old-decomp}               &  \\p{x..}$-$ | 50.0  &         $-$ | $-$         &   \\p{x..}$-$ | 59.3      &       $-$ | $-$        &   $-$ | $-$ \\\\\n        \\midrule\n        DecomP~\\cite{decomp} *                 &         $-$ | $-$   &         \\p{x..}$-$ | 53.5 &   \\p{x..}$-$ | \\textbf{70.8}      &       $-$ | $-$        &  \\p{xx}$-$ | 30.9 \\\\\n        DSP~\\cite{dsp} *                       &         $-$ | $-$   &     \\bf{51.4 | 62.9}      &        $-$ | $-$        &        $-$ | $-$        &        $-$ | $-$   \\\\\n        \\midrule\n        \\sys QA (ours)                         & \\textbf{45.8 | 58.5} &       49.3 | 60.7       &   57.7 | 68.0      &  \\bf{34.2 | 43.8}       &   \\textbf{26.5 | 36.5}      \\\\\n        \\bottomrule\n    \\end{tabular}\n    \\caption{Extended comparison with published LLM-based ODQA systems (as of May 25, 2023) on EM and F1 scores (with new numbers marked with *). `$-$': score is unavailable. HpQA\\textsuperscript{Br}: Bridge questions subset of HotpotQA. MQ\\textsuperscript{2H}: MuSiQue 2-hop questions. IRCoT remains SOTA for MuSiQue and is close to SOTA for HotpotQA and 2WikiMultihopQA. Note the comparisons here are not head-to-head as discussed in the text.}\n    \\label{table:extended-extrinsic-comparison}\n\\end{table*}",
        "caption": "Extended comparison with published LLM-based ODQA systems (as of May 25, 2023) on EM and F1 scores (with new numbers marked with *). `$-$': score is unavailable. HpQA\\textsuperscript{Br}: Bridge questions subset of HotpotQA. MQ\\textsuperscript{2H}: MuSiQue 2-hop questions. IRCoT remains SOTA for MuSiQue and is close to SOTA for HotpotQA and 2WikiMultihopQA. Note the comparisons here are not head-to-head as discussed in the text.",
        "label": "table:extended-extrinsic-comparison",
        "section_info": "9 Comparison with Previous Systems for ODQA with LLMs\n\\section{Comparison with Previous Systems for ODQA with LLMs}\n\\label{sec:sota-differences}\n\nWe showed a leaderboard-style comparison with previous approaches to using large language models for open-domain QA in \\S~\\ref{sec:exp-results}. We noted though that the comparison is not head-to-head given various differences. We briefly describe each method and the differences in API, LLM, retrieval corpus, and other choices here.\n\nInternet-Augmented QA~\\cite{internet-augmented-qa} does (one-step) Google Search retrieval, performs additional LLM-based filtering on it, and then prompts an LLM to answer the question using the resulting context. It uses the Gopher 280B language model. RECITE~\\cite{recitationlm} bypasses the retrieval and instead prompts an LLM to first generate (recite) one or several relevant passages from its own memory, and generate the answer conditioned on this generation. They experiment with many LLMs, the highest performing of which is \\texttt{code-davinci-002} which we report here. ReAct~\\cite{react} prompts LLMs to produce reasoning and action traces where actions are calls to a Wikipedia API to return the summary for a given Wikipedia page title. It uses the PALM 540B model. SelfAsk~\\cite{selfask} prompts LLMs to decompose a question into subquestions and answers these subquestions by issuing separate calls to the Google Search API. It uses the GPT3 (\\texttt{text-davinci-002}) model. Finally, DecomP~\\cite{decomp} is a general framework that decomposes a task and delegates sub-tasks to appropriate sub-models. Similar to our system, it uses BM25 Search and the GPT3 (\\texttt{code-davinci-002}) model. And lastly, DSP~\\cite{dsp} provides a way to programmatically define interactions between LLM and retrieval for ODQA (e.g., via question decomposition), bootstrap demonstrations for such a program, and use them to make the answer prediction. It uses GPT3.5 LLM with ColBERT-based retrieval. Since most of these methods use different knowledge sources or APIs and are built using different LLMs and retrieval models, it's difficult to make a fair scientific comparison across these systems. Additionally, the evaluations in the respective papers are on different random subsets (from the same distribution) of test instances. \n\nDespite these differences, it is still informative to explore, in a leaderboard-style fashion, how \\iconsys performs relative to the best numbers published for these recent systems. Table~\\ref{table:extended-extrinsic-comparison} shows results from different systems, including contemporaneous and newer numbers. The two new systems in this table (relative to Table~\\ref{table:extrinsic-comparison}) are DecomP (newer version) and DSP. While \\iconsys remains SOTA on MuSiQue, DSP outperforms it on HotpotQA by 2.0 points and the newer version of Decomp outperforms \\iconsys on 2WikiMultihopQA by 2.8 points. We speculate DecomP performs well on 2WikiMultihopQA because it has only a few easy-to-predict decomposition patterns, which DecomP's question decomposition can leverage. The lack of such patterns in HotpotQA and MuSiQue causes it to underperform compared to \\iconsys. Lastly, it will be useful to assess whether DSP, which is hardcoded for 2-hop questions like that of HotpotQA, will work well for a dataset with a varied number of hops like that of MuSiQue. We leave this further investigation to future work.\n\n\n\n",
        "Descriptive_question1": "What is the EM score for IRCoT on MuSiQue as shown in table_3?",
        "Descriptive_question2": "Which system has the highest F1 score on HotpotQA according to table_3?",
        "Reasoning_question1": "Why might DecomP perform better on 2WikiMultihopQA compared to IRCoT as discussed in table_3?",
        "Reasoning_question2": "How does the performance of DSP on HotpotQA compare to IRCoT, and what potential limitation might DSP face on MuSiQue according to the context of table_3?",
        "Descriptive_answer1": "26.5",
        "Descriptive_answer2": "DSP",
        "Reasoning_answer1": "DecomP might perform better on 2WikiMultihopQA compared to IRCoT because, as noted in the context, 2WikiMultihopQA has a few easy-to-predict decomposition patterns. DecomP's question decomposition framework can leverage these patterns effectively, allowing it to outperform IRCoT by 2.8 points on this dataset. In contrast, datasets like HotpotQA and MuSiQue lack such predictable patterns, which may explain why DecomP underperforms relative to IRCoT on those datasets.",
        "Reasoning_answer2": "DSP outperforms IRCoT on HotpotQA by 2.0 points, as DSP achieves an F1 score of 62.9 compared to IRCoT's 60.7. This indicates DSP's approach, which involves programmatically defining interactions between LLM and retrieval, may be particularly effective for HotpotQA's structure. However, a potential limitation for DSP on MuSiQue could be its design focus on 2-hop questions, as it is hardcoded for such patterns. MuSiQue, with a varied number of hops, might pose challenges for DSP's adaptability, and the context suggests that further investigation is needed to assess whether DSP can perform well on datasets with diverse hop structures."
    },
    {
        "paper_id": "2212.10509.json",
        "table_id": "table_4",
        "table_content": "\\begin{table*}[ht]\n    \\centering\n    \\footnotesize\n    \\setlength{\\tabcolsep}{3.5pt}\n    \\begin{tabular}{cccccccccccc}\\toprule\n        & &\\p{0}& \\multicolumn{4}{c}{Flan-T5-XXL} & \\p{0}& \\multicolumn{4}{c}{GPT3} \\\\\n        \\cmidrule{4-7} \\cmidrule{9-12}\n        & Model &\\p{0}&   HotpotQA & 2WikiMQA & MuSiQue & IIRC &  \\p{0}&   HotpotQA & 2WikiMQA & MuSiQue & IIRC\\\\\n        \\midrule\n        \\multirow{2}{*}{ZeroR QA}\n        & Direct       &\\p{0}&        \\bf{25.3}\\std{0.3}  &  \\bf{32.7}\\std{0.3}  &  \\bf{13.7}\\std{0.3}  &  \\bf{28.9}\\std{0.3} &   \\p{0}&     \\nf{41.0}\\std{1.1}  &  \\nf{38.5}\\std{1.1}  &  \\nf{19.0}\\std{1.2} &  \\nf{40.9}\\std{0.7} \\\\\n        & CoT          &\\p{0}&        \\nf{22.9}\\std{0.1}  &  \\nf{31.7}\\std{1.5}  &  \\nf{10.3}\\std{0.5}  &  \\nf{24.4}\\std{0.1} &   \\p{0}&     \\bf{47.5}\\std{0.4}  &  \\bf{41.2}\\std{1.0}  &  \\bf{25.2}\\std{1.2} &  \\bf{52.1}\\std{0.1} \\\\\n        \\midrule\n        \\multirow{2}{*}{OneR QA}\n        & Direct       &\\p{0}&        \\bf{49.7}\\std{0.5}  &  \\bf{51.2}\\std{0.3}  &  \\bf{25.8}\\std{0.6}  &  \\bf{40.0}\\std{1.3} &   \\p{0}&     \\nf{50.7}\\std{0.1}  &  \\nf{46.4}\\std{2.9}  &  \\nf{20.4}\\std{0.3} &  \\nf{40.1}\\std{0.9} \\\\\n        & CoT          &\\p{0}&        \\nf{43.1}\\std{0.7}  &  \\nf{47.8}\\std{0.9}  &  \\nf{17.6}\\std{0.2}  &  \\nf{34.5}\\std{1.5} &   \\p{0}&     \\bf{53.6}\\std{0.7}  &  \\bf{54.8}\\std{2.1}  &  \\bf{29.4}\\std{0.8} &  \\bf{49.8}\\std{2.3} \\\\\n        \\midrule\n        \\multirow{2}{*}{\\fixedicon\\sys QA}\n        & Direct       &\\p{0}&        \\bf{59.1}\\std{0.9}  &  \\bf{66.5}\\std{1.4}  &  \\bf{30.8}\\std{0.2}  &  \\bf{42.5}\\std{2.1} &   \\p{0}&     \\nf{60.6}\\std{1.0}  &  \\nf{63.5}\\std{2.7}  &  \\nf{36.0}\\std{0.5} &  \\nf{47.9}\\std{2.3} \\\\\n        & CoT          &\\p{0}&        \\nf{52.0}\\std{0.6}  &  \\nf{55.1}\\std{1.0}  &  \\nf{24.9}\\std{1.0}  &  \\nf{36.5}\\std{1.3} &   \\p{0}&     \\bf{60.7}\\std{1.1}  &  \\bf{68.0}\\std{1.5}  &  \\bf{36.5}\\std{1.2} &  \\bf{49.9}\\std{1.1} \\\\\n        \\bottomrule\n    \\end{tabular}\n    \\caption{Answer F1 for different ODQA models made from NoR, One and \\iconsys retrievals, and Direct and CoT prompting readers. For \\texttt{Flan-T5-XXL}, Direct prompting is a better choice for the reader, and for \\texttt{GPT3}, CoT prompting is a better choice for the reader. Hence, we make different reader choices for Flan-T5 and GPT3 for the experiments in the main paper. Note that \\iconsys QA $>$ OneR QA $>$ ZeroR QA holds up regardless of this choice.}\n    \\label{table:direct-cot-reader-results}\n\\end{table*}",
        "caption": "Answer F1 for different ODQA models made from NoR, One and \\iconsys retrievals, and Direct and CoT prompting readers. For \\texttt{Flan-T5-XXL}, Direct prompting is a better choice for the reader, and for \\texttt{GPT3}, CoT prompting is a better choice for the reader. Hence, we make different reader choices for Flan-T5 and GPT3 for the experiments in the main paper. Note that \\iconsys QA $>$ OneR QA $>$ ZeroR QA holds up regardless of this choice.",
        "label": "table:direct-cot-reader-results",
        "section_info": "11 Direct vs CoT Prompting Readers\n\\section{Direct vs CoT Prompting Readers}\n\\label{sec:apndx-readers-results}\n\nTable~\\ref{table:direct-cot-reader-results} compares reader choice (Direct vs CoT Prompting) for Flan-T5-XXL and GPT3. We find that Flan-T5-XXL works better with Direct Prompting as a reader and GPT3 works better with CoT Prompting as a reader. Therefore, for the experiments in the main paper, we go with this choice. Note though that the trends discussed in \\S~\\ref{sec:exp-results} (\\iconsys QA $>$ OneR QA $>$ ZeroR QA) hold regardless of the choice of the reader.\n\n\n\n",
        "Descriptive_question1": "What is the prompting style used for Flan-T5-XXL in table_4?",
        "Descriptive_question2": "Which model uses CoT prompting in table_4?",
        "Reasoning_question1": "Why was Direct prompting chosen for Flan-T5-XXL over CoT prompting in the experiments discussed in table_4?",
        "Reasoning_question2": "How does the performance trend of \\iconsys QA compare to OneR QA and ZeroR QA according to table_4, and does the choice of reader affect this trend?",
        "Descriptive_answer1": "Direct",
        "Descriptive_answer2": "GPT3",
        "Reasoning_answer1": "Direct prompting was chosen for Flan-T5-XXL over CoT prompting because the table data shows consistently higher F1 scores for Direct prompting across all datasets and retrieval types. For instance, in ZeroR QA, Direct prompting achieves scores like 25.3 on HotpotQA compared to 22.9 with CoT. Similarly, in OneR QA, Direct scores 49.7 on HotpotQA versus 43.1 with CoT, and in \fixedicon/sys QA, it scores 59.1 versus 52.0. This pattern of higher performance with Direct prompting is evident across all tested scenarios, indicating it is the better choice for Flan-T5-XXL as a reader, as also explicitly stated in the table caption and context.",
        "Reasoning_answer2": "According to the table and its caption, the performance trend shows that \fixedicon/sys QA consistently outperforms OneR QA, which in turn outperforms ZeroR QA. For example, with Flan-T5-XXL using Direct prompting, the F1 scores for HotpotQA are 59.1 for \fixedicon/sys QA, 49.7 for OneR QA, and 25.3 for ZeroR QA. This trend holds similarly for GPT3 with CoT prompting, with scores of 60.7, 53.6, and 47.5 respectively for HotpotQA. The caption and context explicitly state that this trend (\fixedicon/sys QA > OneR QA > ZeroR QA) remains consistent regardless of the choice of reader, whether Direct for Flan-T5-XXL or CoT for GPT3. Therefore, the choice of reader does not affect this performance hierarchy."
    },
    {
        "paper_id": "2212.10509.json",
        "table_id": "table_5",
        "table_content": "\\begin{table*}[htb]\n    \\centering\n    \\footnotesize\n    \\setlength{\\tabcolsep}{4.0pt}\n    \\begin{tabular}{llllll}\\toprule\n        & \\textbf{Question} &   \\textbf{Gold Facts} & \\textbf{NoR QA} & \\textbf{OneR QA} & \\textbf{IRCoT QA} \\\\\n\n        \\midrule\n\n\n        \\multirow{2}{*}{\\rotatebox[origin=c]{90}{\\parbox[c]{2.5cm}{\\centering HotpotQA}}}\n        & \\begin{minipage}[t]{2cm} What was the island, on which Marinelli Glacier is located, formerly known as?\\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} $\\Rightarrow$ Marinelli Glacier is located on the island of \\textcolor{blue}{Tierra del Fuego}. \\\\ $\\Rightarrow$ The island of Tierra del Fuego was formerly known as \\textcolor{blue}{Isla de Xativa}. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Marinelli Glacier is located on the island of \\textcolor{red}{Graham Land}. Graham Land was formerly known as \\textcolor{red}{Graham's Land}. So the answer is: Graham's Land. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Marinelli Glacier is located on \\textcolor{blue}{Isla Grande de Tierra del Fuego}. Isla Grande de Tierra del Fuego was formerly known as \\textcolor{red}{Tierra del Fuego}. So the answer is: Tierra del Fuego. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Marinelli Glacier is located on the island of \\textcolor{blue}{Tierra del Fuego}. The island of Tierra del Fuego was formerly known as \\textcolor{blue}{Isla de Xativa}. So the answer is: Isla de Xativa. \\end{minipage} \\\\\n\n        \\midrule\n\n\n        \\multirow{2}{*}{\\rotatebox[origin=c]{90}{\\parbox[c]{3.0cm}{\\centering 2WikiMQA}}}\n        & \\begin{minipage}[t]{2cm} Who is the child of the director of film Mukhyamantri (1996 Film)? \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} $\\Rightarrow$ The director of Mukhyamantri (1996 film) is \\textcolor{blue}{Anjan Choudhury}. \\\\ $\\Rightarrow$ Anjan Choudhury has a child named \\textcolor{blue}{Sandip Choudhury}. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The film Mukhyamantri was directed by \\textcolor{red}{S. V. Rajendra Singh Babu}. S. V. Rajendra Singh Babu has a child named \\textcolor{red}{S. V. Rajendra Singh Babu}. So the answer is: S. V. Rajendra Singh Babu. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Mukhyamantri (1996 film) was directed by \\textcolor{blue}{Anjan Choudhury}. Anjan Choudhury has a child named \\textcolor{red}{Aniruddha Choudhury}. So the answer is: Aniruddha Choudhury. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The director of Mukhyamantri (1996 film) is \\textcolor{blue}{Anjan Choudhury}. Anjan Choudhury has a child named \\textcolor{blue}{Sandip Choudhury}. So the answer is: Sandip Choudhury. \\end{minipage} \\\\\n\n        \\midrule\n\n\n        \\multirow{2}{*}{\\rotatebox[origin=c]{90}{\\parbox[c]{2.7cm}{\\centering MuSiQue}}}\n        & \\begin{minipage}[t]{2cm} When did the Governorship end of the city where the author of Sacerdotii Nostri Primordia died? \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} $\\Rightarrow$ Sacerdotii Nostri Primordia was written by \\textcolor{blue}{Pope John XXIII}. \\\\ $\\Rightarrow$ Pope John XXIII died in \\textcolor{blue}{Vatican City}. \\\\ $\\Rightarrow$ The Governorship of Vatican City ended in \\textcolor{blue}{1952}. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} The author of Sacerdotii Nostri Primordia is \\textcolor{red}{Pope Pius IX}. Pope Pius IX died in the city of \\textcolor{red}{Rome}. The Governorship of Rome ended in \\textcolor{red}{1870}. So the answer is: 1870. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Sacerdotii Nostri Primordia was written by \\textcolor{blue}{Pope John XXIII}. Pope John XXIII died in the city of \\textcolor{red}{Rome}. The Governorship of Rome ended in \\textcolor{red}{1870}. So the answer is: 1870. \\end{minipage}\n        & \\begin{minipage}[t]{3.0cm} Sacerdotii Nostri Primordia was written by \\textcolor{blue}{Pope John XXIII}. Pope John XXIII died in \\textcolor{blue}{Vatican City}. The Governorship of Vatican City ended in \\textcolor{blue}{1952}. So the answer is: 1952. \\end{minipage} \\\\\n\n        \\bottomrule\n\n    \\end{tabular}\n    \\caption{Additional CoTs generated by GPT3 with different methods. ZeroR is most prone to factual errors. OneR often fixes some of the factual information which is closest to the question but doesn't always fix it all the way. Since IRCoT retrieves after each step, it can also fix the errors at each step.  More examples are in Table~\\ref{table:nor-oner-cot-examples}.}\n    \\label{table:apdx-nor-oner-cot-examples}\n\\end{table*}",
        "caption": "Additional CoTs generated by GPT3 with different methods. ZeroR is most prone to factual errors. OneR often fixes some of the factual information which is closest to the question but doesn't always fix it all the way. Since IRCoT retrieves after each step, it can also fix the errors at each step.  More examples are in Table~\\ref{table:nor-oner-cot-examples}.",
        "label": "table:apdx-nor-oner-cot-examples",
        "section_info": "10 Additional CoT Generation Examples\n\\section{Additional CoT Generation Examples}\n\\label{sec:apdx-nor-oner-cot-examples}\n\nTable~\\ref{table:apdx-nor-oner-cot-examples} provides illustrations, in addition to the ones provided in Table~\\ref{table:nor-oner-cot-examples}, for how the CoT generations for NoR QA, OneR QA, and IRCoT QA methods vary. This gives an insight into how IRCoT improves QA performance. Since NoR relies completely on parametric knowledge, it often makes a factual error in the first sentence, which derails the full reasoning chain. Some of this factual information can be fixed by OneR, especially information closest to the question (i.e., can be retrieved using the question). This is insufficient for fixing all the mistakes. Since IRCoT involves retrieval after each step, it can fix errors at each step.\n\n\n\n\n\n",
        "Descriptive_question1": "Which method is most prone to factual errors according to table_5?",
        "Descriptive_question2": "What does IRCoT do after each step as mentioned in table_5?",
        "Reasoning_question1": "How does the performance of OneR compare to NoR in terms of fixing factual errors in table_5, and why might this be the case?",
        "Reasoning_question2": "Based on the information in table_5, what can be inferred about the effectiveness of IRCoT compared to OneR and NoR in improving QA performance?",
        "Descriptive_answer1": "ZeroR",
        "Descriptive_answer2": "Retrieves",
        "Reasoning_answer1": "OneR performs better than NoR in fixing factual errors, as it can correct some factual information, particularly information closest to the question through retrieval. NoR, on the other hand, relies solely on parametric knowledge, often leading to factual errors in the initial step that derail the entire reasoning chain. This difference arises because OneR has access to some external information via retrieval, which provides a chance to correct errors that NoR cannot address due to its complete dependence on internal knowledge.",
        "Reasoning_answer2": "IRCoT appears to be the most effective method compared to OneR and NoR in improving QA performance. This can be inferred because IRCoT retrieves information after each step, allowing it to correct errors at every stage of the reasoning process. In contrast, OneR only partially fixes errors, often limited to information directly related to the question, and NoR frequently fails due to initial factual inaccuracies. The step-by-step retrieval mechanism of IRCoT ensures a more accurate reasoning chain, as evidenced by the correct answers it provides across different datasets in the table, such as identifying 'Isla de Xativa' and 'Sandip Choudhury' correctly."
    },
    {
        "paper_id": "2212.10509.json",
        "table_id": "table_6",
        "table_content": "\\begin{table}[ht]\n    \\centering\n    \\footnotesize\n    \\setlength{\\tabcolsep}{1.0pt}\n    \\begin{tabular}{p{1.5em}ccccc}\\toprule\n        & Model &   HotpotQA & 2WikiMQA & MuSiQue & IIRC   \\\\\n        \\midrule{\\multirow{2}{*}{\\rotatebox[origin=c]{90}{Flan}}}\n        & \\fixedicon\\sys QA     &  \\bf{59.1}\\std{0.9}     & \\bf{66.5}\\std{1.4}     & \\bf{30.8}\\std{0.2}     & \\bf{42.5}\\std{2.1} \\\\\n        & w/o reader            &  \\nf{52.6}\\std{0.3}     & \\nf{60.9}\\std{0.6}     & \\nf{24.9}\\std{0.2}     & \\nf{40.3}\\std{0.2} \\\\\n\n        \\midrule{\\multirow{2}{*}{\\rotatebox[origin=c]{90}{GPT3}}}\n        & \\fixedicon\\sys QA     &  \\nf{60.7}\\std{1.1}     & \\nf{68.0}\\std{1.5}     & \\bf{36.5}\\std{1.2}     & \\bf{49.9}\\std{1.1} \\\\\n        & w/o reader            &  \\bf{61.0}\\std{0.7}     & \\bf{70.4}\\std{1.5}     & \\nf{31.5}\\std{0.6}     & \\nf{48.4}\\std{1.0} \\\\\n        \\bottomrule\n    \\end{tabular}\n    \\caption{Answer F1 of \\iconsys QA with and without a separate reader for \\texttt{Flan-T5-XXL} (top two rows) and \\texttt{GPT3} (bottom two rows). When the reader is not used, the answer is extracted from the CoT generated by \\sys while doing the retrieval. Ablating the reader usually hurts the performance.}\n    \\label{table:qa-reader-ablation}\n\\end{table}",
        "caption": "Answer F1 of \\iconsys QA with and without a separate reader for \\texttt{Flan-T5-XXL} (top two rows) and \\texttt{GPT3} (bottom two rows). When the reader is not used, the answer is extracted from the CoT generated by \\sys while doing the retrieval. Ablating the reader usually hurts the performance.",
        "label": "table:qa-reader-ablation",
        "section_info": "12 Separate Reader in \\iconsys QA\n\\section{Separate Reader in \\iconsys QA}\n\\label{sec:qa-reader-ablation}\n\n\\begin{table}[ht]\n    \\centering\n    \\footnotesize\n    \\setlength{\\tabcolsep}{1.0pt}\n    \\begin{tabular}{p{1.5em}ccccc}\\toprule\n        & Model &   HotpotQA & 2WikiMQA & MuSiQue & IIRC   \\\\\n        \\midrule{\\multirow{2}{*}{\\rotatebox[origin=c]{90}{Flan}}}\n        & \\fixedicon\\sys QA     &  \\bf{59.1}\\std{0.9}     & \\bf{66.5}\\std{1.4}     & \\bf{30.8}\\std{0.2}     & \\bf{42.5}\\std{2.1} \\\\\n        & w/o reader            &  \\nf{52.6}\\std{0.3}     & \\nf{60.9}\\std{0.6}     & \\nf{24.9}\\std{0.2}     & \\nf{40.3}\\std{0.2} \\\\\n\n        \\midrule{\\multirow{2}{*}{\\rotatebox[origin=c]{90}{GPT3}}}\n        & \\fixedicon\\sys QA     &  \\nf{60.7}\\std{1.1}     & \\nf{68.0}\\std{1.5}     & \\bf{36.5}\\std{1.2}     & \\bf{49.9}\\std{1.1} \\\\\n        & w/o reader            &  \\bf{61.0}\\std{0.7}     & \\bf{70.4}\\std{1.5}     & \\nf{31.5}\\std{0.6}     & \\nf{48.4}\\std{1.0} \\\\\n        \\bottomrule\n    \\end{tabular}\n    \\caption{Answer F1 of \\iconsys QA with and without a separate reader for \\texttt{Flan-T5-XXL} (top two rows) and \\texttt{GPT3} (bottom two rows). When the reader is not used, the answer is extracted from the CoT generated by \\sys while doing the retrieval. Ablating the reader usually hurts the performance.}\n    \\label{table:qa-reader-ablation}\n\\end{table}\n\n\\iconsys, by construction, produces a CoT as a part of its retrieval process. So, instead of having a separate post-hoc reader, one can also just extract the answer from the CoT generated during retrieval. As Table~\\ref{table:qa-reader-ablation} shows the effect of such an ablation.\nFor \\texttt{Flan-T5-XXL} having a separate reader is significantly better. For GPT3, this is not always true, but at least a model with a separate reader is always better or close to the one without. So overall we go with the choice of using the reader for the experiments in this paper.\n\n\n\n\n",
        "Descriptive_question1": "What is the Answer F1 score for Flan-T5-XXL with a separate reader on HotpotQA in table_6?",
        "Descriptive_question2": "Which model shows an Answer F1 score of 70.4 on 2WikiMQA in table_6?",
        "Reasoning_question1": "Why might the performance of Flan-T5-XXL generally decrease without a separate reader across datasets in table_6?",
        "Reasoning_question2": "How does the impact of ablating the reader differ between Flan-T5-XXL and GPT3 models in table_6?",
        "Descriptive_answer1": "59.1",
        "Descriptive_answer2": "GPT3",
        "Reasoning_answer1": "The performance of Flan-T5-XXL generally decreases without a separate reader across datasets because, as observed in the table, the Answer F1 scores drop consistently when the reader is ablated. For instance, on HotpotQA, the score falls from 59.1 to 52.6, on 2WikiMQA from 66.5 to 60.9, on MuSiQue from 30.8 to 24.9, and on IIRC from 42.5 to 40.3. This suggests that the separate reader plays a critical role in enhancing the model's ability to extract accurate answers. Without it, the model relies solely on the chain of thought (CoT) generated during retrieval, which may not be as effective in isolating precise answers, leading to a consistent reduction in performance across all datasets.",
        "Reasoning_answer2": "The impact of ablating the reader differs significantly between Flan-T5-XXL and GPT3 models. For Flan-T5-XXL, removing the reader consistently hurts performance across all datasets, with notable declines in Answer F1 scores, such as a drop from 59.1 to 52.6 on HotpotQA and from 66.5 to 60.9 on 2WikiMQA. This indicates a strong dependency on the separate reader for accurate answer extraction. In contrast, for GPT3, the effect is less uniform; performance decreases in some cases (e.g., MuSiQue from 36.5 to 31.5 and IIRC from 49.9 to 48.4) but actually improves slightly in others (e.g., HotpotQA from 60.7 to 61.0 and 2WikiMQA from 68.0 to 70.4). This suggests that GPT3 is less reliant on a separate reader and can sometimes extract answers effectively from the CoT during retrieval. Overall, while Flan-T5-XXL benefits significantly from the reader, GPT3 shows mixed results, indicating varying levels of dependency on the reader component."
    },
    {
        "paper_id": "2212.10467.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}[t]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Example} \\\\ \\midrule\n\\textbf{Context $\\mathcal{C}$}: \\textcolor{blue}{Tim’s tooth was hurting like crazy. His dentist} \\\\ \\textcolor{blue}{took a look around in his mouth. One of his teeth was rotten.} \\\\ \\textcolor{blue}{Once the tooth was pulled, Tim felt fine.}\\\\ \n\\midrule\n\\textbf{Additional Sentence 1 ($\\mathcal{AS}_{before}$)}: \\textcolor{teal}{Tim always met his } \\\\\n\\textcolor{teal}{dentist regularly.}\\\\\n\\midrule\n\\textbf{Event 1 ($\\mathcal{E}_1$)}: \\textcolor{orange}{Tim scheduled an appointment with his dentist.} \\\\\n\\textbf{Event 2 ($\\mathcal{E}_2$)}: \\textcolor{orange}{Tim's tooth started to hurt like crazy.} \\\\\n\\midrule\n\\textbf{Explanation ($Exp$)}: \\textcolor{teal}{Some people maintain regular visits to} \\\\ \\textcolor{teal}{a dentist. Tim is one of these individuals and may have} \\\\ \\textcolor{teal}{ already scheduled a regular appointment with his dentist }\\\\\n\\textcolor{teal}{before his tooth started to hurt.}\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:example} An example of temporal differential analysis, where $\\mathcal{AS}$ shifts the temporal relation between $\\mathcal{E}_1$ and $\\mathcal{E}_2$ to be more ``before''. See \\S \\ref{sec:dataset} for more details.\n}\n\\end{table}",
        "caption": "\n\t\\label{tb:example} An example of temporal differential analysis, where $\\mathcal{AS}$ shifts the temporal relation between $\\mathcal{E}_1$ and $\\mathcal{E}_2$ to be more ``before''. See \\S \\ref{sec:dataset} for more details.\n",
        "label": "tb:example",
        "section_info": "3 Dataset\n\\section{Dataset}\n\\label{sec:dataset}\n\n\n\nIn this section, we introduce the evaluation framework and collection process of \\datasetname{}.\n\n\\subsection{Task overview}\nThe \\datasetname{} dataset and its overall framework are designed to evaluate systems' ability to make temporal predictions with plausible reasons. Existing datasets, including \\matres, \\textsc{Torque}, and \\tracie, only annotate common event pairs that align with human common sense. In other words, if an event pair does not strongly imply a temporal relation (e.g. over 80\\% confidence), it will not be annotated and tested on systems. This allows pre-trained language models with millions of parameters to exploit annotation artifacts and priors that do not necessarily hold in certain contexts. For example, we know ``lunch'' is usually before ``dinner'', but this also depends on if they are performed by the same subject, at the same location, and/or on the same day. Unfortunately, current models often memorize such relations as immutable facts, leading to prediction errors in instances that are less common in real life. This intuition inspires us to build a framework to evaluate how much spurious information and priors current models are using.\n\n\\vpara{Temporal Explanations.}\nAn ideal method to evaluate whether models are making predictions in the right way is to let them explain why a certain prediction is made and evaluate the faithfulness and plausibility of the explanations. However, such an evaluation framework is almost impossible to achieve with current progress in natural language processing, where the two main challenges are: 1) it is extremely difficult to collect gold explanations that are sufficient to cover any possible sets of explanations; and 2) it is impossible to evaluate system generations using existing summarization metrics automatically.\n\n\\vpara{Temporal Differential Analysis.}\nBecause of the aforementioned challenges in directly evaluating system explanations, we propose an alternative that is a close proxy to the ideal form, namely temporal differential analysis. The core of the temporal differential analysis is to check if models can correctly identify how a subtle change to the context may affect the temporal relations of a given event pair. The intuition behind this choice is two-fold: 1) it is much easier for both annotators and models to produce an explanation if they know which dimension to focus on; and 2) this provides a binary evaluation measure that is deterministic and trustworthy in terms of reflecting how much spurious information models are using. \n\nSpecifically, our differential analysis process is defined below. Given an original context $\\mathcal{C}$, event 1 $\\mathcal{E}_1$ and event 2 $\\mathcal{E}_2$,\nwe assume a gold distribution $\\mathbb{D}=\\{P_{before}, P_{after}, P_{same}\\}$ on the temporal relation between $\\mathcal{E}_1$ and $\\mathcal{E}_2$ concerning $\\mathcal{C}$, where $P_{before}, P_{after}, P_{same}$ are the probabilities of the temporal relation being before, after and simultaneous respectively, and the probabilities altogether sum to 1. We then annotate two additional sentences $\\mathcal{AS}_{before}$ and $\\mathcal{AS}_{after}$, where the temporal relation distribution between $\\mathcal{E}_1$ and $\\mathcal{E}_2$ with respect to $\\mathcal{AS}_{before}+\\mathcal{C}$ results in an increased $P_{before}$, while similarly the distribution using $\\mathcal{AS}_{after}+\\mathcal{C}$ as the context has a higher $P_{after}$.\n\nTable~\\ref{tb:example} shows an example instance of temporal differential analysis, where an additional sentence $\\mathcal{AS}_{before}$ has an effect on the temporal relation between the two events and shifts the label distribution towards ``before''. We conducted a human pilot study for this formulation and found that it is easier to annotate and achieve substantial improvement over the explanation quality than to directly ask annotators to provide custom explanations for an event pair. We therefore adopt the former formulation and create our evaluation dataset \\datasetname{} through a multi-stage annotation process as described below.\n\n\n\\begin{table}[t]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Example} \\\\ \\midrule\n\\textbf{Context $\\mathcal{C}$}: \\textcolor{blue}{Tim’s tooth was hurting like crazy. His dentist} \\\\ \\textcolor{blue}{took a look around in his mouth. One of his teeth was rotten.} \\\\ \\textcolor{blue}{Once the tooth was pulled, Tim felt fine.}\\\\ \n\\midrule\n\\textbf{Additional Sentence 1 ($\\mathcal{AS}_{before}$)}: \\textcolor{teal}{Tim always met his } \\\\\n\\textcolor{teal}{dentist regularly.}\\\\\n\\midrule\n\\textbf{Event 1 ($\\mathcal{E}_1$)}: \\textcolor{orange}{Tim scheduled an appointment with his dentist.} \\\\\n\\textbf{Event 2 ($\\mathcal{E}_2$)}: \\textcolor{orange}{Tim's tooth started to hurt like crazy.} \\\\\n\\midrule\n\\textbf{Explanation ($Exp$)}: \\textcolor{teal}{Some people maintain regular visits to} \\\\ \\textcolor{teal}{a dentist. Tim is one of these individuals and may have} \\\\ \\textcolor{teal}{ already scheduled a regular appointment with his dentist }\\\\\n\\textcolor{teal}{before his tooth started to hurt.}\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:example} An example of temporal differential analysis, where $\\mathcal{AS}$ shifts the temporal relation between $\\mathcal{E}_1$ and $\\mathcal{E}_2$ to be more ``before''. See \\S \\ref{sec:dataset} for more details.\n}\n\\end{table}\n\n\n\\subsection{Dataset Construction}\nFollowing the definition of the temporal differential analysis framework above, we collect a dataset to carry out the actual evaluation. Each instance in \\datasetname{} contains a context $\\mathcal{C}$, an event pair $\\mathcal{E}_1$, $\\mathcal{E}_2$, and an additional sentence of either $\\mathcal{AS}_{before}$ or $\\mathcal{AS}_{after}$. In addition, we also annotate a human explanation $Exp$ regarding why the additional sentence affects the temporal relation between the two events. \\datasetname{} is constructed in three steps: 1) event pair generation, 2) additional sentence and explanation annotation, and 3) annotation verification and cleaning. We detail this pipeline below. \n\n\\vpara{Generating $\\mathcal{C}$ and $\\mathcal{E}$.}\nWe randomly sample short stories from the ROCStories dataset~\\cite{mostafazadeh-etal-2016-corpus} as the context $\\mathcal{C}$. For each story, we use GPT-3.5 \\footnote{We use GPT-3.5 text-davinci-002 for data generation throughout the work.} to generate an implicit event phrase based on an explicit event phrase selected by GPT-3.5 at the same time. An implicit event is an event that is not explicitly mentioned by the given context but is still inferable and relevant, e.g. Event 1 in Table~\\ref{tb:example}. A sample prompt can be referred to in Appendix Table~\\ref{tb:prompt2} to construct an event pair. We do this for two main reasons: 1) events that are not explicitly mentioned by the context provide more uncertainty so that the event pair does not come with a deterministic temporal relation decided by the context; 2) this is closer to the format of \\tracie{}, which we aim to compare system performance changes with. \n\n\\vpara{Crowdsourcing $\\mathcal{AS}$ and $Exp$.}\nAfter generating $\\mathcal{C}$ and $\\mathcal{E}$'s, we use Mechanical Turk to ask crowdsourcing annotators to write potential $\\mathcal{AS}_{before}$ and $\\mathcal{AS}_{after}$ with respect to the provided information. The guideline asks annotators to write additional sentences that can be added to the beginning of the context to prevent models from using text positional information. The annotator is also asked to explain why they wrote $\\mathcal{AS}$ and why it affects the temporal relation distribution. We use this as $Exp$. We design an annotation interface that is intuitive and filled with examples, and at the same time, we require annotators to pass a rigorous qualification test to demonstrate a proper understanding. We list our interfaces and tests in Fig.~\\ref{fig:mturk} and Table~\\ref{tb:qual}.\n\n\\vpara{Annotation Verification.}\nWe employ an additional verification stage for the human-written instances from the previous step. We provide annotators with the formatted textual entailment instance and ask if the entailment label changes in the expected direction. We collect two individual verifications per instance, and the instances accepted by all annotators appear in the test set.\n\n\n\\subsection{Statistics}\nWe collect 1,000 instances agreed upon by all annotators as the evaluation set and construct a silver training set with the remaining 1,241 instances that do not have unanimous annotator agreements.  3.1 Task overview\n\\subsection{Task overview}\nThe \\datasetname{} dataset and its overall framework are designed to evaluate systems' ability to make temporal predictions with plausible reasons. Existing datasets, including \\matres, \\textsc{Torque}, and \\tracie, only annotate common event pairs that align with human common sense. In other words, if an event pair does not strongly imply a temporal relation (e.g. over 80\\% confidence), it will not be annotated and tested on systems. This allows pre-trained language models with millions of parameters to exploit annotation artifacts and priors that do not necessarily hold in certain contexts. For example, we know ``lunch'' is usually before ``dinner'', but this also depends on if they are performed by the same subject, at the same location, and/or on the same day. Unfortunately, current models often memorize such relations as immutable facts, leading to prediction errors in instances that are less common in real life. This intuition inspires us to build a framework to evaluate how much spurious information and priors current models are using.\n\n\\vpara{Temporal Explanations.}\nAn ideal method to evaluate whether models are making predictions in the right way is to let them explain why a certain prediction is made and evaluate the faithfulness and plausibility of the explanations. However, such an evaluation framework is almost impossible to achieve with current progress in natural language processing, where the two main challenges are: 1) it is extremely difficult to collect gold explanations that are sufficient to cover any possible sets of explanations; and 2) it is impossible to evaluate system generations using existing summarization metrics automatically.\n\n\\vpara{Temporal Differential Analysis.}\nBecause of the aforementioned challenges in directly evaluating system explanations, we propose an alternative that is a close proxy to the ideal form, namely temporal differential analysis. The core of the temporal differential analysis is to check if models can correctly identify how a subtle change to the context may affect the temporal relations of a given event pair. The intuition behind this choice is two-fold: 1) it is much easier for both annotators and models to produce an explanation if they know which dimension to focus on; and 2) this provides a binary evaluation measure that is deterministic and trustworthy in terms of reflecting how much spurious information models are using. \n\nSpecifically, our differential analysis process is defined below. Given an original context $\\mathcal{C}$, event 1 $\\mathcal{E}_1$ and event 2 $\\mathcal{E}_2$,\nwe assume a gold distribution $\\mathbb{D}=\\{P_{before}, P_{after}, P_{same}\\}$ on the temporal relation between $\\mathcal{E}_1$ and $\\mathcal{E}_2$ concerning $\\mathcal{C}$, where $P_{before}, P_{after}, P_{same}$ are the probabilities of the temporal relation being before, after and simultaneous respectively, and the probabilities altogether sum to 1. We then annotate two additional sentences $\\mathcal{AS}_{before}$ and $\\mathcal{AS}_{after}$, where the temporal relation distribution between $\\mathcal{E}_1$ and $\\mathcal{E}_2$ with respect to $\\mathcal{AS}_{before}+\\mathcal{C}$ results in an increased $P_{before}$, while similarly the distribution using $\\mathcal{AS}_{after}+\\mathcal{C}$ as the context has a higher $P_{after}$.\n\nTable~\\ref{tb:example} shows an example instance of temporal differential analysis, where an additional sentence $\\mathcal{AS}_{before}$ has an effect on the temporal relation between the two events and shifts the label distribution towards ``before''. We conducted a human pilot study for this formulation and found that it is easier to annotate and achieve substantial improvement over the explanation quality than to directly ask annotators to provide custom explanations for an event pair. We therefore adopt the former formulation and create our evaluation dataset \\datasetname{} through a multi-stage annotation process as described below.\n\n\n\\begin{table}[t]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Example} \\\\ \\midrule\n\\textbf{Context $\\mathcal{C}$}: \\textcolor{blue}{Tim’s tooth was hurting like crazy. His dentist} \\\\ \\textcolor{blue}{took a look around in his mouth. One of his teeth was rotten.} \\\\ \\textcolor{blue}{Once the tooth was pulled, Tim felt fine.}\\\\ \n\\midrule\n\\textbf{Additional Sentence 1 ($\\mathcal{AS}_{before}$)}: \\textcolor{teal}{Tim always met his } \\\\\n\\textcolor{teal}{dentist regularly.}\\\\\n\\midrule\n\\textbf{Event 1 ($\\mathcal{E}_1$)}: \\textcolor{orange}{Tim scheduled an appointment with his dentist.} \\\\\n\\textbf{Event 2 ($\\mathcal{E}_2$)}: \\textcolor{orange}{Tim's tooth started to hurt like crazy.} \\\\\n\\midrule\n\\textbf{Explanation ($Exp$)}: \\textcolor{teal}{Some people maintain regular visits to} \\\\ \\textcolor{teal}{a dentist. Tim is one of these individuals and may have} \\\\ \\textcolor{teal}{ already scheduled a regular appointment with his dentist }\\\\\n\\textcolor{teal}{before his tooth started to hurt.}\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:example} An example of temporal differential analysis, where $\\mathcal{AS}$ shifts the temporal relation between $\\mathcal{E}_1$ and $\\mathcal{E}_2$ to be more ``before''. See \\S \\ref{sec:dataset} for more details.\n}\n\\end{table}\n\n\n3.2 Dataset Construction\n\\subsection{Dataset Construction}\nFollowing the definition of the temporal differential analysis framework above, we collect a dataset to carry out the actual evaluation. Each instance in \\datasetname{} contains a context $\\mathcal{C}$, an event pair $\\mathcal{E}_1$, $\\mathcal{E}_2$, and an additional sentence of either $\\mathcal{AS}_{before}$ or $\\mathcal{AS}_{after}$. In addition, we also annotate a human explanation $Exp$ regarding why the additional sentence affects the temporal relation between the two events. \\datasetname{} is constructed in three steps: 1) event pair generation, 2) additional sentence and explanation annotation, and 3) annotation verification and cleaning. We detail this pipeline below. \n\n\\vpara{Generating $\\mathcal{C}$ and $\\mathcal{E}$.}\nWe randomly sample short stories from the ROCStories dataset~\\cite{mostafazadeh-etal-2016-corpus} as the context $\\mathcal{C}$. For each story, we use GPT-3.5 \\footnote{We use GPT-3.5 text-davinci-002 for data generation throughout the work.} to generate an implicit event phrase based on an explicit event phrase selected by GPT-3.5 at the same time. An implicit event is an event that is not explicitly mentioned by the given context but is still inferable and relevant, e.g. Event 1 in Table~\\ref{tb:example}. A sample prompt can be referred to in Appendix Table~\\ref{tb:prompt2} to construct an event pair. We do this for two main reasons: 1) events that are not explicitly mentioned by the context provide more uncertainty so that the event pair does not come with a deterministic temporal relation decided by the context; 2) this is closer to the format of \\tracie{}, which we aim to compare system performance changes with. \n\n\\vpara{Crowdsourcing $\\mathcal{AS}$ and $Exp$.}\nAfter generating $\\mathcal{C}$ and $\\mathcal{E}$'s, we use Mechanical Turk to ask crowdsourcing annotators to write potential $\\mathcal{AS}_{before}$ and $\\mathcal{AS}_{after}$ with respect to the provided information. The guideline asks annotators to write additional sentences that can be added to the beginning of the context to prevent models from using text positional information. The annotator is also asked to explain why they wrote $\\mathcal{AS}$ and why it affects the temporal relation distribution. We use this as $Exp$. We design an annotation interface that is intuitive and filled with examples, and at the same time, we require annotators to pass a rigorous qualification test to demonstrate a proper understanding. We list our interfaces and tests in Fig.~\\ref{fig:mturk} and Table~\\ref{tb:qual}.\n\n\\vpara{Annotation Verification.}\nWe employ an additional verification stage for the human-written instances from the previous step. We provide annotators with the formatted textual entailment instance and ask if the entailment label changes in the expected direction. We collect two individual verifications per instance, and the instances accepted by all annotators appear in the test set.\n\n\n",
        "Descriptive_question1": "What is the context label used in table_1?",
        "Descriptive_question2": "What color is used for the context text in table_1?",
        "Reasoning_question1": "How does the additional sentence in table_1 influence the temporal relation between the two events?",
        "Reasoning_question2": "Why might the explanation provided in table_1 suggest that Tim scheduled an appointment before his tooth started hurting?",
        "Descriptive_answer1": "Context C",
        "Descriptive_answer2": "Blue",
        "Reasoning_answer1": "The additional sentence, labeled as AS_before, influences the temporal relation between the two events by shifting the probability distribution towards 'before'. Specifically, the sentence 'Tim always met his dentist regularly' suggests a pattern of behavior that precedes the specific events described. This implies that Tim likely scheduled an appointment (Event 1) as part of his regular dental visits before experiencing the pain (Event 2). By adding this context, the likelihood that Event 1 occurred before Event 2 increases, as it aligns with Tim's established routine of regular dental checkups which would naturally precede an emergency situation like severe tooth pain.",
        "Reasoning_answer2": "The explanation provided in table_1 suggests that Tim scheduled an appointment before his tooth started hurting because it highlights his habit of maintaining regular dentist visits. It reasons that since 'some people maintain regular visits to a dentist' and 'Tim is one of these individuals', he may have 'already scheduled a regular appointment with his dentist before his tooth started to hurt'. This chain of thought indicates that Tim's proactive behavior in dental care, as part of a regular schedule, likely led to an appointment being set prior to the onset of pain, rather than as a reaction to it. The emphasis on regularity implies a preemptive action, supporting the temporal precedence of Event 1 over Event 2."
    },
    {
        "paper_id": "2212.10467.json",
        "table_id": "table_2",
        "table_content": "\\begin{table}[ht]\n\\centering\n\\small{\n\\scalebox{0.94}{\n\\begin{tabular}{lccccccc}\n\\toprule\nData &\\#Train& \\#Test & Relative-Label & Hard-Label\\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5}\n\\textsc{Today}&1,241&1,000&\\checkmark&\\\\\n\\textsc{Tracie}&860&1,924&&\\checkmark\\\\\n\\textsc{Matres}&1,500&1,322&&\\checkmark\\\\\n\\bottomrule\n\\end{tabular}}\n}\n\\caption{Statistics of the three datasets.} \n\\label{tab:datanum}\n\\end{table}",
        "caption": "Statistics of the three datasets.",
        "label": "tab:datanum",
        "section_info": "6 Experiment\n\\section{Experiment}\n\\label{sec:experiment}\nIn this section, we conduct a series of experiments to show that 1) existing systems do not truly understand temporal relations, 2) \\datasetname{} and incidental supervision signals partially address this issue, and 3) \\datasetname{} motivates future work towards generic temporal reasoning. \n\n\\subsection{Datasets, Metrics, and Settings}\nWe use our proposed dataset \\datasetname{} as the main benchmark, as well as transferability results from two other temporal reasoning benchmarks \\tracie{}~\\cite{zhou-etal-2021-temporal} and \\matres{}~\\cite{ning-etal-2018-multi} to show that existing models fail to perform generic temporal reasoning while our proposal makes significant improvements. \nFollowing \\citet{zhou-etal-2021-temporal}, all three datasets are processed as binary classification tasks by keeping instances that are originally annotated as either ``before'' or ``after''. As a result, we use binary accuracy as the metric. For \\matres{}, we use only 1.5k (10\\%) of the training instances to match the size of the other two datasets. Table~\\ref{tab:datanum} summarizes data statistics.\nWe use $\\epsilon=0.1$ in equation~\\ref{eq:marginrankingloss} and $\\alpha=10$ in equation~\\ref{eq:loss}. All model training follows a standard textual entailment setup, uses default parameters, has the same number of steps, and averages from three random seeds. All training can be done with a single 48G-memory GPU within 5 hours.\n\n\\label{sec:datasetstats}\n\\begin{table}[ht]\n\\centering\n\\small{\n\\scalebox{0.94}{\n\\begin{tabular}{lccccccc}\n\\toprule\nData &\\#Train& \\#Test & Relative-Label & Hard-Label\\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5}\n\\textsc{Today}&1,241&1,000&\\checkmark&\\\\\n\\textsc{Tracie}&860&1,924&&\\checkmark\\\\\n\\textsc{Matres}&1,500&1,322&&\\checkmark\\\\\n\\bottomrule\n\\end{tabular}}\n}\n\\caption{Statistics of the three datasets.} \n\\label{tab:datanum}\n\\end{table}\n\n\n\\subsection{Baselines and Systems}\nWe report baseline performances of a state-of-the-art baseline PatternTime~\\cite{zhou-etal-2021-temporal}, as well as GPT-3.5~\\cite{brown2020language,ouyang2022training}. To show that \\datasetname{} and other incidental supervision signals contribute to generic temporal reasoning, we use the T5-large model implemented by~\\citet{wolf-etal-2020-transformers} as the base model and experiment with different supervision settings. We collect 5,000 GPT-3.5 generated instances in total, and 1,475 instances remain after our proposed verification models.\n\n\\begin{table*}[t]\n\\centering\n\\small\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Train Data) & Loss & \\tracie{} & \\matres{} & \\datasetname{} & \\datasetname{} (gold exp.) & Average \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5}\\cmidrule(lr){6-6}\\cmidrule(lr){7-7}\\cmidrule(lr){8-8}\nGPT-3.5 text-davinci-002 & FewShot&56.1&49.0&57.9&68.7&54.3 \\\\\nGPT-3.5 text-davinci-003 & FewShot&52.3&50.1&59.0&70.0&53.8 \\\\\nT5 (in-domain) & CE / MR & 66.2 & 81.2 & 52.9 & 55.7 & 66.8 \\\\\nPatternTime & Distant & 77.0&73.0 &54.1&67.7&68.0\\\\\n\\cmidrule(lr){1-8}\nT5 (O) & MR &50.6&49.8&52.9&55.7&51.1\\\\\nT5 (O+G) & MR&55.4&52.3&55.0&66.5&54.2 \\\\\n\\cmidrule(lr){1-8}\nT5 (M) & CE & 52.7 & 81.2 & 52.5& 57.5 & 62.1\\\\\nT5 (M+O) & CE + MR & 51.5&81.7 &57.4&82.7&63.5\\\\\nT5 (M+O+G) & CE + MR &49.9& 82.9&61.4&\\textbf{82.9}& 64.8 \\\\\n\\cmidrule(lr){1-8}\nT5 (T) & CE & 66.2 & 63.2 & 52.3&56.0 & 60.7\\\\\nT5 (T+O) & CE + MR & 72.9 & 69.4 &59.9& 81.6 & 67.4\\\\\nT5 (T+O+G) & CE + MR &73.5& 68.8& 62.1&82.0&68.1\\\\\n\\cmidrule(lr){1-8}\nT5 (M+T) & CE & 66.2&82.0&52.5&58.5&66.9 \\\\\nT5 (M+T+O) & CE + MR & 73.0 & 83.5 & 57.9& 77.8& 71.5\\\\\nT5 (M+T+O+G) & CE + MR & 73.3&83.9&\\textbf{63.2}&81.6 & 73.5\\\\\n\\cmidrule(lr){1-8}\nPatternTime (M+T) & CE & 79.7 & 85.0 & 56.3 & 66.5 & 73.7 \\\\\nPatternTime (M+T+O) & CE + MR & 79.8 & 85.8 & 60.9 & 82.2 & 75.5 \\\\\nPatternTime (all) & CE + MR &\\textbf{79.9}& \\textbf{86.3}&62.9&82.3&\\textbf{76.4}\\\\\n\\bottomrule\n\\end{tabular}\n\\caption{System performances under different supervision data and loss function settings across three binary temporal benchmarks. For simplicity, we use T to denote \\tracie{} training data, and similarly M for \\matres{}, O for \\datasetname{} (ours), and G for GPT-3.5-generated incidental supervision. \\datasetname{} (gold exp.) uses gold explanations during evaluation. \\textit{Average} is averaged from \\tracie{}, \\matres{} and \\datasetname{} accuracies. \\textit{all} is equivalent to \\textit{M+T+O+G}.}\n\\label{tab:maintable}\n\\end{table*}\n\n\n\\subsection{Main Results}\nTable~\\ref{tab:maintable} shows system performances under different supervision data and loss function settings across three binary temporal benchmarks, without generated explanations. \n\n\n\\vpara{Existing Work is Insufficient.}\nWe observe that GPT-3.5 is doing random guessing on all three benchmarks, suggesting that language model objectives alone are insufficient for temporal reasoning. On the other hand, PatternTime achieves mid-70s accuracy on \\tracie{} and \\matres{} but drops to random guessing on \\datasetname{}. This suggests that biased supervision signals may improve on biased datasets,\\footnote{Here, ``biased'' refers to datasets that align with natural distributions, such as \\textit{drink coffee} is always before \\textit{dinner}.} but not generic temporal reasoning. To further prove this point, we observe that T5 (M+T) jointly trained on \\tracie{} and \\matres{} does not improve much over T5 trained only on corresponding in-domain supervision (+0.4\\% averaged accuracy), suggesting that previous temporal annotation styles do not motivate joint-learning nor generic temporal reasoning.\n\n\\vpara{Our Work Generalizes Better.}\nOn the contrary, we see that by simply using \\datasetname{}'s moderate-sized 1k training instances, T5 (in-domain+O) improves 6.7\\% on \\tracie{}, and 0.5\\% on \\matres{}. When we add the incidental supervision instances from GPT-3.5 (filtered by \\datasetname{}-supervised models in \\S\\ref{sec:incidental}, denoted as T5(in-domain+O+G) in Table~\\ref{tab:maintable}), there is a 7.3\\% improvement on \\tracie{}, and 1.7\\% on \\matres{}. This is, on average, 4.5\\% better than using \\matres{} or \\tracie{} as the supervision source. Moreover, \\datasetname{} and incidental instances bring better joint learning efficiency and possibility, as we see a 6.7\\% average accuracy improvement from T5(M+T+O+G) compared to T5's in-domain bests. If we use PatternTime\\footnote{PatternTime also uses T5-large as the base model, and it does not use any in-domain annotation.} as the base model, we achieve a 76.4\\% average accuracy which is the new state-of-the-art result of binary temporal relation classification across multiple datasets, and almost 10\\% better than using T5 and in-domain supervision alone.\n\n\\vpara{Scaling and Improving LLMs is Inadequate.} We test the latest GPT-4 model \\cite{OpenAI2023GPT4TR} on \\datasetname{}, which gets 64.0\\% accuracy, and 78.0\\% with gold explanations.\\footnote{We use the gpt-4-0314 checkpoint and chat API.} Even though GPT-4 is shown to significantly improve on many natural-language benchmarks over GPT-3.5, its improvement on \\datasetname{} is relatively moderate, and it is only comparable with (if not worse than) our proposed model with less than a billion parameters. This shows that the advancement in large language models alone is insufficient to solve \\datasetname{}, and more rigorous and controllable reasoning models are desirable for future works.\n\n\\subsection{Experiments with Generated Explanation}\n\\label{sec:inference} \nIn Table~\\ref{tab:maintable}, we see that explanations play an important role in generic temporal reasoning as \\textit{PatternTime(all)} improves almost 20\\% on \\datasetname{} with the gold explanations. We, therefore, augment test instances with generated explanations on all three datasets. To utilize the existing explanation verification models proposed in \\S\\ref{sec:incidental}, we generate an additional sentence together with an explanation sentence. Specifically, for each possible relation direction of the event pair, we generate an additional sentence $\\mathcal{AS}$ and an explanation sentence $Exp$ and then use explanation verifier models to select the $\\mathcal{AS}$ and $Exp$ with the highest positive probability out of the two candidates. We use the same models and prompts described in \\S\\ref{sec:incidental}, and we show a sample of generated explanations in Table~\\ref{tb:tracie}.\\footnote{We use the given $\\mathcal{AS}$ for \\datasetname{}. We achieve this with the same prompt but only ask GPT-3.5 to generate an explanation sentence.}\n\nTable~\\ref{tab:generate_exp} shows model performances when augmented with generated explanations. There are improvements on all three datasets compared to the numbers in Table~\\ref{tab:maintable}, with an average improvement of 1.0\\% using T5 and 0.5\\% using PatternTime. However, the overall performance is still suboptimal and the performance on \\datasetname{} is far from when using gold explanations, which motivates future works on generating better explanations.\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Data) & T & M & \\datasetname{} & Avg & $\\bigtriangleup$ \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nT5 (all) & 76.1& 84.4 & 63.1 & 74.5 & 1.0\\\\\nPatternTime (all) & \\textbf{80.5}  & \\textbf{86.8} & \\textbf{63.4} & \\textbf{76.9} & 0.5 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Model performances when augmented with generated explanations described in \\S\\ref{sec:inference}. T refers to \\tracie{}, M refers to \\matres{}, and Avg refers to Average. $\\bigtriangleup$ shows the differences compared with Table \\ref{tab:maintable}.} \n\\label{tab:generate_exp}\n\\end{table}\n\n\\begin{table}[t]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Example} \\\\ \\midrule\n\\textbf{Context}: \\textcolor{blue}{Jill studied all week for her math test. She stayed} \\\\ \\textcolor{blue}{up studying the cold night before too. The morning of the} \\\\ \\textcolor{blue}{ test, she woke up sick. But she went to school anyway. Jill's}\\\\ \\textcolor{blue}{teacher allowed her to take the test at home.}  \\\\\n\\midrule\n\\textbf{Relation}: \\textcolor{orange}{Jill's teacher trusted Jill \\textbf{starts before} Jill's teacher} \\\\ \\textcolor{orange}{allowed her to take the test at home.} \\\\\n\\midrule\n\\textbf{$\\mathcal{AS}$}: \\textcolor{teal}{Jill's teacher had always been impressed by her } \\\\\n\\textcolor{teal}{dedication to her studies.}\\\\\n\\midrule\n\\textbf{$Exp$}: \\textcolor{teal}{The additional sentence implies jill's teacher allowed} \\\\ \\textcolor{teal}{her to take the test at home because she trusted her and was}\\\\\n\\textcolor{teal}{impressed by her dedication.}\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:tracie}An example of \\tracie{} with generated explanations in \\S\\ref{sec:inference}. $\\mathcal{AS}$ and $Exp$ are generated by GPT-3.5 and selected by our verification models described in \\S\\ref{sec:incidental}.\n}\n\\end{table}\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nAblation &\\#GPT& T & M & \\datasetname{} & Avg \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nOurs&1,475&73.3&83.9&63.2&73.5\\\\\nNo Exp&1,867&73.7&83.5&61.2&72.8\\\\\nNo Addition&2,529&70.2&81.4&59.5&70.4\\\\\nNo General&2,079&71.0&81.8&59.5&70.8\\\\\nMore \\#GPT&2,483&74.6&84.0&63.2&73.9\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Ablation study for LLM generated supervision. \\textit{No Exp} does not use the explanation sentence verifier in \\S\\ref{sec:gev}, \\textit{No Addition} does not use the additional sentence verifier, and \\textit{No General} does not use the general verifier. \\textit{More \\#GPT} uses more verifier-filtered supervision instances (filtered\nby three verifiers).} \n\\label{tab:ablation}\n\\end{table}\n\n\n\n\n\\subsection{Ablation Studies and Human Analysis}\nAs shown in Table~\\ref{tab:ablation}, we conduct ablation studies to better understand our incidental supervision signals. We see that the most rigorous setting with all three verifiers achieves the best performance with the fewest remaining instances. This suggests that all of our verifier models trained with \\datasetname{} supervision are making positive contributions in selecting high-quality instances from GPT-3.5 generations.\n\nWe also see that using more incidental supervision instances verified by the verification models described in \\S\\ref{sec:incidental} can further enhance the model performance, suggesting a higher potential for using LLMs to generate supervision signals to empower smaller models. It also directs us to research the trade-off between model scaling and data scaling in temporal reasoning. \n\nWe also conduct human analysis on the quality of the explanation sentences used in \\datasetname{} and subsequent incidental supervision instances. We adopt the commonly used criteria for explanation~\\cite{wiegreffe-marasovic-2021-review}, namely faithfulness (if an explanation implies the predicted label)~\\cite{wiegreffe-pinter-2019-attention}, and plausibility (how well an explanation supports a predicted label)~\\cite{deyoung-etal-2020-eraser}. We use Mechanical Turk to conduct human evaluation of the properties mentioned above. Given a differential analysis sample with an additional sentence and an explanation sentence towards a target temporal relation direction, we analyze faithfulness for the additional sentence by asking if it makes the temporal relation “more” toward the target relation and plausibility for the explanation sentence by asking if it explains why adding the differential content shifts the distribution toward the target relation. \n\nWe show the experiment interfaces in Appendix Fig.~\\ref{fig:eval} and present the results in Table~\\ref{tab:human}. \nWe randomly select 100 samples for each dataset for our human evaluation. For either faithfulness or plausibility, we collect two human evaluations for each sample. Only the sample that is valued as correct by both human annotators will be counted as a positive sample and we denote the total number of positive samples as the final score. We restrict each annotator to take 10 samples at most and there are 92 distinct annotators.\nWe see that \\datasetname{}'s test set contains high-quality explanation annotations, which is expected from our rigorous agreement requirements. Our verification system improves both metrics for GPT-3.5 generated incidental supervision, which further demonstrates the effectiveness of the proposed verification models.\n\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nData & Faithfulness& Plausibility \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\n\\datasetname{} test&91&88\\\\\n\\datasetname{} train&79&68\\\\\nGPT-3.5 distilled&80&67\\\\\nGPT-3.5 random&57&55\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Human evaluation for faithfulness and plausibility of temporal differential analysis.\nFaithfulness and Plausibility denote binary human evaluation results of the corresponding task. GPT-3.5 distilled refers to verifier-filtered GPT-3.5 data (filtered by three verifiers), and GPT-3.5 random refers to randomly sampled raw GPT-3.5 generated data. } \n\\label{tab:human}\n\\end{table}\n  6.1 Datasets, Metrics, and Settings\n\\subsection{Datasets, Metrics, and Settings}\nWe use our proposed dataset \\datasetname{} as the main benchmark, as well as transferability results from two other temporal reasoning benchmarks \\tracie{}~\\cite{zhou-etal-2021-temporal} and \\matres{}~\\cite{ning-etal-2018-multi} to show that existing models fail to perform generic temporal reasoning while our proposal makes significant improvements. \nFollowing \\citet{zhou-etal-2021-temporal}, all three datasets are processed as binary classification tasks by keeping instances that are originally annotated as either ``before'' or ``after''. As a result, we use binary accuracy as the metric. For \\matres{}, we use only 1.5k (10\\%) of the training instances to match the size of the other two datasets. Table~\\ref{tab:datanum} summarizes data statistics.\nWe use $\\epsilon=0.1$ in equation~\\ref{eq:marginrankingloss} and $\\alpha=10$ in equation~\\ref{eq:loss}. All model training follows a standard textual entailment setup, uses default parameters, has the same number of steps, and averages from three random seeds. All training can be done with a single 48G-memory GPU within 5 hours.\n\n\\label{sec:datasetstats}\n\\begin{table}[ht]\n\\centering\n\\small{\n\\scalebox{0.94}{\n\\begin{tabular}{lccccccc}\n\\toprule\nData &\\#Train& \\#Test & Relative-Label & Hard-Label\\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5}\n\\textsc{Today}&1,241&1,000&\\checkmark&\\\\\n\\textsc{Tracie}&860&1,924&&\\checkmark\\\\\n\\textsc{Matres}&1,500&1,322&&\\checkmark\\\\\n\\bottomrule\n\\end{tabular}}\n}\n\\caption{Statistics of the three datasets.} \n\\label{tab:datanum}\n\\end{table}\n\n\n",
        "Descriptive_question1": "What is the number of training instances for the Matres dataset in table_2?",
        "Descriptive_question2": "Which dataset in table_2 has a checkmark under Relative-Label?",
        "Reasoning_question1": "Why might the Matres dataset in table_2 have a higher number of training instances compared to Tracie, and what could this imply about data selection for balancing dataset sizes?",
        "Reasoning_question2": "Based on the statistics presented in table_2, how might the differences in test set sizes across the datasets impact the reliability of model performance evaluations?",
        "Descriptive_answer1": "1,500",
        "Descriptive_answer2": "Today",
        "Reasoning_answer1": "The Matres dataset has a higher number of training instances (1,500) compared to Tracie (860) likely because the original dataset for Matres is larger, and even after reducing it to 10% of its original size as mentioned in the context, it still surpasses Tracie in training instance count. This suggests a deliberate selection strategy to balance dataset sizes for experimental comparability, as the context indicates that Matres was capped at 1.5k training instances to match the scale of other datasets like Today and Tracie. This implies that the researchers prioritized a controlled experiment setup, ensuring that differences in model performance are not overly skewed by vast disparities in training data volume, though some variation still exists.",
        "Reasoning_answer2": "The differences in test set sizes across the datasets—Today (1,000), Tracie (1,924), and Matres (1,322)—could impact the reliability of model performance evaluations in several ways. First, a larger test set, like Tracie's, provides a more comprehensive evaluation of a model's generalization ability due to the greater number of instances to test against, potentially leading to more statistically significant results. Conversely, a smaller test set, like Today's, might result in higher variance in performance metrics, as the outcome could be more sensitive to specific test instances. This disparity could mean that performance scores on Tracie are more reliable compared to Today, where results might be less stable. Additionally, comparing model performance across datasets with varying test set sizes might introduce bias, as the difficulty or diversity of test instances could differ, affecting the perceived robustness of the model. Researchers would need to account for these differences, perhaps by normalizing results or analyzing confidence intervals, to ensure fair comparisons."
    },
    {
        "paper_id": "2212.10467.json",
        "table_id": "table_3",
        "table_content": "\\begin{table*}[t]\n\\centering\n\\small\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Train Data) & Loss & \\tracie{} & \\matres{} & \\datasetname{} & \\datasetname{} (gold exp.) & Average \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5}\\cmidrule(lr){6-6}\\cmidrule(lr){7-7}\\cmidrule(lr){8-8}\nGPT-3.5 text-davinci-002 & FewShot&56.1&49.0&57.9&68.7&54.3 \\\\\nGPT-3.5 text-davinci-003 & FewShot&52.3&50.1&59.0&70.0&53.8 \\\\\nT5 (in-domain) & CE / MR & 66.2 & 81.2 & 52.9 & 55.7 & 66.8 \\\\\nPatternTime & Distant & 77.0&73.0 &54.1&67.7&68.0\\\\\n\\cmidrule(lr){1-8}\nT5 (O) & MR &50.6&49.8&52.9&55.7&51.1\\\\\nT5 (O+G) & MR&55.4&52.3&55.0&66.5&54.2 \\\\\n\\cmidrule(lr){1-8}\nT5 (M) & CE & 52.7 & 81.2 & 52.5& 57.5 & 62.1\\\\\nT5 (M+O) & CE + MR & 51.5&81.7 &57.4&82.7&63.5\\\\\nT5 (M+O+G) & CE + MR &49.9& 82.9&61.4&\\textbf{82.9}& 64.8 \\\\\n\\cmidrule(lr){1-8}\nT5 (T) & CE & 66.2 & 63.2 & 52.3&56.0 & 60.7\\\\\nT5 (T+O) & CE + MR & 72.9 & 69.4 &59.9& 81.6 & 67.4\\\\\nT5 (T+O+G) & CE + MR &73.5& 68.8& 62.1&82.0&68.1\\\\\n\\cmidrule(lr){1-8}\nT5 (M+T) & CE & 66.2&82.0&52.5&58.5&66.9 \\\\\nT5 (M+T+O) & CE + MR & 73.0 & 83.5 & 57.9& 77.8& 71.5\\\\\nT5 (M+T+O+G) & CE + MR & 73.3&83.9&\\textbf{63.2}&81.6 & 73.5\\\\\n\\cmidrule(lr){1-8}\nPatternTime (M+T) & CE & 79.7 & 85.0 & 56.3 & 66.5 & 73.7 \\\\\nPatternTime (M+T+O) & CE + MR & 79.8 & 85.8 & 60.9 & 82.2 & 75.5 \\\\\nPatternTime (all) & CE + MR &\\textbf{79.9}& \\textbf{86.3}&62.9&82.3&\\textbf{76.4}\\\\\n\\bottomrule\n\\end{tabular}\n\\caption{System performances under different supervision data and loss function settings across three binary temporal benchmarks. For simplicity, we use T to denote \\tracie{} training data, and similarly M for \\matres{}, O for \\datasetname{} (ours), and G for GPT-3.5-generated incidental supervision. \\datasetname{} (gold exp.) uses gold explanations during evaluation. \\textit{Average} is averaged from \\tracie{}, \\matres{} and \\datasetname{} accuracies. \\textit{all} is equivalent to \\textit{M+T+O+G}.}\n\\label{tab:maintable}\n\\end{table*}",
        "caption": "System performances under different supervision data and loss function settings across three binary temporal benchmarks. For simplicity, we use T to denote \\tracie{} training data, and similarly M for \\matres{}, O for \\datasetname{} (ours), and G for GPT-3.5-generated incidental supervision. \\datasetname{} (gold exp.) uses gold explanations during evaluation. \\textit{Average} is averaged from \\tracie{}, \\matres{} and \\datasetname{} accuracies. \\textit{all} is equivalent to \\textit{M+T+O+G}.",
        "label": "tab:maintable",
        "section_info": "6 Experiment\n\\section{Experiment}\n\\label{sec:experiment}\nIn this section, we conduct a series of experiments to show that 1) existing systems do not truly understand temporal relations, 2) \\datasetname{} and incidental supervision signals partially address this issue, and 3) \\datasetname{} motivates future work towards generic temporal reasoning. \n\n\\subsection{Datasets, Metrics, and Settings}\nWe use our proposed dataset \\datasetname{} as the main benchmark, as well as transferability results from two other temporal reasoning benchmarks \\tracie{}~\\cite{zhou-etal-2021-temporal} and \\matres{}~\\cite{ning-etal-2018-multi} to show that existing models fail to perform generic temporal reasoning while our proposal makes significant improvements. \nFollowing \\citet{zhou-etal-2021-temporal}, all three datasets are processed as binary classification tasks by keeping instances that are originally annotated as either ``before'' or ``after''. As a result, we use binary accuracy as the metric. For \\matres{}, we use only 1.5k (10\\%) of the training instances to match the size of the other two datasets. Table~\\ref{tab:datanum} summarizes data statistics.\nWe use $\\epsilon=0.1$ in equation~\\ref{eq:marginrankingloss} and $\\alpha=10$ in equation~\\ref{eq:loss}. All model training follows a standard textual entailment setup, uses default parameters, has the same number of steps, and averages from three random seeds. All training can be done with a single 48G-memory GPU within 5 hours.\n\n\\label{sec:datasetstats}\n\\begin{table}[ht]\n\\centering\n\\small{\n\\scalebox{0.94}{\n\\begin{tabular}{lccccccc}\n\\toprule\nData &\\#Train& \\#Test & Relative-Label & Hard-Label\\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5}\n\\textsc{Today}&1,241&1,000&\\checkmark&\\\\\n\\textsc{Tracie}&860&1,924&&\\checkmark\\\\\n\\textsc{Matres}&1,500&1,322&&\\checkmark\\\\\n\\bottomrule\n\\end{tabular}}\n}\n\\caption{Statistics of the three datasets.} \n\\label{tab:datanum}\n\\end{table}\n\n\n\\subsection{Baselines and Systems}\nWe report baseline performances of a state-of-the-art baseline PatternTime~\\cite{zhou-etal-2021-temporal}, as well as GPT-3.5~\\cite{brown2020language,ouyang2022training}. To show that \\datasetname{} and other incidental supervision signals contribute to generic temporal reasoning, we use the T5-large model implemented by~\\citet{wolf-etal-2020-transformers} as the base model and experiment with different supervision settings. We collect 5,000 GPT-3.5 generated instances in total, and 1,475 instances remain after our proposed verification models.\n\n\\begin{table*}[t]\n\\centering\n\\small\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Train Data) & Loss & \\tracie{} & \\matres{} & \\datasetname{} & \\datasetname{} (gold exp.) & Average \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5}\\cmidrule(lr){6-6}\\cmidrule(lr){7-7}\\cmidrule(lr){8-8}\nGPT-3.5 text-davinci-002 & FewShot&56.1&49.0&57.9&68.7&54.3 \\\\\nGPT-3.5 text-davinci-003 & FewShot&52.3&50.1&59.0&70.0&53.8 \\\\\nT5 (in-domain) & CE / MR & 66.2 & 81.2 & 52.9 & 55.7 & 66.8 \\\\\nPatternTime & Distant & 77.0&73.0 &54.1&67.7&68.0\\\\\n\\cmidrule(lr){1-8}\nT5 (O) & MR &50.6&49.8&52.9&55.7&51.1\\\\\nT5 (O+G) & MR&55.4&52.3&55.0&66.5&54.2 \\\\\n\\cmidrule(lr){1-8}\nT5 (M) & CE & 52.7 & 81.2 & 52.5& 57.5 & 62.1\\\\\nT5 (M+O) & CE + MR & 51.5&81.7 &57.4&82.7&63.5\\\\\nT5 (M+O+G) & CE + MR &49.9& 82.9&61.4&\\textbf{82.9}& 64.8 \\\\\n\\cmidrule(lr){1-8}\nT5 (T) & CE & 66.2 & 63.2 & 52.3&56.0 & 60.7\\\\\nT5 (T+O) & CE + MR & 72.9 & 69.4 &59.9& 81.6 & 67.4\\\\\nT5 (T+O+G) & CE + MR &73.5& 68.8& 62.1&82.0&68.1\\\\\n\\cmidrule(lr){1-8}\nT5 (M+T) & CE & 66.2&82.0&52.5&58.5&66.9 \\\\\nT5 (M+T+O) & CE + MR & 73.0 & 83.5 & 57.9& 77.8& 71.5\\\\\nT5 (M+T+O+G) & CE + MR & 73.3&83.9&\\textbf{63.2}&81.6 & 73.5\\\\\n\\cmidrule(lr){1-8}\nPatternTime (M+T) & CE & 79.7 & 85.0 & 56.3 & 66.5 & 73.7 \\\\\nPatternTime (M+T+O) & CE + MR & 79.8 & 85.8 & 60.9 & 82.2 & 75.5 \\\\\nPatternTime (all) & CE + MR &\\textbf{79.9}& \\textbf{86.3}&62.9&82.3&\\textbf{76.4}\\\\\n\\bottomrule\n\\end{tabular}\n\\caption{System performances under different supervision data and loss function settings across three binary temporal benchmarks. For simplicity, we use T to denote \\tracie{} training data, and similarly M for \\matres{}, O for \\datasetname{} (ours), and G for GPT-3.5-generated incidental supervision. \\datasetname{} (gold exp.) uses gold explanations during evaluation. \\textit{Average} is averaged from \\tracie{}, \\matres{} and \\datasetname{} accuracies. \\textit{all} is equivalent to \\textit{M+T+O+G}.}\n\\label{tab:maintable}\n\\end{table*}\n\n\n\\subsection{Main Results}\nTable~\\ref{tab:maintable} shows system performances under different supervision data and loss function settings across three binary temporal benchmarks, without generated explanations. \n\n\n\\vpara{Existing Work is Insufficient.}\nWe observe that GPT-3.5 is doing random guessing on all three benchmarks, suggesting that language model objectives alone are insufficient for temporal reasoning. On the other hand, PatternTime achieves mid-70s accuracy on \\tracie{} and \\matres{} but drops to random guessing on \\datasetname{}. This suggests that biased supervision signals may improve on biased datasets,\\footnote{Here, ``biased'' refers to datasets that align with natural distributions, such as \\textit{drink coffee} is always before \\textit{dinner}.} but not generic temporal reasoning. To further prove this point, we observe that T5 (M+T) jointly trained on \\tracie{} and \\matres{} does not improve much over T5 trained only on corresponding in-domain supervision (+0.4\\% averaged accuracy), suggesting that previous temporal annotation styles do not motivate joint-learning nor generic temporal reasoning.\n\n\\vpara{Our Work Generalizes Better.}\nOn the contrary, we see that by simply using \\datasetname{}'s moderate-sized 1k training instances, T5 (in-domain+O) improves 6.7\\% on \\tracie{}, and 0.5\\% on \\matres{}. When we add the incidental supervision instances from GPT-3.5 (filtered by \\datasetname{}-supervised models in \\S\\ref{sec:incidental}, denoted as T5(in-domain+O+G) in Table~\\ref{tab:maintable}), there is a 7.3\\% improvement on \\tracie{}, and 1.7\\% on \\matres{}. This is, on average, 4.5\\% better than using \\matres{} or \\tracie{} as the supervision source. Moreover, \\datasetname{} and incidental instances bring better joint learning efficiency and possibility, as we see a 6.7\\% average accuracy improvement from T5(M+T+O+G) compared to T5's in-domain bests. If we use PatternTime\\footnote{PatternTime also uses T5-large as the base model, and it does not use any in-domain annotation.} as the base model, we achieve a 76.4\\% average accuracy which is the new state-of-the-art result of binary temporal relation classification across multiple datasets, and almost 10\\% better than using T5 and in-domain supervision alone.\n\n\\vpara{Scaling and Improving LLMs is Inadequate.} We test the latest GPT-4 model \\cite{OpenAI2023GPT4TR} on \\datasetname{}, which gets 64.0\\% accuracy, and 78.0\\% with gold explanations.\\footnote{We use the gpt-4-0314 checkpoint and chat API.} Even though GPT-4 is shown to significantly improve on many natural-language benchmarks over GPT-3.5, its improvement on \\datasetname{} is relatively moderate, and it is only comparable with (if not worse than) our proposed model with less than a billion parameters. This shows that the advancement in large language models alone is insufficient to solve \\datasetname{}, and more rigorous and controllable reasoning models are desirable for future works.\n\n\\subsection{Experiments with Generated Explanation}\n\\label{sec:inference} \nIn Table~\\ref{tab:maintable}, we see that explanations play an important role in generic temporal reasoning as \\textit{PatternTime(all)} improves almost 20\\% on \\datasetname{} with the gold explanations. We, therefore, augment test instances with generated explanations on all three datasets. To utilize the existing explanation verification models proposed in \\S\\ref{sec:incidental}, we generate an additional sentence together with an explanation sentence. Specifically, for each possible relation direction of the event pair, we generate an additional sentence $\\mathcal{AS}$ and an explanation sentence $Exp$ and then use explanation verifier models to select the $\\mathcal{AS}$ and $Exp$ with the highest positive probability out of the two candidates. We use the same models and prompts described in \\S\\ref{sec:incidental}, and we show a sample of generated explanations in Table~\\ref{tb:tracie}.\\footnote{We use the given $\\mathcal{AS}$ for \\datasetname{}. We achieve this with the same prompt but only ask GPT-3.5 to generate an explanation sentence.}\n\nTable~\\ref{tab:generate_exp} shows model performances when augmented with generated explanations. There are improvements on all three datasets compared to the numbers in Table~\\ref{tab:maintable}, with an average improvement of 1.0\\% using T5 and 0.5\\% using PatternTime. However, the overall performance is still suboptimal and the performance on \\datasetname{} is far from when using gold explanations, which motivates future works on generating better explanations.\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Data) & T & M & \\datasetname{} & Avg & $\\bigtriangleup$ \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nT5 (all) & 76.1& 84.4 & 63.1 & 74.5 & 1.0\\\\\nPatternTime (all) & \\textbf{80.5}  & \\textbf{86.8} & \\textbf{63.4} & \\textbf{76.9} & 0.5 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Model performances when augmented with generated explanations described in \\S\\ref{sec:inference}. T refers to \\tracie{}, M refers to \\matres{}, and Avg refers to Average. $\\bigtriangleup$ shows the differences compared with Table \\ref{tab:maintable}.} \n\\label{tab:generate_exp}\n\\end{table}\n\n\\begin{table}[t]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Example} \\\\ \\midrule\n\\textbf{Context}: \\textcolor{blue}{Jill studied all week for her math test. She stayed} \\\\ \\textcolor{blue}{up studying the cold night before too. The morning of the} \\\\ \\textcolor{blue}{ test, she woke up sick. But she went to school anyway. Jill's}\\\\ \\textcolor{blue}{teacher allowed her to take the test at home.}  \\\\\n\\midrule\n\\textbf{Relation}: \\textcolor{orange}{Jill's teacher trusted Jill \\textbf{starts before} Jill's teacher} \\\\ \\textcolor{orange}{allowed her to take the test at home.} \\\\\n\\midrule\n\\textbf{$\\mathcal{AS}$}: \\textcolor{teal}{Jill's teacher had always been impressed by her } \\\\\n\\textcolor{teal}{dedication to her studies.}\\\\\n\\midrule\n\\textbf{$Exp$}: \\textcolor{teal}{The additional sentence implies jill's teacher allowed} \\\\ \\textcolor{teal}{her to take the test at home because she trusted her and was}\\\\\n\\textcolor{teal}{impressed by her dedication.}\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:tracie}An example of \\tracie{} with generated explanations in \\S\\ref{sec:inference}. $\\mathcal{AS}$ and $Exp$ are generated by GPT-3.5 and selected by our verification models described in \\S\\ref{sec:incidental}.\n}\n\\end{table}\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nAblation &\\#GPT& T & M & \\datasetname{} & Avg \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nOurs&1,475&73.3&83.9&63.2&73.5\\\\\nNo Exp&1,867&73.7&83.5&61.2&72.8\\\\\nNo Addition&2,529&70.2&81.4&59.5&70.4\\\\\nNo General&2,079&71.0&81.8&59.5&70.8\\\\\nMore \\#GPT&2,483&74.6&84.0&63.2&73.9\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Ablation study for LLM generated supervision. \\textit{No Exp} does not use the explanation sentence verifier in \\S\\ref{sec:gev}, \\textit{No Addition} does not use the additional sentence verifier, and \\textit{No General} does not use the general verifier. \\textit{More \\#GPT} uses more verifier-filtered supervision instances (filtered\nby three verifiers).} \n\\label{tab:ablation}\n\\end{table}\n\n\n\n\n\\subsection{Ablation Studies and Human Analysis}\nAs shown in Table~\\ref{tab:ablation}, we conduct ablation studies to better understand our incidental supervision signals. We see that the most rigorous setting with all three verifiers achieves the best performance with the fewest remaining instances. This suggests that all of our verifier models trained with \\datasetname{} supervision are making positive contributions in selecting high-quality instances from GPT-3.5 generations.\n\nWe also see that using more incidental supervision instances verified by the verification models described in \\S\\ref{sec:incidental} can further enhance the model performance, suggesting a higher potential for using LLMs to generate supervision signals to empower smaller models. It also directs us to research the trade-off between model scaling and data scaling in temporal reasoning. \n\nWe also conduct human analysis on the quality of the explanation sentences used in \\datasetname{} and subsequent incidental supervision instances. We adopt the commonly used criteria for explanation~\\cite{wiegreffe-marasovic-2021-review}, namely faithfulness (if an explanation implies the predicted label)~\\cite{wiegreffe-pinter-2019-attention}, and plausibility (how well an explanation supports a predicted label)~\\cite{deyoung-etal-2020-eraser}. We use Mechanical Turk to conduct human evaluation of the properties mentioned above. Given a differential analysis sample with an additional sentence and an explanation sentence towards a target temporal relation direction, we analyze faithfulness for the additional sentence by asking if it makes the temporal relation “more” toward the target relation and plausibility for the explanation sentence by asking if it explains why adding the differential content shifts the distribution toward the target relation. \n\nWe show the experiment interfaces in Appendix Fig.~\\ref{fig:eval} and present the results in Table~\\ref{tab:human}. \nWe randomly select 100 samples for each dataset for our human evaluation. For either faithfulness or plausibility, we collect two human evaluations for each sample. Only the sample that is valued as correct by both human annotators will be counted as a positive sample and we denote the total number of positive samples as the final score. We restrict each annotator to take 10 samples at most and there are 92 distinct annotators.\nWe see that \\datasetname{}'s test set contains high-quality explanation annotations, which is expected from our rigorous agreement requirements. Our verification system improves both metrics for GPT-3.5 generated incidental supervision, which further demonstrates the effectiveness of the proposed verification models.\n\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nData & Faithfulness& Plausibility \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\n\\datasetname{} test&91&88\\\\\n\\datasetname{} train&79&68\\\\\nGPT-3.5 distilled&80&67\\\\\nGPT-3.5 random&57&55\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Human evaluation for faithfulness and plausibility of temporal differential analysis.\nFaithfulness and Plausibility denote binary human evaluation results of the corresponding task. GPT-3.5 distilled refers to verifier-filtered GPT-3.5 data (filtered by three verifiers), and GPT-3.5 random refers to randomly sampled raw GPT-3.5 generated data. } \n\\label{tab:human}\n\\end{table}\n  6.3 Main Results\n\\subsection{Main Results}\nTable~\\ref{tab:maintable} shows system performances under different supervision data and loss function settings across three binary temporal benchmarks, without generated explanations. \n\n\n\\vpara{Existing Work is Insufficient.}\nWe observe that GPT-3.5 is doing random guessing on all three benchmarks, suggesting that language model objectives alone are insufficient for temporal reasoning. On the other hand, PatternTime achieves mid-70s accuracy on \\tracie{} and \\matres{} but drops to random guessing on \\datasetname{}. This suggests that biased supervision signals may improve on biased datasets,\\footnote{Here, ``biased'' refers to datasets that align with natural distributions, such as \\textit{drink coffee} is always before \\textit{dinner}.} but not generic temporal reasoning. To further prove this point, we observe that T5 (M+T) jointly trained on \\tracie{} and \\matres{} does not improve much over T5 trained only on corresponding in-domain supervision (+0.4\\% averaged accuracy), suggesting that previous temporal annotation styles do not motivate joint-learning nor generic temporal reasoning.\n\n\\vpara{Our Work Generalizes Better.}\nOn the contrary, we see that by simply using \\datasetname{}'s moderate-sized 1k training instances, T5 (in-domain+O) improves 6.7\\% on \\tracie{}, and 0.5\\% on \\matres{}. When we add the incidental supervision instances from GPT-3.5 (filtered by \\datasetname{}-supervised models in \\S\\ref{sec:incidental}, denoted as T5(in-domain+O+G) in Table~\\ref{tab:maintable}), there is a 7.3\\% improvement on \\tracie{}, and 1.7\\% on \\matres{}. This is, on average, 4.5\\% better than using \\matres{} or \\tracie{} as the supervision source. Moreover, \\datasetname{} and incidental instances bring better joint learning efficiency and possibility, as we see a 6.7\\% average accuracy improvement from T5(M+T+O+G) compared to T5's in-domain bests. If we use PatternTime\\footnote{PatternTime also uses T5-large as the base model, and it does not use any in-domain annotation.} as the base model, we achieve a 76.4\\% average accuracy which is the new state-of-the-art result of binary temporal relation classification across multiple datasets, and almost 10\\% better than using T5 and in-domain supervision alone.\n\n\\vpara{Scaling and Improving LLMs is Inadequate.} We test the latest GPT-4 model \\cite{OpenAI2023GPT4TR} on \\datasetname{}, which gets 64.0\\% accuracy, and 78.0\\% with gold explanations.\\footnote{We use the gpt-4-0314 checkpoint and chat API.} Even though GPT-4 is shown to significantly improve on many natural-language benchmarks over GPT-3.5, its improvement on \\datasetname{} is relatively moderate, and it is only comparable with (if not worse than) our proposed model with less than a billion parameters. This shows that the advancement in large language models alone is insufficient to solve \\datasetname{}, and more rigorous and controllable reasoning models are desirable for future works.\n\n6.4 Experiments with Generated Explanation\n\\subsection{Experiments with Generated Explanation}\n\\label{sec:inference} \nIn Table~\\ref{tab:maintable}, we see that explanations play an important role in generic temporal reasoning as \\textit{PatternTime(all)} improves almost 20\\% on \\datasetname{} with the gold explanations. We, therefore, augment test instances with generated explanations on all three datasets. To utilize the existing explanation verification models proposed in \\S\\ref{sec:incidental}, we generate an additional sentence together with an explanation sentence. Specifically, for each possible relation direction of the event pair, we generate an additional sentence $\\mathcal{AS}$ and an explanation sentence $Exp$ and then use explanation verifier models to select the $\\mathcal{AS}$ and $Exp$ with the highest positive probability out of the two candidates. We use the same models and prompts described in \\S\\ref{sec:incidental}, and we show a sample of generated explanations in Table~\\ref{tb:tracie}.\\footnote{We use the given $\\mathcal{AS}$ for \\datasetname{}. We achieve this with the same prompt but only ask GPT-3.5 to generate an explanation sentence.}\n\nTable~\\ref{tab:generate_exp} shows model performances when augmented with generated explanations. There are improvements on all three datasets compared to the numbers in Table~\\ref{tab:maintable}, with an average improvement of 1.0\\% using T5 and 0.5\\% using PatternTime. However, the overall performance is still suboptimal and the performance on \\datasetname{} is far from when using gold explanations, which motivates future works on generating better explanations.\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Data) & T & M & \\datasetname{} & Avg & $\\bigtriangleup$ \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nT5 (all) & 76.1& 84.4 & 63.1 & 74.5 & 1.0\\\\\nPatternTime (all) & \\textbf{80.5}  & \\textbf{86.8} & \\textbf{63.4} & \\textbf{76.9} & 0.5 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Model performances when augmented with generated explanations described in \\S\\ref{sec:inference}. T refers to \\tracie{}, M refers to \\matres{}, and Avg refers to Average. $\\bigtriangleup$ shows the differences compared with Table \\ref{tab:maintable}.} \n\\label{tab:generate_exp}\n\\end{table}\n\n\\begin{table}[t]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Example} \\\\ \\midrule\n\\textbf{Context}: \\textcolor{blue}{Jill studied all week for her math test. She stayed} \\\\ \\textcolor{blue}{up studying the cold night before too. The morning of the} \\\\ \\textcolor{blue}{ test, she woke up sick. But she went to school anyway. Jill's}\\\\ \\textcolor{blue}{teacher allowed her to take the test at home.}  \\\\\n\\midrule\n\\textbf{Relation}: \\textcolor{orange}{Jill's teacher trusted Jill \\textbf{starts before} Jill's teacher} \\\\ \\textcolor{orange}{allowed her to take the test at home.} \\\\\n\\midrule\n\\textbf{$\\mathcal{AS}$}: \\textcolor{teal}{Jill's teacher had always been impressed by her } \\\\\n\\textcolor{teal}{dedication to her studies.}\\\\\n\\midrule\n\\textbf{$Exp$}: \\textcolor{teal}{The additional sentence implies jill's teacher allowed} \\\\ \\textcolor{teal}{her to take the test at home because she trusted her and was}\\\\\n\\textcolor{teal}{impressed by her dedication.}\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:tracie}An example of \\tracie{} with generated explanations in \\S\\ref{sec:inference}. $\\mathcal{AS}$ and $Exp$ are generated by GPT-3.5 and selected by our verification models described in \\S\\ref{sec:incidental}.\n}\n\\end{table}\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nAblation &\\#GPT& T & M & \\datasetname{} & Avg \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nOurs&1,475&73.3&83.9&63.2&73.5\\\\\nNo Exp&1,867&73.7&83.5&61.2&72.8\\\\\nNo Addition&2,529&70.2&81.4&59.5&70.4\\\\\nNo General&2,079&71.0&81.8&59.5&70.8\\\\\nMore \\#GPT&2,483&74.6&84.0&63.2&73.9\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Ablation study for LLM generated supervision. \\textit{No Exp} does not use the explanation sentence verifier in \\S\\ref{sec:gev}, \\textit{No Addition} does not use the additional sentence verifier, and \\textit{No General} does not use the general verifier. \\textit{More \\#GPT} uses more verifier-filtered supervision instances (filtered\nby three verifiers).} \n\\label{tab:ablation}\n\\end{table}\n\n\n\n\n",
        "Descriptive_question1": "What is the highest average accuracy reported in table_3 for any model?",
        "Descriptive_question2": "Which model achieved the highest accuracy on the Matres dataset in table_3?",
        "Reasoning_question1": "Why might the PatternTime (all) model have outperformed other models in terms of average accuracy in table_3?",
        "Reasoning_question2": "What can be inferred from the performance difference between T5 (M+O) and T5 (M+O+G) in table_3 regarding the impact of GPT-3.5-generated supervision data?",
        "Descriptive_answer1": "76.4",
        "Descriptive_answer2": "PatternTime (all)",
        "Reasoning_answer1": "The PatternTime (all) model's outperformance in average accuracy (76.4%) in table_3 can likely be attributed to its comprehensive training approach. First, it incorporates all available supervision data, denoted as 'all' which equates to M+T+O+G, meaning it leverages training data from Tracie, Matres, Datasetname, and GPT-3.5-generated incidental supervision. This extensive data mix likely enhances its ability to generalize across different temporal benchmarks. Second, the model uses a combination of CE (Cross-Entropy) and MR (Margin Ranking) loss functions, which might optimize its learning process by balancing classification accuracy with ranking precision. Additionally, compared to other models like T5 variants or earlier PatternTime configurations, the integration of diverse data and dual loss functions could address biases inherent in individual datasets, as noted in the context that previous temporal annotations do not motivate joint learning. Therefore, the holistic approach in data usage and loss strategy seems to provide PatternTime (all) with a robust framework for temporal reasoning, leading to superior average performance across the benchmarks.",
        "Reasoning_answer2": "Analyzing the performance difference between T5 (M+O) and T5 (M+O+G) in table_3 reveals the impact of incorporating GPT-3.5-generated supervision data. T5 (M+O) achieves an average accuracy of 63.5%, while T5 (M+O+G) improves to 64.8%, a gain of 1.3%. Breaking it down by dataset, Tracie accuracy decreases slightly from 51.5 to 49.9, but Matres improves from 81.7 to 82.9, Datasetname from 57.4 to 61.4, and Datasetname (gold exp.) significantly from 82.7 to 82.9. This suggests that the additional GPT-3.5-generated data (denoted by 'G') provides valuable incidental supervision that enhances the model's understanding, especially on Datasetname, aligning with the paper's assertion that incidental supervision signals contribute to generic temporal reasoning. The improvement is not uniform across all datasets, indicating that the generated data might be more compatible or relevant to certain benchmarks like Datasetname. Overall, this performance gap implies that GPT-3.5-generated supervision data can positively influence model generalization and joint learning efficiency when added to structured training sets, supporting the claim in the context that such data contributes to better outcomes."
    },
    {
        "paper_id": "2212.10467.json",
        "table_id": "table_4",
        "table_content": "\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Data) & T & M & \\datasetname{} & Avg & $\\bigtriangleup$ \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nT5 (all) & 76.1& 84.4 & 63.1 & 74.5 & 1.0\\\\\nPatternTime (all) & \\textbf{80.5}  & \\textbf{86.8} & \\textbf{63.4} & \\textbf{76.9} & 0.5 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Model performances when augmented with generated explanations described in \\S\\ref{sec:inference}. T refers to \\tracie{}, M refers to \\matres{}, and Avg refers to Average. $\\bigtriangleup$ shows the differences compared with Table \\ref{tab:maintable}.} \n\\label{tab:generate_exp}\n\\end{table}",
        "caption": "Model performances when augmented with generated explanations described in \\S\\ref{sec:inference}. T refers to \\tracie{}, M refers to \\matres{}, and Avg refers to Average. $\\bigtriangleup$ shows the differences compared with Table \\ref{tab:maintable}.",
        "label": "tab:generate_exp",
        "section_info": "6 Experiment\n\\section{Experiment}\n\\label{sec:experiment}\nIn this section, we conduct a series of experiments to show that 1) existing systems do not truly understand temporal relations, 2) \\datasetname{} and incidental supervision signals partially address this issue, and 3) \\datasetname{} motivates future work towards generic temporal reasoning. \n\n\\subsection{Datasets, Metrics, and Settings}\nWe use our proposed dataset \\datasetname{} as the main benchmark, as well as transferability results from two other temporal reasoning benchmarks \\tracie{}~\\cite{zhou-etal-2021-temporal} and \\matres{}~\\cite{ning-etal-2018-multi} to show that existing models fail to perform generic temporal reasoning while our proposal makes significant improvements. \nFollowing \\citet{zhou-etal-2021-temporal}, all three datasets are processed as binary classification tasks by keeping instances that are originally annotated as either ``before'' or ``after''. As a result, we use binary accuracy as the metric. For \\matres{}, we use only 1.5k (10\\%) of the training instances to match the size of the other two datasets. Table~\\ref{tab:datanum} summarizes data statistics.\nWe use $\\epsilon=0.1$ in equation~\\ref{eq:marginrankingloss} and $\\alpha=10$ in equation~\\ref{eq:loss}. All model training follows a standard textual entailment setup, uses default parameters, has the same number of steps, and averages from three random seeds. All training can be done with a single 48G-memory GPU within 5 hours.\n\n\\label{sec:datasetstats}\n\\begin{table}[ht]\n\\centering\n\\small{\n\\scalebox{0.94}{\n\\begin{tabular}{lccccccc}\n\\toprule\nData &\\#Train& \\#Test & Relative-Label & Hard-Label\\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5}\n\\textsc{Today}&1,241&1,000&\\checkmark&\\\\\n\\textsc{Tracie}&860&1,924&&\\checkmark\\\\\n\\textsc{Matres}&1,500&1,322&&\\checkmark\\\\\n\\bottomrule\n\\end{tabular}}\n}\n\\caption{Statistics of the three datasets.} \n\\label{tab:datanum}\n\\end{table}\n\n\n\\subsection{Baselines and Systems}\nWe report baseline performances of a state-of-the-art baseline PatternTime~\\cite{zhou-etal-2021-temporal}, as well as GPT-3.5~\\cite{brown2020language,ouyang2022training}. To show that \\datasetname{} and other incidental supervision signals contribute to generic temporal reasoning, we use the T5-large model implemented by~\\citet{wolf-etal-2020-transformers} as the base model and experiment with different supervision settings. We collect 5,000 GPT-3.5 generated instances in total, and 1,475 instances remain after our proposed verification models.\n\n\\begin{table*}[t]\n\\centering\n\\small\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Train Data) & Loss & \\tracie{} & \\matres{} & \\datasetname{} & \\datasetname{} (gold exp.) & Average \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5}\\cmidrule(lr){6-6}\\cmidrule(lr){7-7}\\cmidrule(lr){8-8}\nGPT-3.5 text-davinci-002 & FewShot&56.1&49.0&57.9&68.7&54.3 \\\\\nGPT-3.5 text-davinci-003 & FewShot&52.3&50.1&59.0&70.0&53.8 \\\\\nT5 (in-domain) & CE / MR & 66.2 & 81.2 & 52.9 & 55.7 & 66.8 \\\\\nPatternTime & Distant & 77.0&73.0 &54.1&67.7&68.0\\\\\n\\cmidrule(lr){1-8}\nT5 (O) & MR &50.6&49.8&52.9&55.7&51.1\\\\\nT5 (O+G) & MR&55.4&52.3&55.0&66.5&54.2 \\\\\n\\cmidrule(lr){1-8}\nT5 (M) & CE & 52.7 & 81.2 & 52.5& 57.5 & 62.1\\\\\nT5 (M+O) & CE + MR & 51.5&81.7 &57.4&82.7&63.5\\\\\nT5 (M+O+G) & CE + MR &49.9& 82.9&61.4&\\textbf{82.9}& 64.8 \\\\\n\\cmidrule(lr){1-8}\nT5 (T) & CE & 66.2 & 63.2 & 52.3&56.0 & 60.7\\\\\nT5 (T+O) & CE + MR & 72.9 & 69.4 &59.9& 81.6 & 67.4\\\\\nT5 (T+O+G) & CE + MR &73.5& 68.8& 62.1&82.0&68.1\\\\\n\\cmidrule(lr){1-8}\nT5 (M+T) & CE & 66.2&82.0&52.5&58.5&66.9 \\\\\nT5 (M+T+O) & CE + MR & 73.0 & 83.5 & 57.9& 77.8& 71.5\\\\\nT5 (M+T+O+G) & CE + MR & 73.3&83.9&\\textbf{63.2}&81.6 & 73.5\\\\\n\\cmidrule(lr){1-8}\nPatternTime (M+T) & CE & 79.7 & 85.0 & 56.3 & 66.5 & 73.7 \\\\\nPatternTime (M+T+O) & CE + MR & 79.8 & 85.8 & 60.9 & 82.2 & 75.5 \\\\\nPatternTime (all) & CE + MR &\\textbf{79.9}& \\textbf{86.3}&62.9&82.3&\\textbf{76.4}\\\\\n\\bottomrule\n\\end{tabular}\n\\caption{System performances under different supervision data and loss function settings across three binary temporal benchmarks. For simplicity, we use T to denote \\tracie{} training data, and similarly M for \\matres{}, O for \\datasetname{} (ours), and G for GPT-3.5-generated incidental supervision. \\datasetname{} (gold exp.) uses gold explanations during evaluation. \\textit{Average} is averaged from \\tracie{}, \\matres{} and \\datasetname{} accuracies. \\textit{all} is equivalent to \\textit{M+T+O+G}.}\n\\label{tab:maintable}\n\\end{table*}\n\n\n\\subsection{Main Results}\nTable~\\ref{tab:maintable} shows system performances under different supervision data and loss function settings across three binary temporal benchmarks, without generated explanations. \n\n\n\\vpara{Existing Work is Insufficient.}\nWe observe that GPT-3.5 is doing random guessing on all three benchmarks, suggesting that language model objectives alone are insufficient for temporal reasoning. On the other hand, PatternTime achieves mid-70s accuracy on \\tracie{} and \\matres{} but drops to random guessing on \\datasetname{}. This suggests that biased supervision signals may improve on biased datasets,\\footnote{Here, ``biased'' refers to datasets that align with natural distributions, such as \\textit{drink coffee} is always before \\textit{dinner}.} but not generic temporal reasoning. To further prove this point, we observe that T5 (M+T) jointly trained on \\tracie{} and \\matres{} does not improve much over T5 trained only on corresponding in-domain supervision (+0.4\\% averaged accuracy), suggesting that previous temporal annotation styles do not motivate joint-learning nor generic temporal reasoning.\n\n\\vpara{Our Work Generalizes Better.}\nOn the contrary, we see that by simply using \\datasetname{}'s moderate-sized 1k training instances, T5 (in-domain+O) improves 6.7\\% on \\tracie{}, and 0.5\\% on \\matres{}. When we add the incidental supervision instances from GPT-3.5 (filtered by \\datasetname{}-supervised models in \\S\\ref{sec:incidental}, denoted as T5(in-domain+O+G) in Table~\\ref{tab:maintable}), there is a 7.3\\% improvement on \\tracie{}, and 1.7\\% on \\matres{}. This is, on average, 4.5\\% better than using \\matres{} or \\tracie{} as the supervision source. Moreover, \\datasetname{} and incidental instances bring better joint learning efficiency and possibility, as we see a 6.7\\% average accuracy improvement from T5(M+T+O+G) compared to T5's in-domain bests. If we use PatternTime\\footnote{PatternTime also uses T5-large as the base model, and it does not use any in-domain annotation.} as the base model, we achieve a 76.4\\% average accuracy which is the new state-of-the-art result of binary temporal relation classification across multiple datasets, and almost 10\\% better than using T5 and in-domain supervision alone.\n\n\\vpara{Scaling and Improving LLMs is Inadequate.} We test the latest GPT-4 model \\cite{OpenAI2023GPT4TR} on \\datasetname{}, which gets 64.0\\% accuracy, and 78.0\\% with gold explanations.\\footnote{We use the gpt-4-0314 checkpoint and chat API.} Even though GPT-4 is shown to significantly improve on many natural-language benchmarks over GPT-3.5, its improvement on \\datasetname{} is relatively moderate, and it is only comparable with (if not worse than) our proposed model with less than a billion parameters. This shows that the advancement in large language models alone is insufficient to solve \\datasetname{}, and more rigorous and controllable reasoning models are desirable for future works.\n\n\\subsection{Experiments with Generated Explanation}\n\\label{sec:inference} \nIn Table~\\ref{tab:maintable}, we see that explanations play an important role in generic temporal reasoning as \\textit{PatternTime(all)} improves almost 20\\% on \\datasetname{} with the gold explanations. We, therefore, augment test instances with generated explanations on all three datasets. To utilize the existing explanation verification models proposed in \\S\\ref{sec:incidental}, we generate an additional sentence together with an explanation sentence. Specifically, for each possible relation direction of the event pair, we generate an additional sentence $\\mathcal{AS}$ and an explanation sentence $Exp$ and then use explanation verifier models to select the $\\mathcal{AS}$ and $Exp$ with the highest positive probability out of the two candidates. We use the same models and prompts described in \\S\\ref{sec:incidental}, and we show a sample of generated explanations in Table~\\ref{tb:tracie}.\\footnote{We use the given $\\mathcal{AS}$ for \\datasetname{}. We achieve this with the same prompt but only ask GPT-3.5 to generate an explanation sentence.}\n\nTable~\\ref{tab:generate_exp} shows model performances when augmented with generated explanations. There are improvements on all three datasets compared to the numbers in Table~\\ref{tab:maintable}, with an average improvement of 1.0\\% using T5 and 0.5\\% using PatternTime. However, the overall performance is still suboptimal and the performance on \\datasetname{} is far from when using gold explanations, which motivates future works on generating better explanations.\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Data) & T & M & \\datasetname{} & Avg & $\\bigtriangleup$ \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nT5 (all) & 76.1& 84.4 & 63.1 & 74.5 & 1.0\\\\\nPatternTime (all) & \\textbf{80.5}  & \\textbf{86.8} & \\textbf{63.4} & \\textbf{76.9} & 0.5 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Model performances when augmented with generated explanations described in \\S\\ref{sec:inference}. T refers to \\tracie{}, M refers to \\matres{}, and Avg refers to Average. $\\bigtriangleup$ shows the differences compared with Table \\ref{tab:maintable}.} \n\\label{tab:generate_exp}\n\\end{table}\n\n\\begin{table}[t]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Example} \\\\ \\midrule\n\\textbf{Context}: \\textcolor{blue}{Jill studied all week for her math test. She stayed} \\\\ \\textcolor{blue}{up studying the cold night before too. The morning of the} \\\\ \\textcolor{blue}{ test, she woke up sick. But she went to school anyway. Jill's}\\\\ \\textcolor{blue}{teacher allowed her to take the test at home.}  \\\\\n\\midrule\n\\textbf{Relation}: \\textcolor{orange}{Jill's teacher trusted Jill \\textbf{starts before} Jill's teacher} \\\\ \\textcolor{orange}{allowed her to take the test at home.} \\\\\n\\midrule\n\\textbf{$\\mathcal{AS}$}: \\textcolor{teal}{Jill's teacher had always been impressed by her } \\\\\n\\textcolor{teal}{dedication to her studies.}\\\\\n\\midrule\n\\textbf{$Exp$}: \\textcolor{teal}{The additional sentence implies jill's teacher allowed} \\\\ \\textcolor{teal}{her to take the test at home because she trusted her and was}\\\\\n\\textcolor{teal}{impressed by her dedication.}\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:tracie}An example of \\tracie{} with generated explanations in \\S\\ref{sec:inference}. $\\mathcal{AS}$ and $Exp$ are generated by GPT-3.5 and selected by our verification models described in \\S\\ref{sec:incidental}.\n}\n\\end{table}\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nAblation &\\#GPT& T & M & \\datasetname{} & Avg \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nOurs&1,475&73.3&83.9&63.2&73.5\\\\\nNo Exp&1,867&73.7&83.5&61.2&72.8\\\\\nNo Addition&2,529&70.2&81.4&59.5&70.4\\\\\nNo General&2,079&71.0&81.8&59.5&70.8\\\\\nMore \\#GPT&2,483&74.6&84.0&63.2&73.9\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Ablation study for LLM generated supervision. \\textit{No Exp} does not use the explanation sentence verifier in \\S\\ref{sec:gev}, \\textit{No Addition} does not use the additional sentence verifier, and \\textit{No General} does not use the general verifier. \\textit{More \\#GPT} uses more verifier-filtered supervision instances (filtered\nby three verifiers).} \n\\label{tab:ablation}\n\\end{table}\n\n\n\n\n\\subsection{Ablation Studies and Human Analysis}\nAs shown in Table~\\ref{tab:ablation}, we conduct ablation studies to better understand our incidental supervision signals. We see that the most rigorous setting with all three verifiers achieves the best performance with the fewest remaining instances. This suggests that all of our verifier models trained with \\datasetname{} supervision are making positive contributions in selecting high-quality instances from GPT-3.5 generations.\n\nWe also see that using more incidental supervision instances verified by the verification models described in \\S\\ref{sec:incidental} can further enhance the model performance, suggesting a higher potential for using LLMs to generate supervision signals to empower smaller models. It also directs us to research the trade-off between model scaling and data scaling in temporal reasoning. \n\nWe also conduct human analysis on the quality of the explanation sentences used in \\datasetname{} and subsequent incidental supervision instances. We adopt the commonly used criteria for explanation~\\cite{wiegreffe-marasovic-2021-review}, namely faithfulness (if an explanation implies the predicted label)~\\cite{wiegreffe-pinter-2019-attention}, and plausibility (how well an explanation supports a predicted label)~\\cite{deyoung-etal-2020-eraser}. We use Mechanical Turk to conduct human evaluation of the properties mentioned above. Given a differential analysis sample with an additional sentence and an explanation sentence towards a target temporal relation direction, we analyze faithfulness for the additional sentence by asking if it makes the temporal relation “more” toward the target relation and plausibility for the explanation sentence by asking if it explains why adding the differential content shifts the distribution toward the target relation. \n\nWe show the experiment interfaces in Appendix Fig.~\\ref{fig:eval} and present the results in Table~\\ref{tab:human}. \nWe randomly select 100 samples for each dataset for our human evaluation. For either faithfulness or plausibility, we collect two human evaluations for each sample. Only the sample that is valued as correct by both human annotators will be counted as a positive sample and we denote the total number of positive samples as the final score. We restrict each annotator to take 10 samples at most and there are 92 distinct annotators.\nWe see that \\datasetname{}'s test set contains high-quality explanation annotations, which is expected from our rigorous agreement requirements. Our verification system improves both metrics for GPT-3.5 generated incidental supervision, which further demonstrates the effectiveness of the proposed verification models.\n\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nData & Faithfulness& Plausibility \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\n\\datasetname{} test&91&88\\\\\n\\datasetname{} train&79&68\\\\\nGPT-3.5 distilled&80&67\\\\\nGPT-3.5 random&57&55\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Human evaluation for faithfulness and plausibility of temporal differential analysis.\nFaithfulness and Plausibility denote binary human evaluation results of the corresponding task. GPT-3.5 distilled refers to verifier-filtered GPT-3.5 data (filtered by three verifiers), and GPT-3.5 random refers to randomly sampled raw GPT-3.5 generated data. } \n\\label{tab:human}\n\\end{table}\n  6.4 Experiments with Generated Explanation\n\\subsection{Experiments with Generated Explanation}\n\\label{sec:inference} \nIn Table~\\ref{tab:maintable}, we see that explanations play an important role in generic temporal reasoning as \\textit{PatternTime(all)} improves almost 20\\% on \\datasetname{} with the gold explanations. We, therefore, augment test instances with generated explanations on all three datasets. To utilize the existing explanation verification models proposed in \\S\\ref{sec:incidental}, we generate an additional sentence together with an explanation sentence. Specifically, for each possible relation direction of the event pair, we generate an additional sentence $\\mathcal{AS}$ and an explanation sentence $Exp$ and then use explanation verifier models to select the $\\mathcal{AS}$ and $Exp$ with the highest positive probability out of the two candidates. We use the same models and prompts described in \\S\\ref{sec:incidental}, and we show a sample of generated explanations in Table~\\ref{tb:tracie}.\\footnote{We use the given $\\mathcal{AS}$ for \\datasetname{}. We achieve this with the same prompt but only ask GPT-3.5 to generate an explanation sentence.}\n\nTable~\\ref{tab:generate_exp} shows model performances when augmented with generated explanations. There are improvements on all three datasets compared to the numbers in Table~\\ref{tab:maintable}, with an average improvement of 1.0\\% using T5 and 0.5\\% using PatternTime. However, the overall performance is still suboptimal and the performance on \\datasetname{} is far from when using gold explanations, which motivates future works on generating better explanations.\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Data) & T & M & \\datasetname{} & Avg & $\\bigtriangleup$ \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nT5 (all) & 76.1& 84.4 & 63.1 & 74.5 & 1.0\\\\\nPatternTime (all) & \\textbf{80.5}  & \\textbf{86.8} & \\textbf{63.4} & \\textbf{76.9} & 0.5 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Model performances when augmented with generated explanations described in \\S\\ref{sec:inference}. T refers to \\tracie{}, M refers to \\matres{}, and Avg refers to Average. $\\bigtriangleup$ shows the differences compared with Table \\ref{tab:maintable}.} \n\\label{tab:generate_exp}\n\\end{table}\n\n\\begin{table}[t]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Example} \\\\ \\midrule\n\\textbf{Context}: \\textcolor{blue}{Jill studied all week for her math test. She stayed} \\\\ \\textcolor{blue}{up studying the cold night before too. The morning of the} \\\\ \\textcolor{blue}{ test, she woke up sick. But she went to school anyway. Jill's}\\\\ \\textcolor{blue}{teacher allowed her to take the test at home.}  \\\\\n\\midrule\n\\textbf{Relation}: \\textcolor{orange}{Jill's teacher trusted Jill \\textbf{starts before} Jill's teacher} \\\\ \\textcolor{orange}{allowed her to take the test at home.} \\\\\n\\midrule\n\\textbf{$\\mathcal{AS}$}: \\textcolor{teal}{Jill's teacher had always been impressed by her } \\\\\n\\textcolor{teal}{dedication to her studies.}\\\\\n\\midrule\n\\textbf{$Exp$}: \\textcolor{teal}{The additional sentence implies jill's teacher allowed} \\\\ \\textcolor{teal}{her to take the test at home because she trusted her and was}\\\\\n\\textcolor{teal}{impressed by her dedication.}\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:tracie}An example of \\tracie{} with generated explanations in \\S\\ref{sec:inference}. $\\mathcal{AS}$ and $Exp$ are generated by GPT-3.5 and selected by our verification models described in \\S\\ref{sec:incidental}.\n}\n\\end{table}\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nAblation &\\#GPT& T & M & \\datasetname{} & Avg \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nOurs&1,475&73.3&83.9&63.2&73.5\\\\\nNo Exp&1,867&73.7&83.5&61.2&72.8\\\\\nNo Addition&2,529&70.2&81.4&59.5&70.4\\\\\nNo General&2,079&71.0&81.8&59.5&70.8\\\\\nMore \\#GPT&2,483&74.6&84.0&63.2&73.9\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Ablation study for LLM generated supervision. \\textit{No Exp} does not use the explanation sentence verifier in \\S\\ref{sec:gev}, \\textit{No Addition} does not use the additional sentence verifier, and \\textit{No General} does not use the general verifier. \\textit{More \\#GPT} uses more verifier-filtered supervision instances (filtered\nby three verifiers).} \n\\label{tab:ablation}\n\\end{table}\n\n\n\n\n",
        "Descriptive_question1": "What is the average performance of PatternTime (all) in table_4?",
        "Descriptive_question2": "What is the performance improvement ($\bigtriangleup$) for T5 (all) in table_4?",
        "Reasoning_question1": "Why might PatternTime (all) show a higher average performance compared to T5 (all) in table_4?",
        "Reasoning_question2": "What can be inferred about the effectiveness of generated explanations on /datasetname{} performance in table_4 compared to gold explanations?",
        "Descriptive_answer1": "76.9",
        "Descriptive_answer2": "1.0",
        "Reasoning_answer1": "PatternTime (all) likely shows a higher average performance of 76.9 compared to T5 (all)'s 74.5 in Table 4 due to its optimized design or training approach for temporal reasoning tasks. Looking at the data, PatternTime consistently outperforms T5 across all individual datasets: 80.5 vs. 76.1 on T (Tracie), 86.8 vs. 84.4 on M (Matres), and 63.4 vs. 63.1 on the datasetname. This suggests that PatternTime may leverage better pre-trained weights or architectural advantages specifically tailored for temporal relations, as noted in the context where PatternTime achieves state-of-the-art results. Additionally, even though the performance improvement (Δ) for PatternTime is slightly lower at 0.5 compared to T5’s 1.0, the absolute scores indicate a stronger baseline or enhanced capability with generated explanations.",
        "Reasoning_answer2": "The effectiveness of generated explanations on datasetname performance in Table 4 appears limited compared to gold explanations. Observing the numbers, T5 (all) achieves 63.1 and PatternTime (all) achieves 63.4 on datasetname with generated explanations, which reflects only a marginal improvement as indicated by the Δ values of 1.0 and 0.5 respectively when compared to Table 3 (maintable). However, the context from the document highlights that with gold explanations, performance on datasetname improves significantly (e.g., PatternTime(all) improves almost 20% as mentioned in section 6.4). This substantial gap suggests that generated explanations do not capture the depth or accuracy of gold explanations, failing to provide the nuanced understanding needed for complex temporal reasoning tasks. Therefore, it can be inferred that while generated explanations offer slight benefits, they are far less effective than gold explanations for enhancing performance on datasetname."
    },
    {
        "paper_id": "2212.10467.json",
        "table_id": "table_5",
        "table_content": "\\begin{table}[t]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Example} \\\\ \\midrule\n\\textbf{Context}: \\textcolor{blue}{Jill studied all week for her math test. She stayed} \\\\ \\textcolor{blue}{up studying the cold night before too. The morning of the} \\\\ \\textcolor{blue}{ test, she woke up sick. But she went to school anyway. Jill's}\\\\ \\textcolor{blue}{teacher allowed her to take the test at home.}  \\\\\n\\midrule\n\\textbf{Relation}: \\textcolor{orange}{Jill's teacher trusted Jill \\textbf{starts before} Jill's teacher} \\\\ \\textcolor{orange}{allowed her to take the test at home.} \\\\\n\\midrule\n\\textbf{$\\mathcal{AS}$}: \\textcolor{teal}{Jill's teacher had always been impressed by her } \\\\\n\\textcolor{teal}{dedication to her studies.}\\\\\n\\midrule\n\\textbf{$Exp$}: \\textcolor{teal}{The additional sentence implies jill's teacher allowed} \\\\ \\textcolor{teal}{her to take the test at home because she trusted her and was}\\\\\n\\textcolor{teal}{impressed by her dedication.}\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:tracie}An example of \\tracie{} with generated explanations in \\S\\ref{sec:inference}. $\\mathcal{AS}$ and $Exp$ are generated by GPT-3.5 and selected by our verification models described in \\S\\ref{sec:incidental}.\n}\n\\end{table}",
        "caption": "\n\t\\label{tb:tracie}An example of \\tracie{} with generated explanations in \\S\\ref{sec:inference}. $\\mathcal{AS}$ and $Exp$ are generated by GPT-3.5 and selected by our verification models described in \\S\\ref{sec:incidental}.\n",
        "label": "tb:tracie",
        "section_info": "6 Experiment\n\\section{Experiment}\n\\label{sec:experiment}\nIn this section, we conduct a series of experiments to show that 1) existing systems do not truly understand temporal relations, 2) \\datasetname{} and incidental supervision signals partially address this issue, and 3) \\datasetname{} motivates future work towards generic temporal reasoning. \n\n\\subsection{Datasets, Metrics, and Settings}\nWe use our proposed dataset \\datasetname{} as the main benchmark, as well as transferability results from two other temporal reasoning benchmarks \\tracie{}~\\cite{zhou-etal-2021-temporal} and \\matres{}~\\cite{ning-etal-2018-multi} to show that existing models fail to perform generic temporal reasoning while our proposal makes significant improvements. \nFollowing \\citet{zhou-etal-2021-temporal}, all three datasets are processed as binary classification tasks by keeping instances that are originally annotated as either ``before'' or ``after''. As a result, we use binary accuracy as the metric. For \\matres{}, we use only 1.5k (10\\%) of the training instances to match the size of the other two datasets. Table~\\ref{tab:datanum} summarizes data statistics.\nWe use $\\epsilon=0.1$ in equation~\\ref{eq:marginrankingloss} and $\\alpha=10$ in equation~\\ref{eq:loss}. All model training follows a standard textual entailment setup, uses default parameters, has the same number of steps, and averages from three random seeds. All training can be done with a single 48G-memory GPU within 5 hours.\n\n\\label{sec:datasetstats}\n\\begin{table}[ht]\n\\centering\n\\small{\n\\scalebox{0.94}{\n\\begin{tabular}{lccccccc}\n\\toprule\nData &\\#Train& \\#Test & Relative-Label & Hard-Label\\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5}\n\\textsc{Today}&1,241&1,000&\\checkmark&\\\\\n\\textsc{Tracie}&860&1,924&&\\checkmark\\\\\n\\textsc{Matres}&1,500&1,322&&\\checkmark\\\\\n\\bottomrule\n\\end{tabular}}\n}\n\\caption{Statistics of the three datasets.} \n\\label{tab:datanum}\n\\end{table}\n\n\n\\subsection{Baselines and Systems}\nWe report baseline performances of a state-of-the-art baseline PatternTime~\\cite{zhou-etal-2021-temporal}, as well as GPT-3.5~\\cite{brown2020language,ouyang2022training}. To show that \\datasetname{} and other incidental supervision signals contribute to generic temporal reasoning, we use the T5-large model implemented by~\\citet{wolf-etal-2020-transformers} as the base model and experiment with different supervision settings. We collect 5,000 GPT-3.5 generated instances in total, and 1,475 instances remain after our proposed verification models.\n\n\\begin{table*}[t]\n\\centering\n\\small\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Train Data) & Loss & \\tracie{} & \\matres{} & \\datasetname{} & \\datasetname{} (gold exp.) & Average \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5}\\cmidrule(lr){6-6}\\cmidrule(lr){7-7}\\cmidrule(lr){8-8}\nGPT-3.5 text-davinci-002 & FewShot&56.1&49.0&57.9&68.7&54.3 \\\\\nGPT-3.5 text-davinci-003 & FewShot&52.3&50.1&59.0&70.0&53.8 \\\\\nT5 (in-domain) & CE / MR & 66.2 & 81.2 & 52.9 & 55.7 & 66.8 \\\\\nPatternTime & Distant & 77.0&73.0 &54.1&67.7&68.0\\\\\n\\cmidrule(lr){1-8}\nT5 (O) & MR &50.6&49.8&52.9&55.7&51.1\\\\\nT5 (O+G) & MR&55.4&52.3&55.0&66.5&54.2 \\\\\n\\cmidrule(lr){1-8}\nT5 (M) & CE & 52.7 & 81.2 & 52.5& 57.5 & 62.1\\\\\nT5 (M+O) & CE + MR & 51.5&81.7 &57.4&82.7&63.5\\\\\nT5 (M+O+G) & CE + MR &49.9& 82.9&61.4&\\textbf{82.9}& 64.8 \\\\\n\\cmidrule(lr){1-8}\nT5 (T) & CE & 66.2 & 63.2 & 52.3&56.0 & 60.7\\\\\nT5 (T+O) & CE + MR & 72.9 & 69.4 &59.9& 81.6 & 67.4\\\\\nT5 (T+O+G) & CE + MR &73.5& 68.8& 62.1&82.0&68.1\\\\\n\\cmidrule(lr){1-8}\nT5 (M+T) & CE & 66.2&82.0&52.5&58.5&66.9 \\\\\nT5 (M+T+O) & CE + MR & 73.0 & 83.5 & 57.9& 77.8& 71.5\\\\\nT5 (M+T+O+G) & CE + MR & 73.3&83.9&\\textbf{63.2}&81.6 & 73.5\\\\\n\\cmidrule(lr){1-8}\nPatternTime (M+T) & CE & 79.7 & 85.0 & 56.3 & 66.5 & 73.7 \\\\\nPatternTime (M+T+O) & CE + MR & 79.8 & 85.8 & 60.9 & 82.2 & 75.5 \\\\\nPatternTime (all) & CE + MR &\\textbf{79.9}& \\textbf{86.3}&62.9&82.3&\\textbf{76.4}\\\\\n\\bottomrule\n\\end{tabular}\n\\caption{System performances under different supervision data and loss function settings across three binary temporal benchmarks. For simplicity, we use T to denote \\tracie{} training data, and similarly M for \\matres{}, O for \\datasetname{} (ours), and G for GPT-3.5-generated incidental supervision. \\datasetname{} (gold exp.) uses gold explanations during evaluation. \\textit{Average} is averaged from \\tracie{}, \\matres{} and \\datasetname{} accuracies. \\textit{all} is equivalent to \\textit{M+T+O+G}.}\n\\label{tab:maintable}\n\\end{table*}\n\n\n\\subsection{Main Results}\nTable~\\ref{tab:maintable} shows system performances under different supervision data and loss function settings across three binary temporal benchmarks, without generated explanations. \n\n\n\\vpara{Existing Work is Insufficient.}\nWe observe that GPT-3.5 is doing random guessing on all three benchmarks, suggesting that language model objectives alone are insufficient for temporal reasoning. On the other hand, PatternTime achieves mid-70s accuracy on \\tracie{} and \\matres{} but drops to random guessing on \\datasetname{}. This suggests that biased supervision signals may improve on biased datasets,\\footnote{Here, ``biased'' refers to datasets that align with natural distributions, such as \\textit{drink coffee} is always before \\textit{dinner}.} but not generic temporal reasoning. To further prove this point, we observe that T5 (M+T) jointly trained on \\tracie{} and \\matres{} does not improve much over T5 trained only on corresponding in-domain supervision (+0.4\\% averaged accuracy), suggesting that previous temporal annotation styles do not motivate joint-learning nor generic temporal reasoning.\n\n\\vpara{Our Work Generalizes Better.}\nOn the contrary, we see that by simply using \\datasetname{}'s moderate-sized 1k training instances, T5 (in-domain+O) improves 6.7\\% on \\tracie{}, and 0.5\\% on \\matres{}. When we add the incidental supervision instances from GPT-3.5 (filtered by \\datasetname{}-supervised models in \\S\\ref{sec:incidental}, denoted as T5(in-domain+O+G) in Table~\\ref{tab:maintable}), there is a 7.3\\% improvement on \\tracie{}, and 1.7\\% on \\matres{}. This is, on average, 4.5\\% better than using \\matres{} or \\tracie{} as the supervision source. Moreover, \\datasetname{} and incidental instances bring better joint learning efficiency and possibility, as we see a 6.7\\% average accuracy improvement from T5(M+T+O+G) compared to T5's in-domain bests. If we use PatternTime\\footnote{PatternTime also uses T5-large as the base model, and it does not use any in-domain annotation.} as the base model, we achieve a 76.4\\% average accuracy which is the new state-of-the-art result of binary temporal relation classification across multiple datasets, and almost 10\\% better than using T5 and in-domain supervision alone.\n\n\\vpara{Scaling and Improving LLMs is Inadequate.} We test the latest GPT-4 model \\cite{OpenAI2023GPT4TR} on \\datasetname{}, which gets 64.0\\% accuracy, and 78.0\\% with gold explanations.\\footnote{We use the gpt-4-0314 checkpoint and chat API.} Even though GPT-4 is shown to significantly improve on many natural-language benchmarks over GPT-3.5, its improvement on \\datasetname{} is relatively moderate, and it is only comparable with (if not worse than) our proposed model with less than a billion parameters. This shows that the advancement in large language models alone is insufficient to solve \\datasetname{}, and more rigorous and controllable reasoning models are desirable for future works.\n\n\\subsection{Experiments with Generated Explanation}\n\\label{sec:inference} \nIn Table~\\ref{tab:maintable}, we see that explanations play an important role in generic temporal reasoning as \\textit{PatternTime(all)} improves almost 20\\% on \\datasetname{} with the gold explanations. We, therefore, augment test instances with generated explanations on all three datasets. To utilize the existing explanation verification models proposed in \\S\\ref{sec:incidental}, we generate an additional sentence together with an explanation sentence. Specifically, for each possible relation direction of the event pair, we generate an additional sentence $\\mathcal{AS}$ and an explanation sentence $Exp$ and then use explanation verifier models to select the $\\mathcal{AS}$ and $Exp$ with the highest positive probability out of the two candidates. We use the same models and prompts described in \\S\\ref{sec:incidental}, and we show a sample of generated explanations in Table~\\ref{tb:tracie}.\\footnote{We use the given $\\mathcal{AS}$ for \\datasetname{}. We achieve this with the same prompt but only ask GPT-3.5 to generate an explanation sentence.}\n\nTable~\\ref{tab:generate_exp} shows model performances when augmented with generated explanations. There are improvements on all three datasets compared to the numbers in Table~\\ref{tab:maintable}, with an average improvement of 1.0\\% using T5 and 0.5\\% using PatternTime. However, the overall performance is still suboptimal and the performance on \\datasetname{} is far from when using gold explanations, which motivates future works on generating better explanations.\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Data) & T & M & \\datasetname{} & Avg & $\\bigtriangleup$ \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nT5 (all) & 76.1& 84.4 & 63.1 & 74.5 & 1.0\\\\\nPatternTime (all) & \\textbf{80.5}  & \\textbf{86.8} & \\textbf{63.4} & \\textbf{76.9} & 0.5 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Model performances when augmented with generated explanations described in \\S\\ref{sec:inference}. T refers to \\tracie{}, M refers to \\matres{}, and Avg refers to Average. $\\bigtriangleup$ shows the differences compared with Table \\ref{tab:maintable}.} \n\\label{tab:generate_exp}\n\\end{table}\n\n\\begin{table}[t]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Example} \\\\ \\midrule\n\\textbf{Context}: \\textcolor{blue}{Jill studied all week for her math test. She stayed} \\\\ \\textcolor{blue}{up studying the cold night before too. The morning of the} \\\\ \\textcolor{blue}{ test, she woke up sick. But she went to school anyway. Jill's}\\\\ \\textcolor{blue}{teacher allowed her to take the test at home.}  \\\\\n\\midrule\n\\textbf{Relation}: \\textcolor{orange}{Jill's teacher trusted Jill \\textbf{starts before} Jill's teacher} \\\\ \\textcolor{orange}{allowed her to take the test at home.} \\\\\n\\midrule\n\\textbf{$\\mathcal{AS}$}: \\textcolor{teal}{Jill's teacher had always been impressed by her } \\\\\n\\textcolor{teal}{dedication to her studies.}\\\\\n\\midrule\n\\textbf{$Exp$}: \\textcolor{teal}{The additional sentence implies jill's teacher allowed} \\\\ \\textcolor{teal}{her to take the test at home because she trusted her and was}\\\\\n\\textcolor{teal}{impressed by her dedication.}\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:tracie}An example of \\tracie{} with generated explanations in \\S\\ref{sec:inference}. $\\mathcal{AS}$ and $Exp$ are generated by GPT-3.5 and selected by our verification models described in \\S\\ref{sec:incidental}.\n}\n\\end{table}\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nAblation &\\#GPT& T & M & \\datasetname{} & Avg \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nOurs&1,475&73.3&83.9&63.2&73.5\\\\\nNo Exp&1,867&73.7&83.5&61.2&72.8\\\\\nNo Addition&2,529&70.2&81.4&59.5&70.4\\\\\nNo General&2,079&71.0&81.8&59.5&70.8\\\\\nMore \\#GPT&2,483&74.6&84.0&63.2&73.9\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Ablation study for LLM generated supervision. \\textit{No Exp} does not use the explanation sentence verifier in \\S\\ref{sec:gev}, \\textit{No Addition} does not use the additional sentence verifier, and \\textit{No General} does not use the general verifier. \\textit{More \\#GPT} uses more verifier-filtered supervision instances (filtered\nby three verifiers).} \n\\label{tab:ablation}\n\\end{table}\n\n\n\n\n\\subsection{Ablation Studies and Human Analysis}\nAs shown in Table~\\ref{tab:ablation}, we conduct ablation studies to better understand our incidental supervision signals. We see that the most rigorous setting with all three verifiers achieves the best performance with the fewest remaining instances. This suggests that all of our verifier models trained with \\datasetname{} supervision are making positive contributions in selecting high-quality instances from GPT-3.5 generations.\n\nWe also see that using more incidental supervision instances verified by the verification models described in \\S\\ref{sec:incidental} can further enhance the model performance, suggesting a higher potential for using LLMs to generate supervision signals to empower smaller models. It also directs us to research the trade-off between model scaling and data scaling in temporal reasoning. \n\nWe also conduct human analysis on the quality of the explanation sentences used in \\datasetname{} and subsequent incidental supervision instances. We adopt the commonly used criteria for explanation~\\cite{wiegreffe-marasovic-2021-review}, namely faithfulness (if an explanation implies the predicted label)~\\cite{wiegreffe-pinter-2019-attention}, and plausibility (how well an explanation supports a predicted label)~\\cite{deyoung-etal-2020-eraser}. We use Mechanical Turk to conduct human evaluation of the properties mentioned above. Given a differential analysis sample with an additional sentence and an explanation sentence towards a target temporal relation direction, we analyze faithfulness for the additional sentence by asking if it makes the temporal relation “more” toward the target relation and plausibility for the explanation sentence by asking if it explains why adding the differential content shifts the distribution toward the target relation. \n\nWe show the experiment interfaces in Appendix Fig.~\\ref{fig:eval} and present the results in Table~\\ref{tab:human}. \nWe randomly select 100 samples for each dataset for our human evaluation. For either faithfulness or plausibility, we collect two human evaluations for each sample. Only the sample that is valued as correct by both human annotators will be counted as a positive sample and we denote the total number of positive samples as the final score. We restrict each annotator to take 10 samples at most and there are 92 distinct annotators.\nWe see that \\datasetname{}'s test set contains high-quality explanation annotations, which is expected from our rigorous agreement requirements. Our verification system improves both metrics for GPT-3.5 generated incidental supervision, which further demonstrates the effectiveness of the proposed verification models.\n\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nData & Faithfulness& Plausibility \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\n\\datasetname{} test&91&88\\\\\n\\datasetname{} train&79&68\\\\\nGPT-3.5 distilled&80&67\\\\\nGPT-3.5 random&57&55\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Human evaluation for faithfulness and plausibility of temporal differential analysis.\nFaithfulness and Plausibility denote binary human evaluation results of the corresponding task. GPT-3.5 distilled refers to verifier-filtered GPT-3.5 data (filtered by three verifiers), and GPT-3.5 random refers to randomly sampled raw GPT-3.5 generated data. } \n\\label{tab:human}\n\\end{table}\n  6.4 Experiments with Generated Explanation\n\\subsection{Experiments with Generated Explanation}\n\\label{sec:inference} \nIn Table~\\ref{tab:maintable}, we see that explanations play an important role in generic temporal reasoning as \\textit{PatternTime(all)} improves almost 20\\% on \\datasetname{} with the gold explanations. We, therefore, augment test instances with generated explanations on all three datasets. To utilize the existing explanation verification models proposed in \\S\\ref{sec:incidental}, we generate an additional sentence together with an explanation sentence. Specifically, for each possible relation direction of the event pair, we generate an additional sentence $\\mathcal{AS}$ and an explanation sentence $Exp$ and then use explanation verifier models to select the $\\mathcal{AS}$ and $Exp$ with the highest positive probability out of the two candidates. We use the same models and prompts described in \\S\\ref{sec:incidental}, and we show a sample of generated explanations in Table~\\ref{tb:tracie}.\\footnote{We use the given $\\mathcal{AS}$ for \\datasetname{}. We achieve this with the same prompt but only ask GPT-3.5 to generate an explanation sentence.}\n\nTable~\\ref{tab:generate_exp} shows model performances when augmented with generated explanations. There are improvements on all three datasets compared to the numbers in Table~\\ref{tab:maintable}, with an average improvement of 1.0\\% using T5 and 0.5\\% using PatternTime. However, the overall performance is still suboptimal and the performance on \\datasetname{} is far from when using gold explanations, which motivates future works on generating better explanations.\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Data) & T & M & \\datasetname{} & Avg & $\\bigtriangleup$ \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nT5 (all) & 76.1& 84.4 & 63.1 & 74.5 & 1.0\\\\\nPatternTime (all) & \\textbf{80.5}  & \\textbf{86.8} & \\textbf{63.4} & \\textbf{76.9} & 0.5 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Model performances when augmented with generated explanations described in \\S\\ref{sec:inference}. T refers to \\tracie{}, M refers to \\matres{}, and Avg refers to Average. $\\bigtriangleup$ shows the differences compared with Table \\ref{tab:maintable}.} \n\\label{tab:generate_exp}\n\\end{table}\n\n\\begin{table}[t]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Example} \\\\ \\midrule\n\\textbf{Context}: \\textcolor{blue}{Jill studied all week for her math test. She stayed} \\\\ \\textcolor{blue}{up studying the cold night before too. The morning of the} \\\\ \\textcolor{blue}{ test, she woke up sick. But she went to school anyway. Jill's}\\\\ \\textcolor{blue}{teacher allowed her to take the test at home.}  \\\\\n\\midrule\n\\textbf{Relation}: \\textcolor{orange}{Jill's teacher trusted Jill \\textbf{starts before} Jill's teacher} \\\\ \\textcolor{orange}{allowed her to take the test at home.} \\\\\n\\midrule\n\\textbf{$\\mathcal{AS}$}: \\textcolor{teal}{Jill's teacher had always been impressed by her } \\\\\n\\textcolor{teal}{dedication to her studies.}\\\\\n\\midrule\n\\textbf{$Exp$}: \\textcolor{teal}{The additional sentence implies jill's teacher allowed} \\\\ \\textcolor{teal}{her to take the test at home because she trusted her and was}\\\\\n\\textcolor{teal}{impressed by her dedication.}\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:tracie}An example of \\tracie{} with generated explanations in \\S\\ref{sec:inference}. $\\mathcal{AS}$ and $Exp$ are generated by GPT-3.5 and selected by our verification models described in \\S\\ref{sec:incidental}.\n}\n\\end{table}\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nAblation &\\#GPT& T & M & \\datasetname{} & Avg \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nOurs&1,475&73.3&83.9&63.2&73.5\\\\\nNo Exp&1,867&73.7&83.5&61.2&72.8\\\\\nNo Addition&2,529&70.2&81.4&59.5&70.4\\\\\nNo General&2,079&71.0&81.8&59.5&70.8\\\\\nMore \\#GPT&2,483&74.6&84.0&63.2&73.9\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Ablation study for LLM generated supervision. \\textit{No Exp} does not use the explanation sentence verifier in \\S\\ref{sec:gev}, \\textit{No Addition} does not use the additional sentence verifier, and \\textit{No General} does not use the general verifier. \\textit{More \\#GPT} uses more verifier-filtered supervision instances (filtered\nby three verifiers).} \n\\label{tab:ablation}\n\\end{table}\n\n\n\n\n",
        "Descriptive_question1": "What is the context event described in table_5 involving Jill?",
        "Descriptive_question2": "What temporal relation is highlighted in table_5 regarding Jill's teacher?",
        "Reasoning_question1": "How does the additional sentence in table_5 contribute to the understanding of the trust between Jill and her teacher?",
        "Reasoning_question2": "Why might the explanation provided in table_5 suggest that Jill was allowed to take the test at home?",
        "Descriptive_answer1": "math test",
        "Descriptive_answer2": "starts before",
        "Reasoning_answer1": "The additional sentence in table_5, which states that 'Jill's teacher had always been impressed by her dedication to her studies,' contributes to understanding the trust between Jill and her teacher by providing background on the teacher's perception of Jill. This sentence suggests that the teacher has observed Jill's consistent hard work and commitment over time, which likely forms the basis for the trust. This prior positive impression would logically lead the teacher to believe in Jill's integrity and responsibility, fostering a trusting relationship that influences decisions like allowing her to take the test at home.",
        "Reasoning_answer2": "The explanation provided in table_5 suggests that Jill was allowed to take the test at home because it explicitly states that the teacher 'allowed her to take the test at home because she trusted her and was impressed by her dedication.' This implies a causal link between the teacher's trust and admiration for Jill's dedication and the decision to grant her the accommodation. The trust likely stems from Jill's demonstrated commitment to her studies, as noted in the additional sentence, which reassures the teacher of Jill's honesty and ability to complete the test responsibly outside the usual setting. This reasoning highlights how the teacher's positive perception of Jill directly influenced the exceptional permission given."
    },
    {
        "paper_id": "2212.10467.json",
        "table_id": "table_6",
        "table_content": "\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nAblation &\\#GPT& T & M & \\datasetname{} & Avg \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nOurs&1,475&73.3&83.9&63.2&73.5\\\\\nNo Exp&1,867&73.7&83.5&61.2&72.8\\\\\nNo Addition&2,529&70.2&81.4&59.5&70.4\\\\\nNo General&2,079&71.0&81.8&59.5&70.8\\\\\nMore \\#GPT&2,483&74.6&84.0&63.2&73.9\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Ablation study for LLM generated supervision. \\textit{No Exp} does not use the explanation sentence verifier in \\S\\ref{sec:gev}, \\textit{No Addition} does not use the additional sentence verifier, and \\textit{No General} does not use the general verifier. \\textit{More \\#GPT} uses more verifier-filtered supervision instances (filtered\nby three verifiers).} \n\\label{tab:ablation}\n\\end{table}",
        "caption": "Ablation study for LLM generated supervision. \\textit{No Exp} does not use the explanation sentence verifier in \\S\\ref{sec:gev}, \\textit{No Addition} does not use the additional sentence verifier, and \\textit{No General} does not use the general verifier. \\textit{More \\#GPT} uses more verifier-filtered supervision instances (filtered\nby three verifiers).",
        "label": "tab:ablation",
        "section_info": "6 Experiment\n\\section{Experiment}\n\\label{sec:experiment}\nIn this section, we conduct a series of experiments to show that 1) existing systems do not truly understand temporal relations, 2) \\datasetname{} and incidental supervision signals partially address this issue, and 3) \\datasetname{} motivates future work towards generic temporal reasoning. \n\n\\subsection{Datasets, Metrics, and Settings}\nWe use our proposed dataset \\datasetname{} as the main benchmark, as well as transferability results from two other temporal reasoning benchmarks \\tracie{}~\\cite{zhou-etal-2021-temporal} and \\matres{}~\\cite{ning-etal-2018-multi} to show that existing models fail to perform generic temporal reasoning while our proposal makes significant improvements. \nFollowing \\citet{zhou-etal-2021-temporal}, all three datasets are processed as binary classification tasks by keeping instances that are originally annotated as either ``before'' or ``after''. As a result, we use binary accuracy as the metric. For \\matres{}, we use only 1.5k (10\\%) of the training instances to match the size of the other two datasets. Table~\\ref{tab:datanum} summarizes data statistics.\nWe use $\\epsilon=0.1$ in equation~\\ref{eq:marginrankingloss} and $\\alpha=10$ in equation~\\ref{eq:loss}. All model training follows a standard textual entailment setup, uses default parameters, has the same number of steps, and averages from three random seeds. All training can be done with a single 48G-memory GPU within 5 hours.\n\n\\label{sec:datasetstats}\n\\begin{table}[ht]\n\\centering\n\\small{\n\\scalebox{0.94}{\n\\begin{tabular}{lccccccc}\n\\toprule\nData &\\#Train& \\#Test & Relative-Label & Hard-Label\\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5}\n\\textsc{Today}&1,241&1,000&\\checkmark&\\\\\n\\textsc{Tracie}&860&1,924&&\\checkmark\\\\\n\\textsc{Matres}&1,500&1,322&&\\checkmark\\\\\n\\bottomrule\n\\end{tabular}}\n}\n\\caption{Statistics of the three datasets.} \n\\label{tab:datanum}\n\\end{table}\n\n\n\\subsection{Baselines and Systems}\nWe report baseline performances of a state-of-the-art baseline PatternTime~\\cite{zhou-etal-2021-temporal}, as well as GPT-3.5~\\cite{brown2020language,ouyang2022training}. To show that \\datasetname{} and other incidental supervision signals contribute to generic temporal reasoning, we use the T5-large model implemented by~\\citet{wolf-etal-2020-transformers} as the base model and experiment with different supervision settings. We collect 5,000 GPT-3.5 generated instances in total, and 1,475 instances remain after our proposed verification models.\n\n\\begin{table*}[t]\n\\centering\n\\small\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Train Data) & Loss & \\tracie{} & \\matres{} & \\datasetname{} & \\datasetname{} (gold exp.) & Average \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5}\\cmidrule(lr){6-6}\\cmidrule(lr){7-7}\\cmidrule(lr){8-8}\nGPT-3.5 text-davinci-002 & FewShot&56.1&49.0&57.9&68.7&54.3 \\\\\nGPT-3.5 text-davinci-003 & FewShot&52.3&50.1&59.0&70.0&53.8 \\\\\nT5 (in-domain) & CE / MR & 66.2 & 81.2 & 52.9 & 55.7 & 66.8 \\\\\nPatternTime & Distant & 77.0&73.0 &54.1&67.7&68.0\\\\\n\\cmidrule(lr){1-8}\nT5 (O) & MR &50.6&49.8&52.9&55.7&51.1\\\\\nT5 (O+G) & MR&55.4&52.3&55.0&66.5&54.2 \\\\\n\\cmidrule(lr){1-8}\nT5 (M) & CE & 52.7 & 81.2 & 52.5& 57.5 & 62.1\\\\\nT5 (M+O) & CE + MR & 51.5&81.7 &57.4&82.7&63.5\\\\\nT5 (M+O+G) & CE + MR &49.9& 82.9&61.4&\\textbf{82.9}& 64.8 \\\\\n\\cmidrule(lr){1-8}\nT5 (T) & CE & 66.2 & 63.2 & 52.3&56.0 & 60.7\\\\\nT5 (T+O) & CE + MR & 72.9 & 69.4 &59.9& 81.6 & 67.4\\\\\nT5 (T+O+G) & CE + MR &73.5& 68.8& 62.1&82.0&68.1\\\\\n\\cmidrule(lr){1-8}\nT5 (M+T) & CE & 66.2&82.0&52.5&58.5&66.9 \\\\\nT5 (M+T+O) & CE + MR & 73.0 & 83.5 & 57.9& 77.8& 71.5\\\\\nT5 (M+T+O+G) & CE + MR & 73.3&83.9&\\textbf{63.2}&81.6 & 73.5\\\\\n\\cmidrule(lr){1-8}\nPatternTime (M+T) & CE & 79.7 & 85.0 & 56.3 & 66.5 & 73.7 \\\\\nPatternTime (M+T+O) & CE + MR & 79.8 & 85.8 & 60.9 & 82.2 & 75.5 \\\\\nPatternTime (all) & CE + MR &\\textbf{79.9}& \\textbf{86.3}&62.9&82.3&\\textbf{76.4}\\\\\n\\bottomrule\n\\end{tabular}\n\\caption{System performances under different supervision data and loss function settings across three binary temporal benchmarks. For simplicity, we use T to denote \\tracie{} training data, and similarly M for \\matres{}, O for \\datasetname{} (ours), and G for GPT-3.5-generated incidental supervision. \\datasetname{} (gold exp.) uses gold explanations during evaluation. \\textit{Average} is averaged from \\tracie{}, \\matres{} and \\datasetname{} accuracies. \\textit{all} is equivalent to \\textit{M+T+O+G}.}\n\\label{tab:maintable}\n\\end{table*}\n\n\n\\subsection{Main Results}\nTable~\\ref{tab:maintable} shows system performances under different supervision data and loss function settings across three binary temporal benchmarks, without generated explanations. \n\n\n\\vpara{Existing Work is Insufficient.}\nWe observe that GPT-3.5 is doing random guessing on all three benchmarks, suggesting that language model objectives alone are insufficient for temporal reasoning. On the other hand, PatternTime achieves mid-70s accuracy on \\tracie{} and \\matres{} but drops to random guessing on \\datasetname{}. This suggests that biased supervision signals may improve on biased datasets,\\footnote{Here, ``biased'' refers to datasets that align with natural distributions, such as \\textit{drink coffee} is always before \\textit{dinner}.} but not generic temporal reasoning. To further prove this point, we observe that T5 (M+T) jointly trained on \\tracie{} and \\matres{} does not improve much over T5 trained only on corresponding in-domain supervision (+0.4\\% averaged accuracy), suggesting that previous temporal annotation styles do not motivate joint-learning nor generic temporal reasoning.\n\n\\vpara{Our Work Generalizes Better.}\nOn the contrary, we see that by simply using \\datasetname{}'s moderate-sized 1k training instances, T5 (in-domain+O) improves 6.7\\% on \\tracie{}, and 0.5\\% on \\matres{}. When we add the incidental supervision instances from GPT-3.5 (filtered by \\datasetname{}-supervised models in \\S\\ref{sec:incidental}, denoted as T5(in-domain+O+G) in Table~\\ref{tab:maintable}), there is a 7.3\\% improvement on \\tracie{}, and 1.7\\% on \\matres{}. This is, on average, 4.5\\% better than using \\matres{} or \\tracie{} as the supervision source. Moreover, \\datasetname{} and incidental instances bring better joint learning efficiency and possibility, as we see a 6.7\\% average accuracy improvement from T5(M+T+O+G) compared to T5's in-domain bests. If we use PatternTime\\footnote{PatternTime also uses T5-large as the base model, and it does not use any in-domain annotation.} as the base model, we achieve a 76.4\\% average accuracy which is the new state-of-the-art result of binary temporal relation classification across multiple datasets, and almost 10\\% better than using T5 and in-domain supervision alone.\n\n\\vpara{Scaling and Improving LLMs is Inadequate.} We test the latest GPT-4 model \\cite{OpenAI2023GPT4TR} on \\datasetname{}, which gets 64.0\\% accuracy, and 78.0\\% with gold explanations.\\footnote{We use the gpt-4-0314 checkpoint and chat API.} Even though GPT-4 is shown to significantly improve on many natural-language benchmarks over GPT-3.5, its improvement on \\datasetname{} is relatively moderate, and it is only comparable with (if not worse than) our proposed model with less than a billion parameters. This shows that the advancement in large language models alone is insufficient to solve \\datasetname{}, and more rigorous and controllable reasoning models are desirable for future works.\n\n\\subsection{Experiments with Generated Explanation}\n\\label{sec:inference} \nIn Table~\\ref{tab:maintable}, we see that explanations play an important role in generic temporal reasoning as \\textit{PatternTime(all)} improves almost 20\\% on \\datasetname{} with the gold explanations. We, therefore, augment test instances with generated explanations on all three datasets. To utilize the existing explanation verification models proposed in \\S\\ref{sec:incidental}, we generate an additional sentence together with an explanation sentence. Specifically, for each possible relation direction of the event pair, we generate an additional sentence $\\mathcal{AS}$ and an explanation sentence $Exp$ and then use explanation verifier models to select the $\\mathcal{AS}$ and $Exp$ with the highest positive probability out of the two candidates. We use the same models and prompts described in \\S\\ref{sec:incidental}, and we show a sample of generated explanations in Table~\\ref{tb:tracie}.\\footnote{We use the given $\\mathcal{AS}$ for \\datasetname{}. We achieve this with the same prompt but only ask GPT-3.5 to generate an explanation sentence.}\n\nTable~\\ref{tab:generate_exp} shows model performances when augmented with generated explanations. There are improvements on all three datasets compared to the numbers in Table~\\ref{tab:maintable}, with an average improvement of 1.0\\% using T5 and 0.5\\% using PatternTime. However, the overall performance is still suboptimal and the performance on \\datasetname{} is far from when using gold explanations, which motivates future works on generating better explanations.\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Data) & T & M & \\datasetname{} & Avg & $\\bigtriangleup$ \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nT5 (all) & 76.1& 84.4 & 63.1 & 74.5 & 1.0\\\\\nPatternTime (all) & \\textbf{80.5}  & \\textbf{86.8} & \\textbf{63.4} & \\textbf{76.9} & 0.5 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Model performances when augmented with generated explanations described in \\S\\ref{sec:inference}. T refers to \\tracie{}, M refers to \\matres{}, and Avg refers to Average. $\\bigtriangleup$ shows the differences compared with Table \\ref{tab:maintable}.} \n\\label{tab:generate_exp}\n\\end{table}\n\n\\begin{table}[t]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Example} \\\\ \\midrule\n\\textbf{Context}: \\textcolor{blue}{Jill studied all week for her math test. She stayed} \\\\ \\textcolor{blue}{up studying the cold night before too. The morning of the} \\\\ \\textcolor{blue}{ test, she woke up sick. But she went to school anyway. Jill's}\\\\ \\textcolor{blue}{teacher allowed her to take the test at home.}  \\\\\n\\midrule\n\\textbf{Relation}: \\textcolor{orange}{Jill's teacher trusted Jill \\textbf{starts before} Jill's teacher} \\\\ \\textcolor{orange}{allowed her to take the test at home.} \\\\\n\\midrule\n\\textbf{$\\mathcal{AS}$}: \\textcolor{teal}{Jill's teacher had always been impressed by her } \\\\\n\\textcolor{teal}{dedication to her studies.}\\\\\n\\midrule\n\\textbf{$Exp$}: \\textcolor{teal}{The additional sentence implies jill's teacher allowed} \\\\ \\textcolor{teal}{her to take the test at home because she trusted her and was}\\\\\n\\textcolor{teal}{impressed by her dedication.}\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:tracie}An example of \\tracie{} with generated explanations in \\S\\ref{sec:inference}. $\\mathcal{AS}$ and $Exp$ are generated by GPT-3.5 and selected by our verification models described in \\S\\ref{sec:incidental}.\n}\n\\end{table}\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nAblation &\\#GPT& T & M & \\datasetname{} & Avg \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nOurs&1,475&73.3&83.9&63.2&73.5\\\\\nNo Exp&1,867&73.7&83.5&61.2&72.8\\\\\nNo Addition&2,529&70.2&81.4&59.5&70.4\\\\\nNo General&2,079&71.0&81.8&59.5&70.8\\\\\nMore \\#GPT&2,483&74.6&84.0&63.2&73.9\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Ablation study for LLM generated supervision. \\textit{No Exp} does not use the explanation sentence verifier in \\S\\ref{sec:gev}, \\textit{No Addition} does not use the additional sentence verifier, and \\textit{No General} does not use the general verifier. \\textit{More \\#GPT} uses more verifier-filtered supervision instances (filtered\nby three verifiers).} \n\\label{tab:ablation}\n\\end{table}\n\n\n\n\n\\subsection{Ablation Studies and Human Analysis}\nAs shown in Table~\\ref{tab:ablation}, we conduct ablation studies to better understand our incidental supervision signals. We see that the most rigorous setting with all three verifiers achieves the best performance with the fewest remaining instances. This suggests that all of our verifier models trained with \\datasetname{} supervision are making positive contributions in selecting high-quality instances from GPT-3.5 generations.\n\nWe also see that using more incidental supervision instances verified by the verification models described in \\S\\ref{sec:incidental} can further enhance the model performance, suggesting a higher potential for using LLMs to generate supervision signals to empower smaller models. It also directs us to research the trade-off between model scaling and data scaling in temporal reasoning. \n\nWe also conduct human analysis on the quality of the explanation sentences used in \\datasetname{} and subsequent incidental supervision instances. We adopt the commonly used criteria for explanation~\\cite{wiegreffe-marasovic-2021-review}, namely faithfulness (if an explanation implies the predicted label)~\\cite{wiegreffe-pinter-2019-attention}, and plausibility (how well an explanation supports a predicted label)~\\cite{deyoung-etal-2020-eraser}. We use Mechanical Turk to conduct human evaluation of the properties mentioned above. Given a differential analysis sample with an additional sentence and an explanation sentence towards a target temporal relation direction, we analyze faithfulness for the additional sentence by asking if it makes the temporal relation “more” toward the target relation and plausibility for the explanation sentence by asking if it explains why adding the differential content shifts the distribution toward the target relation. \n\nWe show the experiment interfaces in Appendix Fig.~\\ref{fig:eval} and present the results in Table~\\ref{tab:human}. \nWe randomly select 100 samples for each dataset for our human evaluation. For either faithfulness or plausibility, we collect two human evaluations for each sample. Only the sample that is valued as correct by both human annotators will be counted as a positive sample and we denote the total number of positive samples as the final score. We restrict each annotator to take 10 samples at most and there are 92 distinct annotators.\nWe see that \\datasetname{}'s test set contains high-quality explanation annotations, which is expected from our rigorous agreement requirements. Our verification system improves both metrics for GPT-3.5 generated incidental supervision, which further demonstrates the effectiveness of the proposed verification models.\n\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nData & Faithfulness& Plausibility \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\n\\datasetname{} test&91&88\\\\\n\\datasetname{} train&79&68\\\\\nGPT-3.5 distilled&80&67\\\\\nGPT-3.5 random&57&55\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Human evaluation for faithfulness and plausibility of temporal differential analysis.\nFaithfulness and Plausibility denote binary human evaluation results of the corresponding task. GPT-3.5 distilled refers to verifier-filtered GPT-3.5 data (filtered by three verifiers), and GPT-3.5 random refers to randomly sampled raw GPT-3.5 generated data. } \n\\label{tab:human}\n\\end{table}\n  6.5 Ablation Studies and Human Analysis\n\\subsection{Ablation Studies and Human Analysis}\nAs shown in Table~\\ref{tab:ablation}, we conduct ablation studies to better understand our incidental supervision signals. We see that the most rigorous setting with all three verifiers achieves the best performance with the fewest remaining instances. This suggests that all of our verifier models trained with \\datasetname{} supervision are making positive contributions in selecting high-quality instances from GPT-3.5 generations.\n\nWe also see that using more incidental supervision instances verified by the verification models described in \\S\\ref{sec:incidental} can further enhance the model performance, suggesting a higher potential for using LLMs to generate supervision signals to empower smaller models. It also directs us to research the trade-off between model scaling and data scaling in temporal reasoning. \n\nWe also conduct human analysis on the quality of the explanation sentences used in \\datasetname{} and subsequent incidental supervision instances. We adopt the commonly used criteria for explanation~\\cite{wiegreffe-marasovic-2021-review}, namely faithfulness (if an explanation implies the predicted label)~\\cite{wiegreffe-pinter-2019-attention}, and plausibility (how well an explanation supports a predicted label)~\\cite{deyoung-etal-2020-eraser}. We use Mechanical Turk to conduct human evaluation of the properties mentioned above. Given a differential analysis sample with an additional sentence and an explanation sentence towards a target temporal relation direction, we analyze faithfulness for the additional sentence by asking if it makes the temporal relation “more” toward the target relation and plausibility for the explanation sentence by asking if it explains why adding the differential content shifts the distribution toward the target relation. \n\nWe show the experiment interfaces in Appendix Fig.~\\ref{fig:eval} and present the results in Table~\\ref{tab:human}. \nWe randomly select 100 samples for each dataset for our human evaluation. For either faithfulness or plausibility, we collect two human evaluations for each sample. Only the sample that is valued as correct by both human annotators will be counted as a positive sample and we denote the total number of positive samples as the final score. We restrict each annotator to take 10 samples at most and there are 92 distinct annotators.\nWe see that \\datasetname{}'s test set contains high-quality explanation annotations, which is expected from our rigorous agreement requirements. Our verification system improves both metrics for GPT-3.5 generated incidental supervision, which further demonstrates the effectiveness of the proposed verification models.\n\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nData & Faithfulness& Plausibility \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\n\\datasetname{} test&91&88\\\\\n\\datasetname{} train&79&68\\\\\nGPT-3.5 distilled&80&67\\\\\nGPT-3.5 random&57&55\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Human evaluation for faithfulness and plausibility of temporal differential analysis.\nFaithfulness and Plausibility denote binary human evaluation results of the corresponding task. GPT-3.5 distilled refers to verifier-filtered GPT-3.5 data (filtered by three verifiers), and GPT-3.5 random refers to randomly sampled raw GPT-3.5 generated data. } \n\\label{tab:human}\n\\end{table}\n  ",
        "Descriptive_question1": "What is the average performance value for the 'Ours' setting in table_6?",
        "Descriptive_question2": "How many GPT instances are used in the 'No Addition' setting in table_6?",
        "Reasoning_question1": "Why might the 'Ours' setting in table_6 achieve a higher average performance compared to the 'No Exp' setting despite using fewer GPT instances?",
        "Reasoning_question2": "What can be inferred from the performance trend in table_6 when comparing the use of more verifier-filtered supervision instances in the 'More #GPT' setting to other settings?",
        "Descriptive_answer1": "73.5",
        "Descriptive_answer2": "2,529",
        "Reasoning_answer1": "The 'Ours' setting likely achieves a higher average performance of 73.5 compared to 'No Exp' at 72.8, despite using fewer GPT instances (1,475 vs. 1,867), because it incorporates all three verifier models (explanation, additional sentence, and general verifiers). These verifiers, as mentioned in the table caption and context, filter out lower quality supervision instances generated by GPT-3.5, ensuring that only high-quality data is used for training. In contrast, 'No Exp' omits the explanation sentence verifier, which may allow less relevant or less accurate instances to be included, slightly reducing performance. This suggests that the quality of supervision data, rather than the quantity, plays a more critical role in achieving better results.",
        "Reasoning_answer2": "Comparing the 'More #GPT' setting to other settings in table_6, a performance trend emerges where using more verifier-filtered supervision instances (2,483 instances) leads to a higher average performance of 73.9, surpassing 'Ours' (73.5), 'No Exp' (72.8), 'No Addition' (70.4), and 'No General' (70.8). This indicates that increasing the number of high-quality, verifier-filtered instances can enhance model performance, as the filtering process ensures the data's relevance and accuracy. Additionally, as noted in the context, this trend suggests a potential for leveraging LLMs to generate supervision signals for smaller models, pointing towards a trade-off between data scaling and model scaling in temporal reasoning tasks. The improvement highlights the importance of balancing quantity with quality through effective verification."
    },
    {
        "paper_id": "2212.10467.json",
        "table_id": "table_7",
        "table_content": "\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nData & Faithfulness& Plausibility \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\n\\datasetname{} test&91&88\\\\\n\\datasetname{} train&79&68\\\\\nGPT-3.5 distilled&80&67\\\\\nGPT-3.5 random&57&55\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Human evaluation for faithfulness and plausibility of temporal differential analysis.\nFaithfulness and Plausibility denote binary human evaluation results of the corresponding task. GPT-3.5 distilled refers to verifier-filtered GPT-3.5 data (filtered by three verifiers), and GPT-3.5 random refers to randomly sampled raw GPT-3.5 generated data. } \n\\label{tab:human}\n\\end{table}",
        "caption": "Human evaluation for faithfulness and plausibility of temporal differential analysis.\nFaithfulness and Plausibility denote binary human evaluation results of the corresponding task. GPT-3.5 distilled refers to verifier-filtered GPT-3.5 data (filtered by three verifiers), and GPT-3.5 random refers to randomly sampled raw GPT-3.5 generated data. ",
        "label": "tab:human",
        "section_info": "6 Experiment\n\\section{Experiment}\n\\label{sec:experiment}\nIn this section, we conduct a series of experiments to show that 1) existing systems do not truly understand temporal relations, 2) \\datasetname{} and incidental supervision signals partially address this issue, and 3) \\datasetname{} motivates future work towards generic temporal reasoning. \n\n\\subsection{Datasets, Metrics, and Settings}\nWe use our proposed dataset \\datasetname{} as the main benchmark, as well as transferability results from two other temporal reasoning benchmarks \\tracie{}~\\cite{zhou-etal-2021-temporal} and \\matres{}~\\cite{ning-etal-2018-multi} to show that existing models fail to perform generic temporal reasoning while our proposal makes significant improvements. \nFollowing \\citet{zhou-etal-2021-temporal}, all three datasets are processed as binary classification tasks by keeping instances that are originally annotated as either ``before'' or ``after''. As a result, we use binary accuracy as the metric. For \\matres{}, we use only 1.5k (10\\%) of the training instances to match the size of the other two datasets. Table~\\ref{tab:datanum} summarizes data statistics.\nWe use $\\epsilon=0.1$ in equation~\\ref{eq:marginrankingloss} and $\\alpha=10$ in equation~\\ref{eq:loss}. All model training follows a standard textual entailment setup, uses default parameters, has the same number of steps, and averages from three random seeds. All training can be done with a single 48G-memory GPU within 5 hours.\n\n\\label{sec:datasetstats}\n\\begin{table}[ht]\n\\centering\n\\small{\n\\scalebox{0.94}{\n\\begin{tabular}{lccccccc}\n\\toprule\nData &\\#Train& \\#Test & Relative-Label & Hard-Label\\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5}\n\\textsc{Today}&1,241&1,000&\\checkmark&\\\\\n\\textsc{Tracie}&860&1,924&&\\checkmark\\\\\n\\textsc{Matres}&1,500&1,322&&\\checkmark\\\\\n\\bottomrule\n\\end{tabular}}\n}\n\\caption{Statistics of the three datasets.} \n\\label{tab:datanum}\n\\end{table}\n\n\n\\subsection{Baselines and Systems}\nWe report baseline performances of a state-of-the-art baseline PatternTime~\\cite{zhou-etal-2021-temporal}, as well as GPT-3.5~\\cite{brown2020language,ouyang2022training}. To show that \\datasetname{} and other incidental supervision signals contribute to generic temporal reasoning, we use the T5-large model implemented by~\\citet{wolf-etal-2020-transformers} as the base model and experiment with different supervision settings. We collect 5,000 GPT-3.5 generated instances in total, and 1,475 instances remain after our proposed verification models.\n\n\\begin{table*}[t]\n\\centering\n\\small\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Train Data) & Loss & \\tracie{} & \\matres{} & \\datasetname{} & \\datasetname{} (gold exp.) & Average \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5}\\cmidrule(lr){6-6}\\cmidrule(lr){7-7}\\cmidrule(lr){8-8}\nGPT-3.5 text-davinci-002 & FewShot&56.1&49.0&57.9&68.7&54.3 \\\\\nGPT-3.5 text-davinci-003 & FewShot&52.3&50.1&59.0&70.0&53.8 \\\\\nT5 (in-domain) & CE / MR & 66.2 & 81.2 & 52.9 & 55.7 & 66.8 \\\\\nPatternTime & Distant & 77.0&73.0 &54.1&67.7&68.0\\\\\n\\cmidrule(lr){1-8}\nT5 (O) & MR &50.6&49.8&52.9&55.7&51.1\\\\\nT5 (O+G) & MR&55.4&52.3&55.0&66.5&54.2 \\\\\n\\cmidrule(lr){1-8}\nT5 (M) & CE & 52.7 & 81.2 & 52.5& 57.5 & 62.1\\\\\nT5 (M+O) & CE + MR & 51.5&81.7 &57.4&82.7&63.5\\\\\nT5 (M+O+G) & CE + MR &49.9& 82.9&61.4&\\textbf{82.9}& 64.8 \\\\\n\\cmidrule(lr){1-8}\nT5 (T) & CE & 66.2 & 63.2 & 52.3&56.0 & 60.7\\\\\nT5 (T+O) & CE + MR & 72.9 & 69.4 &59.9& 81.6 & 67.4\\\\\nT5 (T+O+G) & CE + MR &73.5& 68.8& 62.1&82.0&68.1\\\\\n\\cmidrule(lr){1-8}\nT5 (M+T) & CE & 66.2&82.0&52.5&58.5&66.9 \\\\\nT5 (M+T+O) & CE + MR & 73.0 & 83.5 & 57.9& 77.8& 71.5\\\\\nT5 (M+T+O+G) & CE + MR & 73.3&83.9&\\textbf{63.2}&81.6 & 73.5\\\\\n\\cmidrule(lr){1-8}\nPatternTime (M+T) & CE & 79.7 & 85.0 & 56.3 & 66.5 & 73.7 \\\\\nPatternTime (M+T+O) & CE + MR & 79.8 & 85.8 & 60.9 & 82.2 & 75.5 \\\\\nPatternTime (all) & CE + MR &\\textbf{79.9}& \\textbf{86.3}&62.9&82.3&\\textbf{76.4}\\\\\n\\bottomrule\n\\end{tabular}\n\\caption{System performances under different supervision data and loss function settings across three binary temporal benchmarks. For simplicity, we use T to denote \\tracie{} training data, and similarly M for \\matres{}, O for \\datasetname{} (ours), and G for GPT-3.5-generated incidental supervision. \\datasetname{} (gold exp.) uses gold explanations during evaluation. \\textit{Average} is averaged from \\tracie{}, \\matres{} and \\datasetname{} accuracies. \\textit{all} is equivalent to \\textit{M+T+O+G}.}\n\\label{tab:maintable}\n\\end{table*}\n\n\n\\subsection{Main Results}\nTable~\\ref{tab:maintable} shows system performances under different supervision data and loss function settings across three binary temporal benchmarks, without generated explanations. \n\n\n\\vpara{Existing Work is Insufficient.}\nWe observe that GPT-3.5 is doing random guessing on all three benchmarks, suggesting that language model objectives alone are insufficient for temporal reasoning. On the other hand, PatternTime achieves mid-70s accuracy on \\tracie{} and \\matres{} but drops to random guessing on \\datasetname{}. This suggests that biased supervision signals may improve on biased datasets,\\footnote{Here, ``biased'' refers to datasets that align with natural distributions, such as \\textit{drink coffee} is always before \\textit{dinner}.} but not generic temporal reasoning. To further prove this point, we observe that T5 (M+T) jointly trained on \\tracie{} and \\matres{} does not improve much over T5 trained only on corresponding in-domain supervision (+0.4\\% averaged accuracy), suggesting that previous temporal annotation styles do not motivate joint-learning nor generic temporal reasoning.\n\n\\vpara{Our Work Generalizes Better.}\nOn the contrary, we see that by simply using \\datasetname{}'s moderate-sized 1k training instances, T5 (in-domain+O) improves 6.7\\% on \\tracie{}, and 0.5\\% on \\matres{}. When we add the incidental supervision instances from GPT-3.5 (filtered by \\datasetname{}-supervised models in \\S\\ref{sec:incidental}, denoted as T5(in-domain+O+G) in Table~\\ref{tab:maintable}), there is a 7.3\\% improvement on \\tracie{}, and 1.7\\% on \\matres{}. This is, on average, 4.5\\% better than using \\matres{} or \\tracie{} as the supervision source. Moreover, \\datasetname{} and incidental instances bring better joint learning efficiency and possibility, as we see a 6.7\\% average accuracy improvement from T5(M+T+O+G) compared to T5's in-domain bests. If we use PatternTime\\footnote{PatternTime also uses T5-large as the base model, and it does not use any in-domain annotation.} as the base model, we achieve a 76.4\\% average accuracy which is the new state-of-the-art result of binary temporal relation classification across multiple datasets, and almost 10\\% better than using T5 and in-domain supervision alone.\n\n\\vpara{Scaling and Improving LLMs is Inadequate.} We test the latest GPT-4 model \\cite{OpenAI2023GPT4TR} on \\datasetname{}, which gets 64.0\\% accuracy, and 78.0\\% with gold explanations.\\footnote{We use the gpt-4-0314 checkpoint and chat API.} Even though GPT-4 is shown to significantly improve on many natural-language benchmarks over GPT-3.5, its improvement on \\datasetname{} is relatively moderate, and it is only comparable with (if not worse than) our proposed model with less than a billion parameters. This shows that the advancement in large language models alone is insufficient to solve \\datasetname{}, and more rigorous and controllable reasoning models are desirable for future works.\n\n\\subsection{Experiments with Generated Explanation}\n\\label{sec:inference} \nIn Table~\\ref{tab:maintable}, we see that explanations play an important role in generic temporal reasoning as \\textit{PatternTime(all)} improves almost 20\\% on \\datasetname{} with the gold explanations. We, therefore, augment test instances with generated explanations on all three datasets. To utilize the existing explanation verification models proposed in \\S\\ref{sec:incidental}, we generate an additional sentence together with an explanation sentence. Specifically, for each possible relation direction of the event pair, we generate an additional sentence $\\mathcal{AS}$ and an explanation sentence $Exp$ and then use explanation verifier models to select the $\\mathcal{AS}$ and $Exp$ with the highest positive probability out of the two candidates. We use the same models and prompts described in \\S\\ref{sec:incidental}, and we show a sample of generated explanations in Table~\\ref{tb:tracie}.\\footnote{We use the given $\\mathcal{AS}$ for \\datasetname{}. We achieve this with the same prompt but only ask GPT-3.5 to generate an explanation sentence.}\n\nTable~\\ref{tab:generate_exp} shows model performances when augmented with generated explanations. There are improvements on all three datasets compared to the numbers in Table~\\ref{tab:maintable}, with an average improvement of 1.0\\% using T5 and 0.5\\% using PatternTime. However, the overall performance is still suboptimal and the performance on \\datasetname{} is far from when using gold explanations, which motivates future works on generating better explanations.\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nModel (Data) & T & M & \\datasetname{} & Avg & $\\bigtriangleup$ \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nT5 (all) & 76.1& 84.4 & 63.1 & 74.5 & 1.0\\\\\nPatternTime (all) & \\textbf{80.5}  & \\textbf{86.8} & \\textbf{63.4} & \\textbf{76.9} & 0.5 \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Model performances when augmented with generated explanations described in \\S\\ref{sec:inference}. T refers to \\tracie{}, M refers to \\matres{}, and Avg refers to Average. $\\bigtriangleup$ shows the differences compared with Table \\ref{tab:maintable}.} \n\\label{tab:generate_exp}\n\\end{table}\n\n\\begin{table}[t]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Example} \\\\ \\midrule\n\\textbf{Context}: \\textcolor{blue}{Jill studied all week for her math test. She stayed} \\\\ \\textcolor{blue}{up studying the cold night before too. The morning of the} \\\\ \\textcolor{blue}{ test, she woke up sick. But she went to school anyway. Jill's}\\\\ \\textcolor{blue}{teacher allowed her to take the test at home.}  \\\\\n\\midrule\n\\textbf{Relation}: \\textcolor{orange}{Jill's teacher trusted Jill \\textbf{starts before} Jill's teacher} \\\\ \\textcolor{orange}{allowed her to take the test at home.} \\\\\n\\midrule\n\\textbf{$\\mathcal{AS}$}: \\textcolor{teal}{Jill's teacher had always been impressed by her } \\\\\n\\textcolor{teal}{dedication to her studies.}\\\\\n\\midrule\n\\textbf{$Exp$}: \\textcolor{teal}{The additional sentence implies jill's teacher allowed} \\\\ \\textcolor{teal}{her to take the test at home because she trusted her and was}\\\\\n\\textcolor{teal}{impressed by her dedication.}\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:tracie}An example of \\tracie{} with generated explanations in \\S\\ref{sec:inference}. $\\mathcal{AS}$ and $Exp$ are generated by GPT-3.5 and selected by our verification models described in \\S\\ref{sec:incidental}.\n}\n\\end{table}\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nAblation &\\#GPT& T & M & \\datasetname{} & Avg \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\\cmidrule(lr){4-4}\\cmidrule(lr){5-5} \\cmidrule(lr){6-6}\nOurs&1,475&73.3&83.9&63.2&73.5\\\\\nNo Exp&1,867&73.7&83.5&61.2&72.8\\\\\nNo Addition&2,529&70.2&81.4&59.5&70.4\\\\\nNo General&2,079&71.0&81.8&59.5&70.8\\\\\nMore \\#GPT&2,483&74.6&84.0&63.2&73.9\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Ablation study for LLM generated supervision. \\textit{No Exp} does not use the explanation sentence verifier in \\S\\ref{sec:gev}, \\textit{No Addition} does not use the additional sentence verifier, and \\textit{No General} does not use the general verifier. \\textit{More \\#GPT} uses more verifier-filtered supervision instances (filtered\nby three verifiers).} \n\\label{tab:ablation}\n\\end{table}\n\n\n\n\n\\subsection{Ablation Studies and Human Analysis}\nAs shown in Table~\\ref{tab:ablation}, we conduct ablation studies to better understand our incidental supervision signals. We see that the most rigorous setting with all three verifiers achieves the best performance with the fewest remaining instances. This suggests that all of our verifier models trained with \\datasetname{} supervision are making positive contributions in selecting high-quality instances from GPT-3.5 generations.\n\nWe also see that using more incidental supervision instances verified by the verification models described in \\S\\ref{sec:incidental} can further enhance the model performance, suggesting a higher potential for using LLMs to generate supervision signals to empower smaller models. It also directs us to research the trade-off between model scaling and data scaling in temporal reasoning. \n\nWe also conduct human analysis on the quality of the explanation sentences used in \\datasetname{} and subsequent incidental supervision instances. We adopt the commonly used criteria for explanation~\\cite{wiegreffe-marasovic-2021-review}, namely faithfulness (if an explanation implies the predicted label)~\\cite{wiegreffe-pinter-2019-attention}, and plausibility (how well an explanation supports a predicted label)~\\cite{deyoung-etal-2020-eraser}. We use Mechanical Turk to conduct human evaluation of the properties mentioned above. Given a differential analysis sample with an additional sentence and an explanation sentence towards a target temporal relation direction, we analyze faithfulness for the additional sentence by asking if it makes the temporal relation “more” toward the target relation and plausibility for the explanation sentence by asking if it explains why adding the differential content shifts the distribution toward the target relation. \n\nWe show the experiment interfaces in Appendix Fig.~\\ref{fig:eval} and present the results in Table~\\ref{tab:human}. \nWe randomly select 100 samples for each dataset for our human evaluation. For either faithfulness or plausibility, we collect two human evaluations for each sample. Only the sample that is valued as correct by both human annotators will be counted as a positive sample and we denote the total number of positive samples as the final score. We restrict each annotator to take 10 samples at most and there are 92 distinct annotators.\nWe see that \\datasetname{}'s test set contains high-quality explanation annotations, which is expected from our rigorous agreement requirements. Our verification system improves both metrics for GPT-3.5 generated incidental supervision, which further demonstrates the effectiveness of the proposed verification models.\n\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nData & Faithfulness& Plausibility \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\n\\datasetname{} test&91&88\\\\\n\\datasetname{} train&79&68\\\\\nGPT-3.5 distilled&80&67\\\\\nGPT-3.5 random&57&55\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Human evaluation for faithfulness and plausibility of temporal differential analysis.\nFaithfulness and Plausibility denote binary human evaluation results of the corresponding task. GPT-3.5 distilled refers to verifier-filtered GPT-3.5 data (filtered by three verifiers), and GPT-3.5 random refers to randomly sampled raw GPT-3.5 generated data. } \n\\label{tab:human}\n\\end{table}\n  6.5 Ablation Studies and Human Analysis\n\\subsection{Ablation Studies and Human Analysis}\nAs shown in Table~\\ref{tab:ablation}, we conduct ablation studies to better understand our incidental supervision signals. We see that the most rigorous setting with all three verifiers achieves the best performance with the fewest remaining instances. This suggests that all of our verifier models trained with \\datasetname{} supervision are making positive contributions in selecting high-quality instances from GPT-3.5 generations.\n\nWe also see that using more incidental supervision instances verified by the verification models described in \\S\\ref{sec:incidental} can further enhance the model performance, suggesting a higher potential for using LLMs to generate supervision signals to empower smaller models. It also directs us to research the trade-off between model scaling and data scaling in temporal reasoning. \n\nWe also conduct human analysis on the quality of the explanation sentences used in \\datasetname{} and subsequent incidental supervision instances. We adopt the commonly used criteria for explanation~\\cite{wiegreffe-marasovic-2021-review}, namely faithfulness (if an explanation implies the predicted label)~\\cite{wiegreffe-pinter-2019-attention}, and plausibility (how well an explanation supports a predicted label)~\\cite{deyoung-etal-2020-eraser}. We use Mechanical Turk to conduct human evaluation of the properties mentioned above. Given a differential analysis sample with an additional sentence and an explanation sentence towards a target temporal relation direction, we analyze faithfulness for the additional sentence by asking if it makes the temporal relation “more” toward the target relation and plausibility for the explanation sentence by asking if it explains why adding the differential content shifts the distribution toward the target relation. \n\nWe show the experiment interfaces in Appendix Fig.~\\ref{fig:eval} and present the results in Table~\\ref{tab:human}. \nWe randomly select 100 samples for each dataset for our human evaluation. For either faithfulness or plausibility, we collect two human evaluations for each sample. Only the sample that is valued as correct by both human annotators will be counted as a positive sample and we denote the total number of positive samples as the final score. We restrict each annotator to take 10 samples at most and there are 92 distinct annotators.\nWe see that \\datasetname{}'s test set contains high-quality explanation annotations, which is expected from our rigorous agreement requirements. Our verification system improves both metrics for GPT-3.5 generated incidental supervision, which further demonstrates the effectiveness of the proposed verification models.\n\n\n\\begin{table}[t]\n\\centering\n\\small{\n\\begin{tabular}{lccccccc}\n\\toprule\nData & Faithfulness& Plausibility \\\\\n\\cmidrule(lr){1-1}\\cmidrule(lr){2-2}\\cmidrule(lr){3-3}\n\\datasetname{} test&91&88\\\\\n\\datasetname{} train&79&68\\\\\nGPT-3.5 distilled&80&67\\\\\nGPT-3.5 random&57&55\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{Human evaluation for faithfulness and plausibility of temporal differential analysis.\nFaithfulness and Plausibility denote binary human evaluation results of the corresponding task. GPT-3.5 distilled refers to verifier-filtered GPT-3.5 data (filtered by three verifiers), and GPT-3.5 random refers to randomly sampled raw GPT-3.5 generated data. } \n\\label{tab:human}\n\\end{table}\n  ",
        "Descriptive_question1": "What is the Faithfulness score for the datasetname test set in table_7?",
        "Descriptive_question2": "What is the Plausibility score for GPT-3.5 random in table_7?",
        "Reasoning_question1": "Why might the Faithfulness and Plausibility scores be higher for the datasetname test set compared to GPT-3.5 random in table_7?",
        "Reasoning_question2": "How does the use of verifier-filtered data impact the Faithfulness and Plausibility scores of GPT-3.5 distilled compared to GPT-3.5 random in table_7?",
        "Descriptive_answer1": "91",
        "Descriptive_answer2": "55",
        "Reasoning_answer1": "The higher Faithfulness and Plausibility scores for the datasetname test set compared to GPT-3.5 random can likely be attributed to the quality and rigor of the annotation process. First, the datasetname test set is noted to have high-quality explanation annotations due to rigorous agreement requirements among annotators, as mentioned in the text. This ensures that the data is carefully curated and verified, leading to more reliable and accurate evaluations, reflected in the high scores of 91 for Faithfulness and 88 for Plausibility. In contrast, GPT-3.5 random represents randomly sampled raw data generated by the model without any filtering or verification, resulting in lower quality explanations and additional sentences. This lack of quality control likely contributes to the significantly lower scores of 57 for Faithfulness and 55 for Plausibility, as the raw outputs may contain errors or inconsistencies that human evaluators deem less faithful or plausible.",
        "Reasoning_answer2": "The use of verifier-filtered data significantly improves the Faithfulness and Plausibility scores for GPT-3.5 distilled compared to GPT-3.5 random. Let's break this down: GPT-3.5 random, with scores of 57 for Faithfulness and 55 for Plausibility, consists of unfiltered, raw generated data, which likely includes a higher proportion of low-quality or irrelevant content as there is no mechanism to ensure accuracy or relevance. On the other hand, GPT-3.5 distilled, with scores of 80 for Faithfulness and 67 for Plausibility, is filtered by three verifiers as described in the table caption and context. This filtering process, supported by the verification models trained with datasetname supervision, selects higher-quality instances from the GPT-3.5 generations, as noted in the human analysis section. The verifiers help eliminate incorrect or poorly constructed outputs, thereby enhancing the overall quality of the data. Consequently, the improvement in scores for GPT-3.5 distilled demonstrates the effectiveness of the verification system in refining the raw data to achieve better alignment with human evaluation criteria for temporal differential analysis."
    },
    {
        "paper_id": "2212.10467.json",
        "table_id": "table_8",
        "table_content": "\\begin{table*}[ht]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small\n\\scalebox{0.95}{\n\\begin{tabular}{@{}l@{}}\n\\toprule\nLet’s add a sentence to the first sentence of the context such that the hypothesis is more likely to hold true and explain why.\\\\ \nContext: \\textcolor{blue}{\nTara always wanted jewelry. Her birthday was coming up. Test went to the store. He gave her a really nice necklace.}\\\\\n \\textcolor{blue}{She adored him for the gift.\n}\\\\\nHypothesis: \\textcolor{orange}{Test was being a good friend \\textbf{starts after} he give her a really nice necklace}\\\\\n\nAdd a sentence to the first sentence of the context such that the hypothesis is more likely to hold true and explain why.  \\\\\n\\textcolor{teal}{Test had a secret crush on a girl named Tara in the lower grade.} \\\\\n\\textcolor{teal}{Explanation: the fact that Test and Tara were in different grades implies that their relationship may not have been particularly close.}\\\\\n\\textcolor{teal}{However, Test's secret crush on Tara suggests that he paid close attention to her. By giving her the necklace, Test aimed to establish}\\\\\n\\textcolor{teal}{a stronger connection with Tara. }  \\\\\n\\#\\#\\# \\\\\nContext: \\textcolor{blue}{\nTara always wanted jewelry. Her birthday was coming up. Test went to the store. He gave her a really nice necklace.}\\\\\n \\textcolor{blue}{She adored him for the gift.\n}\\\\\nHypothesis: \\textcolor{orange}{Test was being a good friend \\textbf{starts before} he give her a really nice necklace}\\\\\n\nAdd a sentence to the first sentence of the context such that the hypothesis is more likely to hold true and explain why. \\\\\n\\textcolor{teal}{Test and Tara always hung out together.} \\\\\n\\textcolor{teal}{Explanation: normally people who hang out frequently are friends, and friends will send each other gifts on their birthdays.} \\\\\n\n\\#\\#\\# \\\\\nContext: \\textcolor{blue}{I have always been attracted to Hispanic men. That said, my first huge crush was on a Mexican. I was in love with}\\\\\n\\textcolor{blue}{him for two years. After two years, I realized I was wasting my time and idolizing him. Without any real sense of closure, I} \\\\\n\\textcolor{blue}{decided to pull my heart away.} \\\\\nHypothesis: \\textcolor{orange}{I felt lonely \\textbf{starts before} I decided to pull my heart away} \\\\\nAdd a sentence to the first sentence of the context such that the hypothesis is more likely to hold true and explain why.  \n \\\\\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:prompt} A sample prompt with an instance for two hypothetical changes to make the event pair's temporal relation \"more before\" or \"more after\".\n}\n\\end{table*}",
        "caption": "\n\t\\label{tb:prompt} A sample prompt with an instance for two hypothetical changes to make the event pair's temporal relation \"more before\" or \"more after\".\n",
        "label": "tb:prompt",
        "section_info": "5 LLM Incidental Supervision\n\\section{LLM Incidental Supervision}\n\\label{sec:incidental}\n\nAs we hypothesize and later show in \\S\\ref{sec:experiment}, human-annotated explanations greatly benefit generic temporal reasoning models, as they encourage models to learn to use the correct signals. However, it is extremely difficult and expensive to crowdsource such explanations for training purposes since collecting one instance costs \\$1 on average. On the other hand, large language models (LLMs) can produce a large amount of generated explanations at a much cheaper cost. Unfortunately, these generated explanations are mostly unusable as they are simply model guesses based on \ntextual correlations. \n\nIn this section, we introduce a knowledge distillation method that combines the benefits of both human annotations and LLM generations by training verification models based on our seed annotation, which is then used to select generations more likely to be plausible. Compared to previous work \\cite{wiegreffe-etal-2022-reframing}, we propose a verification system composed of multiple models that individually verify different aspects of automatically-generated explanations. We detail our pipeline below.\n\n\\subsection{Temporal Explanations from GPT-3.5}\nWe adopt the same event pair generation and context selection process as detailed in \\S\\ref{sec:dataset}. We design prompts as shown in Appendix Table~\\ref{tb:prompt} and Table~\\ref{tb:prompt1} that provide GPT-3.5 with contexts, event pairs, and temporal relations, and ask GPT-3.5 to generate additional sentences, how these sentences will change the temporal relations, and why. The prompt contains a few examples, which makes this setting few-shot. \n\n\n\n\\subsection{Verification System}\n\\label{sec:gev}\n\n\\vpara{Similarity-based Filtering.}\nWe filter GPT-3.5 instances that use exact same sentences from the context as the additional sentence or repeat the event pairs and temporal relations as explanations. We use S-BERT~\\cite{reimers-gurevych-2019-sentence} with a $0.95$ threshold to perform this filtering.\n\n\\vpara{General Explanation Verifier.}\nWe use the generic temporal relation model as proposed in \\S\\ref{sec:model} trained on \\datasetname{} and an additional temporal relation dataset\\footnote{Depending on the target task, this additional temporal relation dataset is different. We use \\matres{} / \\tracie{} / \\matres{} + \\tracie{} as the additional temporal relation dataset when evaluated on \\matres{} / \\tracie{} / All, respectively.} to verify if the generated additional sentence $\\mathcal{AS}$ together with the explanation sentence $Exp$ shifts the temporal relation to the direction that it is supposed to.\n\n\\vpara{Additional Sentence Verifier.}\nThe general explanation verifier cannot sufficiently identify partial correctnesses of GPT-3.5 generations. For example, a generated instance may have a sub-optimal $\\mathcal{AS}$ but convincing $Exp$, which could create deceptions. To address this, we train a separate $\\mathcal{AS}$ verification model with \\datasetname{} that does not use $Exp$ as input. We follow the same training scheme as \\S\\ref{sec:model}, and similarly, verify if the $\\mathcal{AS}$ shifts the temporal relation as expected as our filtering criteria.\n\n\\vpara{Explanation Sentence Verifier.}\nWe also train a binary classification model to check the plausibility of $Exp$ individually. To generate negative $Exp$ instances, for each instance in the \\datasetname{} training set with a given $\\mathcal{AS}$, we ask GPT-3.5 to generate three possible explanation sentences. We use the one that is the least similar to the human-annotated $Exp$ according to S-BERT as the negative instance, which we denote as $Exp_{neg}$. We finetune the base seq-to-seq model with the positive and negative explanations and optimize the loss function as the negative log-likelihood of the positive explanation:\n\\beqn{\n\\ell^{E} &= -log\\frac{e^{p_{pos}}}{e^{p_{pos}}+e^{p_{neg}}}\\\\\np_{pos} &= p(ent|(\\mathcal{AS}+\\mathcal{C},Exp_{human}),r_g) \\\\\np_{neg} &= p(ent|(\\mathcal{AS}+\\mathcal{C},Exp_{neg}),r_g)\n}\nWe filter all GPT-3.5 generated instances whose explanation is deemed as negative by this binary classification model.  \n5.1 Temporal Explanations from GPT-3.5\n\\subsection{Temporal Explanations from GPT-3.5}\nWe adopt the same event pair generation and context selection process as detailed in \\S\\ref{sec:dataset}. We design prompts as shown in Appendix Table~\\ref{tb:prompt} and Table~\\ref{tb:prompt1} that provide GPT-3.5 with contexts, event pairs, and temporal relations, and ask GPT-3.5 to generate additional sentences, how these sentences will change the temporal relations, and why. The prompt contains a few examples, which makes this setting few-shot. \n\n\n\n",
        "Descriptive_question1": "What is the main focus of table_8 in the appendix?",
        "Descriptive_question2": "Which model is referenced in table_8 for generating temporal explanations?",
        "Reasoning_question1": "How does the structure of the prompt in table_8 contribute to the generation of temporal relation changes?",
        "Reasoning_question2": "Why might the examples provided in table_8 be significant for the few-shot learning approach of GPT-3.5?",
        "Descriptive_answer1": "temporal relations",
        "Descriptive_answer2": "GPT-3.5",
        "Reasoning_answer1": "The structure of the prompt in table_8 is designed to guide GPT-3.5 in generating temporal relation changes by providing specific contexts, event pairs, and desired temporal relations ('more before' or 'more after'). This structured input helps by framing the task clearly, ensuring the model understands the objective of adding sentences to shift temporal perceptions. By including both the context and hypothesis, the prompt allows the model to focus on generating relevant additional sentences and explanations that align with the intended temporal shift, thus improving the accuracy and relevance of the output.",
        "Reasoning_answer2": "The examples provided in table_8 are significant for the few-shot learning approach of GPT-3.5 because they serve as templates or demonstrations of the desired output format and logic. Few-shot learning relies on a small set of examples to adapt the model to a specific task without extensive retraining. In this case, the examples help GPT-3.5 understand how to construct additional sentences and explanations that effectively alter temporal relations. By observing these instances, the model can infer patterns and apply similar reasoning to new contexts, enhancing its ability to generate plausible and contextually appropriate responses with minimal training data."
    },
    {
        "paper_id": "2212.10467.json",
        "table_id": "table_9",
        "table_content": "\\begin{table*}[ht]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small\n\\scalebox{0.95}{\n\\begin{tabular}{@{}l@{}}\n\\toprule\nLet’s add a sentence as the first sentence of the paragraph to let the statement more likely to hold true and explain why. \\\\ \nParagraph: \\textcolor{blue}{Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look around in his mouth. One of }\\\\\n\\textcolor{blue}{his teeth was rotten. Once the tooth was pulled, Tim felt fine.} \\\\\nStatement: \\textcolor{orange}{Tim scheduled an appointment with his dentist \\textbf{starts after} his tooth started hurting like crazy} \\\\\nAdd what sentence as the first sentence of the paragraph and why is the statement more likely to hold true? \\\\\n\\textcolor{teal}{Tim's tooth was usually perfect, so he did not often go to see the dentist.} \\\\\n\\textcolor{teal}{This makes the statement true because it implies that Tim did not have regular appointments with his dentist and the reason why he} \\\\\n\\textcolor{teal}{scheduled an appointment with his dentist was that his tooth was hurting like crazy.}\\\\\n\\#\\#\\# \\\\\nParagraph: \\textcolor{blue}{Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look around in his mouth. One of }\\\\\n\\textcolor{blue}{his teeth was rotten. Once the tooth was pulled, Tim felt fine.} \\\\\nStatement: \\textcolor{orange}{Tim scheduled an appointment with his dentist \\textbf{starts before} his tooth started hurting like crazy} \\\\\nAdd what sentence as the first sentence of the paragraph and why is the statement more likely to hold true? \\\\\n\\textcolor{teal}{Tim always met his dentist regularly.} \\\\\n\\textcolor{teal}{This makes the statement true because it implies that Tim may have already scheduled regular appointments with his dentist before}  \\\\\n\\textcolor{teal}{his tooth started hurting like crazy.}  \\\\\n\\#\\#\\# \\\\\nParagraph: \\textcolor{blue}{ Chuck was hanging out with some friends at a bar. They mentioned that they were moving soon. Chuck offered}\\\\\n \\textcolor{blue}{to help them move their things. The team worked together and got the move done quickly. They were so grateful that they.\n}\\\\\n \\textcolor{blue}{invited him to stay for dinner.\n }\\\\\nStatement: \\textcolor{orange}{ Chuck wanted to be helpful \\textbf{starts before} Chuck offered to help them move their things}\\\\\nAdd what sentence as the first sentence of the paragraph and why is the statement more likely to hold true? \\\\\n\\textcolor{teal}{Chuck is the kind of person that always wants to help out.} \\\\\n\\textcolor{teal}{This makes the statement true because it implies Chuck's wanted to help his friends move their things was because he is naturally}  \\\\\n\\textcolor{teal}{helpful.}  \\\\\n\\#\\#\\# \\\\\nParagraph: \\textcolor{blue}{ Chuck was hanging out with some friends at a bar. They mentioned that they were moving soon. Chuck offered}\\\\\n \\textcolor{blue}{to help them move their things. The team worked together and got the move done quickly. They were so grateful that they.\n}\\\\\n \\textcolor{blue}{invited him to stay for dinner.\n }\\\\\nStatement: \\textcolor{orange}{ Chuck wanted to be helpful \\textbf{starts after} Chuck offered to help them move their things}\\\\\nAdd what sentence as the first sentence of the paragraph and why is the statement more likely to hold true? \\\\\n\\textcolor{teal}{Chuck often found himself reluctant to do thing, but grateful afterward that he did.} \\\\\n\\textcolor{teal}{This makes the statement true because if Chuck was reluctant, he might not have truly felt like being helpful until after he}  \\\\\n\\textcolor{teal}{offered to help and was grateful afterward.}  \\\\\n\\#\\#\\# \\\\\nParagraph: \\textcolor{blue}{ I have always been attracted to Hispanic men. That said, my first huge crush was a Mexican. I was in love with}\\\\\n \\textcolor{blue}{him for two years. After two years, I realized I was wasting my time and over-idolizing him. Without any real sense of closure, I\n}\\\\\n \\textcolor{blue}{decided to pull my heart away.\n }\\\\\nStatement: \\textcolor{orange}{I felt lonely \\textbf{starts before} I decided to pull my heart away}\\\\\nAdd what sentence as the first sentence of the paragraph and why is the statement more likely to hold true?\n \\\\\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:prompt1} A sample prompt with two instances for two hypothetical changes to make the event pair's temporal relation \"more before\" or \"more after\".\n}\n\\end{table*}",
        "caption": "\n\t\\label{tb:prompt1} A sample prompt with two instances for two hypothetical changes to make the event pair's temporal relation \"more before\" or \"more after\".\n",
        "label": "tb:prompt1",
        "section_info": "5 LLM Incidental Supervision\n\\section{LLM Incidental Supervision}\n\\label{sec:incidental}\n\nAs we hypothesize and later show in \\S\\ref{sec:experiment}, human-annotated explanations greatly benefit generic temporal reasoning models, as they encourage models to learn to use the correct signals. However, it is extremely difficult and expensive to crowdsource such explanations for training purposes since collecting one instance costs \\$1 on average. On the other hand, large language models (LLMs) can produce a large amount of generated explanations at a much cheaper cost. Unfortunately, these generated explanations are mostly unusable as they are simply model guesses based on \ntextual correlations. \n\nIn this section, we introduce a knowledge distillation method that combines the benefits of both human annotations and LLM generations by training verification models based on our seed annotation, which is then used to select generations more likely to be plausible. Compared to previous work \\cite{wiegreffe-etal-2022-reframing}, we propose a verification system composed of multiple models that individually verify different aspects of automatically-generated explanations. We detail our pipeline below.\n\n\\subsection{Temporal Explanations from GPT-3.5}\nWe adopt the same event pair generation and context selection process as detailed in \\S\\ref{sec:dataset}. We design prompts as shown in Appendix Table~\\ref{tb:prompt} and Table~\\ref{tb:prompt1} that provide GPT-3.5 with contexts, event pairs, and temporal relations, and ask GPT-3.5 to generate additional sentences, how these sentences will change the temporal relations, and why. The prompt contains a few examples, which makes this setting few-shot. \n\n\n\n\\subsection{Verification System}\n\\label{sec:gev}\n\n\\vpara{Similarity-based Filtering.}\nWe filter GPT-3.5 instances that use exact same sentences from the context as the additional sentence or repeat the event pairs and temporal relations as explanations. We use S-BERT~\\cite{reimers-gurevych-2019-sentence} with a $0.95$ threshold to perform this filtering.\n\n\\vpara{General Explanation Verifier.}\nWe use the generic temporal relation model as proposed in \\S\\ref{sec:model} trained on \\datasetname{} and an additional temporal relation dataset\\footnote{Depending on the target task, this additional temporal relation dataset is different. We use \\matres{} / \\tracie{} / \\matres{} + \\tracie{} as the additional temporal relation dataset when evaluated on \\matres{} / \\tracie{} / All, respectively.} to verify if the generated additional sentence $\\mathcal{AS}$ together with the explanation sentence $Exp$ shifts the temporal relation to the direction that it is supposed to.\n\n\\vpara{Additional Sentence Verifier.}\nThe general explanation verifier cannot sufficiently identify partial correctnesses of GPT-3.5 generations. For example, a generated instance may have a sub-optimal $\\mathcal{AS}$ but convincing $Exp$, which could create deceptions. To address this, we train a separate $\\mathcal{AS}$ verification model with \\datasetname{} that does not use $Exp$ as input. We follow the same training scheme as \\S\\ref{sec:model}, and similarly, verify if the $\\mathcal{AS}$ shifts the temporal relation as expected as our filtering criteria.\n\n\\vpara{Explanation Sentence Verifier.}\nWe also train a binary classification model to check the plausibility of $Exp$ individually. To generate negative $Exp$ instances, for each instance in the \\datasetname{} training set with a given $\\mathcal{AS}$, we ask GPT-3.5 to generate three possible explanation sentences. We use the one that is the least similar to the human-annotated $Exp$ according to S-BERT as the negative instance, which we denote as $Exp_{neg}$. We finetune the base seq-to-seq model with the positive and negative explanations and optimize the loss function as the negative log-likelihood of the positive explanation:\n\\beqn{\n\\ell^{E} &= -log\\frac{e^{p_{pos}}}{e^{p_{pos}}+e^{p_{neg}}}\\\\\np_{pos} &= p(ent|(\\mathcal{AS}+\\mathcal{C},Exp_{human}),r_g) \\\\\np_{neg} &= p(ent|(\\mathcal{AS}+\\mathcal{C},Exp_{neg}),r_g)\n}\nWe filter all GPT-3.5 generated instances whose explanation is deemed as negative by this binary classification model.  \n5.1 Temporal Explanations from GPT-3.5\n\\subsection{Temporal Explanations from GPT-3.5}\nWe adopt the same event pair generation and context selection process as detailed in \\S\\ref{sec:dataset}. We design prompts as shown in Appendix Table~\\ref{tb:prompt} and Table~\\ref{tb:prompt1} that provide GPT-3.5 with contexts, event pairs, and temporal relations, and ask GPT-3.5 to generate additional sentences, how these sentences will change the temporal relations, and why. The prompt contains a few examples, which makes this setting few-shot. \n\n\n\n",
        "Descriptive_question1": "What is the purpose of table_9 in the context of the study?",
        "Descriptive_question2": "How many instances are presented in table_9?",
        "Reasoning_question1": "How might the hypothetical changes in table_9 influence the temporal relation analysis performed by GPT-3.5?",
        "Reasoning_question2": "Why do you think the authors chose to present two specific instances in table_9 for demonstrating temporal relation shifts?",
        "Descriptive_answer1": "Sample prompt",
        "Descriptive_answer2": "Two instances",
        "Reasoning_answer1": "The hypothetical changes in table_9, such as adding sentences to alter the context of events, can significantly influence the temporal relation analysis performed by GPT-3.5. By introducing new sentences, the model is prompted to reassess the sequence or causality between events, which may shift the perceived temporal relations to 'more before' or 'more after.' This allows the model to generate explanations that reflect how context modifications impact event ordering, thereby testing its ability to adapt to nuanced changes in narrative structure. Such changes help in understanding whether GPT-3.5 can accurately detect and explain temporal shifts based on subtle contextual cues, enhancing the robustness of temporal reasoning models through incidental supervision.",
        "Reasoning_answer2": "The authors likely chose to present two specific instances in table_9 to provide a balanced demonstration of how temporal relations can be manipulated in both directions—'more before' and 'more after.' This dual representation serves as a clear example of the flexibility and range of temporal relation shifts, offering a comparative perspective on how different contextual additions can alter event sequencing. By showing two contrasting cases for each scenario, the authors can effectively illustrate the concept of temporal modification to readers and highlight the potential of GPT-3.5 to generate varied explanations. This approach also underscores the methodology of using few-shot learning prompts, ensuring the model's capability is tested across diverse temporal dynamics."
    },
    {
        "paper_id": "2212.10467.json",
        "table_id": "table_10",
        "table_content": "\\begin{table*}[ht]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small\n\\scalebox{0.95}{\n\\begin{tabular}{@{}l@{}}\n\\toprule\nLet's find out an event that is unmentioned but can be inferred from the context and the temporal relation between the two events\\\\\n are not deterministic. The new event should not be longer than ten words and include only one verb. \\\\ \nContext: \\textcolor{blue}{\nTara always wanted jewelry. Her birthday was coming up. Test went to the store. He gave her a really nice necklace}\\\\\n \\textcolor{blue}{She adored him for the gift.\n}\\\\\nWhat is an event that is unmentioned but has some role and can be inferred from the context? \\\\\n\\textcolor{teal}{Test was being a good friend} \\\\\n\\textcolor{teal}{It can be inferred from She adored him for the gift.} \\\\\n\\#\\#\\# \\\\\nContext: \\textcolor{blue}{Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look around in his mouth. One of }\\\\\n\\textcolor{blue}{his teeth was rotten. Once the tooth was pulled, Tim felt fine.}\\\\\nWhat is an event that is unmentioned but has some role and can be inferred from the context? \\\\\n\\textcolor{teal}{Tim scheduled an appointment with his dentist} \\\\\n\\textcolor{teal}{It can be inferred from Tim's tooth was hurting like crazy.} \\\\\n\\#\\#\\# \\\\\nContext: \\textcolor{blue}{Lily went to a nice restaurant. She ordered a steak. To her dismay the steak was rare. Lily was rather upset. She had }\\\\\n\\textcolor{blue}{to send it back.}\\\\\nWhat is an event that is unmentioned but has some role and can be inferred from the context?\n \\\\\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:prompt2} A sample prompt to generate an implicit event given the context.\n}\n\\end{table*}",
        "caption": "\n\t\\label{tb:prompt2} A sample prompt to generate an implicit event given the context.\n",
        "label": "tb:prompt2",
        "section_info": "3 Dataset\n\\section{Dataset}\n\\label{sec:dataset}\n\n\n\nIn this section, we introduce the evaluation framework and collection process of \\datasetname{}.\n\n\\subsection{Task overview}\nThe \\datasetname{} dataset and its overall framework are designed to evaluate systems' ability to make temporal predictions with plausible reasons. Existing datasets, including \\matres, \\textsc{Torque}, and \\tracie, only annotate common event pairs that align with human common sense. In other words, if an event pair does not strongly imply a temporal relation (e.g. over 80\\% confidence), it will not be annotated and tested on systems. This allows pre-trained language models with millions of parameters to exploit annotation artifacts and priors that do not necessarily hold in certain contexts. For example, we know ``lunch'' is usually before ``dinner'', but this also depends on if they are performed by the same subject, at the same location, and/or on the same day. Unfortunately, current models often memorize such relations as immutable facts, leading to prediction errors in instances that are less common in real life. This intuition inspires us to build a framework to evaluate how much spurious information and priors current models are using.\n\n\\vpara{Temporal Explanations.}\nAn ideal method to evaluate whether models are making predictions in the right way is to let them explain why a certain prediction is made and evaluate the faithfulness and plausibility of the explanations. However, such an evaluation framework is almost impossible to achieve with current progress in natural language processing, where the two main challenges are: 1) it is extremely difficult to collect gold explanations that are sufficient to cover any possible sets of explanations; and 2) it is impossible to evaluate system generations using existing summarization metrics automatically.\n\n\\vpara{Temporal Differential Analysis.}\nBecause of the aforementioned challenges in directly evaluating system explanations, we propose an alternative that is a close proxy to the ideal form, namely temporal differential analysis. The core of the temporal differential analysis is to check if models can correctly identify how a subtle change to the context may affect the temporal relations of a given event pair. The intuition behind this choice is two-fold: 1) it is much easier for both annotators and models to produce an explanation if they know which dimension to focus on; and 2) this provides a binary evaluation measure that is deterministic and trustworthy in terms of reflecting how much spurious information models are using. \n\nSpecifically, our differential analysis process is defined below. Given an original context $\\mathcal{C}$, event 1 $\\mathcal{E}_1$ and event 2 $\\mathcal{E}_2$,\nwe assume a gold distribution $\\mathbb{D}=\\{P_{before}, P_{after}, P_{same}\\}$ on the temporal relation between $\\mathcal{E}_1$ and $\\mathcal{E}_2$ concerning $\\mathcal{C}$, where $P_{before}, P_{after}, P_{same}$ are the probabilities of the temporal relation being before, after and simultaneous respectively, and the probabilities altogether sum to 1. We then annotate two additional sentences $\\mathcal{AS}_{before}$ and $\\mathcal{AS}_{after}$, where the temporal relation distribution between $\\mathcal{E}_1$ and $\\mathcal{E}_2$ with respect to $\\mathcal{AS}_{before}+\\mathcal{C}$ results in an increased $P_{before}$, while similarly the distribution using $\\mathcal{AS}_{after}+\\mathcal{C}$ as the context has a higher $P_{after}$.\n\nTable~\\ref{tb:example} shows an example instance of temporal differential analysis, where an additional sentence $\\mathcal{AS}_{before}$ has an effect on the temporal relation between the two events and shifts the label distribution towards ``before''. We conducted a human pilot study for this formulation and found that it is easier to annotate and achieve substantial improvement over the explanation quality than to directly ask annotators to provide custom explanations for an event pair. We therefore adopt the former formulation and create our evaluation dataset \\datasetname{} through a multi-stage annotation process as described below.\n\n\n\\begin{table}[t]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Example} \\\\ \\midrule\n\\textbf{Context $\\mathcal{C}$}: \\textcolor{blue}{Tim’s tooth was hurting like crazy. His dentist} \\\\ \\textcolor{blue}{took a look around in his mouth. One of his teeth was rotten.} \\\\ \\textcolor{blue}{Once the tooth was pulled, Tim felt fine.}\\\\ \n\\midrule\n\\textbf{Additional Sentence 1 ($\\mathcal{AS}_{before}$)}: \\textcolor{teal}{Tim always met his } \\\\\n\\textcolor{teal}{dentist regularly.}\\\\\n\\midrule\n\\textbf{Event 1 ($\\mathcal{E}_1$)}: \\textcolor{orange}{Tim scheduled an appointment with his dentist.} \\\\\n\\textbf{Event 2 ($\\mathcal{E}_2$)}: \\textcolor{orange}{Tim's tooth started to hurt like crazy.} \\\\\n\\midrule\n\\textbf{Explanation ($Exp$)}: \\textcolor{teal}{Some people maintain regular visits to} \\\\ \\textcolor{teal}{a dentist. Tim is one of these individuals and may have} \\\\ \\textcolor{teal}{ already scheduled a regular appointment with his dentist }\\\\\n\\textcolor{teal}{before his tooth started to hurt.}\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:example} An example of temporal differential analysis, where $\\mathcal{AS}$ shifts the temporal relation between $\\mathcal{E}_1$ and $\\mathcal{E}_2$ to be more ``before''. See \\S \\ref{sec:dataset} for more details.\n}\n\\end{table}\n\n\n\\subsection{Dataset Construction}\nFollowing the definition of the temporal differential analysis framework above, we collect a dataset to carry out the actual evaluation. Each instance in \\datasetname{} contains a context $\\mathcal{C}$, an event pair $\\mathcal{E}_1$, $\\mathcal{E}_2$, and an additional sentence of either $\\mathcal{AS}_{before}$ or $\\mathcal{AS}_{after}$. In addition, we also annotate a human explanation $Exp$ regarding why the additional sentence affects the temporal relation between the two events. \\datasetname{} is constructed in three steps: 1) event pair generation, 2) additional sentence and explanation annotation, and 3) annotation verification and cleaning. We detail this pipeline below. \n\n\\vpara{Generating $\\mathcal{C}$ and $\\mathcal{E}$.}\nWe randomly sample short stories from the ROCStories dataset~\\cite{mostafazadeh-etal-2016-corpus} as the context $\\mathcal{C}$. For each story, we use GPT-3.5 \\footnote{We use GPT-3.5 text-davinci-002 for data generation throughout the work.} to generate an implicit event phrase based on an explicit event phrase selected by GPT-3.5 at the same time. An implicit event is an event that is not explicitly mentioned by the given context but is still inferable and relevant, e.g. Event 1 in Table~\\ref{tb:example}. A sample prompt can be referred to in Appendix Table~\\ref{tb:prompt2} to construct an event pair. We do this for two main reasons: 1) events that are not explicitly mentioned by the context provide more uncertainty so that the event pair does not come with a deterministic temporal relation decided by the context; 2) this is closer to the format of \\tracie{}, which we aim to compare system performance changes with. \n\n\\vpara{Crowdsourcing $\\mathcal{AS}$ and $Exp$.}\nAfter generating $\\mathcal{C}$ and $\\mathcal{E}$'s, we use Mechanical Turk to ask crowdsourcing annotators to write potential $\\mathcal{AS}_{before}$ and $\\mathcal{AS}_{after}$ with respect to the provided information. The guideline asks annotators to write additional sentences that can be added to the beginning of the context to prevent models from using text positional information. The annotator is also asked to explain why they wrote $\\mathcal{AS}$ and why it affects the temporal relation distribution. We use this as $Exp$. We design an annotation interface that is intuitive and filled with examples, and at the same time, we require annotators to pass a rigorous qualification test to demonstrate a proper understanding. We list our interfaces and tests in Fig.~\\ref{fig:mturk} and Table~\\ref{tb:qual}.\n\n\\vpara{Annotation Verification.}\nWe employ an additional verification stage for the human-written instances from the previous step. We provide annotators with the formatted textual entailment instance and ask if the entailment label changes in the expected direction. We collect two individual verifications per instance, and the instances accepted by all annotators appear in the test set.\n\n\n\\subsection{Statistics}\nWe collect 1,000 instances agreed upon by all annotators as the evaluation set and construct a silver training set with the remaining 1,241 instances that do not have unanimous annotator agreements.  3.2 Dataset Construction\n\\subsection{Dataset Construction}\nFollowing the definition of the temporal differential analysis framework above, we collect a dataset to carry out the actual evaluation. Each instance in \\datasetname{} contains a context $\\mathcal{C}$, an event pair $\\mathcal{E}_1$, $\\mathcal{E}_2$, and an additional sentence of either $\\mathcal{AS}_{before}$ or $\\mathcal{AS}_{after}$. In addition, we also annotate a human explanation $Exp$ regarding why the additional sentence affects the temporal relation between the two events. \\datasetname{} is constructed in three steps: 1) event pair generation, 2) additional sentence and explanation annotation, and 3) annotation verification and cleaning. We detail this pipeline below. \n\n\\vpara{Generating $\\mathcal{C}$ and $\\mathcal{E}$.}\nWe randomly sample short stories from the ROCStories dataset~\\cite{mostafazadeh-etal-2016-corpus} as the context $\\mathcal{C}$. For each story, we use GPT-3.5 \\footnote{We use GPT-3.5 text-davinci-002 for data generation throughout the work.} to generate an implicit event phrase based on an explicit event phrase selected by GPT-3.5 at the same time. An implicit event is an event that is not explicitly mentioned by the given context but is still inferable and relevant, e.g. Event 1 in Table~\\ref{tb:example}. A sample prompt can be referred to in Appendix Table~\\ref{tb:prompt2} to construct an event pair. We do this for two main reasons: 1) events that are not explicitly mentioned by the context provide more uncertainty so that the event pair does not come with a deterministic temporal relation decided by the context; 2) this is closer to the format of \\tracie{}, which we aim to compare system performance changes with. \n\n\\vpara{Crowdsourcing $\\mathcal{AS}$ and $Exp$.}\nAfter generating $\\mathcal{C}$ and $\\mathcal{E}$'s, we use Mechanical Turk to ask crowdsourcing annotators to write potential $\\mathcal{AS}_{before}$ and $\\mathcal{AS}_{after}$ with respect to the provided information. The guideline asks annotators to write additional sentences that can be added to the beginning of the context to prevent models from using text positional information. The annotator is also asked to explain why they wrote $\\mathcal{AS}$ and why it affects the temporal relation distribution. We use this as $Exp$. We design an annotation interface that is intuitive and filled with examples, and at the same time, we require annotators to pass a rigorous qualification test to demonstrate a proper understanding. We list our interfaces and tests in Fig.~\\ref{fig:mturk} and Table~\\ref{tb:qual}.\n\n\\vpara{Annotation Verification.}\nWe employ an additional verification stage for the human-written instances from the previous step. We provide annotators with the formatted textual entailment instance and ask if the entailment label changes in the expected direction. We collect two individual verifications per instance, and the instances accepted by all annotators appear in the test set.\n\n\n",
        "Descriptive_question1": "What is the purpose of the sample prompt shown in table_10?",
        "Descriptive_question2": "Where is the sample prompt for generating an implicit event located in table_10?",
        "Reasoning_question1": "Why might the sample prompt in table_10 be crucial for constructing event pairs in the context of temporal differential analysis?",
        "Reasoning_question2": "How does the sample prompt in table_10 contribute to evaluating a system's ability to make temporal predictions with plausible reasons?",
        "Descriptive_answer1": "Generate implicit event",
        "Descriptive_answer2": "Appendix Table",
        "Reasoning_answer1": "The sample prompt in table_10 is crucial for constructing event pairs in temporal differential analysis because it guides the generation of implicit events that are not explicitly mentioned in the context but are inferable. This process begins with understanding the context provided, such as short stories from datasets like ROCStories. The prompt ensures that the generated events introduce uncertainty in temporal relations, as they are not deterministically tied to the context. This uncertainty is key to testing a model's ability to handle subtle contextual changes, as it prevents the model from relying on memorized or overly deterministic patterns. By creating such event pairs, the prompt helps in evaluating whether models can correctly predict temporal relations when additional sentences shift the probability distribution of those relations, thus supporting the core objective of temporal differential analysis.",
        "Reasoning_answer2": "The sample prompt in table_10 contributes to evaluating a system's ability to make temporal predictions with plausible reasons by providing a structured approach to generating implicit events that test a model's reasoning capabilities. First, it sets the foundation for creating contexts and event pairs with inherent temporal ambiguity, as seen in the examples provided in the table. This ambiguity challenges the system to go beyond simple pattern recognition and consider nuanced contextual clues. Additionally, the prompt's focus on implicit events aligns with the dataset's goal of identifying whether models rely on spurious correlations or priors, rather than genuine temporal understanding. By using such prompts, the framework can assess if systems can adapt their predictions when additional context (like AS_before or AS_after) is introduced, and if they can provide plausible reasoning for changes in temporal relations, mirroring the human-like reasoning the dataset aims to evaluate."
    },
    {
        "paper_id": "2212.10467.json",
        "table_id": "table_11",
        "table_content": "\\begin{table*}[ht]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small\n\\scalebox{0.95}{\n\\begin{tabular}{@{}l@{}}\n\\toprule\nPlease read the paragraph below and the two following statements that use the paragraph for context.\\\\\n Use your imagination and add a sentence in the front of the paragraph so that the statement will be more likely to hold.  \\\\ \nThe sentence you add CANNOT directly include the implicit event: Tim scheduled an appointment with his dentist. \\\\\n \\midrule \n\\textbf{Paragraph}: Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look around in his mouth. One of \\\\\nhis teeth was rotten. Once the tooth was pulled, Tim felt fine. \\\\\n\\textbf{Statement 1}: Tim scheduled an appointment with his dentist \\textbf{starts after} his tooth was hurting like crazy.\\\\\n\\\\\n\\textbf{Question 1.1}: Which modified paragraph do you think is the most suitable to make statement 1 more likely to hold?\\\\\n$\\circ$ \\textbf{Tim ate a lot of spicy food.} Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look around in \\\\\nhis mouth. One of his teeth was rotten. Once the tooth was pulled, Tim felt fine. \\\\\n$\\circ$ \\textbf{Tim didn't schedule an appointment with his dentist.} Tim's tooth was hurting like crazy. He could barely eat or drink. His\\\\\ndentist took a look around in his mouth. One of his teeth was rotten. Once the tooth was pulled, Tim felt fine. \\\\\n$\\bullet$ \\textbf{Tim's tooth was usually perfect, so he did not often go to see the dentist.} Tim's tooth was hurting like crazy. He could barely\\\\\neat or drink. His dentist took a look around in his mouth. One of his teeth was rotten. Once the tooth was pulled, Tim felt fine. \\\\\n\\midrule \n\\textbf{Paragraph}: Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look around in his mouth. One of \\\\\nhis teeth was rotten. Once the tooth was pulled, Tim felt fine. \\\\\n\\textbf{Statement 2}: Tim scheduled an appointment with his dentist \\textbf{starts before} his tooth was hurting like crazy. \\\\\n\\\\\n\\textbf{Question 1.2}: Which modified paragraph do you think is the most suitable to make statement 2 more likely to hold? \\\\\n$\\circ$ \\textbf{Tim scheduled an appointment with his dentist.} Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist\\\\\ntook a look around in his mouth. One of his teeth was rotten. Once the tooth was pulled, Tim felt fine.\\\\\n$\\circ$ \\textbf{Tim was looking for a dentist.} Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look around\\\\\nin his mouth. One of his teeth was rotten. Once the tooth was pulled, Tim felt fine. \\\\\n$\\bullet$ \\textbf{Tim always met his dentist regularly.} Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look\\\\\naround in his mouth. One of his teeth was rotten. Once the tooth was pulled, Tim felt fine. \\\\\n\\midrule\n\\textbf{Question 2}: Do you understand that the additional sentence and the explanation you write down must make the statement more \\\\\nlikely to hold true and irrelevant explanation answers like \"good\" or merely copying any part of the paragraph will not be paid? \\\\\n$\\bullet$ Yes \\\\\n$\\circ$ No \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:qual}Qualification test of differential analysis annotation. Participants can take the qualification test 3 times and only those who answer each question correctly can be allowed for annotation and evaluation tasks. \n}\n\\end{table*}",
        "caption": "\n\t\\label{tb:qual}Qualification test of differential analysis annotation. Participants can take the qualification test 3 times and only those who answer each question correctly can be allowed for annotation and evaluation tasks. \n",
        "label": "tb:qual",
        "section_info": "3 Dataset\n\\section{Dataset}\n\\label{sec:dataset}\n\n\n\nIn this section, we introduce the evaluation framework and collection process of \\datasetname{}.\n\n\\subsection{Task overview}\nThe \\datasetname{} dataset and its overall framework are designed to evaluate systems' ability to make temporal predictions with plausible reasons. Existing datasets, including \\matres, \\textsc{Torque}, and \\tracie, only annotate common event pairs that align with human common sense. In other words, if an event pair does not strongly imply a temporal relation (e.g. over 80\\% confidence), it will not be annotated and tested on systems. This allows pre-trained language models with millions of parameters to exploit annotation artifacts and priors that do not necessarily hold in certain contexts. For example, we know ``lunch'' is usually before ``dinner'', but this also depends on if they are performed by the same subject, at the same location, and/or on the same day. Unfortunately, current models often memorize such relations as immutable facts, leading to prediction errors in instances that are less common in real life. This intuition inspires us to build a framework to evaluate how much spurious information and priors current models are using.\n\n\\vpara{Temporal Explanations.}\nAn ideal method to evaluate whether models are making predictions in the right way is to let them explain why a certain prediction is made and evaluate the faithfulness and plausibility of the explanations. However, such an evaluation framework is almost impossible to achieve with current progress in natural language processing, where the two main challenges are: 1) it is extremely difficult to collect gold explanations that are sufficient to cover any possible sets of explanations; and 2) it is impossible to evaluate system generations using existing summarization metrics automatically.\n\n\\vpara{Temporal Differential Analysis.}\nBecause of the aforementioned challenges in directly evaluating system explanations, we propose an alternative that is a close proxy to the ideal form, namely temporal differential analysis. The core of the temporal differential analysis is to check if models can correctly identify how a subtle change to the context may affect the temporal relations of a given event pair. The intuition behind this choice is two-fold: 1) it is much easier for both annotators and models to produce an explanation if they know which dimension to focus on; and 2) this provides a binary evaluation measure that is deterministic and trustworthy in terms of reflecting how much spurious information models are using. \n\nSpecifically, our differential analysis process is defined below. Given an original context $\\mathcal{C}$, event 1 $\\mathcal{E}_1$ and event 2 $\\mathcal{E}_2$,\nwe assume a gold distribution $\\mathbb{D}=\\{P_{before}, P_{after}, P_{same}\\}$ on the temporal relation between $\\mathcal{E}_1$ and $\\mathcal{E}_2$ concerning $\\mathcal{C}$, where $P_{before}, P_{after}, P_{same}$ are the probabilities of the temporal relation being before, after and simultaneous respectively, and the probabilities altogether sum to 1. We then annotate two additional sentences $\\mathcal{AS}_{before}$ and $\\mathcal{AS}_{after}$, where the temporal relation distribution between $\\mathcal{E}_1$ and $\\mathcal{E}_2$ with respect to $\\mathcal{AS}_{before}+\\mathcal{C}$ results in an increased $P_{before}$, while similarly the distribution using $\\mathcal{AS}_{after}+\\mathcal{C}$ as the context has a higher $P_{after}$.\n\nTable~\\ref{tb:example} shows an example instance of temporal differential analysis, where an additional sentence $\\mathcal{AS}_{before}$ has an effect on the temporal relation between the two events and shifts the label distribution towards ``before''. We conducted a human pilot study for this formulation and found that it is easier to annotate and achieve substantial improvement over the explanation quality than to directly ask annotators to provide custom explanations for an event pair. We therefore adopt the former formulation and create our evaluation dataset \\datasetname{} through a multi-stage annotation process as described below.\n\n\n\\begin{table}[t]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Example} \\\\ \\midrule\n\\textbf{Context $\\mathcal{C}$}: \\textcolor{blue}{Tim’s tooth was hurting like crazy. His dentist} \\\\ \\textcolor{blue}{took a look around in his mouth. One of his teeth was rotten.} \\\\ \\textcolor{blue}{Once the tooth was pulled, Tim felt fine.}\\\\ \n\\midrule\n\\textbf{Additional Sentence 1 ($\\mathcal{AS}_{before}$)}: \\textcolor{teal}{Tim always met his } \\\\\n\\textcolor{teal}{dentist regularly.}\\\\\n\\midrule\n\\textbf{Event 1 ($\\mathcal{E}_1$)}: \\textcolor{orange}{Tim scheduled an appointment with his dentist.} \\\\\n\\textbf{Event 2 ($\\mathcal{E}_2$)}: \\textcolor{orange}{Tim's tooth started to hurt like crazy.} \\\\\n\\midrule\n\\textbf{Explanation ($Exp$)}: \\textcolor{teal}{Some people maintain regular visits to} \\\\ \\textcolor{teal}{a dentist. Tim is one of these individuals and may have} \\\\ \\textcolor{teal}{ already scheduled a regular appointment with his dentist }\\\\\n\\textcolor{teal}{before his tooth started to hurt.}\\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:example} An example of temporal differential analysis, where $\\mathcal{AS}$ shifts the temporal relation between $\\mathcal{E}_1$ and $\\mathcal{E}_2$ to be more ``before''. See \\S \\ref{sec:dataset} for more details.\n}\n\\end{table}\n\n\n\\subsection{Dataset Construction}\nFollowing the definition of the temporal differential analysis framework above, we collect a dataset to carry out the actual evaluation. Each instance in \\datasetname{} contains a context $\\mathcal{C}$, an event pair $\\mathcal{E}_1$, $\\mathcal{E}_2$, and an additional sentence of either $\\mathcal{AS}_{before}$ or $\\mathcal{AS}_{after}$. In addition, we also annotate a human explanation $Exp$ regarding why the additional sentence affects the temporal relation between the two events. \\datasetname{} is constructed in three steps: 1) event pair generation, 2) additional sentence and explanation annotation, and 3) annotation verification and cleaning. We detail this pipeline below. \n\n\\vpara{Generating $\\mathcal{C}$ and $\\mathcal{E}$.}\nWe randomly sample short stories from the ROCStories dataset~\\cite{mostafazadeh-etal-2016-corpus} as the context $\\mathcal{C}$. For each story, we use GPT-3.5 \\footnote{We use GPT-3.5 text-davinci-002 for data generation throughout the work.} to generate an implicit event phrase based on an explicit event phrase selected by GPT-3.5 at the same time. An implicit event is an event that is not explicitly mentioned by the given context but is still inferable and relevant, e.g. Event 1 in Table~\\ref{tb:example}. A sample prompt can be referred to in Appendix Table~\\ref{tb:prompt2} to construct an event pair. We do this for two main reasons: 1) events that are not explicitly mentioned by the context provide more uncertainty so that the event pair does not come with a deterministic temporal relation decided by the context; 2) this is closer to the format of \\tracie{}, which we aim to compare system performance changes with. \n\n\\vpara{Crowdsourcing $\\mathcal{AS}$ and $Exp$.}\nAfter generating $\\mathcal{C}$ and $\\mathcal{E}$'s, we use Mechanical Turk to ask crowdsourcing annotators to write potential $\\mathcal{AS}_{before}$ and $\\mathcal{AS}_{after}$ with respect to the provided information. The guideline asks annotators to write additional sentences that can be added to the beginning of the context to prevent models from using text positional information. The annotator is also asked to explain why they wrote $\\mathcal{AS}$ and why it affects the temporal relation distribution. We use this as $Exp$. We design an annotation interface that is intuitive and filled with examples, and at the same time, we require annotators to pass a rigorous qualification test to demonstrate a proper understanding. We list our interfaces and tests in Fig.~\\ref{fig:mturk} and Table~\\ref{tb:qual}.\n\n\\vpara{Annotation Verification.}\nWe employ an additional verification stage for the human-written instances from the previous step. We provide annotators with the formatted textual entailment instance and ask if the entailment label changes in the expected direction. We collect two individual verifications per instance, and the instances accepted by all annotators appear in the test set.\n\n\n\\subsection{Statistics}\nWe collect 1,000 instances agreed upon by all annotators as the evaluation set and construct a silver training set with the remaining 1,241 instances that do not have unanimous annotator agreements.  3.2 Dataset Construction\n\\subsection{Dataset Construction}\nFollowing the definition of the temporal differential analysis framework above, we collect a dataset to carry out the actual evaluation. Each instance in \\datasetname{} contains a context $\\mathcal{C}$, an event pair $\\mathcal{E}_1$, $\\mathcal{E}_2$, and an additional sentence of either $\\mathcal{AS}_{before}$ or $\\mathcal{AS}_{after}$. In addition, we also annotate a human explanation $Exp$ regarding why the additional sentence affects the temporal relation between the two events. \\datasetname{} is constructed in three steps: 1) event pair generation, 2) additional sentence and explanation annotation, and 3) annotation verification and cleaning. We detail this pipeline below. \n\n\\vpara{Generating $\\mathcal{C}$ and $\\mathcal{E}$.}\nWe randomly sample short stories from the ROCStories dataset~\\cite{mostafazadeh-etal-2016-corpus} as the context $\\mathcal{C}$. For each story, we use GPT-3.5 \\footnote{We use GPT-3.5 text-davinci-002 for data generation throughout the work.} to generate an implicit event phrase based on an explicit event phrase selected by GPT-3.5 at the same time. An implicit event is an event that is not explicitly mentioned by the given context but is still inferable and relevant, e.g. Event 1 in Table~\\ref{tb:example}. A sample prompt can be referred to in Appendix Table~\\ref{tb:prompt2} to construct an event pair. We do this for two main reasons: 1) events that are not explicitly mentioned by the context provide more uncertainty so that the event pair does not come with a deterministic temporal relation decided by the context; 2) this is closer to the format of \\tracie{}, which we aim to compare system performance changes with. \n\n\\vpara{Crowdsourcing $\\mathcal{AS}$ and $Exp$.}\nAfter generating $\\mathcal{C}$ and $\\mathcal{E}$'s, we use Mechanical Turk to ask crowdsourcing annotators to write potential $\\mathcal{AS}_{before}$ and $\\mathcal{AS}_{after}$ with respect to the provided information. The guideline asks annotators to write additional sentences that can be added to the beginning of the context to prevent models from using text positional information. The annotator is also asked to explain why they wrote $\\mathcal{AS}$ and why it affects the temporal relation distribution. We use this as $Exp$. We design an annotation interface that is intuitive and filled with examples, and at the same time, we require annotators to pass a rigorous qualification test to demonstrate a proper understanding. We list our interfaces and tests in Fig.~\\ref{fig:mturk} and Table~\\ref{tb:qual}.\n\n\\vpara{Annotation Verification.}\nWe employ an additional verification stage for the human-written instances from the previous step. We provide annotators with the formatted textual entailment instance and ask if the entailment label changes in the expected direction. We collect two individual verifications per instance, and the instances accepted by all annotators appear in the test set.\n\n\n8 Appendix\n\\section{Appendix}\n\\begin{figure*}[b]\n\\centering\n\\scalebox{0.8}{\n\t\n\t\\includegraphics[width=1\\textwidth]{mturk.png}}\n\t\\caption{\\label{fig:mturk}The interface for differential analysis annotation. We only allow participants who have 90\\% or more HITs acceptance rate, are located in the US, and pass our qualification task in Table \\ref{tb:qual}. We also require annotators to spend at least 1.5 minutes for each instance (the hourly\nsalary is ~\\$15). }\n\\end{figure*}\n\n\n\\begin{figure*}[ht]\n\\centering\n\\scalebox{0.8}{\n\t\n\t\\includegraphics[width=1\\textwidth]{human.png}}\n\t\\caption{\\label{fig:eval}The interface for human evaluation. We only allow participants who have 98\\% or more HITs acceptance rate, are located in the US, and pass our qualification task in Table \\ref{tb:qual}. We also require annotators to spend at least 1 minute for each instance (the hourly\nsalary is ~\\$15).}\n\\end{figure*}\n\n\n\\begin{table*}[ht]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small\n\\scalebox{0.95}{\n\\begin{tabular}{@{}l@{}}\n\\toprule\nLet’s add a sentence to the first sentence of the context such that the hypothesis is more likely to hold true and explain why.\\\\ \nContext: \\textcolor{blue}{\nTara always wanted jewelry. Her birthday was coming up. Test went to the store. He gave her a really nice necklace.}\\\\\n \\textcolor{blue}{She adored him for the gift.\n}\\\\\nHypothesis: \\textcolor{orange}{Test was being a good friend \\textbf{starts after} he give her a really nice necklace}\\\\\n\nAdd a sentence to the first sentence of the context such that the hypothesis is more likely to hold true and explain why.  \\\\\n\\textcolor{teal}{Test had a secret crush on a girl named Tara in the lower grade.} \\\\\n\\textcolor{teal}{Explanation: the fact that Test and Tara were in different grades implies that their relationship may not have been particularly close.}\\\\\n\\textcolor{teal}{However, Test's secret crush on Tara suggests that he paid close attention to her. By giving her the necklace, Test aimed to establish}\\\\\n\\textcolor{teal}{a stronger connection with Tara. }  \\\\\n\\#\\#\\# \\\\\nContext: \\textcolor{blue}{\nTara always wanted jewelry. Her birthday was coming up. Test went to the store. He gave her a really nice necklace.}\\\\\n \\textcolor{blue}{She adored him for the gift.\n}\\\\\nHypothesis: \\textcolor{orange}{Test was being a good friend \\textbf{starts before} he give her a really nice necklace}\\\\\n\nAdd a sentence to the first sentence of the context such that the hypothesis is more likely to hold true and explain why. \\\\\n\\textcolor{teal}{Test and Tara always hung out together.} \\\\\n\\textcolor{teal}{Explanation: normally people who hang out frequently are friends, and friends will send each other gifts on their birthdays.} \\\\\n\n\\#\\#\\# \\\\\nContext: \\textcolor{blue}{I have always been attracted to Hispanic men. That said, my first huge crush was on a Mexican. I was in love with}\\\\\n\\textcolor{blue}{him for two years. After two years, I realized I was wasting my time and idolizing him. Without any real sense of closure, I} \\\\\n\\textcolor{blue}{decided to pull my heart away.} \\\\\nHypothesis: \\textcolor{orange}{I felt lonely \\textbf{starts before} I decided to pull my heart away} \\\\\nAdd a sentence to the first sentence of the context such that the hypothesis is more likely to hold true and explain why.  \n \\\\\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:prompt} A sample prompt with an instance for two hypothetical changes to make the event pair's temporal relation \"more before\" or \"more after\".\n}\n\\end{table*}\n\n\\begin{table*}[ht]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small\n\\scalebox{0.95}{\n\\begin{tabular}{@{}l@{}}\n\\toprule\nLet’s add a sentence as the first sentence of the paragraph to let the statement more likely to hold true and explain why. \\\\ \nParagraph: \\textcolor{blue}{Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look around in his mouth. One of }\\\\\n\\textcolor{blue}{his teeth was rotten. Once the tooth was pulled, Tim felt fine.} \\\\\nStatement: \\textcolor{orange}{Tim scheduled an appointment with his dentist \\textbf{starts after} his tooth started hurting like crazy} \\\\\nAdd what sentence as the first sentence of the paragraph and why is the statement more likely to hold true? \\\\\n\\textcolor{teal}{Tim's tooth was usually perfect, so he did not often go to see the dentist.} \\\\\n\\textcolor{teal}{This makes the statement true because it implies that Tim did not have regular appointments with his dentist and the reason why he} \\\\\n\\textcolor{teal}{scheduled an appointment with his dentist was that his tooth was hurting like crazy.}\\\\\n\\#\\#\\# \\\\\nParagraph: \\textcolor{blue}{Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look around in his mouth. One of }\\\\\n\\textcolor{blue}{his teeth was rotten. Once the tooth was pulled, Tim felt fine.} \\\\\nStatement: \\textcolor{orange}{Tim scheduled an appointment with his dentist \\textbf{starts before} his tooth started hurting like crazy} \\\\\nAdd what sentence as the first sentence of the paragraph and why is the statement more likely to hold true? \\\\\n\\textcolor{teal}{Tim always met his dentist regularly.} \\\\\n\\textcolor{teal}{This makes the statement true because it implies that Tim may have already scheduled regular appointments with his dentist before}  \\\\\n\\textcolor{teal}{his tooth started hurting like crazy.}  \\\\\n\\#\\#\\# \\\\\nParagraph: \\textcolor{blue}{ Chuck was hanging out with some friends at a bar. They mentioned that they were moving soon. Chuck offered}\\\\\n \\textcolor{blue}{to help them move their things. The team worked together and got the move done quickly. They were so grateful that they.\n}\\\\\n \\textcolor{blue}{invited him to stay for dinner.\n }\\\\\nStatement: \\textcolor{orange}{ Chuck wanted to be helpful \\textbf{starts before} Chuck offered to help them move their things}\\\\\nAdd what sentence as the first sentence of the paragraph and why is the statement more likely to hold true? \\\\\n\\textcolor{teal}{Chuck is the kind of person that always wants to help out.} \\\\\n\\textcolor{teal}{This makes the statement true because it implies Chuck's wanted to help his friends move their things was because he is naturally}  \\\\\n\\textcolor{teal}{helpful.}  \\\\\n\\#\\#\\# \\\\\nParagraph: \\textcolor{blue}{ Chuck was hanging out with some friends at a bar. They mentioned that they were moving soon. Chuck offered}\\\\\n \\textcolor{blue}{to help them move their things. The team worked together and got the move done quickly. They were so grateful that they.\n}\\\\\n \\textcolor{blue}{invited him to stay for dinner.\n }\\\\\nStatement: \\textcolor{orange}{ Chuck wanted to be helpful \\textbf{starts after} Chuck offered to help them move their things}\\\\\nAdd what sentence as the first sentence of the paragraph and why is the statement more likely to hold true? \\\\\n\\textcolor{teal}{Chuck often found himself reluctant to do thing, but grateful afterward that he did.} \\\\\n\\textcolor{teal}{This makes the statement true because if Chuck was reluctant, he might not have truly felt like being helpful until after he}  \\\\\n\\textcolor{teal}{offered to help and was grateful afterward.}  \\\\\n\\#\\#\\# \\\\\nParagraph: \\textcolor{blue}{ I have always been attracted to Hispanic men. That said, my first huge crush was a Mexican. I was in love with}\\\\\n \\textcolor{blue}{him for two years. After two years, I realized I was wasting my time and over-idolizing him. Without any real sense of closure, I\n}\\\\\n \\textcolor{blue}{decided to pull my heart away.\n }\\\\\nStatement: \\textcolor{orange}{I felt lonely \\textbf{starts before} I decided to pull my heart away}\\\\\nAdd what sentence as the first sentence of the paragraph and why is the statement more likely to hold true?\n \\\\\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:prompt1} A sample prompt with two instances for two hypothetical changes to make the event pair's temporal relation \"more before\" or \"more after\".\n}\n\\end{table*}\n\n\n\n\\begin{table*}[ht]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small\n\\scalebox{0.95}{\n\\begin{tabular}{@{}l@{}}\n\\toprule\nLet's find out an event that is unmentioned but can be inferred from the context and the temporal relation between the two events\\\\\n are not deterministic. The new event should not be longer than ten words and include only one verb. \\\\ \nContext: \\textcolor{blue}{\nTara always wanted jewelry. Her birthday was coming up. Test went to the store. He gave her a really nice necklace}\\\\\n \\textcolor{blue}{She adored him for the gift.\n}\\\\\nWhat is an event that is unmentioned but has some role and can be inferred from the context? \\\\\n\\textcolor{teal}{Test was being a good friend} \\\\\n\\textcolor{teal}{It can be inferred from She adored him for the gift.} \\\\\n\\#\\#\\# \\\\\nContext: \\textcolor{blue}{Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look around in his mouth. One of }\\\\\n\\textcolor{blue}{his teeth was rotten. Once the tooth was pulled, Tim felt fine.}\\\\\nWhat is an event that is unmentioned but has some role and can be inferred from the context? \\\\\n\\textcolor{teal}{Tim scheduled an appointment with his dentist} \\\\\n\\textcolor{teal}{It can be inferred from Tim's tooth was hurting like crazy.} \\\\\n\\#\\#\\# \\\\\nContext: \\textcolor{blue}{Lily went to a nice restaurant. She ordered a steak. To her dismay the steak was rare. Lily was rather upset. She had }\\\\\n\\textcolor{blue}{to send it back.}\\\\\nWhat is an event that is unmentioned but has some role and can be inferred from the context?\n \\\\\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:prompt2} A sample prompt to generate an implicit event given the context.\n}\n\\end{table*}\n\n\\begin{table*}[ht]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small\n\\scalebox{0.95}{\n\\begin{tabular}{@{}l@{}}\n\\toprule\nPlease read the paragraph below and the two following statements that use the paragraph for context.\\\\\n Use your imagination and add a sentence in the front of the paragraph so that the statement will be more likely to hold.  \\\\ \nThe sentence you add CANNOT directly include the implicit event: Tim scheduled an appointment with his dentist. \\\\\n \\midrule \n\\textbf{Paragraph}: Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look around in his mouth. One of \\\\\nhis teeth was rotten. Once the tooth was pulled, Tim felt fine. \\\\\n\\textbf{Statement 1}: Tim scheduled an appointment with his dentist \\textbf{starts after} his tooth was hurting like crazy.\\\\\n\\\\\n\\textbf{Question 1.1}: Which modified paragraph do you think is the most suitable to make statement 1 more likely to hold?\\\\\n$\\circ$ \\textbf{Tim ate a lot of spicy food.} Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look around in \\\\\nhis mouth. One of his teeth was rotten. Once the tooth was pulled, Tim felt fine. \\\\\n$\\circ$ \\textbf{Tim didn't schedule an appointment with his dentist.} Tim's tooth was hurting like crazy. He could barely eat or drink. His\\\\\ndentist took a look around in his mouth. One of his teeth was rotten. Once the tooth was pulled, Tim felt fine. \\\\\n$\\bullet$ \\textbf{Tim's tooth was usually perfect, so he did not often go to see the dentist.} Tim's tooth was hurting like crazy. He could barely\\\\\neat or drink. His dentist took a look around in his mouth. One of his teeth was rotten. Once the tooth was pulled, Tim felt fine. \\\\\n\\midrule \n\\textbf{Paragraph}: Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look around in his mouth. One of \\\\\nhis teeth was rotten. Once the tooth was pulled, Tim felt fine. \\\\\n\\textbf{Statement 2}: Tim scheduled an appointment with his dentist \\textbf{starts before} his tooth was hurting like crazy. \\\\\n\\\\\n\\textbf{Question 1.2}: Which modified paragraph do you think is the most suitable to make statement 2 more likely to hold? \\\\\n$\\circ$ \\textbf{Tim scheduled an appointment with his dentist.} Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist\\\\\ntook a look around in his mouth. One of his teeth was rotten. Once the tooth was pulled, Tim felt fine.\\\\\n$\\circ$ \\textbf{Tim was looking for a dentist.} Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look around\\\\\nin his mouth. One of his teeth was rotten. Once the tooth was pulled, Tim felt fine. \\\\\n$\\bullet$ \\textbf{Tim always met his dentist regularly.} Tim's tooth was hurting like crazy. He could barely eat or drink. His dentist took a look\\\\\naround in his mouth. One of his teeth was rotten. Once the tooth was pulled, Tim felt fine. \\\\\n\\midrule\n\\textbf{Question 2}: Do you understand that the additional sentence and the explanation you write down must make the statement more \\\\\nlikely to hold true and irrelevant explanation answers like \"good\" or merely copying any part of the paragraph will not be paid? \\\\\n$\\bullet$ Yes \\\\\n$\\circ$ No \\\\\n\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:qual}Qualification test of differential analysis annotation. Participants can take the qualification test 3 times and only those who answer each question correctly can be allowed for annotation and evaluation tasks. \n}\n\\end{table*}\n\n\\begin{table*}[htb]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small\n\\scalebox{0.95}{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Gold answer} \\\\\n\\midrule\nLet’s explain classification decisions.\\\\\n\\textcolor{blue}{A young boy wearing a tank-top is climbing a tree.}\\\\\nquestion: \\textcolor{orange}{A boy was showing off for a girl.}\\\\\ntrue, false, or neither? \\textcolor{teal}{neither}\\\\\nwhy? \\textcolor{teal}{A boy might climb a tree to show off for a girl, but he also might do it for fun or for other reasons.} \\\\\n\\midrule\n\\textbf{GPT-3.5 generated differential analysis explanation} \\\\\n\\midrule\nLet’s add a sentence to the first sentence of the context such that the hypothesis is more likely to hold true and explain why. \\\\ \nContext: \\textcolor{blue}{\nA young boy wearing a tank-top is climbing a tree.}\\\\\nHypothesis: \\textcolor{orange}{A boy \\textbf{was} showing off for a girl.}\\\\\nAdd what sentence as the first sentence of the context and why is the hypothesis more likely to hold true? \\\\\n\\sethlcolor{pink}\\hl{A boy was climbing a tree and he saw a girl watching him.} \\\\\n\\sethlcolor{pink}\\hl{Because the paragraph is about a boy climbing a tree, the statement is more likely to be true if the boy is climbing the tree to}\\\\\n\\sethlcolor{pink}\\hl{show off for the girl.} \\\\\n\\#\\#\\# \\\\\nContext: \\textcolor{blue}{\nA young boy wearing a tank-top is climbing a tree.}\\\\\nHypothesis: \\textcolor{orange}{A boy \\textbf{was not} showing off for a girl.}\\\\\nAdd what sentence as the first sentence of the context and why is the hypothesis more likely to hold true? \\\\\n\\sethlcolor{pink}\\hl{The boy was not showing off for a girl, he was merely climbing a tree.} \\\\\n\\sethlcolor{pink}\\hl{The statement is more likely to hold true because if the boy was showing off for a girl, it would be implied that he was doing.}\\\\\n\\sethlcolor{pink}\\hl{something to impress her. However, since the sentence states that he was merely climbing a tree, it is more likely that he was not.}\\\\\n\\sethlcolor{pink}\\hl{doing it to show off for anyone.}\n \\\\\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:snli} A training sample from SNLI with its gold annotated explanations from \\cite{wiegreffe-etal-2022-reframing} compared with the GPT-3.5 generated explanations (highlighted in pink) under our differential analysis formulation. \n}\n\\end{table*}\n  \n\n\\end{document}\n",
        "Descriptive_question1": "What is the maximum number of times participants can take the qualification test in table_11?",
        "Descriptive_question2": "What is the purpose of the qualification test in table_11?",
        "Reasoning_question1": "Why might the qualification test in table_11 be limited to three attempts, and how could this impact the quality of annotations?",
        "Reasoning_question2": "How does the structure of the questions in table_11 ensure that annotators understand the concept of making a statement more likely to hold true?",
        "Descriptive_answer1": "Three times",
        "Descriptive_answer2": "Annotation qualification",
        "Reasoning_answer1": "Limiting the qualification test to three attempts in table_11 likely serves to ensure that only annotators with a strong initial understanding and ability to grasp the task quickly are selected. This restriction may prevent individuals from repeatedly guessing or learning through trial and error without truly comprehending the underlying concepts. By capping attempts, the process filters for competence and commitment, which could enhance the quality of annotations by reducing the inclusion of less capable or less serious annotators. However, this might also exclude potentially capable individuals who need more time or attempts to understand the nuanced requirements, possibly leading to a smaller but more skilled pool of annotators. Overall, this limit likely prioritizes quality over quantity in the annotation process.",
        "Reasoning_answer2": "The structure of the questions in table_11 is designed to ensure annotators understand the concept of making a statement more likely to hold true by presenting specific scenarios with modified paragraphs and requiring annotators to select the most suitable option that aligns with a given temporal relationship. For instance, in questions 1.1 and 1.2, annotators must choose a sentence that logically shifts the likelihood of an event occurring 'after' or 'before' another event, respectively. This forces annotators to think critically about how subtle contextual changes influence temporal relations, rather than merely accepting surface-level details. Additionally, question 2 directly tests comprehension of the task's requirements by asking if annotators understand that their explanations must substantiate the likelihood shift and avoid irrelevant or copied responses. This multi-layered approach—combining scenario-based choices with a meta-understanding check—reinforces the need for logical reasoning and relevance in their annotations, ensuring a deeper grasp of the differential analysis concept."
    },
    {
        "paper_id": "2212.10467.json",
        "table_id": "table_12",
        "table_content": "\\begin{table*}[htb]\n\\newcolumntype{?}{!{\\vrule width 1pt}}\n\\newcolumntype{C}{>{\\centering\\arraybackslash}p{40em}}\n\n\\centering \n\\renewcommand\\arraystretch{1.0}\n\\small\n\\scalebox{0.95}{\n\\begin{tabular}{@{}l@{}}\n\\toprule\n\\textbf{Gold answer} \\\\\n\\midrule\nLet’s explain classification decisions.\\\\\n\\textcolor{blue}{A young boy wearing a tank-top is climbing a tree.}\\\\\nquestion: \\textcolor{orange}{A boy was showing off for a girl.}\\\\\ntrue, false, or neither? \\textcolor{teal}{neither}\\\\\nwhy? \\textcolor{teal}{A boy might climb a tree to show off for a girl, but he also might do it for fun or for other reasons.} \\\\\n\\midrule\n\\textbf{GPT-3.5 generated differential analysis explanation} \\\\\n\\midrule\nLet’s add a sentence to the first sentence of the context such that the hypothesis is more likely to hold true and explain why. \\\\ \nContext: \\textcolor{blue}{\nA young boy wearing a tank-top is climbing a tree.}\\\\\nHypothesis: \\textcolor{orange}{A boy \\textbf{was} showing off for a girl.}\\\\\nAdd what sentence as the first sentence of the context and why is the hypothesis more likely to hold true? \\\\\n\\sethlcolor{pink}\\hl{A boy was climbing a tree and he saw a girl watching him.} \\\\\n\\sethlcolor{pink}\\hl{Because the paragraph is about a boy climbing a tree, the statement is more likely to be true if the boy is climbing the tree to}\\\\\n\\sethlcolor{pink}\\hl{show off for the girl.} \\\\\n\\#\\#\\# \\\\\nContext: \\textcolor{blue}{\nA young boy wearing a tank-top is climbing a tree.}\\\\\nHypothesis: \\textcolor{orange}{A boy \\textbf{was not} showing off for a girl.}\\\\\nAdd what sentence as the first sentence of the context and why is the hypothesis more likely to hold true? \\\\\n\\sethlcolor{pink}\\hl{The boy was not showing off for a girl, he was merely climbing a tree.} \\\\\n\\sethlcolor{pink}\\hl{The statement is more likely to hold true because if the boy was showing off for a girl, it would be implied that he was doing.}\\\\\n\\sethlcolor{pink}\\hl{something to impress her. However, since the sentence states that he was merely climbing a tree, it is more likely that he was not.}\\\\\n\\sethlcolor{pink}\\hl{doing it to show off for anyone.}\n \\\\\\bottomrule\n\\end{tabular}\n}\n\\caption{\n\t\\label{tb:snli} A training sample from SNLI with its gold annotated explanations from \\cite{wiegreffe-etal-2022-reframing} compared with the GPT-3.5 generated explanations (highlighted in pink) under our differential analysis formulation. \n}\n\\end{table*}",
        "caption": "\n\t\\label{tb:snli} A training sample from SNLI with its gold annotated explanations from \\cite{wiegreffe-etal-2022-reframing} compared with the GPT-3.5 generated explanations (highlighted in pink) under our differential analysis formulation. \n",
        "label": "tb:snli",
        "section_info": "7 Conclusion\n\\section{Conclusion}\nWe introduce a novel differential analysis framework and dataset called \\datasetname{} that interprets and evaluates if a temporal model can make correct predictions without using spurious information and biases. We show that existing temporal models' performances drop to random guessing on \\datasetname{} due to model limitations and supervision biases. To address this issue, we propose to jointly train with \\datasetname{} and its explanation annotations, resulting in improved performances on multiple temporal reasoning benchmarks, namely \\tracie{} (+7\\%), \\matres{} (+3\\%), and \\datasetname{} (+10\\%). We also demonstrate that \\datasetname{} can be used to distill GPT-3.5 and automatically generate and filter incidental supervision instances with high-quality explanations, which further improves performances. Despite these advances, the gap in performance on \\datasetname{} still motivates future work toward generic temporal reasoning.   \\section*{Limitations}\nThis work initially builds on human annotations, which are relatively expensive compared to simple model generations.\nDue to such cost-related reasons, we do not include neutral contextual changes which are hard to annotate, and do not investigate the potential harms of annotated/generated language, e.g. harmful social biases. Throughout this work, we only use ROCStories as the source data, more diverse sources are reasonable for future work. We use T5 and GPT-3 architectures; however, there are more powerful architectures that could potentially improve our results.\n\nLastly, this work only focuses on generalizing temporal reasoning, which is a challenging yet relatively narrow task for large language models. Through pilot experiments, we find that similar task formulation, annotation schemes, and model structures can be applied to other tasks, such as natural language inference (NLI) and question answering (QA). A sample from the SNLI training set~\\cite{bowman-etal-2015-large} using our formulation for explanation is shown in Table~\\ref{tb:snli} in the Appendix.\n\n\\section*{Acknowledgements}\nWe thank the anonymous reviewers for their valuable feedback on this paper, as well as many others who provided constructive comments on the preprint. This work was supported by Contract FA8750-19-2-1004 with the US Defense Advanced Research Projects Agency (DARPA). Approved for Public Release, Distribution Unlimited. The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government.  \n\n\\begin{thebibliography}{37}\n\\expandafter\\ifx\\csname natexlab\\endcsname\\relax\\def\\natexlab{#1}{#1}\\fi\n\n\\bibitem[{Aggarwal et~al.(2021)Aggarwal, Mandowara, Agrawal, Khandelwal,\n  Singla, and Garg}]{aggarwal-etal-2021-explanations}\nShourya Aggarwal, Divyanshu Mandowara, Vishwajeet Agrawal, Dinesh Khandelwal,\n  Parag Singla, and Dinesh Garg. 2021.\n\\newblock \\href{https://doi.org/10.18653/v1/2021.acl-long.238}{{E}xplanations\n  for {C}ommonsense{QA}: {N}ew {D}ataset and {M}odels}.\n\\newblock In \\emph{Proceedings of the 59th Annual Meeting of the Association\n  for Computational Linguistics and the 11th International Joint Conference on\n  Natural Language Processing (Volume 1: Long Papers)}, pages 3050--3065,\n  Online. Association for Computational Linguistics.\n\n\\bibitem[{Bowman et~al.(2015)Bowman, Angeli, Potts, and\n  Manning}]{bowman-etal-2015-large}\nSamuel~R. Bowman, Gabor Angeli, Christopher Potts, and Christopher~D. Manning.\n  2015.\n\\newblock \\href{https://doi.org/10.18653/v1/D15-1075}{A large annotated\n  corpus for learning natural language inference}.\n\\newblock In \\emph{Proceedings of the 2015 Conference on Empirical Methods in\n  Natural Language Processing}, pages 632--642, Lisbon, Portugal. Association\n  for Computational Linguistics.\n\n\\bibitem[{Brown et~al.(2020)Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal,\n  Neelakantan, Shyam, Sastry, Askell, Agarwal, Herbert-Voss, Krueger, Henighan,\n  Child, Ramesh, Ziegler, Wu, Winter, Hesse, Chen, Sigler, Litwin, Gray, Chess,\n  Clark, Berner, McCandlish, Radford, Sutskever, and\n  Amodei}]{brown2020language}\nTom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared~D Kaplan, Prafulla\n  Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell,\n  Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon\n  Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris\n  Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,\n  Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever,\n  and Dario Amodei. 2020.\n\\newblock \\href{https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf}{Language models are few-shot learners}.\n\\newblock In \\emph{Advances in Neural Information Processing Systems},\n  volume~33, pages 1877--1901. Curran Associates, Inc.\n\n\\bibitem[{Camburu et~al.(2018)Camburu, Rockt{\\\"a}schel, Lukasiewicz, and\n  Blunsom}]{Camburu2018eSNLINL}\nOana-Maria Camburu, Tim Rockt{\\\"a}schel, Thomas Lukasiewicz, and Phil Blunsom.\n  2018.\n\\newblock \\href{https://dl.acm.org/doi/pdf/10.5555/3327546.3327624}{e-snli:\n  Natural language inference with natural language explanations}.\n\\newblock In \\emph{Proceedings of the 32nd International Conference on Neural\n  Information Processing Systems}, page 9560–9572.\n\n\\bibitem[{Cassidy et~al.(2014)Cassidy, McDowell, Chambers, and\n  Bethard}]{cassidy-etal-2014-annotation}\nTaylor Cassidy, Bill McDowell, Nathanael Chambers, and Steven Bethard. 2014.\n\\newblock \\href{https://doi.org/10.3115/v1/P14-2082}{An annotation framework\n  for dense event ordering}.\n\\newblock In \\emph{Proceedings of the 52nd Annual Meeting of the Association\n  for Computational Linguistics (Volume 2: Short Papers)}, pages 501--506,\n  Baltimore, Maryland. Association for Computational Linguistics.\n\n\\bibitem[{Chambers et~al.(2014)Chambers, Cassidy, McDowell, and\n  Bethard}]{chambers-etal-2014-dense}\nNathanael Chambers, Taylor Cassidy, Bill McDowell, and Steven Bethard. 2014.\n\\newblock \\href{https://doi.org/10.1162/tacl_a_00182}{Dense event ordering\n  with a multi-pass architecture}.\n\\newblock \\emph{Transactions of the Association for Computational Linguistics},\n  2:273--284.\n\n\\bibitem[{DeYoung et~al.(2020)DeYoung, Jain, Rajani, Lehman, Xiong, Socher, and\n  Wallace}]{deyoung-etal-2020-eraser}\nJay DeYoung, Sarthak Jain, Nazneen~Fatema Rajani, Eric Lehman, Caiming Xiong,\n  Richard Socher, and Byron~C. Wallace. 2020.\n\\newblock \\href{https://doi.org/10.18653/v1/2020.acl-main.408}{{ERASER}: {A}\n  benchmark to evaluate rationalized {NLP} models}.\n\\newblock In \\emph{Proceedings of the 58th Annual Meeting of the Association\n  for Computational Linguistics}, pages 4443--4458, Online. Association for\n  Computational Linguistics.\n\n\\bibitem[{Han et~al.(2019)Han, Ning, and Peng}]{han-etal-2019-joint}\nRujun Han, Qiang Ning, and Nanyun Peng. 2019.\n\\newblock \\href{https://doi.org/10.18653/v1/D19-1041}{Joint event and\n  temporal relation extraction with shared representations and structured\n  prediction}.\n\\newblock In \\emph{Proceedings of the 2019 Conference on Empirical Methods in\n  Natural Language Processing and the 9th International Joint Conference on\n  Natural Language Processing (EMNLP-IJCNLP)}, pages 434--444, Hong Kong,\n  China. Association for Computational Linguistics.\n\n\\bibitem[{Kumar and Talukdar(2020)}]{kumar-talukdar-2020-nile}\nSawan Kumar and Partha Talukdar. 2020.\n\\newblock \\href{https://doi.org/10.18653/v1/2020.acl-main.771}{{NILE} :\n  Natural language inference with faithful natural language explanations}.\n\\newblock In \\emph{Proceedings of the 58th Annual Meeting of the Association\n  for Computational Linguistics}, pages 8730--8742, Online. Association for\n  Computational Linguistics.\n\n\\bibitem[{Latcinnik and Berant(2020)}]{Latcinnik2020ExplainingQA}\nVeronica Latcinnik and Jonathan Berant. 2020.\n\\newblock \\href{https://arxiv.org/pdf/2004.05569.pdf}{Explaining question\n  answering models through text generation}.\n\\newblock \\emph{ArXiv}, abs/2004.05569.\n\n\\bibitem[{Liu et~al.(2021)Liu, Xu, Chen, and Zhang}]{liu2021discourse}\nJian Liu, Jinan Xu, Yufeng Chen, and Yujie Zhang. 2021.\n\\newblock \\href{https://doi.org/10.24963/ijcai.2021/533}{Discourse-level\n  event temporal ordering with uncertainty-guided graph completion.}\n\\newblock In \\emph{Proceedings of the Thirtieth International Joint Conference\n  on Artificial Intelligence, {IJCAI-21}}, pages 3871--3877. International\n  Joint Conferences on Artificial Intelligence Organization.\n\n\\bibitem[{Mani et~al.(2007)Mani, Wellner, Verhagen, and\n  Pustejovsky}]{mani2007three}\nInderjeet Mani, Ben Wellner, Marc Verhagen, and James Pustejovsky. 2007.\n\\newblock \\href{https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=29385d344a77cfa934723af3c3b66572f3832823}{Three approaches to learning tlinks in timeml}.\n\\newblock \\emph{Computer Science Department, Brandeis University}.\n\n\\bibitem[{Marasovi{\\'c} et~al.(2022)Marasovi{\\'c}, Beltagy, Downey, and\n  Peters}]{marasovic-beltagy-et-al-2022-feb}\nAna Marasovi{\\'c}, Iz~Beltagy, Doug Downey, and Matthew~E. Peters. 2022.\n\\newblock \\href{https://arxiv.org/abs/2111.08284}{Few-shot\n  self-rationalization with natural language prompts}.\n\\newblock In \\emph{Findings of the Association for Computational Linguistics:\n  NAACL 2022}.\n\n\\bibitem[{Mathur et~al.(2021)Mathur, Jain, Dernoncourt, Morariu, Tran, and\n  Manocha}]{mathur-etal-2021-timers}\nPuneet Mathur, Rajiv Jain, Franck Dernoncourt, Vlad Morariu, Quan~Hung Tran,\n  and Dinesh Manocha. 2021.\n\\newblock \\href{https://doi.org/10.18653/v1/2021.acl-short.67}{{TIMERS}:\n  Document-level temporal relation extraction}.\n\\newblock In \\emph{Proceedings of the 59th Annual Meeting of the Association\n  for Computational Linguistics and the 11th International Joint Conference on\n  Natural Language Processing (Volume 2: Short Papers)}, pages 524--533,\n  Online. Association for Computational Linguistics.\n\n\\bibitem[{Mostafazadeh et~al.(2016)Mostafazadeh, Chambers, He, Parikh, Batra,\n  Vanderwende, Kohli, and Allen}]{mostafazadeh-etal-2016-corpus}\nNasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra,\n  Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016.\n\\newblock \\href{https://doi.org/10.18653/v1/N16-1098}{A corpus and cloze\n  evaluation for deeper understanding of commonsense stories}.\n\\newblock In \\emph{Proceedings of the 2016 Conference of the North {A}merican\n  Chapter of the Association for Computational Linguistics: Human Language\n  Technologies}, pages 839--849, San Diego, California. Association for\n  Computational Linguistics.\n\n\\bibitem[{Ning et~al.(2017)Ning, Feng, and Roth}]{ning-etal-2017-structured}\nQiang Ning, Zhili Feng, and Dan Roth. 2017.\n\\newblock \\href{https://doi.org/10.18653/v1/D17-1108}{A structured learning\n  approach to temporal relation extraction}.\n\\newblock In \\emph{Proceedings of the 2017 Conference on Empirical Methods in\n  Natural Language Processing}, pages 1027--1037, Copenhagen, Denmark.\n  Association for Computational Linguistics.\n\n\\bibitem[{Ning et~al.(2020)Ning, Wu, Han, Peng, Gardner, and\n  Roth}]{ning-etal-2020-torque}\nQiang Ning, Hao Wu, Rujun Han, Nanyun Peng, Matt Gardner, and Dan Roth. 2020.\n\\newblock \\href{https://doi.org/10.18653/v1/2020.emnlp-main.88}{{TORQUE}: A\n  reading comprehension dataset of temporal ordering questions}.\n\\newblock In \\emph{Proceedings of the 2020 Conference on Empirical Methods in\n  Natural Language Processing (EMNLP)}, pages 1158--1172, Online. Association\n  for Computational Linguistics.\n\n\\bibitem[{Ning et~al.(2018{\\natexlab{a}})Ning, Wu, and\n  Roth}]{ning-etal-2018-multi}\nQiang Ning, Hao Wu, and Dan Roth. 2018{\\natexlab{a}}.\n\\newblock \\href{https://doi.org/10.18653/v1/P18-1122}{A multi-axis annotation\n  scheme for event temporal relations}.\n\\newblock In \\emph{Proceedings of the 56th Annual Meeting of the Association\n  for Computational Linguistics (Volume 1: Long Papers)}, pages 1318--1328,\n  Melbourne, Australia. Association for Computational Linguistics.\n\n\\bibitem[{Ning et~al.(2018{\\natexlab{b}})Ning, Zhou, Feng, Peng, and\n  Roth}]{ning-etal-2018-cogcomptime}\nQiang Ning, Ben Zhou, Zhili Feng, Haoruo Peng, and Dan Roth.\n  2018{\\natexlab{b}}.\n\\newblock \\href{https://doi.org/10.18653/v1/D18-2013}{{C}og{C}omp{T}ime: A\n  tool for understanding time in natural language}.\n\\newblock In \\emph{Proceedings of the 2018 Conference on Empirical Methods in\n  Natural Language Processing: System Demonstrations}, pages 72--77, Brussels,\n  Belgium. Association for Computational Linguistics.\n\n\\bibitem[{OpenAI(2023)}]{OpenAI2023GPT4TR}\nOpenAI. 2023.\n\\newblock \\href{https://arxiv.org/pdf/2303.08774.pdf}{Gpt-4 technical\n  report}.\n\\newblock \\emph{ArXiv}, abs/2303.08774.\n\n\\bibitem[{Ouyang et~al.(2022)Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin,\n  Zhang, Agarwal, Slama, Gray, Schulman, Hilton, Kelton, Miller, Simens,\n  Askell, Welinder, Christiano, Leike, and Lowe}]{ouyang2022training}\nLong Ouyang, Jeffrey Wu, Xu~Jiang, Diogo Almeida, Carroll Wainwright, Pamela\n  Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John\n  Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda\n  Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022.\n\\newblock \\href{https://openreview.net/forum?id=TG8KACxEON}{Training language\n  models to follow instructions with human feedback}.\n\\newblock In \\emph{Advances in Neural Information Processing Systems}.\n\n\\bibitem[{Pustejovsky et~al.(2003)Pustejovsky, Hanks, Sauri, See, Gaizauskas,\n  Setzer, Radev, Sundheim, Day, Ferro et~al.}]{pustejovsky2003timebank}\nJames Pustejovsky, Patrick Hanks, Roser Sauri, Andrew See, Robert Gaizauskas,\n  Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, et~al.\n  2003.\n\\newblock \\href{https://www.researchgate.net/publication/228559081_The_TimeBank_corpus}{The\n  timebank corpus}.\n\\newblock In \\emph{Corpus linguistics}, volume 2003, page~40, Lancaster, UK.\n\n\\bibitem[{Rajani et~al.(2019)Rajani, McCann, Xiong, and\n  Socher}]{rajani-etal-2019-explain}\nNazneen~Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019.\n\\newblock \\href{https://doi.org/10.18653/v1/P19-1487}{Explain yourself!\n  leveraging language models for commonsense reasoning}.\n\\newblock In \\emph{Proceedings of the 57th Annual Meeting of the Association\n  for Computational Linguistics}, pages 4932--4942, Florence, Italy.\n  Association for Computational Linguistics.\n\n\\bibitem[{Reimers and Gurevych(2019)}]{reimers-gurevych-2019-sentence}\nNils Reimers and Iryna Gurevych. 2019.\n\\newblock \\href{https://doi.org/10.18653/v1/D19-1410}{Sentence-{BERT}:\n  Sentence embeddings using {S}iamese {BERT}-networks}.\n\\newblock In \\emph{Proceedings of the 2019 Conference on Empirical Methods in\n  Natural Language Processing and the 9th International Joint Conference on\n  Natural Language Processing (EMNLP-IJCNLP)}, pages 3982--3992, Hong Kong,\n  China. Association for Computational Linguistics.\n\n\\bibitem[{Trong et~al.(2022)Trong, Trung, Van~Ngo, and\n  Nguyen}]{trong2022selecting}\nHieu Man~Duc Trong, Nghia~Ngo Trung, Linh Van~Ngo, and Thien~Huu Nguyen. 2022.\n\\newblock \\href{https://www.aaai.org/AAAI22Papers/AAAI-3912.ManH.pdf}{Selecting optimal context sentences for event-event relation extraction}.\n\\newblock In \\emph{AAAI Conference on Artificial Intelligencel Intelligence},\n  pages 11058--11066, Vancouver, Canada.\n\n\\bibitem[{UzZaman et~al.(2013)UzZaman, Llorens, Derczynski, Allen, Verhagen,\n  and Pustejovsky}]{uzzaman-etal-2013-semeval}\nNaushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen,\n  and James Pustejovsky. 2013.\n\\newblock \\href{https://aclanthology.org/S13-2001}{{S}em{E}val-2013 task 1:\n  {T}emp{E}val-3: Evaluating time expressions, events, and temporal relations}.\n\\newblock In \\emph{Second Joint Conference on Lexical and Computational\n  Semantics (*{SEM}), Volume 2: Proceedings of the Seventh International\n  Workshop on Semantic Evaluation ({S}em{E}val 2013)}, pages 1--9, Atlanta,\n  Georgia, USA. Association for Computational Linguistics.\n\n\\bibitem[{Verhagen et~al.(2007)Verhagen, Gaizauskas, Schilder, Hepple, Katz,\n  and Pustejovsky}]{verhagen-etal-2007-semeval}\nMarc Verhagen, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and\n  James Pustejovsky. 2007.\n\\newblock \\href{https://aclanthology.org/S07-1014}{{S}em{E}val-2007 task 15:\n  {T}emp{E}val temporal relation identification}.\n\\newblock In \\emph{Proceedings of the Fourth International Workshop on Semantic\n  Evaluations ({S}em{E}val-2007)}, pages 75--80, Prague, Czech Republic.\n  Association for Computational Linguistics.\n\n\\bibitem[{Verhagen et~al.(2010)Verhagen, Saur{\\'\\i}, Caselli, and\n  Pustejovsky}]{verhagen-etal-2010-semeval}\nMarc Verhagen, Roser Saur{\\'\\i}, Tommaso Caselli, and James Pustejovsky. 2010.\n\\newblock \\href{https://aclanthology.org/S10-1010}{{S}em{E}val-2010 task 13:\n  {T}emp{E}val-2}.\n\\newblock In \\emph{Proceedings of the 5th International Workshop on Semantic\n  Evaluation}, pages 57--62, Uppsala, Sweden. Association for Computational\n  Linguistics.\n\n\\bibitem[{Wang et~al.(2022)Wang, Zhang, Deng, Gardner, Chen, and\n  Roth}]{wang2022extracting}\nHaoyu Wang, Hongming Zhang, Yuqian Deng, Jacob~R Gardner, Muhao Chen, and Dan\n  Roth. 2022.\n\\newblock \\href{https://arxiv.org/pdf/2210.04992.pdf}{Extracting or guessing?\n  improving faithfulness of event temporal relation extraction}.\n\\newblock \\emph{arXiv preprint arXiv:2210.04992}.\n\n\\bibitem[{Wiegreffe et~al.(2022)Wiegreffe, Hessel, Swayamdipta, Riedl, and\n  Choi}]{wiegreffe-etal-2022-reframing}\nSarah Wiegreffe, Jack Hessel, Swabha Swayamdipta, Mark Riedl, and Yejin Choi.\n  2022.\n\\newblock \\href{https://doi.org/10.18653/v1/2022.naacl-main.47}{Reframing\n  human-{AI} collaboration for generating free-text explanations}.\n\\newblock In \\emph{Proceedings of the 2022 Conference of the North American\n  Chapter of the Association for Computational Linguistics: Human Language\n  Technologies}, pages 632--658, Seattle, United States. Association for\n  Computational Linguistics.\n\n\\bibitem[{Wiegreffe and Marasovi\\'{c}(2021)}]{wiegreffe-marasovic-2021-review}\nSarah Wiegreffe and Ana Marasovi\\'{c}. 2021.\n\\newblock \\href{https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/698d51a19d8a121ce581499d7b701668-Paper-round1.pdf}{Teach me to explain: A review of datasets for explainable nlp}.\n\\newblock In \\emph{Proceedings of the Neural Information Processing Systems\n  Track on Datasets and Benchmarks}.\n\n\\bibitem[{Wiegreffe and Pinter(2019)}]{wiegreffe-pinter-2019-attention}\nSarah Wiegreffe and Yuval Pinter. 2019.\n\\newblock \\href{https://doi.org/10.18653/v1/D19-1002}{Attention is not not\n  explanation}.\n\\newblock In \\emph{Proceedings of the 2019 Conference on Empirical Methods in\n  Natural Language Processing and the 9th International Joint Conference on\n  Natural Language Processing (EMNLP-IJCNLP)}, pages 11--20, Hong Kong, China.\n  Association for Computational Linguistics.\n\n\\bibitem[{Wolf et~al.(2020)Wolf, Debut, Sanh, Chaumond, Delangue, Moi, Cistac,\n  Rault, Louf, Funtowicz, Davison, Shleifer, von Platen, Ma, Jernite, Plu, Xu,\n  Le~Scao, Gugger, Drame, Lhoest, and Rush}]{wolf-etal-2020-transformers}\nThomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,\n  Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe\n  Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien\n  Plu, Canwen Xu, Teven Le~Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,\n  and Alexander Rush. 2020.\n\\newblock \\href{https://doi.org/10.18653/v1/2020.emnlp-demos.6}{Transformers:\n  State-of-the-art natural language processing}.\n\\newblock In \\emph{Proceedings of the 2020 Conference on Empirical Methods in\n  Natural Language Processing: System Demonstrations}, pages 38--45, Online.\n  Association for Computational Linguistics.\n\n\\bibitem[{Yin et~al.(2022)Yin, Shi, Hsieh, and\n  Chang}]{yin-etal-2022-sensitivity}\nFan Yin, Zhouxing Shi, Cho-Jui Hsieh, and Kai-Wei Chang. 2022.\n\\newblock \\href{https://doi.org/10.18653/v1/2022.acl-long.188}{On the\n  sensitivity and stability of model interpretations in {NLP}}.\n\\newblock In \\emph{Proceedings of the 60th Annual Meeting of the Association\n  for Computational Linguistics (Volume 1: Long Papers)}, pages 2631--2647,\n  Dublin, Ireland. Association for Computational Linguistics.\n\n\\bibitem[{Zhou et~al.(2020)Zhou, Ning, Khashabi, and\n  Roth}]{zhou-etal-2020-temporal}\nBen Zhou, Qiang Ning, Daniel Khashabi, and Dan Roth. 2020.\n\\newblock \\href{https://doi.org/10.18653/v1/2020.acl-main.678}{Temporal\n  common sense acquisition with minimal supervision}.\n\\newblock In \\emph{Proceedings of the 58th Annual Meeting of the Association\n  for Computational Linguistics}, pages 7579--7589, Online. Association for\n  Computational Linguistics.\n\n\\bibitem[{Zhou et~al.(2021)Zhou, Richardson, Ning, Khot, Sabharwal, and\n  Roth}]{zhou-etal-2021-temporal}\nBen Zhou, Kyle Richardson, Qiang Ning, Tushar Khot, Ashish Sabharwal, and Dan\n  Roth. 2021.\n\\newblock \\href{https://doi.org/10.18653/v1/2021.naacl-main.107}{Temporal\n  reasoning on implicit events from distant supervision}.\n\\newblock In \\emph{Proceedings of the 2021 Conference of the North American\n  Chapter of the Association for Computational Linguistics: Human Language\n  Technologies}, pages 1361--1371, Online. Association for Computational\n  Linguistics.\n\n\\bibitem[{Zhou et~al.(2022)Zhou, Richardson, Yu, and Roth}]{ZRYR22}\nBen Zhou, Kyle Richardson, Xiaodong Yu, and Dan Roth. 2022.\n\\newblock \\href{https://preview.aclanthology.org/emnlp-22-ingestion/2022.emnlp-main.142.pdf}{Learning to decompose: Hypothetical question decomposition based on\n  comparable texts}.\n\\newblock In \\emph{In Proceedings of the 2022 Conference on Empirical Methods\n  in Natural Language Processing}, pages 2223--2235, Abu Dhabi, United Arab\n  Emirates. Association for Computational Linguistics.\n\n\\end{thebibliography}\n  \\bibliographystyle{acl_natbib}\n\n\n\\appendix\n",
        "Descriptive_question1": "What type of explanations are highlighted in pink in table_12?",
        "Descriptive_question2": "Which dataset is used as a training sample in table_12?",
        "Reasoning_question1": "How does the differential analysis formulation in table_12 help in comparing gold annotated explanations with GPT-3.5 generated ones?",
        "Reasoning_question2": "What insights can be drawn from the differences between gold annotated and GPT-3.5 generated explanations as presented in table_12?",
        "Descriptive_answer1": "GPT-3.5 generated",
        "Descriptive_answer2": "SNLI",
        "Reasoning_answer1": "The differential analysis formulation in table_12 provides a structured framework to evaluate explanations by presenting both gold annotated and GPT-3.5 generated explanations side by side. This formulation allows for a direct comparison by altering the context with additional sentences to test if the hypothesis becomes more likely to hold true. For instance, in the case of the hypothesis 'A boy was showing off for a girl,' the added sentence about the boy seeing a girl watching him offers a contextual clue that supports the hypothesis, which is then explained in the GPT-3.5 response. Similarly, for the negative hypothesis 'A boy was not showing off for a girl,' the added context clarifies intent, making the hypothesis more plausible. This method helps in assessing how well GPT-3.5 can mimic human-like reasoning compared to gold standards by examining the impact of contextual modifications on inference.",
        "Reasoning_answer2": "From table_12, we can infer that gold annotated explanations tend to be more concise and focus on the ambiguity or multiple possible motivations behind an action, as seen in the explanation 'A boy might climb a tree to show off for a girl, but he also might do it for fun or for other reasons,' which directly addresses the uncertainty leading to a 'neither' classification. In contrast, GPT-3.5 generated explanations are more verbose and attempt to justify the hypothesis by constructing a narrative with added context, such as suggesting the boy saw a girl watching him, which implies intent to show off. This suggests that GPT-3.5 may over-explain or inject speculative detail to force a logical connection, whereas gold annotations maintain a broader perspective on possible interpretations. This difference highlights a potential gap in nuanced reasoning, where GPT-3.5 might prioritize coherence over acknowledging ambiguity, pointing to areas for improvement in automated explanation generation."
    },
    {
        "paper_id": "1704.00976.json",
        "table_id": "table_1",
        "table_content": "\\begin{table}[!b]\n\\caption{\\label{TabM} Madelung constants of the 2D Yukawa crystals (triangular lattice) for various screening parameters in the range $0.5 \\leq \\kappa\\leq 3.0$ }\n\\begin{ruledtabular}\n\\begin{tabular}{cccc}\n$\\kappa$ & $M$ & $\\kappa$ & $M$   \\\\ \\hline\n0.5 & 1.11914 & 1.8 & 0.05449  \\\\\n0.6 & 0.82503 & 2.0 & 0.03660  \\\\\n0.8 & 0.48127 & 2.2 & 0.02470  \\\\\n1.0 & 0.29709 & 2.4 & 0.01672  \\\\\n1.2 & 0.18960 & 2.6 & 0.01135  \\\\\n1.4 & 0.12357 & 2.8 & 0.00772  \\\\\n1.6 & 0.08167 & 3.0 & 0.00525 \\\\ \n\\end{tabular}\n\\end{ruledtabular}\n\\end{table}",
        "caption": "\\label{TabM} Madelung constants of the 2D Yukawa crystals (triangular lattice) for various screening parameters in the range $0.5 \\leq \\kappa\\leq 3.0$ ",
        "label": "TabM",
        "section_info": "3 Results\n\\section{Results}\n\n\\subsection{Weakly-coupled fluids}\n\n\\begin{figure}[!b]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R1.eps}\\\\\n    \\caption{The excess energy $u_{\\rm ex}$ of 2D Yukawa weakly coupled fluids versus the screening parameter $\\kappa$ at a fixed coupling parameter $\\Gamma = 0.5$. The symbols correspond to the results of MD simulations, the solid curve is plotted using the analytical expression of Eq.~(\\ref{SVC}).\n    }\n\\label{FigSC}\n\\end{figure}\n\nA simple and physically transparent approach to the thermodynamics of weakly coupled Yukawa systems for small deviations from the ideal gas behavior is to calculate the  second virial coefficient. This has recently been shown to work well in 3D Yukawa systems.~\\cite{KhrapakPPCF2016} In the 2D geometry the excess free energy is expressed in this approximation as\n\\begin{equation}\\label{SVC}\nf_{\\rm ex}\\simeq \\pi n \\int\\left[1-e^{-\\varphi(r)/k_{\\rm B}T}\\right]r dr.\n\\end{equation}\nThe excess energy and pressure can be readily obtained from the excess free energy. We compare the values $u_{\\rm ex}$ at a fixed coupling parameter $\\Gamma=0.5$ obtained from Eq.~(\\ref{SVC}) and computed using MD simulations in Fig.~\\ref{FigSC}. The agreement is satisfactory: in the range of $\\kappa$ investigated the deviations are within several percent. The agreement naturally improves with increasing $\\kappa$, because at a fixed $\\Gamma$ the actual interaction strength weakens as $\\kappa$ increases.\n\n\n\\subsection{Strongly-coupled fluids}\n\nThe excess energy and pressure of the 2D Yukawa fluids have been determined using MD simulations in a wide range of coupling and screening parameters. The results are summarized in the Table \\ref{Table1} of the Appendix. Here we describe simple analytical\napproximations, which can be used to evaluate the energy and pressure for practical purposes.\n\nIn the strongly coupled fluid regime it is helpful to divide the thermodynamic quantities, such as energy and pressure, into static and thermal contributions. The static contribution corresponds to the value of internal energy when the particles are frozen in some regular configuration and the thermal corrections arise due to the deviations of the particles from these fixed positions (due to thermal motion). Of course, such a division is only meaningful when the regular structure is specified. For crystals, the obvious choice is a corresponding lattice sum (Madelung energy). For fluids this choice is also meaningful and we use it here (Note, that in 3D Yukawa system a slightly different definition of the static fluid energy is traditionally employed.~\\cite{KhrapakPPCF2016, KhrapakISM})\n\n\\begin{table}[!b]\n\\caption{\\label{TabM} Madelung constants of the 2D Yukawa crystals (triangular lattice) for various screening parameters in the range $0.5 \\leq \\kappa\\leq 3.0$ }\n\\begin{ruledtabular}\n\\begin{tabular}{cccc}\n$\\kappa$ & $M$ & $\\kappa$ & $M$   \\\\ \\hline\n0.5 & 1.11914 & 1.8 & 0.05449  \\\\\n0.6 & 0.82503 & 2.0 & 0.03660  \\\\\n0.8 & 0.48127 & 2.2 & 0.02470  \\\\\n1.0 & 0.29709 & 2.4 & 0.01672  \\\\\n1.2 & 0.18960 & 2.6 & 0.01135  \\\\\n1.4 & 0.12357 & 2.8 & 0.00772  \\\\\n1.6 & 0.08167 & 3.0 & 0.00525 \\\\ \n\\end{tabular}\n\\end{ruledtabular}\n\\end{table}\n\nThe excess internal energy is thus a sum of the static and thermal contributions,\n\\begin{equation}\nu_{\\rm ex} = u_{\\rm st} + u_{\\rm th},\n\\end{equation}\nwhere $u_{\\rm st} = M\\Gamma$ and $M$ is the Madelung constant.\nThe values of the Madelung constant for 2D Yukawa systems in the regime of relatively weak screening, $0.5 \\leq \\kappa\\leq 3.0$, are tabulated in Table~\\ref{TabM}. The dependence $M(\\kappa)$ can be fitted using a functional form similar to that proposed by  Totsuji~\\emph{et al.}\\cite{PhysRevE.70.016405}\n\\begin{equation}\n\\label{Eq6}\nM = -1.1061+0.5038\\kappa-0.11053\\kappa^2+0.00968\\kappa^3+1/\\kappa.\n\\end{equation}\nThe last term in (\\ref{Eq6}) accounts for the absence of neutralizing background in our case (but present in Ref.~\\onlinecite{PhysRevE.70.016405}), the energy of this background being simply $-\\Gamma/\\kappa$. The fit is chosen in such a way that when $\\kappa\\rightarrow 0$ and the neutralizing background is introduced, the Madelung constant is reduced to the well known value of the triangular lattice sum of the 2D one-component-plasma (OCP) with Coulomb interactions, $M_{\\rm OCP}\\simeq -1.1061$. This fit is accurate to within a tiny fraction of percent for $\\kappa\\lesssim 1.0$ and to within $\\sim 1\\%$ when screening becomes stronger ($\\kappa\\sim 3$).\n\nThe thermal part of the excess energy is expected to exhibit a quasi-universal scaling with respect to the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. This is a general property of classical particle systems with sufficiently soft interactions, which was first pointed out by Rosenfeld and Tarazona (RT scaling) for 3D systems.~\\cite{RT1,RT2} In the context of 3D Yukawa systems, the RT scaling has been proven to be very useful in Refs.~\\onlinecite{1.4921223,KhrapakPPCF2016,KhrapakPRE2015,KhrapakPRE03_2015} The emergence of RT scaling analogue for 2D systems has been discussed in the context of OCP with Coulomb and logarithmic interactions, Yukawa systems near the OCP limit, and inverse-power-law interactions.~\\cite{KhrapakCPP2016,KhrapakPoP08_2015} The dependence of $u_{\\rm th}$ on $\\Gamma/\\Gamma_{\\rm m}$ in the strongly coupled regime is displayed in Fig.~\\ref{FigR1}. The quasi-universality is well pronounced, although there is clearly some systematic tendency of decreasing the value of $u_{\\rm th}$ with $\\kappa$ at the same value of  $\\Gamma/\\Gamma_{\\rm m}$. This tendency is expected when the potential steepness increases (see e.g. Fig.~4 from Ref.~\\onlinecite{KhrapakPoP08_2015}). Overall, the data points corresponding to the dependence $u_{\\rm th}(\\Gamma/\\Gamma_{\\rm m})$ are confined to a relatively narrow range. The important point is that towards the side of soft interactions (sufficiently small $\\kappa$ in our case), the static component of the internal energy is dominant over the thermal one. For example, at $\\kappa=1$ the thermal component contributes only to about $2\\%$ of the total excess energy near the fluid-solid phase transition. Therefore, even moderately accurate fits for $u_{\\rm th}$ allow to obtain high accuracy with respect to the total excess energy $u_{\\rm ex}$.\n\nThree fits are shown in Fig.~\\ref{FigR1}. The upper (lower) curve corresponds to the data portion for $\\kappa=0.5$ ($\\kappa = 3.0$).\nThe intermediate curve has been obtained using the entire massive of the data points (corresponding to the parameter regime shown). It can be considered as representative for strongly coupled 2D Yukawa fluids in the vicinity of the freezing transition.\nThe functional form of the fit is the same as used previously~\\cite{KhrapakPoP08_2015}\n\\begin{equation} \\label{Fit1}\nu_{\\rm th} =A \\ln (1+B\\Gamma/\\Gamma_{\\rm m}).\n\\end{equation}\nThe use of the coefficients $A=0.257$ and $B=195.4$ determined here would somewhat improve previous approximations.\n\nThe excess free energy can be routinely calculated using the model for the excess energy formulated above and the second of Eqs.~(\\ref{pf}). The resulting expression is rather simple,\n\\begin{equation}\\label{fex}\nf_{\\rm ex}=M(\\kappa)\\Gamma - A{\\rm Li}_2(-B\\Gamma/\\Gamma_{\\rm m}),\n\\end{equation}\nwhere ${\\rm Li}_2(z)=\\int_z^0 dt \\ln(1-t)/t$ is dilogarithm. Note that in deriving Eq.~(\\ref{fex}), the thermodynamic integration over the coupling parameter from 0 to $\\Gamma$ has been performed, while Eq.~(\\ref{Fit1}) is strictly speaking not applicable at $\\Gamma\\ll 1$.\nThe correct procedure would be to start thermodynamic integration from some small, but finite value $\\Gamma_0$, and then add the constant $f_{\\rm ex}(\\Gamma_0)$ evaluated using Eq.~(\\ref{SVC}). However, since the actual contribution from the weakly coupling regime is small, Eq.~(\\ref{fex}) remains rather accurate at strong coupling and we use it here.\n\nThe calculation of pressure from the excess free energy is straightforward, but rather cumbersome in the considered case. This is because the differentiation with respect to $\\kappa$ is involved, and  the two fits for $M(\\kappa)$ and $\\Gamma_{\\rm m}(\\kappa)$ are present. For this reason, the explicit expression for $p$ is not displayed. We verified that near freezing (at $\\Gamma/\\Gamma_{\\rm m}\\simeq 0.95$) the derived expression yields the pressures which deviate from the exact MD results by $\\sim 0.001\\%$ at $\\kappa=0.5$, $~\\sim 0.1\\%$ at $\\kappa=1.0$, and $\\sim 1\\%$ at $\\kappa = 2.0-2.8$. The accuracy drops at the highest value $\\kappa=3.0$. This is not surprising, since the fits for $M(\\kappa)$ and $\\Gamma_{\\rm m}(\\kappa)$ are only applicable for $\\kappa\\lesssim 3.0$ and, therefore, derivatives from these fits at   $\\kappa=3.0$ can produce significant errors.\n\nWe also found out that if better accuracy is required, the data for the excess thermal energy can be fitted by the following slightly modified expression\n\\begin{equation}\n\\label{Eq7}\nu_{\\mathrm{ex}} = A(\\kappa)\\ln\\left[ 1 + B(\\kappa) \\Gamma^{s(\\kappa)} \\right],\n\\end{equation}\nwhere $A$ and $B$ are now assumed $\\kappa$-dependent and a $\\kappa$-dependent exponent $s$ is introduced. Based on all the data points obtained in MD simulations the following relations are identified:\n$A(\\kappa) = 0.35708 + 0.09397\\kappa$,\n$B(\\kappa)= 1.65491\\exp(- 0.76911\\kappa)$,\n$s(\\kappa) = 0.68838 - 0.05183\\kappa$.\nSome representative examples are shown in Fig.~\\ref{FigR2}.\nThe fit of Eq.~(\\ref{Eq7}) is clearly more accurate and can be used in\nthe regime of weaker coupling, compared to the simple form (\\ref{Fit1}). However, it is also less practical in evaluating thermodynamic parameters other than the excess internal energy.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R2.eps}\\\\\n    \\caption{\n    Thermal component of the reduced excess energy, $u_{\\rm th}$ of 2D Yukawa fluids near the fluid-solid phase transition versus the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. Symbols correspond to MD simulations for different values of the screening parameter $\\kappa$. The curves are the analytical fits to these data using Eq.~(\\ref{Fit1}): The upper (lower) curve corresponds to fitting the MD results for $\\kappa=0.5$ ($\\kappa = 3.0$) and the intermediate (red) curve is obtained by fitting the entire massive of the data points.}\n\\label{FigR1}\n\\end{figure}\n\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R3.eps}\\\\\n    \\caption{Dependence of the excess thermal energy $u_{\\rm th}$ on the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. All the data points from numerical simulations are plotted. Solid curves correspond to three representative fits using Eq.~\\eqref{Eq7}.}\n\\label{FigR2}\n\\end{figure}\n\n\n\n\\subsection{Relation between excess pressure and energy}\n\nIt is sometimes advantageous to operate with an equation of state written in the form of relation between the pressure and internal energy of the system. For soft purely repulsive potentials a simplest formulation of this kind can be written as\n\\begin{equation}\\label{gamma_ex}\np_{\\rm ex}=\\gamma_{\\rm ex}u_{\\rm ex}.\n\\end{equation}\nHere the parameter $\\gamma_{\\rm ex}$ generally depends both on the temperature and density, that is both on $\\Gamma$ and $\\kappa$ for Yukawa systems. Note that the parameter $\\gamma_{\\rm ex}$ introduced in this way is not directly related to the conventional definitions of either the density scaling exponent or Gr\\\"uneisen parameter.~\\cite{HummelPRB2015} Nevertheless, it may be helpful in characterizing the softness of the repulsive potential. We remind that for inverse-power-law (IPL) repulsive potentials of the form $\\varphi(r)\\propto r^{-\\alpha}$ the relation between the excess pressure and energy is particularly simple, $p_{\\rm ex}=\\tfrac{\\alpha}{2} u_{\\rm ex}$ in 2D. Thus, an ``effective IPL exponent'' may be associated with the quantity $2\\gamma_{\\rm ex}$.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{gamma.eps}\\\\\n    \\caption{Ratio of the excess pressure to the excess energy, $\\gamma_{\\rm ex}=p_{\\rm ex}/u_{\\rm ex}$ on the plane ($\\kappa$, $\\Gamma/\\Gamma_{\\rm m}$).\n    }\n\\label{gamma}\n\\end{figure}\n\nHaving approximations for both $p_{\\rm ex}$ and $u_{\\rm ex}$ for 2D Yukawa fluids we can easily estimate the value of $\\gamma_{\\rm ex}$. The corresponding plot of $\\gamma_{\\rm ex}$ as a function of Yukawa systems state variables $\\kappa$ and $\\Gamma/\\Gamma_{\\rm m}$ is shown in Fig.~\\ref{gamma}. To produce this plot, Eq.~(\\ref{Fit1}) for the thermal component of the excess energy has been used. Figure~\\ref{gamma} shows that in the strongly coupled regime $\\gamma_{\\rm ex}$ is very weakly dependent on the coupling strength (temperature), but exhibits considerable dependence on $\\kappa$ (density). Using the exact MD results for $p_{\\rm ex}/u_{\\rm ex}$ in the vicinity of the fluid-solid phase transition ($\\Gamma/\\Gamma_{\\rm m}\\simeq 0.95$) we have obtained a representative dependence $\\gamma_{\\rm ex}(\\kappa)$ in the strongly coupled regime:\n\\begin{equation}\n\\gamma_{\\rm ex}(\\kappa)=1+0.526\\kappa+0.13\\kappa^2-0.02\\kappa^3.\n\\end{equation}\nImportantly, $\\gamma_{\\rm ex}\\rightarrow 1$ as $\\kappa\\rightarrow 0$.\nThis seems counter-intuitive at first, because one would naturally expect $\\gamma_{\\rm ex}=\\tfrac{1}{2}$ in the OCP Coulomb interaction limit in 2D. The difference is attributed to the presence of the neutralizing background in the OCP model. In the limit of very soft interaction, the energy and pressure are dominated by their static contributions. As $\\kappa\\rightarrow 0$, the dominant contribution is the Madelung energy, so that $f_{\\rm ex}\\sim u_{\\rm ex}\\sim M\\Gamma\\sim \\Gamma/\\kappa$ (without background). This implies $p_{\\rm ex}=\\tfrac{\\Gamma}{2}(\\partial f_{\\rm ex}/\\partial \\Gamma)-\\tfrac{\\kappa}{2}(\\partial f_{\\rm ex}/\\partial \\kappa)\\sim \\Gamma/\\kappa\\sim u_{\\rm ex}$. In the presence of neutralizing background the term $\\Gamma/\\kappa$ disappears and we have $f_{\\rm ex}\\sim u_{\\rm ex}\\sim M_{\\rm OCP}\\Gamma$. This yields $p_{\\rm ex}\\sim \\tfrac{1}{2}M_{\\rm OCP}\\Gamma\\sim \\tfrac{1}{2}u_{\\rm ex}$. This consideration demonstrates that the Yukawa systems in the limit $\\kappa\\rightarrow 0$ are not fully equivalent to the Coulomb systems with the neutralizing background.\n\n\n\\subsection{Crystals}\n\nIn a series of MD simulations for 2D Yukawa crystals, in addition to evaluate the excess energy and pressure (which are summarized in Tables~\\ref{Table3} and \\ref{Table4} of the Appendix), the mean squared displacements were calculated to find the anharmonic correction coefficient $\\beta$. The resulting dependence $\\beta(\\kappa)$ is shown in Figure~\\ref{FigR3} (the corresponding values are also tabulated in Table~\\ref{Table2} of the Appendix for completeness).\nThe inset in Fig.~\\ref{FigR3} presents the radial (isotropic) pair correlation function, $g(r) \\propto \\int{d\\varphi\\; g(\\mathbf{r})}$,\nand demonstrates excellent representation of the short- and long-distance correlations. The obtained anharmonic correction coefficient $\\beta(\\kappa)$ allows to calculate analytically pair correlation function and then the excess energy, pressure and other thermodynamic parameters by the thermodynamic integration with the help of the expressions given in Sec.~\\ref{Thermo}.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R5.eps}\\\\\n    \\caption{    Dependence of the anharmonic correction coefficient $\\beta$ on the screening parameter $\\kappa$. The inset demonstrates a typical comparison between the radial distribution functions obtained in a direct MD simulation and computed using the shortest-graph method. For details see the text.}\n\\label{FigR3}\n\\end{figure}\n\nIt is worth to point out the following observation:\nIn the limit $\\kappa \\rightarrow 0$, the Yukawa interaction tends to the unscreened Coulomb interaction $\\varphi \\propto r^{-1}$. According to our previous MD simulations,~\\cite{1.4926945}\nthe finite-temperature phononic spectra differ weakly from zero-temperature ones for IPL potentials, $\\varphi \\propto r^{-\\alpha}$. Therefore, in the OCP limit ($\\kappa=0$ and $\\alpha=1$) we should obtain the smallest values of $\\beta(\\kappa)$. This is indeed observed in Fig.~\\ref{FigR3}.\n\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R6.eps}\\\\\n    \\caption{ Dependence of the reduced pressure on the reduced excess energy. Open (solid) symbols are the results of MD simulations for fluids and solids, respectively. The solid and dashed curves correspond to the shortest-graph method for solids and to the fit of Eq.~(\\ref{Eq7}) for fluids.}\n\\label{FigR4}\n\\end{figure}\n\nIn figure~\\ref{FigR4} we plot the reduced pressure versus the reduced excess energy of 2D Yukawa fluids and solids. Symbols are the MD results, the solid and dashed curves correspond to the shortest-graph method [with found anharmonic correction coefficient $\\beta(\\kappa)$] for the crystalline phase and the proposed fit by Eq.\\eqref{Eq7} for the fluid phase, respectively. Excellent agreement is observed.\n\n\\subsection{Accuracy}\n\nThe relative difference between the excess energies calculated using the shortest-graph method and those evaluated using direct MD simulations in the solid phase amounts to $\\simeq5\\times 10^{-5}$, which is comparable to the values reported earlier.~\\cite{0953-8984-28-23-235401} The accurate fit of Eq.~\\eqref{Eq7}\nyields the relative error in the excess energy smaller than $5\\times10^{-4}$ and  $2\\times10^{-3}$  for 72\\% and 95\\% of\nthe examined fluids data points, respectively. Maximal relative deviation, $5\\times 10^{-3}$, is observed near the melting line at large values of the screening parameter $\\kappa$. A simpler fit of Eq.~(\\ref{Fit1}) is applicable when the relative deviations within $\\lesssim 1\\%$ are acceptable.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{Pressure_kappa05.eps}\\\\\n    \\caption{Reduced pressure, $p$, as a function of the coupling parameter $\\Gamma$ for a Yukawa 2D fluid with the screening parameter $\\kappa=0.5$. The symbols are exact MD results, the solid (red) line corresponds to the fit of Eq.~(\\ref{Fit1}), the dashed (blue) line is the fit from Ref.~\\onlinecite{0022-3727-49-23-235203}.}\n\\label{FigPressure}\n\\end{figure}\n\nIn addition, we can compare our results with those recently reported in Refs.~\\onlinecite{0022-3727-49-23-235203,1.4962685}, where fits for the pressure of 2D Yukawa fluids in the $(\\kappa,\\Gamma)$ parameter space have been proposed. The case $\\kappa=0.5$ received special attention and a simple two-term fit has been proposed based on the results of a MD simulation,~\\cite{0022-3727-49-23-235203} $p=1.53\\Gamma+1.33$.\nWe plot our MD results along with the fit of Eq.~(\\ref{Fit1}) and the fit from Ref.~\\onlinecite{0022-3727-49-23-235203} in Fig.~\\ref{FigPressure}. One can see that the fit from Ref.~\\onlinecite{0022-3727-49-23-235203} overestimates the pressure systematically at high values of $\\Gamma$. At the strongest coupling in the fluid phase studied in this work, $\\Gamma=135.42$, the present MD simulation yields $p= 199.434$, fit by Eq.~(\\ref{Fit1}) yields $p=199.432$, while the fit from Ref.~\\onlinecite{0022-3727-49-23-235203} yields $p=208.523$. On the other hand, the previous model for 2D Yukawa systems in the OCP (weakly screening) limit discussed in Refs.~\\onlinecite{KhrapakPoP08_2015,1.4935846}\nyields $p=199.445$, providing confidence in the accuracy of the  present results. The reasons for deviations in Ref.~\\onlinecite{0022-3727-49-23-235203} have to be identified.\n\n3.2 Strongly-coupled fluids\n\\subsection{Strongly-coupled fluids}\n\nThe excess energy and pressure of the 2D Yukawa fluids have been determined using MD simulations in a wide range of coupling and screening parameters. The results are summarized in the Table \\ref{Table1} of the Appendix. Here we describe simple analytical\napproximations, which can be used to evaluate the energy and pressure for practical purposes.\n\nIn the strongly coupled fluid regime it is helpful to divide the thermodynamic quantities, such as energy and pressure, into static and thermal contributions. The static contribution corresponds to the value of internal energy when the particles are frozen in some regular configuration and the thermal corrections arise due to the deviations of the particles from these fixed positions (due to thermal motion). Of course, such a division is only meaningful when the regular structure is specified. For crystals, the obvious choice is a corresponding lattice sum (Madelung energy). For fluids this choice is also meaningful and we use it here (Note, that in 3D Yukawa system a slightly different definition of the static fluid energy is traditionally employed.~\\cite{KhrapakPPCF2016, KhrapakISM})\n\n\\begin{table}[!b]\n\\caption{\\label{TabM} Madelung constants of the 2D Yukawa crystals (triangular lattice) for various screening parameters in the range $0.5 \\leq \\kappa\\leq 3.0$ }\n\\begin{ruledtabular}\n\\begin{tabular}{cccc}\n$\\kappa$ & $M$ & $\\kappa$ & $M$   \\\\ \\hline\n0.5 & 1.11914 & 1.8 & 0.05449  \\\\\n0.6 & 0.82503 & 2.0 & 0.03660  \\\\\n0.8 & 0.48127 & 2.2 & 0.02470  \\\\\n1.0 & 0.29709 & 2.4 & 0.01672  \\\\\n1.2 & 0.18960 & 2.6 & 0.01135  \\\\\n1.4 & 0.12357 & 2.8 & 0.00772  \\\\\n1.6 & 0.08167 & 3.0 & 0.00525 \\\\ \n\\end{tabular}\n\\end{ruledtabular}\n\\end{table}\n\nThe excess internal energy is thus a sum of the static and thermal contributions,\n\\begin{equation}\nu_{\\rm ex} = u_{\\rm st} + u_{\\rm th},\n\\end{equation}\nwhere $u_{\\rm st} = M\\Gamma$ and $M$ is the Madelung constant.\nThe values of the Madelung constant for 2D Yukawa systems in the regime of relatively weak screening, $0.5 \\leq \\kappa\\leq 3.0$, are tabulated in Table~\\ref{TabM}. The dependence $M(\\kappa)$ can be fitted using a functional form similar to that proposed by  Totsuji~\\emph{et al.}\\cite{PhysRevE.70.016405}\n\\begin{equation}\n\\label{Eq6}\nM = -1.1061+0.5038\\kappa-0.11053\\kappa^2+0.00968\\kappa^3+1/\\kappa.\n\\end{equation}\nThe last term in (\\ref{Eq6}) accounts for the absence of neutralizing background in our case (but present in Ref.~\\onlinecite{PhysRevE.70.016405}), the energy of this background being simply $-\\Gamma/\\kappa$. The fit is chosen in such a way that when $\\kappa\\rightarrow 0$ and the neutralizing background is introduced, the Madelung constant is reduced to the well known value of the triangular lattice sum of the 2D one-component-plasma (OCP) with Coulomb interactions, $M_{\\rm OCP}\\simeq -1.1061$. This fit is accurate to within a tiny fraction of percent for $\\kappa\\lesssim 1.0$ and to within $\\sim 1\\%$ when screening becomes stronger ($\\kappa\\sim 3$).\n\nThe thermal part of the excess energy is expected to exhibit a quasi-universal scaling with respect to the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. This is a general property of classical particle systems with sufficiently soft interactions, which was first pointed out by Rosenfeld and Tarazona (RT scaling) for 3D systems.~\\cite{RT1,RT2} In the context of 3D Yukawa systems, the RT scaling has been proven to be very useful in Refs.~\\onlinecite{1.4921223,KhrapakPPCF2016,KhrapakPRE2015,KhrapakPRE03_2015} The emergence of RT scaling analogue for 2D systems has been discussed in the context of OCP with Coulomb and logarithmic interactions, Yukawa systems near the OCP limit, and inverse-power-law interactions.~\\cite{KhrapakCPP2016,KhrapakPoP08_2015} The dependence of $u_{\\rm th}$ on $\\Gamma/\\Gamma_{\\rm m}$ in the strongly coupled regime is displayed in Fig.~\\ref{FigR1}. The quasi-universality is well pronounced, although there is clearly some systematic tendency of decreasing the value of $u_{\\rm th}$ with $\\kappa$ at the same value of  $\\Gamma/\\Gamma_{\\rm m}$. This tendency is expected when the potential steepness increases (see e.g. Fig.~4 from Ref.~\\onlinecite{KhrapakPoP08_2015}). Overall, the data points corresponding to the dependence $u_{\\rm th}(\\Gamma/\\Gamma_{\\rm m})$ are confined to a relatively narrow range. The important point is that towards the side of soft interactions (sufficiently small $\\kappa$ in our case), the static component of the internal energy is dominant over the thermal one. For example, at $\\kappa=1$ the thermal component contributes only to about $2\\%$ of the total excess energy near the fluid-solid phase transition. Therefore, even moderately accurate fits for $u_{\\rm th}$ allow to obtain high accuracy with respect to the total excess energy $u_{\\rm ex}$.\n\nThree fits are shown in Fig.~\\ref{FigR1}. The upper (lower) curve corresponds to the data portion for $\\kappa=0.5$ ($\\kappa = 3.0$).\nThe intermediate curve has been obtained using the entire massive of the data points (corresponding to the parameter regime shown). It can be considered as representative for strongly coupled 2D Yukawa fluids in the vicinity of the freezing transition.\nThe functional form of the fit is the same as used previously~\\cite{KhrapakPoP08_2015}\n\\begin{equation} \\label{Fit1}\nu_{\\rm th} =A \\ln (1+B\\Gamma/\\Gamma_{\\rm m}).\n\\end{equation}\nThe use of the coefficients $A=0.257$ and $B=195.4$ determined here would somewhat improve previous approximations.\n\nThe excess free energy can be routinely calculated using the model for the excess energy formulated above and the second of Eqs.~(\\ref{pf}). The resulting expression is rather simple,\n\\begin{equation}\\label{fex}\nf_{\\rm ex}=M(\\kappa)\\Gamma - A{\\rm Li}_2(-B\\Gamma/\\Gamma_{\\rm m}),\n\\end{equation}\nwhere ${\\rm Li}_2(z)=\\int_z^0 dt \\ln(1-t)/t$ is dilogarithm. Note that in deriving Eq.~(\\ref{fex}), the thermodynamic integration over the coupling parameter from 0 to $\\Gamma$ has been performed, while Eq.~(\\ref{Fit1}) is strictly speaking not applicable at $\\Gamma\\ll 1$.\nThe correct procedure would be to start thermodynamic integration from some small, but finite value $\\Gamma_0$, and then add the constant $f_{\\rm ex}(\\Gamma_0)$ evaluated using Eq.~(\\ref{SVC}). However, since the actual contribution from the weakly coupling regime is small, Eq.~(\\ref{fex}) remains rather accurate at strong coupling and we use it here.\n\nThe calculation of pressure from the excess free energy is straightforward, but rather cumbersome in the considered case. This is because the differentiation with respect to $\\kappa$ is involved, and  the two fits for $M(\\kappa)$ and $\\Gamma_{\\rm m}(\\kappa)$ are present. For this reason, the explicit expression for $p$ is not displayed. We verified that near freezing (at $\\Gamma/\\Gamma_{\\rm m}\\simeq 0.95$) the derived expression yields the pressures which deviate from the exact MD results by $\\sim 0.001\\%$ at $\\kappa=0.5$, $~\\sim 0.1\\%$ at $\\kappa=1.0$, and $\\sim 1\\%$ at $\\kappa = 2.0-2.8$. The accuracy drops at the highest value $\\kappa=3.0$. This is not surprising, since the fits for $M(\\kappa)$ and $\\Gamma_{\\rm m}(\\kappa)$ are only applicable for $\\kappa\\lesssim 3.0$ and, therefore, derivatives from these fits at   $\\kappa=3.0$ can produce significant errors.\n\nWe also found out that if better accuracy is required, the data for the excess thermal energy can be fitted by the following slightly modified expression\n\\begin{equation}\n\\label{Eq7}\nu_{\\mathrm{ex}} = A(\\kappa)\\ln\\left[ 1 + B(\\kappa) \\Gamma^{s(\\kappa)} \\right],\n\\end{equation}\nwhere $A$ and $B$ are now assumed $\\kappa$-dependent and a $\\kappa$-dependent exponent $s$ is introduced. Based on all the data points obtained in MD simulations the following relations are identified:\n$A(\\kappa) = 0.35708 + 0.09397\\kappa$,\n$B(\\kappa)= 1.65491\\exp(- 0.76911\\kappa)$,\n$s(\\kappa) = 0.68838 - 0.05183\\kappa$.\nSome representative examples are shown in Fig.~\\ref{FigR2}.\nThe fit of Eq.~(\\ref{Eq7}) is clearly more accurate and can be used in\nthe regime of weaker coupling, compared to the simple form (\\ref{Fit1}). However, it is also less practical in evaluating thermodynamic parameters other than the excess internal energy.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R2.eps}\\\\\n    \\caption{\n    Thermal component of the reduced excess energy, $u_{\\rm th}$ of 2D Yukawa fluids near the fluid-solid phase transition versus the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. Symbols correspond to MD simulations for different values of the screening parameter $\\kappa$. The curves are the analytical fits to these data using Eq.~(\\ref{Fit1}): The upper (lower) curve corresponds to fitting the MD results for $\\kappa=0.5$ ($\\kappa = 3.0$) and the intermediate (red) curve is obtained by fitting the entire massive of the data points.}\n\\label{FigR1}\n\\end{figure}\n\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R3.eps}\\\\\n    \\caption{Dependence of the excess thermal energy $u_{\\rm th}$ on the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. All the data points from numerical simulations are plotted. Solid curves correspond to three representative fits using Eq.~\\eqref{Eq7}.}\n\\label{FigR2}\n\\end{figure}\n\n\n\n",
        "Descriptive_question1": "What is the Madelung constant for κ = 0.5 in table_1?",
        "Descriptive_question2": "What is the value of κ when the Madelung constant is 0.03660 in table_1?",
        "Reasoning_question1": "How does the Madelung constant change as the screening parameter κ increases from 0.5 to 3.0 in table_1, and what might this indicate about the interaction strength in 2D Yukawa crystals?",
        "Reasoning_question2": "Why do the Madelung constants in table_1 decrease significantly with increasing κ, and how does this relate to the screening effect in the triangular lattice of 2D Yukawa crystals?",
        "Descriptive_answer1": "1.11914",
        "Descriptive_answer2": "2.0",
        "Reasoning_answer1": "As the screening parameter κ increases from 0.5 to 3.0 in table_1, the Madelung constant M decreases significantly from 1.11914 to 0.00525. Observing the data, each incremental increase in κ corresponds to a consistent reduction in M, indicating a clear trend of diminishing values. For instance, at κ=1.0, M is 0.29709, and by κ=2.0, it drops to 0.03660, showing a substantial decrease over the range. This trend suggests a weakening of the interaction strength in 2D Yukawa crystals as κ increases, likely due to enhanced screening effects that reduce the effective potential between particles in the triangular lattice. A higher κ implies stronger screening, which shortens the range of interaction, thus lowering the Madelung constant, which represents the static energy contribution of the lattice.",
        "Reasoning_answer2": "The significant decrease in Madelung constants with increasing κ in table_1, from 1.11914 at κ=0.5 to 0.00525 at κ=3.0, can be attributed to the screening effect inherent in Yukawa interactions. In the context of 2D Yukawa crystals with a triangular lattice, κ represents the screening parameter that controls the range of the repulsive potential between particles. As κ increases, the interaction potential decays more rapidly with distance due to stronger screening, effectively reducing the long-range interactions that contribute to the lattice energy. The Madelung constant, being a measure of the static energy per particle in the lattice, thus decreases because fewer neighboring particles contribute significantly to the potential energy at higher κ. This reflects how screening weakens the overall cohesive energy of the lattice structure, aligning with the physical expectation that increased screening diminishes the effective interaction strength in such systems."
    },
    {
        "paper_id": "1704.00976.json",
        "table_id": "table_2",
        "table_content": "\\begin{table}[h]\n\t\\centering\n\t\\small\n\t\\caption{Reduced excess energy $u_{\\rm ex}$ and pressure $p$ of two-dimensional Yukawa fluids evaluated using MD simulations for various coupling ($\\Gamma$) and screening ($\\kappa$) parameters.}\n\t\\label{Table1}\n\t\t\\begin{tabular}{l c c c c c c c c c c c c}\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=0.5$}\\\\ \\hline\n\t\t$\\Gamma$ & 135.420 & 86.7254 & 52.7787 & 32.1811 & 19.6073 & 11.9310 & 7.27175 & 4.43126 & 2.69848 & 1.64302 & 1.00136 &  0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 152.944 & 98.3115 & 60.1901 & 37.0087 & 22.8180 & 14.1176 & 8.79838 & 5.51964 & 3.48587 & 2.21772 & 1.42021 & 0.76495\\\\\n\t\t$p$ & 199.434 & 128.303 & 78.6946 & 48.5651 & 30.1485 & 18.8835 & 12.0216 & 7.81631 & 5.22964 & 3.63556 & 2.64961 & 1.85883\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=0.6$}\\\\\\hline\n\t\t$\\Gamma$  & 140.131 & 89.5076 & 54.3171 & 32.9737 & 20.0017 & 12.1359 & 7.36665 & 4.47442 & 2.71053 & 1.64677 & 1.00106 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 116.984 & 75.1128 & 45.9415 & 28.2016 & 17.3768 & 10.7727 & 6.73045 & 4.24422 & 2.69421 & 1.72956 & 1.11776 & 0.61083\\\\\n\t\t$p$ \t\t\t\t\t& 160.369 & 103.050  & 63.1652  & 38.9451  & 24.1971  & 15.2284  & 9.76528  & 6.42899  & 4.37128  & 3.11015  & 2.32663  & 1.69701\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=0.8$}\\\\\\hline\n\t\t$\\Gamma$ & 152.277 & 96.5736 & 58.0604 & 34.9737 & 21.0334 & 12.6675 & 7.61503 & 4.58845 & 2.75830 & 1.66410 & 0.99914  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 74.6424 & 47.7340  & 29.0608 & 17.8181 & 10.9844 & 6.84185 & 4.30139 & 2.74217 & 1.76665 & 1.15293  & 0.75437 & 0.42469\\\\\n\t\t$p$ \t\t\t\t\t& 112.709 & 72.1411  & 44.0441 & 27.1658 & 16.9406 & 10.7731 & 7.01845 & 4.73986 & 3.33679 & 2.47393  & 1.92983 & 1.49910\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.0$}\\\\\\hline\n\t\t$\\Gamma$ & 169.071 & 105.975 & 63.1038 & 37.6027 & 22.4047 & 13.3361 & 7.94729 & 4.73129 & 2.81940 & 1.68034  & 0.99956 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 51.5786 & 32.7335 & 19.8556 & 12.1451 & 7.50279 & 4.68984 & 2.97702 & 1.91799 & 1.25426 & 0.82932 & 0.55059 & 0.31770\\\\\n\t\t$p$ \t& 85.4036 & 54.2492 & 33.0215 & 20.3527 & 12.7618 & 8.19406 & 5.44279 & 3.76791 & 2.74103 & 2.10336 & 1.70075 & 1.38135\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.2$}\\\\\\hline\n\t\t$\\Gamma$ & 191.126 & 118.398 & 69.6429 & 40.9597 & 24.1083 & 14.1893 & 8.34919 & 4.90490 & 2.88868 & 1.70019 & 0.99984 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 37.5852 & 23.6918 & 14.3026 & 8.72609 & 5.39936 & 3.39637 & 2.17547 & 1.41736 & 0.93933 & 0.62908 & 0.42281 & 0.24960\\\\\n\t\t$p$ \t& 67.9344 & 42.8619 & 25.9838 & 16.0024 & 10.0874 & 6.56025 & 4.44041 & 3.15023 & 2.36021 & 1.86635 & 1.55301 & 1.30594\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.4$}\\\\\\hline\n\t\t$\\Gamma$ & 220.172 & 134.441 & 77.9949 & 45.2452 & 26.2578 & 15.2219 & 8.83634 & 5.12702 & 2.97137 & 1.72440 & 1.00140  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 28.5555 & 17.8503 & 10.7244 & 6.53392 & 4.05300 & 2.56405 & 1.65932 & 1.09552 & 0.73364 & 0.49726 & 0.33718 & 0.20253\\\\\n\t\t$p$ \t& 56.0915 & 35.0963 & 21.1892 & 13.0574 & 8.28303 & 5.45288 & 3.76392 & 2.73780 & 2.10241 & 1.70540 & 1.45171 & 1.25396\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.6$}\\\\\\hline\n\t\t$\\Gamma$ & 258.433 & 155.296 & 88.6297 & 50.6106 & 28.9099 & 16.4928 & 9.41249 & 5.37870 & 3.07317 & 1.75217 & 0.99889  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 22.4535 & 13.9136 & 8.31218 & 5.05719 & 3.14728 & 2.00498 & 1.30903 & 0.87473 & 0.59391 & 0.40446 & 0.27520 & 0.16486\\\\\n\t\t$p$ & 47.7294 & 29.6021 & 17.7849 & 10.9674 & 7.00739 & 4.67522 & 3.28559 & 2.44432 & 1.92230 & 1.58965 & 1.37647 & 1.15781\\\\\\hline\\hline\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.8$}\\\\\\hline\n\t\t$\\Gamma$ & 308.935 & 182.395 & 102.261 & 57.3435 & 32.1483 & 18.0355 & 10.1029 & 5.67241 & 3.17978 & 1.78359 & 0.99997  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 18.1745 & 11.1626 & 6.63304 & 4.02868 & 2.51560 & 1.61389 & 1.06328 & 0.71747 & 0.49051 & 0.33739 & 0.23058 & 0.14359\\\\\n\t\t$p$ \t& 41.6428 & 25.5932 & 15.3055 & 9.44338 & 6.07675 & 4.10949 & 2.93845 & 2.22906 & 1.78546 & 1.50402 & 1.32125 & 1.18748\\\\\\hline\\hline\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.0$}\\\\\\hline\n\t\t$\\Gamma$ & 375.818 & 217.422 & 119.600 & 65.7745 & 36.1611 & 19.8980 & 10.9232 & 6.01199 & 3.30681 & 1.81767 & 1.00051 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 15.0964 & 9.17319 & 5.42177 & 3.29200 & 2.06139 & 1.33276 & 0.88426 & 0.60261 & 0.41513 & 0.28650 & 0.19651 & 0.12379\\\\\n\t\t$p$ \t\t\t\t\t& 37.1333 & 22.5775 & 13.4413 & 8.30684 & 5.38337 & 3.68921 & 2.67835 & 2.06727 & 1.68347 & 1.43752 & 1.27850 & 1.16494\\\\\\hline\\hline\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.2$}\\\\\\hline\n\t\t$\\Gamma$ & 463.975 & 262.948 & 141.568 & 76.2338 & 41.0173 & 22.0958 & 11.9035 & 6.41082 & 3.45056 & 1.85303 & 1.00113 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 12.7875 & 7.69994 & 4.52708 & 2.74830 & 1.72461 & 1.12217 & 0.75368 & 0.51642 & 0.35777 & 0.24734 & 0.17009 & 0.10850\\\\\n\t\t$p$ \t\t\t\t\t& 33.6575 & 20.2710 & 12.0118 & 7.43585 & 4.85060 & 3.36425 & 2.48426 & 1.94445 & 1.60450 & 1.38520 & 1.24473 & 1.14408\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.4$}\\\\\\hline\n\t\t$\\Gamma$ & 578.968 & 320.871 & 168.949 & 89.0382 & 46.8778 & 24.7092 & 12.9953 & 6.85634 & 3.60307 & 1.89919 & 0.99952 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 10.9709 & 6.56430 & 3.83850 & 2.33031 & 1.47100 & 0.96365 & 0.65089 & 0.44974 & 0.31141 & 0.21697 & 0.14862 & 0.09589\\\\\n\t\t$p$ & 30.8215 & 18.4175 & 10.8648 & 6.74135 & 4.43655 & 3.11369 & 2.32748 & 1.84722 & 1.53931 & 1.34446 & 1.21673 & 1.12942\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.6$}\\\\\\hline\n\t\t$\\Gamma$ & 723.656 & 392.384 & 202.051 & 104.080 & 53.5742 & 27.6270 & 14.2191 & 7.32182 & 3.76653 & 1.93971 & 1.00200 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 9.50055 & 5.63818 & 3.28596 & 1.99866 & 1.26783 & 0.83500 & 0.56905 & 0.39442 & 0.27600 & 0.19145 & 0.13130 & 0.08576\\\\\n\t\t$p$ \t& 28.3633 & 16.8096 & 9.89231 & 6.16190 & 4.09049 & 2.90245 & 2.19936 & 1.76426 & 1.48858 & 1.30961 & 1.19408 & 1.11954\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.8$}\\\\\\hline\n\t\t$\\Gamma$ & 893.746 & 474.549 & 239.143 & 120.685 & 60.8483 & 30.6642 & 15.4796 & 7.80951 & 3.93161 & 1.98042 & 1.00296 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 8.19448 & 4.82859 & 2.81518 & 1.71951 & 1.09985 & 0.73051 & 0.50093 & 0.35038 & 0.24489 & 0.17117 & 0.11700 & 0.07671\\\\\n\t\t$p$ \t& 25.9004 & 15.2521 & 8.98792 & 5.63831 & 3.78782 & 2.72194 & 2.08856 & 1.69631 & 1.44344 & 1.28133 & 1.17497 & 1.10201\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=3.0$}\\\\\\hline\n\t\t$\\Gamma$ & 1071.02 & 558.495 & 276.444 & 136.953 & 67.7922 & 33.5897 & 16.6383 & 8.22716 & 4.07874 & 2.02013 & 0.99949 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 6.93189 & 4.07091 & 2.38838 & 1.47193 & 0.95056 & 0.64023 & 0.44340 & 0.31146 & 0.21994 & 0.15395 & 0.10494 & 0.06958\\\\\n\t\t$p$ \t& 23.1181 & 13.5906 & 8.07317 & 5.12679 & 3.49444 & 2.55590 & 1.98879 & 1.63334 & 1.40554 & 1.25677 & 1.15868 & 1.09682\\\\\\hline\\hline\n\t\t\\end{tabular}\n\\end{table}",
        "caption": "Reduced excess energy $u_{\\rm ex}$ and pressure $p$ of two-dimensional Yukawa fluids evaluated using MD simulations for various coupling ($\\Gamma$) and screening ($\\kappa$) parameters.",
        "label": "Table1",
        "section_info": "2 Methods\n\\section{Methods}\n\n\n\\subsection{System description}\n\\label{SD}\n\nWe investigate a classical system of point-like particles in the 2D geometry interacting via the pairwise repulsive Yukawa potential of the form\n\\begin{equation*}\n\\varphi (r) = \\frac{\\varepsilon \\lambda}{r}\\exp\\left(-\\frac{r}{\\lambda}\\right),\n\\end{equation*}\nwhere $\\varepsilon$, and $\\lambda$ are the energy and (screening) length scales of the interaction. For charged particles immersed in a plasma-like screening environment, the energy scale is $\\varepsilon=Q^2/4\\pi\\epsilon_0\\lambda$ (in SI units), where $Q$ is the charge and $\\epsilon_0$ is the permittivity of free space. The properties of Yukawa systems are determined by the two dimensionless parameters. The first is the coupling parameter, $\\Gamma = (Q^2/4\\pi  \\epsilon_0 a k_{\\mathrm{B}}T)$, where $k_{\\mathrm{B}}$ is the Boltzmann constant, $T$ is the temperature, $a=(\\pi  n)^{-1/2}$ is the 2D Wigner-Seitz radius, and $n=N/V$ is the areal density of $N$ particles occupying the 2D volume $V$. The second is the screening parameter, $\\kappa = a/\\lambda$. Note, that the coupling parameter is roughly the ratio of the potential energy of interaction between two neighbouring particles to their kinetic energy. The system is usually said to be in the strongly coupled state when this ratio is large, that is $\\Gamma\\gtrsim 1$.\n\nWhen coupling increases the system forms a strongly coupled fluid phase, which can crystallize upon further increase in $\\Gamma$. This fluid-solid transition can be characterized by the temperature and/or coupling parameter,  $T_{\\rm m}$ and $\\Gamma_{\\rm m}$, where the subscript ``m'' refers to melting. Both $T_{\\rm m}$ and $\\Gamma_{\\rm m}$ are the functions of the screening parameter $\\kappa$. The dependence $\\Gamma_{\\rm m}(\\kappa)$ has been approximated in Ref.~\\onlinecite{PhysRevE.72.026409} by the following fit:\n\\begin{equation}\\label{Melting2D}\n\\Gamma_{\\rm m}(\\kappa)\\simeq \\frac{131}{1-0.388\\kappa^2+0.138\\kappa^3-0.0138\\kappa^4}.\n\\end{equation}\nThis fit describes relatively well the melting points found from the bond angular correlation analysis (see Fig.~6 of Ref.~\\onlinecite{PhysRevE.72.026409}) up to $\\kappa = 3.0$ and it should  not be applied for larger $\\kappa$. In the limit $\\kappa = 0$ the system reduces to the 2D one-component-plasma (OCP) with the Coulomb interaction. In this case $\\Gamma_{\\rm m}\\simeq 131$ lies in the range predicted in earlier numerical simulations~\\cite{Gann1979} and obtained in experiments with a classical 2D sheet of electrons~\\cite{Grimes1979} (see also Ref.~\\onlinecite{KhrapakCPP2016} for a recent overview of OCP thermodynamics in 2D and 3D).\n\nFinally, it is worth to comment on the nature of the fluid-solid phase transition in 2D Yukawa systems. Recently, it has been demonstrated that the potential softness is very important factor, which determines the melting scenario.~\\cite{KapferPRL2015}\nFor sufficiently steep repulsive interactions the hard-disk melting scenario holds: a first-order liquid-hexatic and a continuous\nhexatic-solid transition can be identified. ~\\cite{PhysRevLett.107.155704, PhysRevE.87.042134} For softer interactions the liquid-hexatic transition is continuous, with correlations consistent with the Kosterlitz-Thouless-Halperin-Nelson-Young (KTHNY) scenario.  (For example, in 2D colloidal systems, hexatic phase was observed in the experiment by Zahn et al.~\\cite{PhysRevLett.82.2721}) For the Yukawa potential the transition between these two scenarios occurs at about $\\kappa\\simeq 6$.~\\cite{KapferPRL2015} Below we consider systems with $\\kappa$ in the range from $0.5$ to $3.0$ (this range is particularly relevant to 2D plasma crystals and fluids in laboratory experiments~\\cite{FortovUFN2004,FortovPR2005},\\cite{CTPP:CTPP201400099}), thus belonging to the soft interaction class.  In this range of $\\kappa$, the hexatic phase occupies a rather narrow region on the phase diagram,~\\cite{KapferPRL2015} and the study of its properties is beyond the scope of the present investigation.\n\n\\subsection{Computational details}\n\\label{MDdetails}\n\nTo obtain the thermodynamic properties of the 2D Yukawa systems across coupling regime, extensive MD simulations have been performed. The MD simulations have been done in the $NVT$ ensemble at different temperatures using $N=64 000$ particles and the Langevin thermostat. The numerical time step was chosen $\\Delta t_c=5\\times 10^{-4}\\sqrt{m\\lambda^2/\\epsilon}$ for the crystalline phase and $\\Delta t_c \\sqrt{\\Gamma/\\Gamma_{\\rm m}}$  for the fluid phase. The cutoff radius of the Yukawa potential was set equal to $15n^{-1/2}$. The simulations were run for $1.5\\times 10^6$ time steps to equilibrate the system and obtain the equilibrium properties. In the simulation run with $\\kappa = 0.5$ Ewald summation was implemented.\n\nThe simulations have been performed for a number of screening parameters $\\kappa$ ranging from $0.5$ to $3.0$. This corresponds to sufficiently soft interactions as discussed above. For each value of the screening parameter $\\kappa$, twelve simulation runs correspond to the fluid phase and nine runs to the crystalline phase. In the fluid phase the coupling parameter ranges from $\\Gamma=0.5$ to $\\simeq 0.95\\Gamma_{\\rm m}$. In the solid phase the values corresponding to $\\Gamma_{\\rm m}/\\Gamma=0.9,0.8,...,0.1$ are taken.\n\nThe main simulation results are summarized in Tables~\\ref{Table1}-\\ref{Table4} of the Appendix.\n\n\n\n\\subsection{Thermodynamic definitions and relations}\\label{Thermo}\n\nThe main thermodynamic quantities which will be required below are the internal energy $U$, Helmholtz free energy $F$, and pressure $P$ of the system. The following thermodynamic definitions exist~\\cite{LL}\n\\begin{eqnarray}\nU=-T^2\\left(\\frac{\\partial}{\\partial T}\\frac{F}{T}\\right)_V, \\\\\nP=-\\left(\\frac{\\partial F}{\\partial V}\\right)_T.\n\\end{eqnarray}\nIn addition, $U$ and $P$ can be calculated using the integral equations of state~\\cite{hansen-book, frenkel2001}\n\\begin{equation}\n\\begin{split}\n& U= N\\left(k_{\\rm B}T+ n\\int{d\\mathbf{r}\\; \\varphi(r)g(\\mathbf{r})}\\right),\\\\\n& PV = N\\left(k_{\\rm B}T - \\frac{n}{4}\\int{d\\mathbf{r}\\; r\\varphi'(r)g(\\mathbf{r})} \\right),\n\\end{split}\n\\end{equation}\nwhere $g(\\mathbf{r})$ denotes the radial distribution function, which is isotropic in gas and fluid phases and anisotropic in the crystalline phase.\n\nWe will use conventional reduced units: $u=U/Nk_{\\rm B}T$, $f=F/Nk_{\\rm B}T$, and $p=PV/Nk_{\\rm B}T$ and divide the thermodynamic quantities into the kinetic (ideal gas) and potential (excess) components, so that $u=1 + u_{\\rm ex}$ (in 2D), $f=f_{\\rm id}+f_{\\rm ex}$, and $p=1+p_{\\rm ex}$. Finally, it is useful to operate with the Yukawa system state variables $\\Gamma$ and $\\kappa$. In these variables the thermodynamic identities for 2D Yukawa fluids are~\\cite{KhrapakPoP08_2015, 1.4935846}\n\\begin{equation}\\label{pf}\np=1+\\frac{\\Gamma}{2}\\frac{\\partial f_{\\rm ex}}{\\partial\\Gamma}-\\frac{\\kappa}{2}\\frac{\\partial f_{\\rm ex}}{\\partial\\kappa}, \\qquad\nf_{\\rm ex} = \\int_0^{\\Gamma}{d\\Gamma'\\; \\frac{u_{\\mathrm{ex}}(\\kappa, \\Gamma')}{\\Gamma'}}.\n\\end{equation}\n\n\\subsection{The shortest-graph method}\n\nTo describe the thermodynamics of 2D Yukawa crystals analytically,\nwe employ the shortest-graph method, proposed and developed in Refs.~\\onlinecite{1.4869863, 1.4926945, 0953-8984-28-23-235401}.\nFollowing these papers, thermodynamical properties of classical crystals can be obtained very accurately from the following consideration. The anisotropic pair-correlation function $g(\\mathbf{r})$ of a crystal is written in the form\n\\begin{equation}\n\\label{Eq1}\ng(\\mathbf{r}) = \\frac{1}{n}\\sum_\\alpha{p_\\alpha(\\mathbf{r}-\\mathbf{r_\\alpha})},\n\\end{equation}\nwhere the summation is over all the nodes $\\alpha$, and\neach individual peak has the shape\n\\begin{equation}\n\\label{Eq2}\n\\begin{split}\n&p_\\alpha(\\mathbf{r}) \\propto\n \\exp\\left[-\\frac{\\varphi(\\mathbf{r}+\\mathbf{r_\\alpha})}{k_{\\rm B}T}-b_\\alpha(\\mathbf{e_\\alpha}\\cdot\\mathbf{r})-\n\\right. \\\\\n& \\qquad\\qquad \\qquad \\qquad \\left.\n-\\frac{(\\mathbf{e_\\alpha}\\cdot\\mathbf{r})^2}{2 a_{\\|\\alpha}^2}-\n\\frac{\\mathbf{r}^2-(\\mathbf{e_\\alpha}\\cdot\\mathbf{r})^2}{2 a_{\\perp\\alpha}^2}\\right].\n\\end{split}\n\\end{equation}\nThe normalization constant as well as the parameters $a_{\\|,\\perp\\alpha}^2, b_\\alpha$ are defined by the following conditions\\cite{0953-8984-28-23-235401}\n\\begin{equation}\n\\label{Eq3}\n\\begin{split}\n& \\int{d\\mathbf{r}\\;p_\\alpha(\\mathbf{r})}=1, \\qquad \\int{d\\mathbf{r}\\;\\mathbf{r}p_\\alpha(\\mathbf{r})}=0, \\\\\n& \\int{d\\mathbf{r}\\;(\\mathbf{e_\\alpha}\\cdot\\mathbf{r})^2 p_\\alpha(\\mathbf{r})}=\\sigma_{\\|\\alpha}^2,\\\\\n& \\int{d\\mathbf{r}\\;[\\mathbf{r}^2-(\\mathbf{e_\\alpha}\\cdot\\mathbf{r})^2] p_\\alpha(\\mathbf{r})}=(D-1) \\sigma_{\\perp\\alpha}^2,\n\\end{split}\n\\end{equation}\nwhere $D=2$ is the spatial dimensionality and $\\mathbf{e_\\alpha}=\\mathbf{r_\\alpha}/r_\\alpha$ is the unit vector in the\ndirection of $\\mathbf{r_\\alpha}$,\n$\\sigma_{\\|,\\perp}^2$ is the mean squared displacement for longitudinal and transversal directions, respectively, calculated using the finite-temperature phonon spectra,\ntaking into account the anharmonic effects.\\cite{0953-8984-28-23-235401} By using the pair correlation function $g(\\mathbf{r})$ the excess energy and pressure can then be obtained.\nHowever, calculation of the finite-temperature phonon spectra is a difficult problem, which is beyond the scope of the present paper.\nTherefore, we propose here a simpler practical approach, which yields very accurate results and can be used for practical calculations.\n\nDue to the anharmonicity of phonon spectra at finite temperatures,\nthe second-order term becomes more significant in the temperature expansion of the mean-squared displacements $\\sigma^2$.\nTo account for this effect, we propose the anharmonic correction of the mean-squared displacements\n\\begin{equation}\n\\label{Eq5}\n\\sigma_{\\|,\\perp\\alpha}^2 = \\widetilde{\\sigma}_{\\|,\\perp\\alpha}^2 \\left[1+\\beta(\\kappa)N\\widetilde{\\sigma}_{1}^2/V\\right],\n\\end{equation}\nwhere the tildes denote the mean-squared displacement calculated using zero-temperature phonon spectra (see Ref.\\onlinecite{1.4926945}),\n$\\widetilde{\\sigma}_1^2$ is the total mean-squared displacement for the nearest neighbours, and we have introduced the anharmonic correction coefficient $\\beta(\\kappa)$, which does not depend on the temperature and should be found using MD simulations for different screening parameters.\nThe correction given by Eq.\\eqref{Eq5} conserves the ratio $\\sigma_\\|^2/\\sigma_\\perp^2$ between the mean-squared displacements in the longitudinal and transversal directions.\n\\emph{A posteriori} comparison with MD results proves that this assumption allows to obtain excellent accuracy.\n\n2.2 Computational details\n\\subsection{Computational details}\n\\label{MDdetails}\n\nTo obtain the thermodynamic properties of the 2D Yukawa systems across coupling regime, extensive MD simulations have been performed. The MD simulations have been done in the $NVT$ ensemble at different temperatures using $N=64 000$ particles and the Langevin thermostat. The numerical time step was chosen $\\Delta t_c=5\\times 10^{-4}\\sqrt{m\\lambda^2/\\epsilon}$ for the crystalline phase and $\\Delta t_c \\sqrt{\\Gamma/\\Gamma_{\\rm m}}$  for the fluid phase. The cutoff radius of the Yukawa potential was set equal to $15n^{-1/2}$. The simulations were run for $1.5\\times 10^6$ time steps to equilibrate the system and obtain the equilibrium properties. In the simulation run with $\\kappa = 0.5$ Ewald summation was implemented.\n\nThe simulations have been performed for a number of screening parameters $\\kappa$ ranging from $0.5$ to $3.0$. This corresponds to sufficiently soft interactions as discussed above. For each value of the screening parameter $\\kappa$, twelve simulation runs correspond to the fluid phase and nine runs to the crystalline phase. In the fluid phase the coupling parameter ranges from $\\Gamma=0.5$ to $\\simeq 0.95\\Gamma_{\\rm m}$. In the solid phase the values corresponding to $\\Gamma_{\\rm m}/\\Gamma=0.9,0.8,...,0.1$ are taken.\n\nThe main simulation results are summarized in Tables~\\ref{Table1}-\\ref{Table4} of the Appendix.\n\n\n\n3 Results\n\\section{Results}\n\n\\subsection{Weakly-coupled fluids}\n\n\\begin{figure}[!b]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R1.eps}\\\\\n    \\caption{The excess energy $u_{\\rm ex}$ of 2D Yukawa weakly coupled fluids versus the screening parameter $\\kappa$ at a fixed coupling parameter $\\Gamma = 0.5$. The symbols correspond to the results of MD simulations, the solid curve is plotted using the analytical expression of Eq.~(\\ref{SVC}).\n    }\n\\label{FigSC}\n\\end{figure}\n\nA simple and physically transparent approach to the thermodynamics of weakly coupled Yukawa systems for small deviations from the ideal gas behavior is to calculate the  second virial coefficient. This has recently been shown to work well in 3D Yukawa systems.~\\cite{KhrapakPPCF2016} In the 2D geometry the excess free energy is expressed in this approximation as\n\\begin{equation}\\label{SVC}\nf_{\\rm ex}\\simeq \\pi n \\int\\left[1-e^{-\\varphi(r)/k_{\\rm B}T}\\right]r dr.\n\\end{equation}\nThe excess energy and pressure can be readily obtained from the excess free energy. We compare the values $u_{\\rm ex}$ at a fixed coupling parameter $\\Gamma=0.5$ obtained from Eq.~(\\ref{SVC}) and computed using MD simulations in Fig.~\\ref{FigSC}. The agreement is satisfactory: in the range of $\\kappa$ investigated the deviations are within several percent. The agreement naturally improves with increasing $\\kappa$, because at a fixed $\\Gamma$ the actual interaction strength weakens as $\\kappa$ increases.\n\n\n\\subsection{Strongly-coupled fluids}\n\nThe excess energy and pressure of the 2D Yukawa fluids have been determined using MD simulations in a wide range of coupling and screening parameters. The results are summarized in the Table \\ref{Table1} of the Appendix. Here we describe simple analytical\napproximations, which can be used to evaluate the energy and pressure for practical purposes.\n\nIn the strongly coupled fluid regime it is helpful to divide the thermodynamic quantities, such as energy and pressure, into static and thermal contributions. The static contribution corresponds to the value of internal energy when the particles are frozen in some regular configuration and the thermal corrections arise due to the deviations of the particles from these fixed positions (due to thermal motion). Of course, such a division is only meaningful when the regular structure is specified. For crystals, the obvious choice is a corresponding lattice sum (Madelung energy). For fluids this choice is also meaningful and we use it here (Note, that in 3D Yukawa system a slightly different definition of the static fluid energy is traditionally employed.~\\cite{KhrapakPPCF2016, KhrapakISM})\n\n\\begin{table}[!b]\n\\caption{\\label{TabM} Madelung constants of the 2D Yukawa crystals (triangular lattice) for various screening parameters in the range $0.5 \\leq \\kappa\\leq 3.0$ }\n\\begin{ruledtabular}\n\\begin{tabular}{cccc}\n$\\kappa$ & $M$ & $\\kappa$ & $M$   \\\\ \\hline\n0.5 & 1.11914 & 1.8 & 0.05449  \\\\\n0.6 & 0.82503 & 2.0 & 0.03660  \\\\\n0.8 & 0.48127 & 2.2 & 0.02470  \\\\\n1.0 & 0.29709 & 2.4 & 0.01672  \\\\\n1.2 & 0.18960 & 2.6 & 0.01135  \\\\\n1.4 & 0.12357 & 2.8 & 0.00772  \\\\\n1.6 & 0.08167 & 3.0 & 0.00525 \\\\ \n\\end{tabular}\n\\end{ruledtabular}\n\\end{table}\n\nThe excess internal energy is thus a sum of the static and thermal contributions,\n\\begin{equation}\nu_{\\rm ex} = u_{\\rm st} + u_{\\rm th},\n\\end{equation}\nwhere $u_{\\rm st} = M\\Gamma$ and $M$ is the Madelung constant.\nThe values of the Madelung constant for 2D Yukawa systems in the regime of relatively weak screening, $0.5 \\leq \\kappa\\leq 3.0$, are tabulated in Table~\\ref{TabM}. The dependence $M(\\kappa)$ can be fitted using a functional form similar to that proposed by  Totsuji~\\emph{et al.}\\cite{PhysRevE.70.016405}\n\\begin{equation}\n\\label{Eq6}\nM = -1.1061+0.5038\\kappa-0.11053\\kappa^2+0.00968\\kappa^3+1/\\kappa.\n\\end{equation}\nThe last term in (\\ref{Eq6}) accounts for the absence of neutralizing background in our case (but present in Ref.~\\onlinecite{PhysRevE.70.016405}), the energy of this background being simply $-\\Gamma/\\kappa$. The fit is chosen in such a way that when $\\kappa\\rightarrow 0$ and the neutralizing background is introduced, the Madelung constant is reduced to the well known value of the triangular lattice sum of the 2D one-component-plasma (OCP) with Coulomb interactions, $M_{\\rm OCP}\\simeq -1.1061$. This fit is accurate to within a tiny fraction of percent for $\\kappa\\lesssim 1.0$ and to within $\\sim 1\\%$ when screening becomes stronger ($\\kappa\\sim 3$).\n\nThe thermal part of the excess energy is expected to exhibit a quasi-universal scaling with respect to the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. This is a general property of classical particle systems with sufficiently soft interactions, which was first pointed out by Rosenfeld and Tarazona (RT scaling) for 3D systems.~\\cite{RT1,RT2} In the context of 3D Yukawa systems, the RT scaling has been proven to be very useful in Refs.~\\onlinecite{1.4921223,KhrapakPPCF2016,KhrapakPRE2015,KhrapakPRE03_2015} The emergence of RT scaling analogue for 2D systems has been discussed in the context of OCP with Coulomb and logarithmic interactions, Yukawa systems near the OCP limit, and inverse-power-law interactions.~\\cite{KhrapakCPP2016,KhrapakPoP08_2015} The dependence of $u_{\\rm th}$ on $\\Gamma/\\Gamma_{\\rm m}$ in the strongly coupled regime is displayed in Fig.~\\ref{FigR1}. The quasi-universality is well pronounced, although there is clearly some systematic tendency of decreasing the value of $u_{\\rm th}$ with $\\kappa$ at the same value of  $\\Gamma/\\Gamma_{\\rm m}$. This tendency is expected when the potential steepness increases (see e.g. Fig.~4 from Ref.~\\onlinecite{KhrapakPoP08_2015}). Overall, the data points corresponding to the dependence $u_{\\rm th}(\\Gamma/\\Gamma_{\\rm m})$ are confined to a relatively narrow range. The important point is that towards the side of soft interactions (sufficiently small $\\kappa$ in our case), the static component of the internal energy is dominant over the thermal one. For example, at $\\kappa=1$ the thermal component contributes only to about $2\\%$ of the total excess energy near the fluid-solid phase transition. Therefore, even moderately accurate fits for $u_{\\rm th}$ allow to obtain high accuracy with respect to the total excess energy $u_{\\rm ex}$.\n\nThree fits are shown in Fig.~\\ref{FigR1}. The upper (lower) curve corresponds to the data portion for $\\kappa=0.5$ ($\\kappa = 3.0$).\nThe intermediate curve has been obtained using the entire massive of the data points (corresponding to the parameter regime shown). It can be considered as representative for strongly coupled 2D Yukawa fluids in the vicinity of the freezing transition.\nThe functional form of the fit is the same as used previously~\\cite{KhrapakPoP08_2015}\n\\begin{equation} \\label{Fit1}\nu_{\\rm th} =A \\ln (1+B\\Gamma/\\Gamma_{\\rm m}).\n\\end{equation}\nThe use of the coefficients $A=0.257$ and $B=195.4$ determined here would somewhat improve previous approximations.\n\nThe excess free energy can be routinely calculated using the model for the excess energy formulated above and the second of Eqs.~(\\ref{pf}). The resulting expression is rather simple,\n\\begin{equation}\\label{fex}\nf_{\\rm ex}=M(\\kappa)\\Gamma - A{\\rm Li}_2(-B\\Gamma/\\Gamma_{\\rm m}),\n\\end{equation}\nwhere ${\\rm Li}_2(z)=\\int_z^0 dt \\ln(1-t)/t$ is dilogarithm. Note that in deriving Eq.~(\\ref{fex}), the thermodynamic integration over the coupling parameter from 0 to $\\Gamma$ has been performed, while Eq.~(\\ref{Fit1}) is strictly speaking not applicable at $\\Gamma\\ll 1$.\nThe correct procedure would be to start thermodynamic integration from some small, but finite value $\\Gamma_0$, and then add the constant $f_{\\rm ex}(\\Gamma_0)$ evaluated using Eq.~(\\ref{SVC}). However, since the actual contribution from the weakly coupling regime is small, Eq.~(\\ref{fex}) remains rather accurate at strong coupling and we use it here.\n\nThe calculation of pressure from the excess free energy is straightforward, but rather cumbersome in the considered case. This is because the differentiation with respect to $\\kappa$ is involved, and  the two fits for $M(\\kappa)$ and $\\Gamma_{\\rm m}(\\kappa)$ are present. For this reason, the explicit expression for $p$ is not displayed. We verified that near freezing (at $\\Gamma/\\Gamma_{\\rm m}\\simeq 0.95$) the derived expression yields the pressures which deviate from the exact MD results by $\\sim 0.001\\%$ at $\\kappa=0.5$, $~\\sim 0.1\\%$ at $\\kappa=1.0$, and $\\sim 1\\%$ at $\\kappa = 2.0-2.8$. The accuracy drops at the highest value $\\kappa=3.0$. This is not surprising, since the fits for $M(\\kappa)$ and $\\Gamma_{\\rm m}(\\kappa)$ are only applicable for $\\kappa\\lesssim 3.0$ and, therefore, derivatives from these fits at   $\\kappa=3.0$ can produce significant errors.\n\nWe also found out that if better accuracy is required, the data for the excess thermal energy can be fitted by the following slightly modified expression\n\\begin{equation}\n\\label{Eq7}\nu_{\\mathrm{ex}} = A(\\kappa)\\ln\\left[ 1 + B(\\kappa) \\Gamma^{s(\\kappa)} \\right],\n\\end{equation}\nwhere $A$ and $B$ are now assumed $\\kappa$-dependent and a $\\kappa$-dependent exponent $s$ is introduced. Based on all the data points obtained in MD simulations the following relations are identified:\n$A(\\kappa) = 0.35708 + 0.09397\\kappa$,\n$B(\\kappa)= 1.65491\\exp(- 0.76911\\kappa)$,\n$s(\\kappa) = 0.68838 - 0.05183\\kappa$.\nSome representative examples are shown in Fig.~\\ref{FigR2}.\nThe fit of Eq.~(\\ref{Eq7}) is clearly more accurate and can be used in\nthe regime of weaker coupling, compared to the simple form (\\ref{Fit1}). However, it is also less practical in evaluating thermodynamic parameters other than the excess internal energy.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R2.eps}\\\\\n    \\caption{\n    Thermal component of the reduced excess energy, $u_{\\rm th}$ of 2D Yukawa fluids near the fluid-solid phase transition versus the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. Symbols correspond to MD simulations for different values of the screening parameter $\\kappa$. The curves are the analytical fits to these data using Eq.~(\\ref{Fit1}): The upper (lower) curve corresponds to fitting the MD results for $\\kappa=0.5$ ($\\kappa = 3.0$) and the intermediate (red) curve is obtained by fitting the entire massive of the data points.}\n\\label{FigR1}\n\\end{figure}\n\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R3.eps}\\\\\n    \\caption{Dependence of the excess thermal energy $u_{\\rm th}$ on the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. All the data points from numerical simulations are plotted. Solid curves correspond to three representative fits using Eq.~\\eqref{Eq7}.}\n\\label{FigR2}\n\\end{figure}\n\n\n\n\\subsection{Relation between excess pressure and energy}\n\nIt is sometimes advantageous to operate with an equation of state written in the form of relation between the pressure and internal energy of the system. For soft purely repulsive potentials a simplest formulation of this kind can be written as\n\\begin{equation}\\label{gamma_ex}\np_{\\rm ex}=\\gamma_{\\rm ex}u_{\\rm ex}.\n\\end{equation}\nHere the parameter $\\gamma_{\\rm ex}$ generally depends both on the temperature and density, that is both on $\\Gamma$ and $\\kappa$ for Yukawa systems. Note that the parameter $\\gamma_{\\rm ex}$ introduced in this way is not directly related to the conventional definitions of either the density scaling exponent or Gr\\\"uneisen parameter.~\\cite{HummelPRB2015} Nevertheless, it may be helpful in characterizing the softness of the repulsive potential. We remind that for inverse-power-law (IPL) repulsive potentials of the form $\\varphi(r)\\propto r^{-\\alpha}$ the relation between the excess pressure and energy is particularly simple, $p_{\\rm ex}=\\tfrac{\\alpha}{2} u_{\\rm ex}$ in 2D. Thus, an ``effective IPL exponent'' may be associated with the quantity $2\\gamma_{\\rm ex}$.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{gamma.eps}\\\\\n    \\caption{Ratio of the excess pressure to the excess energy, $\\gamma_{\\rm ex}=p_{\\rm ex}/u_{\\rm ex}$ on the plane ($\\kappa$, $\\Gamma/\\Gamma_{\\rm m}$).\n    }\n\\label{gamma}\n\\end{figure}\n\nHaving approximations for both $p_{\\rm ex}$ and $u_{\\rm ex}$ for 2D Yukawa fluids we can easily estimate the value of $\\gamma_{\\rm ex}$. The corresponding plot of $\\gamma_{\\rm ex}$ as a function of Yukawa systems state variables $\\kappa$ and $\\Gamma/\\Gamma_{\\rm m}$ is shown in Fig.~\\ref{gamma}. To produce this plot, Eq.~(\\ref{Fit1}) for the thermal component of the excess energy has been used. Figure~\\ref{gamma} shows that in the strongly coupled regime $\\gamma_{\\rm ex}$ is very weakly dependent on the coupling strength (temperature), but exhibits considerable dependence on $\\kappa$ (density). Using the exact MD results for $p_{\\rm ex}/u_{\\rm ex}$ in the vicinity of the fluid-solid phase transition ($\\Gamma/\\Gamma_{\\rm m}\\simeq 0.95$) we have obtained a representative dependence $\\gamma_{\\rm ex}(\\kappa)$ in the strongly coupled regime:\n\\begin{equation}\n\\gamma_{\\rm ex}(\\kappa)=1+0.526\\kappa+0.13\\kappa^2-0.02\\kappa^3.\n\\end{equation}\nImportantly, $\\gamma_{\\rm ex}\\rightarrow 1$ as $\\kappa\\rightarrow 0$.\nThis seems counter-intuitive at first, because one would naturally expect $\\gamma_{\\rm ex}=\\tfrac{1}{2}$ in the OCP Coulomb interaction limit in 2D. The difference is attributed to the presence of the neutralizing background in the OCP model. In the limit of very soft interaction, the energy and pressure are dominated by their static contributions. As $\\kappa\\rightarrow 0$, the dominant contribution is the Madelung energy, so that $f_{\\rm ex}\\sim u_{\\rm ex}\\sim M\\Gamma\\sim \\Gamma/\\kappa$ (without background). This implies $p_{\\rm ex}=\\tfrac{\\Gamma}{2}(\\partial f_{\\rm ex}/\\partial \\Gamma)-\\tfrac{\\kappa}{2}(\\partial f_{\\rm ex}/\\partial \\kappa)\\sim \\Gamma/\\kappa\\sim u_{\\rm ex}$. In the presence of neutralizing background the term $\\Gamma/\\kappa$ disappears and we have $f_{\\rm ex}\\sim u_{\\rm ex}\\sim M_{\\rm OCP}\\Gamma$. This yields $p_{\\rm ex}\\sim \\tfrac{1}{2}M_{\\rm OCP}\\Gamma\\sim \\tfrac{1}{2}u_{\\rm ex}$. This consideration demonstrates that the Yukawa systems in the limit $\\kappa\\rightarrow 0$ are not fully equivalent to the Coulomb systems with the neutralizing background.\n\n\n\\subsection{Crystals}\n\nIn a series of MD simulations for 2D Yukawa crystals, in addition to evaluate the excess energy and pressure (which are summarized in Tables~\\ref{Table3} and \\ref{Table4} of the Appendix), the mean squared displacements were calculated to find the anharmonic correction coefficient $\\beta$. The resulting dependence $\\beta(\\kappa)$ is shown in Figure~\\ref{FigR3} (the corresponding values are also tabulated in Table~\\ref{Table2} of the Appendix for completeness).\nThe inset in Fig.~\\ref{FigR3} presents the radial (isotropic) pair correlation function, $g(r) \\propto \\int{d\\varphi\\; g(\\mathbf{r})}$,\nand demonstrates excellent representation of the short- and long-distance correlations. The obtained anharmonic correction coefficient $\\beta(\\kappa)$ allows to calculate analytically pair correlation function and then the excess energy, pressure and other thermodynamic parameters by the thermodynamic integration with the help of the expressions given in Sec.~\\ref{Thermo}.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R5.eps}\\\\\n    \\caption{    Dependence of the anharmonic correction coefficient $\\beta$ on the screening parameter $\\kappa$. The inset demonstrates a typical comparison between the radial distribution functions obtained in a direct MD simulation and computed using the shortest-graph method. For details see the text.}\n\\label{FigR3}\n\\end{figure}\n\nIt is worth to point out the following observation:\nIn the limit $\\kappa \\rightarrow 0$, the Yukawa interaction tends to the unscreened Coulomb interaction $\\varphi \\propto r^{-1}$. According to our previous MD simulations,~\\cite{1.4926945}\nthe finite-temperature phononic spectra differ weakly from zero-temperature ones for IPL potentials, $\\varphi \\propto r^{-\\alpha}$. Therefore, in the OCP limit ($\\kappa=0$ and $\\alpha=1$) we should obtain the smallest values of $\\beta(\\kappa)$. This is indeed observed in Fig.~\\ref{FigR3}.\n\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R6.eps}\\\\\n    \\caption{ Dependence of the reduced pressure on the reduced excess energy. Open (solid) symbols are the results of MD simulations for fluids and solids, respectively. The solid and dashed curves correspond to the shortest-graph method for solids and to the fit of Eq.~(\\ref{Eq7}) for fluids.}\n\\label{FigR4}\n\\end{figure}\n\nIn figure~\\ref{FigR4} we plot the reduced pressure versus the reduced excess energy of 2D Yukawa fluids and solids. Symbols are the MD results, the solid and dashed curves correspond to the shortest-graph method [with found anharmonic correction coefficient $\\beta(\\kappa)$] for the crystalline phase and the proposed fit by Eq.\\eqref{Eq7} for the fluid phase, respectively. Excellent agreement is observed.\n\n\\subsection{Accuracy}\n\nThe relative difference between the excess energies calculated using the shortest-graph method and those evaluated using direct MD simulations in the solid phase amounts to $\\simeq5\\times 10^{-5}$, which is comparable to the values reported earlier.~\\cite{0953-8984-28-23-235401} The accurate fit of Eq.~\\eqref{Eq7}\nyields the relative error in the excess energy smaller than $5\\times10^{-4}$ and  $2\\times10^{-3}$  for 72\\% and 95\\% of\nthe examined fluids data points, respectively. Maximal relative deviation, $5\\times 10^{-3}$, is observed near the melting line at large values of the screening parameter $\\kappa$. A simpler fit of Eq.~(\\ref{Fit1}) is applicable when the relative deviations within $\\lesssim 1\\%$ are acceptable.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{Pressure_kappa05.eps}\\\\\n    \\caption{Reduced pressure, $p$, as a function of the coupling parameter $\\Gamma$ for a Yukawa 2D fluid with the screening parameter $\\kappa=0.5$. The symbols are exact MD results, the solid (red) line corresponds to the fit of Eq.~(\\ref{Fit1}), the dashed (blue) line is the fit from Ref.~\\onlinecite{0022-3727-49-23-235203}.}\n\\label{FigPressure}\n\\end{figure}\n\nIn addition, we can compare our results with those recently reported in Refs.~\\onlinecite{0022-3727-49-23-235203,1.4962685}, where fits for the pressure of 2D Yukawa fluids in the $(\\kappa,\\Gamma)$ parameter space have been proposed. The case $\\kappa=0.5$ received special attention and a simple two-term fit has been proposed based on the results of a MD simulation,~\\cite{0022-3727-49-23-235203} $p=1.53\\Gamma+1.33$.\nWe plot our MD results along with the fit of Eq.~(\\ref{Fit1}) and the fit from Ref.~\\onlinecite{0022-3727-49-23-235203} in Fig.~\\ref{FigPressure}. One can see that the fit from Ref.~\\onlinecite{0022-3727-49-23-235203} overestimates the pressure systematically at high values of $\\Gamma$. At the strongest coupling in the fluid phase studied in this work, $\\Gamma=135.42$, the present MD simulation yields $p= 199.434$, fit by Eq.~(\\ref{Fit1}) yields $p=199.432$, while the fit from Ref.~\\onlinecite{0022-3727-49-23-235203} yields $p=208.523$. On the other hand, the previous model for 2D Yukawa systems in the OCP (weakly screening) limit discussed in Refs.~\\onlinecite{KhrapakPoP08_2015,1.4935846}\nyields $p=199.445$, providing confidence in the accuracy of the  present results. The reasons for deviations in Ref.~\\onlinecite{0022-3727-49-23-235203} have to be identified.\n\n3.2 Strongly-coupled fluids\n\\subsection{Strongly-coupled fluids}\n\nThe excess energy and pressure of the 2D Yukawa fluids have been determined using MD simulations in a wide range of coupling and screening parameters. The results are summarized in the Table \\ref{Table1} of the Appendix. Here we describe simple analytical\napproximations, which can be used to evaluate the energy and pressure for practical purposes.\n\nIn the strongly coupled fluid regime it is helpful to divide the thermodynamic quantities, such as energy and pressure, into static and thermal contributions. The static contribution corresponds to the value of internal energy when the particles are frozen in some regular configuration and the thermal corrections arise due to the deviations of the particles from these fixed positions (due to thermal motion). Of course, such a division is only meaningful when the regular structure is specified. For crystals, the obvious choice is a corresponding lattice sum (Madelung energy). For fluids this choice is also meaningful and we use it here (Note, that in 3D Yukawa system a slightly different definition of the static fluid energy is traditionally employed.~\\cite{KhrapakPPCF2016, KhrapakISM})\n\n\\begin{table}[!b]\n\\caption{\\label{TabM} Madelung constants of the 2D Yukawa crystals (triangular lattice) for various screening parameters in the range $0.5 \\leq \\kappa\\leq 3.0$ }\n\\begin{ruledtabular}\n\\begin{tabular}{cccc}\n$\\kappa$ & $M$ & $\\kappa$ & $M$   \\\\ \\hline\n0.5 & 1.11914 & 1.8 & 0.05449  \\\\\n0.6 & 0.82503 & 2.0 & 0.03660  \\\\\n0.8 & 0.48127 & 2.2 & 0.02470  \\\\\n1.0 & 0.29709 & 2.4 & 0.01672  \\\\\n1.2 & 0.18960 & 2.6 & 0.01135  \\\\\n1.4 & 0.12357 & 2.8 & 0.00772  \\\\\n1.6 & 0.08167 & 3.0 & 0.00525 \\\\ \n\\end{tabular}\n\\end{ruledtabular}\n\\end{table}\n\nThe excess internal energy is thus a sum of the static and thermal contributions,\n\\begin{equation}\nu_{\\rm ex} = u_{\\rm st} + u_{\\rm th},\n\\end{equation}\nwhere $u_{\\rm st} = M\\Gamma$ and $M$ is the Madelung constant.\nThe values of the Madelung constant for 2D Yukawa systems in the regime of relatively weak screening, $0.5 \\leq \\kappa\\leq 3.0$, are tabulated in Table~\\ref{TabM}. The dependence $M(\\kappa)$ can be fitted using a functional form similar to that proposed by  Totsuji~\\emph{et al.}\\cite{PhysRevE.70.016405}\n\\begin{equation}\n\\label{Eq6}\nM = -1.1061+0.5038\\kappa-0.11053\\kappa^2+0.00968\\kappa^3+1/\\kappa.\n\\end{equation}\nThe last term in (\\ref{Eq6}) accounts for the absence of neutralizing background in our case (but present in Ref.~\\onlinecite{PhysRevE.70.016405}), the energy of this background being simply $-\\Gamma/\\kappa$. The fit is chosen in such a way that when $\\kappa\\rightarrow 0$ and the neutralizing background is introduced, the Madelung constant is reduced to the well known value of the triangular lattice sum of the 2D one-component-plasma (OCP) with Coulomb interactions, $M_{\\rm OCP}\\simeq -1.1061$. This fit is accurate to within a tiny fraction of percent for $\\kappa\\lesssim 1.0$ and to within $\\sim 1\\%$ when screening becomes stronger ($\\kappa\\sim 3$).\n\nThe thermal part of the excess energy is expected to exhibit a quasi-universal scaling with respect to the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. This is a general property of classical particle systems with sufficiently soft interactions, which was first pointed out by Rosenfeld and Tarazona (RT scaling) for 3D systems.~\\cite{RT1,RT2} In the context of 3D Yukawa systems, the RT scaling has been proven to be very useful in Refs.~\\onlinecite{1.4921223,KhrapakPPCF2016,KhrapakPRE2015,KhrapakPRE03_2015} The emergence of RT scaling analogue for 2D systems has been discussed in the context of OCP with Coulomb and logarithmic interactions, Yukawa systems near the OCP limit, and inverse-power-law interactions.~\\cite{KhrapakCPP2016,KhrapakPoP08_2015} The dependence of $u_{\\rm th}$ on $\\Gamma/\\Gamma_{\\rm m}$ in the strongly coupled regime is displayed in Fig.~\\ref{FigR1}. The quasi-universality is well pronounced, although there is clearly some systematic tendency of decreasing the value of $u_{\\rm th}$ with $\\kappa$ at the same value of  $\\Gamma/\\Gamma_{\\rm m}$. This tendency is expected when the potential steepness increases (see e.g. Fig.~4 from Ref.~\\onlinecite{KhrapakPoP08_2015}). Overall, the data points corresponding to the dependence $u_{\\rm th}(\\Gamma/\\Gamma_{\\rm m})$ are confined to a relatively narrow range. The important point is that towards the side of soft interactions (sufficiently small $\\kappa$ in our case), the static component of the internal energy is dominant over the thermal one. For example, at $\\kappa=1$ the thermal component contributes only to about $2\\%$ of the total excess energy near the fluid-solid phase transition. Therefore, even moderately accurate fits for $u_{\\rm th}$ allow to obtain high accuracy with respect to the total excess energy $u_{\\rm ex}$.\n\nThree fits are shown in Fig.~\\ref{FigR1}. The upper (lower) curve corresponds to the data portion for $\\kappa=0.5$ ($\\kappa = 3.0$).\nThe intermediate curve has been obtained using the entire massive of the data points (corresponding to the parameter regime shown). It can be considered as representative for strongly coupled 2D Yukawa fluids in the vicinity of the freezing transition.\nThe functional form of the fit is the same as used previously~\\cite{KhrapakPoP08_2015}\n\\begin{equation} \\label{Fit1}\nu_{\\rm th} =A \\ln (1+B\\Gamma/\\Gamma_{\\rm m}).\n\\end{equation}\nThe use of the coefficients $A=0.257$ and $B=195.4$ determined here would somewhat improve previous approximations.\n\nThe excess free energy can be routinely calculated using the model for the excess energy formulated above and the second of Eqs.~(\\ref{pf}). The resulting expression is rather simple,\n\\begin{equation}\\label{fex}\nf_{\\rm ex}=M(\\kappa)\\Gamma - A{\\rm Li}_2(-B\\Gamma/\\Gamma_{\\rm m}),\n\\end{equation}\nwhere ${\\rm Li}_2(z)=\\int_z^0 dt \\ln(1-t)/t$ is dilogarithm. Note that in deriving Eq.~(\\ref{fex}), the thermodynamic integration over the coupling parameter from 0 to $\\Gamma$ has been performed, while Eq.~(\\ref{Fit1}) is strictly speaking not applicable at $\\Gamma\\ll 1$.\nThe correct procedure would be to start thermodynamic integration from some small, but finite value $\\Gamma_0$, and then add the constant $f_{\\rm ex}(\\Gamma_0)$ evaluated using Eq.~(\\ref{SVC}). However, since the actual contribution from the weakly coupling regime is small, Eq.~(\\ref{fex}) remains rather accurate at strong coupling and we use it here.\n\nThe calculation of pressure from the excess free energy is straightforward, but rather cumbersome in the considered case. This is because the differentiation with respect to $\\kappa$ is involved, and  the two fits for $M(\\kappa)$ and $\\Gamma_{\\rm m}(\\kappa)$ are present. For this reason, the explicit expression for $p$ is not displayed. We verified that near freezing (at $\\Gamma/\\Gamma_{\\rm m}\\simeq 0.95$) the derived expression yields the pressures which deviate from the exact MD results by $\\sim 0.001\\%$ at $\\kappa=0.5$, $~\\sim 0.1\\%$ at $\\kappa=1.0$, and $\\sim 1\\%$ at $\\kappa = 2.0-2.8$. The accuracy drops at the highest value $\\kappa=3.0$. This is not surprising, since the fits for $M(\\kappa)$ and $\\Gamma_{\\rm m}(\\kappa)$ are only applicable for $\\kappa\\lesssim 3.0$ and, therefore, derivatives from these fits at   $\\kappa=3.0$ can produce significant errors.\n\nWe also found out that if better accuracy is required, the data for the excess thermal energy can be fitted by the following slightly modified expression\n\\begin{equation}\n\\label{Eq7}\nu_{\\mathrm{ex}} = A(\\kappa)\\ln\\left[ 1 + B(\\kappa) \\Gamma^{s(\\kappa)} \\right],\n\\end{equation}\nwhere $A$ and $B$ are now assumed $\\kappa$-dependent and a $\\kappa$-dependent exponent $s$ is introduced. Based on all the data points obtained in MD simulations the following relations are identified:\n$A(\\kappa) = 0.35708 + 0.09397\\kappa$,\n$B(\\kappa)= 1.65491\\exp(- 0.76911\\kappa)$,\n$s(\\kappa) = 0.68838 - 0.05183\\kappa$.\nSome representative examples are shown in Fig.~\\ref{FigR2}.\nThe fit of Eq.~(\\ref{Eq7}) is clearly more accurate and can be used in\nthe regime of weaker coupling, compared to the simple form (\\ref{Fit1}). However, it is also less practical in evaluating thermodynamic parameters other than the excess internal energy.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R2.eps}\\\\\n    \\caption{\n    Thermal component of the reduced excess energy, $u_{\\rm th}$ of 2D Yukawa fluids near the fluid-solid phase transition versus the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. Symbols correspond to MD simulations for different values of the screening parameter $\\kappa$. The curves are the analytical fits to these data using Eq.~(\\ref{Fit1}): The upper (lower) curve corresponds to fitting the MD results for $\\kappa=0.5$ ($\\kappa = 3.0$) and the intermediate (red) curve is obtained by fitting the entire massive of the data points.}\n\\label{FigR1}\n\\end{figure}\n\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R3.eps}\\\\\n    \\caption{Dependence of the excess thermal energy $u_{\\rm th}$ on the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. All the data points from numerical simulations are plotted. Solid curves correspond to three representative fits using Eq.~\\eqref{Eq7}.}\n\\label{FigR2}\n\\end{figure}\n\n\n\n5 MD results\n\\section{MD results}\n\\label{Appendix}\n\nIn the Appendix, we summarize main results from MD simulations performed in this study. Table \\ref{Table1} reports the reduced excess energies and pressures at different state points in the fluid phase. Table  \\ref{Table2} summarizes the values of the anharmonic correction coefficient $\\beta$ evaluated using MD simulations of the crystalline phase. Finally, Tables  \\ref{Table3} and  \\ref{Table4} report the excess energies and pressures in the crystalline phase.\n\n\\begin{table}[h]\n\t\\centering\n\t\\small\n\t\\caption{Reduced excess energy $u_{\\rm ex}$ and pressure $p$ of two-dimensional Yukawa fluids evaluated using MD simulations for various coupling ($\\Gamma$) and screening ($\\kappa$) parameters.}\n\t\\label{Table1}\n\t\t\\begin{tabular}{l c c c c c c c c c c c c}\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=0.5$}\\\\ \\hline\n\t\t$\\Gamma$ & 135.420 & 86.7254 & 52.7787 & 32.1811 & 19.6073 & 11.9310 & 7.27175 & 4.43126 & 2.69848 & 1.64302 & 1.00136 &  0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 152.944 & 98.3115 & 60.1901 & 37.0087 & 22.8180 & 14.1176 & 8.79838 & 5.51964 & 3.48587 & 2.21772 & 1.42021 & 0.76495\\\\\n\t\t$p$ & 199.434 & 128.303 & 78.6946 & 48.5651 & 30.1485 & 18.8835 & 12.0216 & 7.81631 & 5.22964 & 3.63556 & 2.64961 & 1.85883\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=0.6$}\\\\\\hline\n\t\t$\\Gamma$  & 140.131 & 89.5076 & 54.3171 & 32.9737 & 20.0017 & 12.1359 & 7.36665 & 4.47442 & 2.71053 & 1.64677 & 1.00106 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 116.984 & 75.1128 & 45.9415 & 28.2016 & 17.3768 & 10.7727 & 6.73045 & 4.24422 & 2.69421 & 1.72956 & 1.11776 & 0.61083\\\\\n\t\t$p$ \t\t\t\t\t& 160.369 & 103.050  & 63.1652  & 38.9451  & 24.1971  & 15.2284  & 9.76528  & 6.42899  & 4.37128  & 3.11015  & 2.32663  & 1.69701\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=0.8$}\\\\\\hline\n\t\t$\\Gamma$ & 152.277 & 96.5736 & 58.0604 & 34.9737 & 21.0334 & 12.6675 & 7.61503 & 4.58845 & 2.75830 & 1.66410 & 0.99914  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 74.6424 & 47.7340  & 29.0608 & 17.8181 & 10.9844 & 6.84185 & 4.30139 & 2.74217 & 1.76665 & 1.15293  & 0.75437 & 0.42469\\\\\n\t\t$p$ \t\t\t\t\t& 112.709 & 72.1411  & 44.0441 & 27.1658 & 16.9406 & 10.7731 & 7.01845 & 4.73986 & 3.33679 & 2.47393  & 1.92983 & 1.49910\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.0$}\\\\\\hline\n\t\t$\\Gamma$ & 169.071 & 105.975 & 63.1038 & 37.6027 & 22.4047 & 13.3361 & 7.94729 & 4.73129 & 2.81940 & 1.68034  & 0.99956 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 51.5786 & 32.7335 & 19.8556 & 12.1451 & 7.50279 & 4.68984 & 2.97702 & 1.91799 & 1.25426 & 0.82932 & 0.55059 & 0.31770\\\\\n\t\t$p$ \t& 85.4036 & 54.2492 & 33.0215 & 20.3527 & 12.7618 & 8.19406 & 5.44279 & 3.76791 & 2.74103 & 2.10336 & 1.70075 & 1.38135\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.2$}\\\\\\hline\n\t\t$\\Gamma$ & 191.126 & 118.398 & 69.6429 & 40.9597 & 24.1083 & 14.1893 & 8.34919 & 4.90490 & 2.88868 & 1.70019 & 0.99984 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 37.5852 & 23.6918 & 14.3026 & 8.72609 & 5.39936 & 3.39637 & 2.17547 & 1.41736 & 0.93933 & 0.62908 & 0.42281 & 0.24960\\\\\n\t\t$p$ \t& 67.9344 & 42.8619 & 25.9838 & 16.0024 & 10.0874 & 6.56025 & 4.44041 & 3.15023 & 2.36021 & 1.86635 & 1.55301 & 1.30594\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.4$}\\\\\\hline\n\t\t$\\Gamma$ & 220.172 & 134.441 & 77.9949 & 45.2452 & 26.2578 & 15.2219 & 8.83634 & 5.12702 & 2.97137 & 1.72440 & 1.00140  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 28.5555 & 17.8503 & 10.7244 & 6.53392 & 4.05300 & 2.56405 & 1.65932 & 1.09552 & 0.73364 & 0.49726 & 0.33718 & 0.20253\\\\\n\t\t$p$ \t& 56.0915 & 35.0963 & 21.1892 & 13.0574 & 8.28303 & 5.45288 & 3.76392 & 2.73780 & 2.10241 & 1.70540 & 1.45171 & 1.25396\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.6$}\\\\\\hline\n\t\t$\\Gamma$ & 258.433 & 155.296 & 88.6297 & 50.6106 & 28.9099 & 16.4928 & 9.41249 & 5.37870 & 3.07317 & 1.75217 & 0.99889  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 22.4535 & 13.9136 & 8.31218 & 5.05719 & 3.14728 & 2.00498 & 1.30903 & 0.87473 & 0.59391 & 0.40446 & 0.27520 & 0.16486\\\\\n\t\t$p$ & 47.7294 & 29.6021 & 17.7849 & 10.9674 & 7.00739 & 4.67522 & 3.28559 & 2.44432 & 1.92230 & 1.58965 & 1.37647 & 1.15781\\\\\\hline\\hline\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.8$}\\\\\\hline\n\t\t$\\Gamma$ & 308.935 & 182.395 & 102.261 & 57.3435 & 32.1483 & 18.0355 & 10.1029 & 5.67241 & 3.17978 & 1.78359 & 0.99997  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 18.1745 & 11.1626 & 6.63304 & 4.02868 & 2.51560 & 1.61389 & 1.06328 & 0.71747 & 0.49051 & 0.33739 & 0.23058 & 0.14359\\\\\n\t\t$p$ \t& 41.6428 & 25.5932 & 15.3055 & 9.44338 & 6.07675 & 4.10949 & 2.93845 & 2.22906 & 1.78546 & 1.50402 & 1.32125 & 1.18748\\\\\\hline\\hline\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.0$}\\\\\\hline\n\t\t$\\Gamma$ & 375.818 & 217.422 & 119.600 & 65.7745 & 36.1611 & 19.8980 & 10.9232 & 6.01199 & 3.30681 & 1.81767 & 1.00051 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 15.0964 & 9.17319 & 5.42177 & 3.29200 & 2.06139 & 1.33276 & 0.88426 & 0.60261 & 0.41513 & 0.28650 & 0.19651 & 0.12379\\\\\n\t\t$p$ \t\t\t\t\t& 37.1333 & 22.5775 & 13.4413 & 8.30684 & 5.38337 & 3.68921 & 2.67835 & 2.06727 & 1.68347 & 1.43752 & 1.27850 & 1.16494\\\\\\hline\\hline\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.2$}\\\\\\hline\n\t\t$\\Gamma$ & 463.975 & 262.948 & 141.568 & 76.2338 & 41.0173 & 22.0958 & 11.9035 & 6.41082 & 3.45056 & 1.85303 & 1.00113 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 12.7875 & 7.69994 & 4.52708 & 2.74830 & 1.72461 & 1.12217 & 0.75368 & 0.51642 & 0.35777 & 0.24734 & 0.17009 & 0.10850\\\\\n\t\t$p$ \t\t\t\t\t& 33.6575 & 20.2710 & 12.0118 & 7.43585 & 4.85060 & 3.36425 & 2.48426 & 1.94445 & 1.60450 & 1.38520 & 1.24473 & 1.14408\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.4$}\\\\\\hline\n\t\t$\\Gamma$ & 578.968 & 320.871 & 168.949 & 89.0382 & 46.8778 & 24.7092 & 12.9953 & 6.85634 & 3.60307 & 1.89919 & 0.99952 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 10.9709 & 6.56430 & 3.83850 & 2.33031 & 1.47100 & 0.96365 & 0.65089 & 0.44974 & 0.31141 & 0.21697 & 0.14862 & 0.09589\\\\\n\t\t$p$ & 30.8215 & 18.4175 & 10.8648 & 6.74135 & 4.43655 & 3.11369 & 2.32748 & 1.84722 & 1.53931 & 1.34446 & 1.21673 & 1.12942\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.6$}\\\\\\hline\n\t\t$\\Gamma$ & 723.656 & 392.384 & 202.051 & 104.080 & 53.5742 & 27.6270 & 14.2191 & 7.32182 & 3.76653 & 1.93971 & 1.00200 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 9.50055 & 5.63818 & 3.28596 & 1.99866 & 1.26783 & 0.83500 & 0.56905 & 0.39442 & 0.27600 & 0.19145 & 0.13130 & 0.08576\\\\\n\t\t$p$ \t& 28.3633 & 16.8096 & 9.89231 & 6.16190 & 4.09049 & 2.90245 & 2.19936 & 1.76426 & 1.48858 & 1.30961 & 1.19408 & 1.11954\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.8$}\\\\\\hline\n\t\t$\\Gamma$ & 893.746 & 474.549 & 239.143 & 120.685 & 60.8483 & 30.6642 & 15.4796 & 7.80951 & 3.93161 & 1.98042 & 1.00296 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 8.19448 & 4.82859 & 2.81518 & 1.71951 & 1.09985 & 0.73051 & 0.50093 & 0.35038 & 0.24489 & 0.17117 & 0.11700 & 0.07671\\\\\n\t\t$p$ \t& 25.9004 & 15.2521 & 8.98792 & 5.63831 & 3.78782 & 2.72194 & 2.08856 & 1.69631 & 1.44344 & 1.28133 & 1.17497 & 1.10201\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=3.0$}\\\\\\hline\n\t\t$\\Gamma$ & 1071.02 & 558.495 & 276.444 & 136.953 & 67.7922 & 33.5897 & 16.6383 & 8.22716 & 4.07874 & 2.02013 & 0.99949 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 6.93189 & 4.07091 & 2.38838 & 1.47193 & 0.95056 & 0.64023 & 0.44340 & 0.31146 & 0.21994 & 0.15395 & 0.10494 & 0.06958\\\\\n\t\t$p$ \t& 23.1181 & 13.5906 & 8.07317 & 5.12679 & 3.49444 & 2.55590 & 1.98879 & 1.63334 & 1.40554 & 1.25677 & 1.15868 & 1.09682\\\\\\hline\\hline\n\t\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\small\n\t\\caption{Values of the anharmonic correction coefficient $\\beta$ for different screening parameter $\\kappa$.}\n\t\\label{Table2}\n\t\t\\begin{tabular}{l c c c c c c c c c c c c c c c c c}\n\t\t$\\kappa$ & 0.0 & 0.2 & 0.3 & 0.4 & 0.6 & 0.8 & 1.0 & 1.2 & 1.4 & 1.6 & 1.8 & 2.0 & 2.2 & 2.4 & 2.6 & 2.8 & 3.0 \\\\\\hline\n\t\t$\\beta(\\kappa)$\t& 3.01 & 9.23 & 12.38 & 14.30 & 10.53 & 9.71 & 9.35 & 9.28 & 9.14 & 9.08 & 8.97 & 8.855 & 8.68 & 8.71 & 8.46 & 8.47 & 8.51\n\t\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\small\n\t\\caption{Reduced excess energy $u_{\\mathrm{ex}}$ of the 2D Yukawa crystal obtained in MD simulations for various screening parameters $\\kappa$ and reduced coupling parameters $\\Gamma_{\\rm m}/\\Gamma$.}\n\t\\label{Table3}\n\t\t\\begin{tabular}{lccccccccc}\n\t\t\\multicolumn{1}{c|}{ $\\kappa$}& \\multicolumn{9}{c}{$\\Gamma_{\\rm m}/\\Gamma$}     \\\\ \\hline\\hline\n\t\t\\multicolumn{1}{l|}{ }& 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.6 & 0.7 & 0.8 & 0.9 \\\\ \\cline{2-10}\n\t\t\\multicolumn{1}{l|}{0.5} & 1595.62 & 798.828 & 532.689 & 399.681 & 319.981 & 266.796 & 228.880 & 200.332 & 178.283 \\\\\n\t\t\\multicolumn{1}{l|}{0.6} & 1217.36 & 609.282 & 406.628 & 305.117 & 244.469 & 203.938 & 174.914 & 153.188 & 136.267 \\\\\n\t\t\\multicolumn{1}{l|}{0.8} & 773.025 & 387.104 & 258.328 & 194.074 & 155.484 & 129.733 & 111.343 & 97.5607 & 86.8364 \\\\\n\t\t\\multicolumn{1}{l|}{1.0} & 529.643 & 265.306 & 177.235 & 133.215 & 106.726 & 89.1490 & 76.5169 & 67.1314 & 59.7831 \\\\\n\t\t\\multicolumn{1}{l|}{1.2} & 382.522 & 191.740 & 128.152 & 96.3972 & 77.2970 & 64.6022 & 55.5318 & 48.7317 & 43.4438   \\\\\n\t\t\\multicolumn{1}{l|}{1.4} & 287.408 & 144.232 & 96.4804 & 72.5942 & 58.2862 & 48.7586 & 41.9386 & 36.8484 & 32.8838   \\\\\n\t\t\\multicolumn{1}{l|}{1.6} & 223.185 & 112.096 & 75.0671 & 56.5515 & 45.4466 & 38.0606 & 32.7681 & 28.8120 & 25.7391   \\\\\n\t\t\\multicolumn{1}{l|}{1.8} & 178.133 & 89.6228 & 60.0889 & 45.3116 & 36.4631 & 30.5563 & 26.3521 & 23.1896 & 20.7451   \\\\\n\t\t\\multicolumn{1}{l|}{2.0} & 145.774 & 73.3800 & 49.2712 & 37.2003 & 29.9641 & 25.1447 & 21.7011 & 19.1314 & 17.1275   \\\\\n\t\t\\multicolumn{1}{l|}{2.2} & 121.609 & 61.3067 & 41.2021 & 31.1620 & 25.1352 & 21.1177 & 18.2517 & 16.1113 & 14.4385   \\\\\n\t\t\\multicolumn{1}{l|}{2.4} & 102.908 & 51.9465 & 34.9672 & 26.4819 & 21.3920 & 17.9999 & 15.5706 & 13.7650 & 12.3602   \\\\\n\t\t\\multicolumn{1}{l|}{2.6} & 87.4157 & 44.2324 & 29.8212 & 22.6181 & 18.2990 & 15.4212 & 13.3710 & 11.8300 & 10.6351   \\\\\n\t\t\\multicolumn{1}{l|}{2.8} & 73.5771 & 37.3025 & 25.2028 & 19.1490 & 15.5271 & 13.1108 & 11.3865 & 10.0997 & 9.10597   \\\\\n\t\t\\multicolumn{1}{l|}{3.0} & 60.2002 & 30.6118 & 20.7457 & 15.8118 & 12.8497 & 10.8840 & 9.47465 & 8.43053 & 7.65187   \\\\\n\t\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\small\n\t\\caption{Reduced pressure (compressibility)  $p$ of the 2D Yukawa crystal obtained in MD simulations for various screening parameters $\\kappa$ and reduced coupling parameters $\\Gamma_{\\rm m}/\\Gamma$.}\n\t\\label{Table4}\n\t\t\\begin{tabular}{lccccccccc}\n\t\t\\multicolumn{1}{c|}{ $\\kappa$}& \\multicolumn{9}{c}{$\\Gamma_{\\rm m}/\\Gamma$}     \\\\ \\hline\\hline\n\t\t\\multicolumn{1}{l|}{ }& 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.6 & 0.7 & 0.8 & 0.9 \\\\ \\cline{2-10}\n\t\t\\multicolumn{1}{l|}{0.5} & 2080.63 & 1041.70 & 694.789 & 521.370 & 417.454 & 348.100 & 298.669 & 261.442 & 232.679 \\\\\n\t\t\\multicolumn{1}{l|}{0.6} & 1669.06 & 835.485 & 557.680 & 418.523 & 335.380 & 279.814 & 240.022 & 210.233 & 187.022 \\\\\n\t\t\\multicolumn{1}{l|}{0.8} & 1168.03 & 585.024 & 390.480 & 293.406 & 235.104 & 196.197 & 168.410 & 147.583 & 131.370 \\\\\n\t\t\\multicolumn{1}{l|}{1.0} & 878.208 & 440.005 & 294.000 & 221.023 & 177.106 & 147.964 & 127.016 & 111.450 & 99.2542 \\\\\n\t\t\\multicolumn{1}{l|}{1.2} & 693.046 & 347.470 & 232.288 & 174.765 & 140.162 & 117.162 & 100.726 & 88.4011 & 78.8053   \\\\\n\t\t\\multicolumn{1}{l|}{1.4} & 566.555 & 284.386 & 190.275 & 143.196 & 114.994 & 96.2113 & 82.7636 & 72.7234 & 64.8975   \\\\\n\t\t\\multicolumn{1}{l|}{1.6} & 476.692 & 239.477 & 160.406 & 120.865 & 97.1465 & 81.3696 & 70.0608 & 61.6053 & 55.0288   \\\\\n\t\t\\multicolumn{1}{l|}{1.8} & 410.580 & 206.621 & 138.561 & 104.505 & 84.1086 & 70.4915 & 60.7970 & 53.5005 & 47.8555   \\\\\n\t\t\\multicolumn{1}{l|}{2.0} & 361.191 & 181.859 & 122.134 & 92.2267 & 74.2973 & 62.3524 & 53.8144 & 47.4405 & 42.4641   \\\\\n\t\t\\multicolumn{1}{l|}{2.2} & 322.729 & 162.732 & 109.386 & 82.7430 & 66.7485 & 56.0825 & 48.4703 & 42.7821 & 38.3327   \\\\\n\t\t\\multicolumn{1}{l|}{2.4} & 291.498 & 147.173 & 99.0847 & 75.0489 & 60.6307 & 51.0175 & 44.1300 & 39.0087 & 35.0158   \\\\\n\t\t\\multicolumn{1}{l|}{2.6} & 263.437 & 133.325 & 89.9002 & 68.1935 & 55.1747 & 46.4976 & 40.3128 & 35.6615 & 32.0486   \\\\\n\t\t\\multicolumn{1}{l|}{2.8} & 235.188 & 119.260 & 80.5872 & 61.2342 & 49.6540 & 41.9257 & 36.4074 & 32.2829 & 29.0897   \\\\\n\t\t\\multicolumn{1}{l|}{3.0} & 203.533 & 103.516 & 70.1601 & 53.4777 & 43.4588 & 36.8063 & 32.0351 & 28.4887 & 25.8081  \\\\\n\t\t\\end{tabular}\n\\end{table}\n\\end{widetext}\n\n\\bibliography{Ref-2D-Yukawa} \n\n\\end{document}\n",
        "Descriptive_question1": "What is the value of reduced excess energy u_ex for κ=0.5 and Γ=135.420 in table_2?",
        "Descriptive_question2": "What is the pressure p for κ=1.0 and Γ=169.071 in table_2?",
        "Reasoning_question1": "How does the reduced excess energy u_ex change with increasing coupling parameter Γ for a fixed screening parameter κ=0.5 in table_2, and what might this indicate about the system's behavior?",
        "Reasoning_question2": "Compare the pressure p values across different screening parameters κ at a similar coupling parameter Γ around 1.0 in table_2. What trend can be observed, and what could be the physical implication of this trend?",
        "Descriptive_answer1": "152.944",
        "Descriptive_answer2": "85.4036",
        "Reasoning_answer1": "For a fixed screening parameter κ=0.5, as the coupling parameter Γ increases from 0.5 to 135.420, the reduced excess energy u_ex increases significantly from 0.76495 to 152.944. This trend indicates a direct relationship between Γ and u_ex. Let's analyze this step by step: First, Γ represents the ratio of potential energy to kinetic energy, so an increase in Γ suggests that the system is becoming more strongly coupled, meaning interactions between particles are more dominant. As Γ rises, the potential energy contribution to the total energy grows, which is reflected in the increasing values of u_ex, the excess energy beyond the ideal gas contribution. This behavior might suggest that the system is transitioning towards a more ordered state, possibly approaching a fluid-solid transition as coupling strengthens, since higher potential energy often correlates with more structured particle arrangements in strongly coupled systems.",
        "Reasoning_answer2": "To compare pressure p across different screening parameters κ at a coupling parameter Γ around 1.0, we look at the values from the table: for κ=0.5 and Γ=1.00136, p=2.64961; for κ=0.6 and Γ=1.00106, p=2.32663; for κ=0.8 and Γ=0.99914, p=1.92983; for κ=1.0 and Γ=0.99956, p=1.70075; for κ=1.2 and Γ=0.99984, p=1.55301; and continuing up to κ=3.0 and Γ=0.99949, p=1.15868. Observing these values, there is a clear decreasing trend in pressure p as κ increases. Let's break this down: κ represents the screening parameter, which is inversely related to the range of interaction between particles; a higher κ means a shorter interaction range due to stronger screening. As κ increases, the interaction potential becomes more short-ranged, reducing the overall potential energy contribution to pressure. Physically, this trend implies that with stronger screening, the system's pressure decreases because particles interact over shorter distances, leading to less overall force exerted across the system, which could influence phase behavior and compressibility in Yukawa fluids."
    },
    {
        "paper_id": "1704.00976.json",
        "table_id": "table_3",
        "table_content": "\\begin{table}[!t]\n\t\\centering\n\t\\small\n\t\\caption{Values of the anharmonic correction coefficient $\\beta$ for different screening parameter $\\kappa$.}\n\t\\label{Table2}\n\t\t\\begin{tabular}{l c c c c c c c c c c c c c c c c c}\n\t\t$\\kappa$ & 0.0 & 0.2 & 0.3 & 0.4 & 0.6 & 0.8 & 1.0 & 1.2 & 1.4 & 1.6 & 1.8 & 2.0 & 2.2 & 2.4 & 2.6 & 2.8 & 3.0 \\\\\\hline\n\t\t$\\beta(\\kappa)$\t& 3.01 & 9.23 & 12.38 & 14.30 & 10.53 & 9.71 & 9.35 & 9.28 & 9.14 & 9.08 & 8.97 & 8.855 & 8.68 & 8.71 & 8.46 & 8.47 & 8.51\n\t\t\\end{tabular}\n\\end{table}",
        "caption": "Values of the anharmonic correction coefficient $\\beta$ for different screening parameter $\\kappa$.",
        "label": "Table2",
        "section_info": "3 Results\n\\section{Results}\n\n\\subsection{Weakly-coupled fluids}\n\n\\begin{figure}[!b]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R1.eps}\\\\\n    \\caption{The excess energy $u_{\\rm ex}$ of 2D Yukawa weakly coupled fluids versus the screening parameter $\\kappa$ at a fixed coupling parameter $\\Gamma = 0.5$. The symbols correspond to the results of MD simulations, the solid curve is plotted using the analytical expression of Eq.~(\\ref{SVC}).\n    }\n\\label{FigSC}\n\\end{figure}\n\nA simple and physically transparent approach to the thermodynamics of weakly coupled Yukawa systems for small deviations from the ideal gas behavior is to calculate the  second virial coefficient. This has recently been shown to work well in 3D Yukawa systems.~\\cite{KhrapakPPCF2016} In the 2D geometry the excess free energy is expressed in this approximation as\n\\begin{equation}\\label{SVC}\nf_{\\rm ex}\\simeq \\pi n \\int\\left[1-e^{-\\varphi(r)/k_{\\rm B}T}\\right]r dr.\n\\end{equation}\nThe excess energy and pressure can be readily obtained from the excess free energy. We compare the values $u_{\\rm ex}$ at a fixed coupling parameter $\\Gamma=0.5$ obtained from Eq.~(\\ref{SVC}) and computed using MD simulations in Fig.~\\ref{FigSC}. The agreement is satisfactory: in the range of $\\kappa$ investigated the deviations are within several percent. The agreement naturally improves with increasing $\\kappa$, because at a fixed $\\Gamma$ the actual interaction strength weakens as $\\kappa$ increases.\n\n\n\\subsection{Strongly-coupled fluids}\n\nThe excess energy and pressure of the 2D Yukawa fluids have been determined using MD simulations in a wide range of coupling and screening parameters. The results are summarized in the Table \\ref{Table1} of the Appendix. Here we describe simple analytical\napproximations, which can be used to evaluate the energy and pressure for practical purposes.\n\nIn the strongly coupled fluid regime it is helpful to divide the thermodynamic quantities, such as energy and pressure, into static and thermal contributions. The static contribution corresponds to the value of internal energy when the particles are frozen in some regular configuration and the thermal corrections arise due to the deviations of the particles from these fixed positions (due to thermal motion). Of course, such a division is only meaningful when the regular structure is specified. For crystals, the obvious choice is a corresponding lattice sum (Madelung energy). For fluids this choice is also meaningful and we use it here (Note, that in 3D Yukawa system a slightly different definition of the static fluid energy is traditionally employed.~\\cite{KhrapakPPCF2016, KhrapakISM})\n\n\\begin{table}[!b]\n\\caption{\\label{TabM} Madelung constants of the 2D Yukawa crystals (triangular lattice) for various screening parameters in the range $0.5 \\leq \\kappa\\leq 3.0$ }\n\\begin{ruledtabular}\n\\begin{tabular}{cccc}\n$\\kappa$ & $M$ & $\\kappa$ & $M$   \\\\ \\hline\n0.5 & 1.11914 & 1.8 & 0.05449  \\\\\n0.6 & 0.82503 & 2.0 & 0.03660  \\\\\n0.8 & 0.48127 & 2.2 & 0.02470  \\\\\n1.0 & 0.29709 & 2.4 & 0.01672  \\\\\n1.2 & 0.18960 & 2.6 & 0.01135  \\\\\n1.4 & 0.12357 & 2.8 & 0.00772  \\\\\n1.6 & 0.08167 & 3.0 & 0.00525 \\\\ \n\\end{tabular}\n\\end{ruledtabular}\n\\end{table}\n\nThe excess internal energy is thus a sum of the static and thermal contributions,\n\\begin{equation}\nu_{\\rm ex} = u_{\\rm st} + u_{\\rm th},\n\\end{equation}\nwhere $u_{\\rm st} = M\\Gamma$ and $M$ is the Madelung constant.\nThe values of the Madelung constant for 2D Yukawa systems in the regime of relatively weak screening, $0.5 \\leq \\kappa\\leq 3.0$, are tabulated in Table~\\ref{TabM}. The dependence $M(\\kappa)$ can be fitted using a functional form similar to that proposed by  Totsuji~\\emph{et al.}\\cite{PhysRevE.70.016405}\n\\begin{equation}\n\\label{Eq6}\nM = -1.1061+0.5038\\kappa-0.11053\\kappa^2+0.00968\\kappa^3+1/\\kappa.\n\\end{equation}\nThe last term in (\\ref{Eq6}) accounts for the absence of neutralizing background in our case (but present in Ref.~\\onlinecite{PhysRevE.70.016405}), the energy of this background being simply $-\\Gamma/\\kappa$. The fit is chosen in such a way that when $\\kappa\\rightarrow 0$ and the neutralizing background is introduced, the Madelung constant is reduced to the well known value of the triangular lattice sum of the 2D one-component-plasma (OCP) with Coulomb interactions, $M_{\\rm OCP}\\simeq -1.1061$. This fit is accurate to within a tiny fraction of percent for $\\kappa\\lesssim 1.0$ and to within $\\sim 1\\%$ when screening becomes stronger ($\\kappa\\sim 3$).\n\nThe thermal part of the excess energy is expected to exhibit a quasi-universal scaling with respect to the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. This is a general property of classical particle systems with sufficiently soft interactions, which was first pointed out by Rosenfeld and Tarazona (RT scaling) for 3D systems.~\\cite{RT1,RT2} In the context of 3D Yukawa systems, the RT scaling has been proven to be very useful in Refs.~\\onlinecite{1.4921223,KhrapakPPCF2016,KhrapakPRE2015,KhrapakPRE03_2015} The emergence of RT scaling analogue for 2D systems has been discussed in the context of OCP with Coulomb and logarithmic interactions, Yukawa systems near the OCP limit, and inverse-power-law interactions.~\\cite{KhrapakCPP2016,KhrapakPoP08_2015} The dependence of $u_{\\rm th}$ on $\\Gamma/\\Gamma_{\\rm m}$ in the strongly coupled regime is displayed in Fig.~\\ref{FigR1}. The quasi-universality is well pronounced, although there is clearly some systematic tendency of decreasing the value of $u_{\\rm th}$ with $\\kappa$ at the same value of  $\\Gamma/\\Gamma_{\\rm m}$. This tendency is expected when the potential steepness increases (see e.g. Fig.~4 from Ref.~\\onlinecite{KhrapakPoP08_2015}). Overall, the data points corresponding to the dependence $u_{\\rm th}(\\Gamma/\\Gamma_{\\rm m})$ are confined to a relatively narrow range. The important point is that towards the side of soft interactions (sufficiently small $\\kappa$ in our case), the static component of the internal energy is dominant over the thermal one. For example, at $\\kappa=1$ the thermal component contributes only to about $2\\%$ of the total excess energy near the fluid-solid phase transition. Therefore, even moderately accurate fits for $u_{\\rm th}$ allow to obtain high accuracy with respect to the total excess energy $u_{\\rm ex}$.\n\nThree fits are shown in Fig.~\\ref{FigR1}. The upper (lower) curve corresponds to the data portion for $\\kappa=0.5$ ($\\kappa = 3.0$).\nThe intermediate curve has been obtained using the entire massive of the data points (corresponding to the parameter regime shown). It can be considered as representative for strongly coupled 2D Yukawa fluids in the vicinity of the freezing transition.\nThe functional form of the fit is the same as used previously~\\cite{KhrapakPoP08_2015}\n\\begin{equation} \\label{Fit1}\nu_{\\rm th} =A \\ln (1+B\\Gamma/\\Gamma_{\\rm m}).\n\\end{equation}\nThe use of the coefficients $A=0.257$ and $B=195.4$ determined here would somewhat improve previous approximations.\n\nThe excess free energy can be routinely calculated using the model for the excess energy formulated above and the second of Eqs.~(\\ref{pf}). The resulting expression is rather simple,\n\\begin{equation}\\label{fex}\nf_{\\rm ex}=M(\\kappa)\\Gamma - A{\\rm Li}_2(-B\\Gamma/\\Gamma_{\\rm m}),\n\\end{equation}\nwhere ${\\rm Li}_2(z)=\\int_z^0 dt \\ln(1-t)/t$ is dilogarithm. Note that in deriving Eq.~(\\ref{fex}), the thermodynamic integration over the coupling parameter from 0 to $\\Gamma$ has been performed, while Eq.~(\\ref{Fit1}) is strictly speaking not applicable at $\\Gamma\\ll 1$.\nThe correct procedure would be to start thermodynamic integration from some small, but finite value $\\Gamma_0$, and then add the constant $f_{\\rm ex}(\\Gamma_0)$ evaluated using Eq.~(\\ref{SVC}). However, since the actual contribution from the weakly coupling regime is small, Eq.~(\\ref{fex}) remains rather accurate at strong coupling and we use it here.\n\nThe calculation of pressure from the excess free energy is straightforward, but rather cumbersome in the considered case. This is because the differentiation with respect to $\\kappa$ is involved, and  the two fits for $M(\\kappa)$ and $\\Gamma_{\\rm m}(\\kappa)$ are present. For this reason, the explicit expression for $p$ is not displayed. We verified that near freezing (at $\\Gamma/\\Gamma_{\\rm m}\\simeq 0.95$) the derived expression yields the pressures which deviate from the exact MD results by $\\sim 0.001\\%$ at $\\kappa=0.5$, $~\\sim 0.1\\%$ at $\\kappa=1.0$, and $\\sim 1\\%$ at $\\kappa = 2.0-2.8$. The accuracy drops at the highest value $\\kappa=3.0$. This is not surprising, since the fits for $M(\\kappa)$ and $\\Gamma_{\\rm m}(\\kappa)$ are only applicable for $\\kappa\\lesssim 3.0$ and, therefore, derivatives from these fits at   $\\kappa=3.0$ can produce significant errors.\n\nWe also found out that if better accuracy is required, the data for the excess thermal energy can be fitted by the following slightly modified expression\n\\begin{equation}\n\\label{Eq7}\nu_{\\mathrm{ex}} = A(\\kappa)\\ln\\left[ 1 + B(\\kappa) \\Gamma^{s(\\kappa)} \\right],\n\\end{equation}\nwhere $A$ and $B$ are now assumed $\\kappa$-dependent and a $\\kappa$-dependent exponent $s$ is introduced. Based on all the data points obtained in MD simulations the following relations are identified:\n$A(\\kappa) = 0.35708 + 0.09397\\kappa$,\n$B(\\kappa)= 1.65491\\exp(- 0.76911\\kappa)$,\n$s(\\kappa) = 0.68838 - 0.05183\\kappa$.\nSome representative examples are shown in Fig.~\\ref{FigR2}.\nThe fit of Eq.~(\\ref{Eq7}) is clearly more accurate and can be used in\nthe regime of weaker coupling, compared to the simple form (\\ref{Fit1}). However, it is also less practical in evaluating thermodynamic parameters other than the excess internal energy.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R2.eps}\\\\\n    \\caption{\n    Thermal component of the reduced excess energy, $u_{\\rm th}$ of 2D Yukawa fluids near the fluid-solid phase transition versus the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. Symbols correspond to MD simulations for different values of the screening parameter $\\kappa$. The curves are the analytical fits to these data using Eq.~(\\ref{Fit1}): The upper (lower) curve corresponds to fitting the MD results for $\\kappa=0.5$ ($\\kappa = 3.0$) and the intermediate (red) curve is obtained by fitting the entire massive of the data points.}\n\\label{FigR1}\n\\end{figure}\n\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R3.eps}\\\\\n    \\caption{Dependence of the excess thermal energy $u_{\\rm th}$ on the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. All the data points from numerical simulations are plotted. Solid curves correspond to three representative fits using Eq.~\\eqref{Eq7}.}\n\\label{FigR2}\n\\end{figure}\n\n\n\n\\subsection{Relation between excess pressure and energy}\n\nIt is sometimes advantageous to operate with an equation of state written in the form of relation between the pressure and internal energy of the system. For soft purely repulsive potentials a simplest formulation of this kind can be written as\n\\begin{equation}\\label{gamma_ex}\np_{\\rm ex}=\\gamma_{\\rm ex}u_{\\rm ex}.\n\\end{equation}\nHere the parameter $\\gamma_{\\rm ex}$ generally depends both on the temperature and density, that is both on $\\Gamma$ and $\\kappa$ for Yukawa systems. Note that the parameter $\\gamma_{\\rm ex}$ introduced in this way is not directly related to the conventional definitions of either the density scaling exponent or Gr\\\"uneisen parameter.~\\cite{HummelPRB2015} Nevertheless, it may be helpful in characterizing the softness of the repulsive potential. We remind that for inverse-power-law (IPL) repulsive potentials of the form $\\varphi(r)\\propto r^{-\\alpha}$ the relation between the excess pressure and energy is particularly simple, $p_{\\rm ex}=\\tfrac{\\alpha}{2} u_{\\rm ex}$ in 2D. Thus, an ``effective IPL exponent'' may be associated with the quantity $2\\gamma_{\\rm ex}$.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{gamma.eps}\\\\\n    \\caption{Ratio of the excess pressure to the excess energy, $\\gamma_{\\rm ex}=p_{\\rm ex}/u_{\\rm ex}$ on the plane ($\\kappa$, $\\Gamma/\\Gamma_{\\rm m}$).\n    }\n\\label{gamma}\n\\end{figure}\n\nHaving approximations for both $p_{\\rm ex}$ and $u_{\\rm ex}$ for 2D Yukawa fluids we can easily estimate the value of $\\gamma_{\\rm ex}$. The corresponding plot of $\\gamma_{\\rm ex}$ as a function of Yukawa systems state variables $\\kappa$ and $\\Gamma/\\Gamma_{\\rm m}$ is shown in Fig.~\\ref{gamma}. To produce this plot, Eq.~(\\ref{Fit1}) for the thermal component of the excess energy has been used. Figure~\\ref{gamma} shows that in the strongly coupled regime $\\gamma_{\\rm ex}$ is very weakly dependent on the coupling strength (temperature), but exhibits considerable dependence on $\\kappa$ (density). Using the exact MD results for $p_{\\rm ex}/u_{\\rm ex}$ in the vicinity of the fluid-solid phase transition ($\\Gamma/\\Gamma_{\\rm m}\\simeq 0.95$) we have obtained a representative dependence $\\gamma_{\\rm ex}(\\kappa)$ in the strongly coupled regime:\n\\begin{equation}\n\\gamma_{\\rm ex}(\\kappa)=1+0.526\\kappa+0.13\\kappa^2-0.02\\kappa^3.\n\\end{equation}\nImportantly, $\\gamma_{\\rm ex}\\rightarrow 1$ as $\\kappa\\rightarrow 0$.\nThis seems counter-intuitive at first, because one would naturally expect $\\gamma_{\\rm ex}=\\tfrac{1}{2}$ in the OCP Coulomb interaction limit in 2D. The difference is attributed to the presence of the neutralizing background in the OCP model. In the limit of very soft interaction, the energy and pressure are dominated by their static contributions. As $\\kappa\\rightarrow 0$, the dominant contribution is the Madelung energy, so that $f_{\\rm ex}\\sim u_{\\rm ex}\\sim M\\Gamma\\sim \\Gamma/\\kappa$ (without background). This implies $p_{\\rm ex}=\\tfrac{\\Gamma}{2}(\\partial f_{\\rm ex}/\\partial \\Gamma)-\\tfrac{\\kappa}{2}(\\partial f_{\\rm ex}/\\partial \\kappa)\\sim \\Gamma/\\kappa\\sim u_{\\rm ex}$. In the presence of neutralizing background the term $\\Gamma/\\kappa$ disappears and we have $f_{\\rm ex}\\sim u_{\\rm ex}\\sim M_{\\rm OCP}\\Gamma$. This yields $p_{\\rm ex}\\sim \\tfrac{1}{2}M_{\\rm OCP}\\Gamma\\sim \\tfrac{1}{2}u_{\\rm ex}$. This consideration demonstrates that the Yukawa systems in the limit $\\kappa\\rightarrow 0$ are not fully equivalent to the Coulomb systems with the neutralizing background.\n\n\n\\subsection{Crystals}\n\nIn a series of MD simulations for 2D Yukawa crystals, in addition to evaluate the excess energy and pressure (which are summarized in Tables~\\ref{Table3} and \\ref{Table4} of the Appendix), the mean squared displacements were calculated to find the anharmonic correction coefficient $\\beta$. The resulting dependence $\\beta(\\kappa)$ is shown in Figure~\\ref{FigR3} (the corresponding values are also tabulated in Table~\\ref{Table2} of the Appendix for completeness).\nThe inset in Fig.~\\ref{FigR3} presents the radial (isotropic) pair correlation function, $g(r) \\propto \\int{d\\varphi\\; g(\\mathbf{r})}$,\nand demonstrates excellent representation of the short- and long-distance correlations. The obtained anharmonic correction coefficient $\\beta(\\kappa)$ allows to calculate analytically pair correlation function and then the excess energy, pressure and other thermodynamic parameters by the thermodynamic integration with the help of the expressions given in Sec.~\\ref{Thermo}.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R5.eps}\\\\\n    \\caption{    Dependence of the anharmonic correction coefficient $\\beta$ on the screening parameter $\\kappa$. The inset demonstrates a typical comparison between the radial distribution functions obtained in a direct MD simulation and computed using the shortest-graph method. For details see the text.}\n\\label{FigR3}\n\\end{figure}\n\nIt is worth to point out the following observation:\nIn the limit $\\kappa \\rightarrow 0$, the Yukawa interaction tends to the unscreened Coulomb interaction $\\varphi \\propto r^{-1}$. According to our previous MD simulations,~\\cite{1.4926945}\nthe finite-temperature phononic spectra differ weakly from zero-temperature ones for IPL potentials, $\\varphi \\propto r^{-\\alpha}$. Therefore, in the OCP limit ($\\kappa=0$ and $\\alpha=1$) we should obtain the smallest values of $\\beta(\\kappa)$. This is indeed observed in Fig.~\\ref{FigR3}.\n\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R6.eps}\\\\\n    \\caption{ Dependence of the reduced pressure on the reduced excess energy. Open (solid) symbols are the results of MD simulations for fluids and solids, respectively. The solid and dashed curves correspond to the shortest-graph method for solids and to the fit of Eq.~(\\ref{Eq7}) for fluids.}\n\\label{FigR4}\n\\end{figure}\n\nIn figure~\\ref{FigR4} we plot the reduced pressure versus the reduced excess energy of 2D Yukawa fluids and solids. Symbols are the MD results, the solid and dashed curves correspond to the shortest-graph method [with found anharmonic correction coefficient $\\beta(\\kappa)$] for the crystalline phase and the proposed fit by Eq.\\eqref{Eq7} for the fluid phase, respectively. Excellent agreement is observed.\n\n\\subsection{Accuracy}\n\nThe relative difference between the excess energies calculated using the shortest-graph method and those evaluated using direct MD simulations in the solid phase amounts to $\\simeq5\\times 10^{-5}$, which is comparable to the values reported earlier.~\\cite{0953-8984-28-23-235401} The accurate fit of Eq.~\\eqref{Eq7}\nyields the relative error in the excess energy smaller than $5\\times10^{-4}$ and  $2\\times10^{-3}$  for 72\\% and 95\\% of\nthe examined fluids data points, respectively. Maximal relative deviation, $5\\times 10^{-3}$, is observed near the melting line at large values of the screening parameter $\\kappa$. A simpler fit of Eq.~(\\ref{Fit1}) is applicable when the relative deviations within $\\lesssim 1\\%$ are acceptable.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{Pressure_kappa05.eps}\\\\\n    \\caption{Reduced pressure, $p$, as a function of the coupling parameter $\\Gamma$ for a Yukawa 2D fluid with the screening parameter $\\kappa=0.5$. The symbols are exact MD results, the solid (red) line corresponds to the fit of Eq.~(\\ref{Fit1}), the dashed (blue) line is the fit from Ref.~\\onlinecite{0022-3727-49-23-235203}.}\n\\label{FigPressure}\n\\end{figure}\n\nIn addition, we can compare our results with those recently reported in Refs.~\\onlinecite{0022-3727-49-23-235203,1.4962685}, where fits for the pressure of 2D Yukawa fluids in the $(\\kappa,\\Gamma)$ parameter space have been proposed. The case $\\kappa=0.5$ received special attention and a simple two-term fit has been proposed based on the results of a MD simulation,~\\cite{0022-3727-49-23-235203} $p=1.53\\Gamma+1.33$.\nWe plot our MD results along with the fit of Eq.~(\\ref{Fit1}) and the fit from Ref.~\\onlinecite{0022-3727-49-23-235203} in Fig.~\\ref{FigPressure}. One can see that the fit from Ref.~\\onlinecite{0022-3727-49-23-235203} overestimates the pressure systematically at high values of $\\Gamma$. At the strongest coupling in the fluid phase studied in this work, $\\Gamma=135.42$, the present MD simulation yields $p= 199.434$, fit by Eq.~(\\ref{Fit1}) yields $p=199.432$, while the fit from Ref.~\\onlinecite{0022-3727-49-23-235203} yields $p=208.523$. On the other hand, the previous model for 2D Yukawa systems in the OCP (weakly screening) limit discussed in Refs.~\\onlinecite{KhrapakPoP08_2015,1.4935846}\nyields $p=199.445$, providing confidence in the accuracy of the  present results. The reasons for deviations in Ref.~\\onlinecite{0022-3727-49-23-235203} have to be identified.\n\n3.4 Crystals\n\\subsection{Crystals}\n\nIn a series of MD simulations for 2D Yukawa crystals, in addition to evaluate the excess energy and pressure (which are summarized in Tables~\\ref{Table3} and \\ref{Table4} of the Appendix), the mean squared displacements were calculated to find the anharmonic correction coefficient $\\beta$. The resulting dependence $\\beta(\\kappa)$ is shown in Figure~\\ref{FigR3} (the corresponding values are also tabulated in Table~\\ref{Table2} of the Appendix for completeness).\nThe inset in Fig.~\\ref{FigR3} presents the radial (isotropic) pair correlation function, $g(r) \\propto \\int{d\\varphi\\; g(\\mathbf{r})}$,\nand demonstrates excellent representation of the short- and long-distance correlations. The obtained anharmonic correction coefficient $\\beta(\\kappa)$ allows to calculate analytically pair correlation function and then the excess energy, pressure and other thermodynamic parameters by the thermodynamic integration with the help of the expressions given in Sec.~\\ref{Thermo}.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R5.eps}\\\\\n    \\caption{    Dependence of the anharmonic correction coefficient $\\beta$ on the screening parameter $\\kappa$. The inset demonstrates a typical comparison between the radial distribution functions obtained in a direct MD simulation and computed using the shortest-graph method. For details see the text.}\n\\label{FigR3}\n\\end{figure}\n\nIt is worth to point out the following observation:\nIn the limit $\\kappa \\rightarrow 0$, the Yukawa interaction tends to the unscreened Coulomb interaction $\\varphi \\propto r^{-1}$. According to our previous MD simulations,~\\cite{1.4926945}\nthe finite-temperature phononic spectra differ weakly from zero-temperature ones for IPL potentials, $\\varphi \\propto r^{-\\alpha}$. Therefore, in the OCP limit ($\\kappa=0$ and $\\alpha=1$) we should obtain the smallest values of $\\beta(\\kappa)$. This is indeed observed in Fig.~\\ref{FigR3}.\n\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R6.eps}\\\\\n    \\caption{ Dependence of the reduced pressure on the reduced excess energy. Open (solid) symbols are the results of MD simulations for fluids and solids, respectively. The solid and dashed curves correspond to the shortest-graph method for solids and to the fit of Eq.~(\\ref{Eq7}) for fluids.}\n\\label{FigR4}\n\\end{figure}\n\nIn figure~\\ref{FigR4} we plot the reduced pressure versus the reduced excess energy of 2D Yukawa fluids and solids. Symbols are the MD results, the solid and dashed curves correspond to the shortest-graph method [with found anharmonic correction coefficient $\\beta(\\kappa)$] for the crystalline phase and the proposed fit by Eq.\\eqref{Eq7} for the fluid phase, respectively. Excellent agreement is observed.\n\n5 MD results\n\\section{MD results}\n\\label{Appendix}\n\nIn the Appendix, we summarize main results from MD simulations performed in this study. Table \\ref{Table1} reports the reduced excess energies and pressures at different state points in the fluid phase. Table  \\ref{Table2} summarizes the values of the anharmonic correction coefficient $\\beta$ evaluated using MD simulations of the crystalline phase. Finally, Tables  \\ref{Table3} and  \\ref{Table4} report the excess energies and pressures in the crystalline phase.\n\n\\begin{table}[h]\n\t\\centering\n\t\\small\n\t\\caption{Reduced excess energy $u_{\\rm ex}$ and pressure $p$ of two-dimensional Yukawa fluids evaluated using MD simulations for various coupling ($\\Gamma$) and screening ($\\kappa$) parameters.}\n\t\\label{Table1}\n\t\t\\begin{tabular}{l c c c c c c c c c c c c}\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=0.5$}\\\\ \\hline\n\t\t$\\Gamma$ & 135.420 & 86.7254 & 52.7787 & 32.1811 & 19.6073 & 11.9310 & 7.27175 & 4.43126 & 2.69848 & 1.64302 & 1.00136 &  0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 152.944 & 98.3115 & 60.1901 & 37.0087 & 22.8180 & 14.1176 & 8.79838 & 5.51964 & 3.48587 & 2.21772 & 1.42021 & 0.76495\\\\\n\t\t$p$ & 199.434 & 128.303 & 78.6946 & 48.5651 & 30.1485 & 18.8835 & 12.0216 & 7.81631 & 5.22964 & 3.63556 & 2.64961 & 1.85883\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=0.6$}\\\\\\hline\n\t\t$\\Gamma$  & 140.131 & 89.5076 & 54.3171 & 32.9737 & 20.0017 & 12.1359 & 7.36665 & 4.47442 & 2.71053 & 1.64677 & 1.00106 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 116.984 & 75.1128 & 45.9415 & 28.2016 & 17.3768 & 10.7727 & 6.73045 & 4.24422 & 2.69421 & 1.72956 & 1.11776 & 0.61083\\\\\n\t\t$p$ \t\t\t\t\t& 160.369 & 103.050  & 63.1652  & 38.9451  & 24.1971  & 15.2284  & 9.76528  & 6.42899  & 4.37128  & 3.11015  & 2.32663  & 1.69701\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=0.8$}\\\\\\hline\n\t\t$\\Gamma$ & 152.277 & 96.5736 & 58.0604 & 34.9737 & 21.0334 & 12.6675 & 7.61503 & 4.58845 & 2.75830 & 1.66410 & 0.99914  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 74.6424 & 47.7340  & 29.0608 & 17.8181 & 10.9844 & 6.84185 & 4.30139 & 2.74217 & 1.76665 & 1.15293  & 0.75437 & 0.42469\\\\\n\t\t$p$ \t\t\t\t\t& 112.709 & 72.1411  & 44.0441 & 27.1658 & 16.9406 & 10.7731 & 7.01845 & 4.73986 & 3.33679 & 2.47393  & 1.92983 & 1.49910\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.0$}\\\\\\hline\n\t\t$\\Gamma$ & 169.071 & 105.975 & 63.1038 & 37.6027 & 22.4047 & 13.3361 & 7.94729 & 4.73129 & 2.81940 & 1.68034  & 0.99956 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 51.5786 & 32.7335 & 19.8556 & 12.1451 & 7.50279 & 4.68984 & 2.97702 & 1.91799 & 1.25426 & 0.82932 & 0.55059 & 0.31770\\\\\n\t\t$p$ \t& 85.4036 & 54.2492 & 33.0215 & 20.3527 & 12.7618 & 8.19406 & 5.44279 & 3.76791 & 2.74103 & 2.10336 & 1.70075 & 1.38135\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.2$}\\\\\\hline\n\t\t$\\Gamma$ & 191.126 & 118.398 & 69.6429 & 40.9597 & 24.1083 & 14.1893 & 8.34919 & 4.90490 & 2.88868 & 1.70019 & 0.99984 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 37.5852 & 23.6918 & 14.3026 & 8.72609 & 5.39936 & 3.39637 & 2.17547 & 1.41736 & 0.93933 & 0.62908 & 0.42281 & 0.24960\\\\\n\t\t$p$ \t& 67.9344 & 42.8619 & 25.9838 & 16.0024 & 10.0874 & 6.56025 & 4.44041 & 3.15023 & 2.36021 & 1.86635 & 1.55301 & 1.30594\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.4$}\\\\\\hline\n\t\t$\\Gamma$ & 220.172 & 134.441 & 77.9949 & 45.2452 & 26.2578 & 15.2219 & 8.83634 & 5.12702 & 2.97137 & 1.72440 & 1.00140  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 28.5555 & 17.8503 & 10.7244 & 6.53392 & 4.05300 & 2.56405 & 1.65932 & 1.09552 & 0.73364 & 0.49726 & 0.33718 & 0.20253\\\\\n\t\t$p$ \t& 56.0915 & 35.0963 & 21.1892 & 13.0574 & 8.28303 & 5.45288 & 3.76392 & 2.73780 & 2.10241 & 1.70540 & 1.45171 & 1.25396\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.6$}\\\\\\hline\n\t\t$\\Gamma$ & 258.433 & 155.296 & 88.6297 & 50.6106 & 28.9099 & 16.4928 & 9.41249 & 5.37870 & 3.07317 & 1.75217 & 0.99889  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 22.4535 & 13.9136 & 8.31218 & 5.05719 & 3.14728 & 2.00498 & 1.30903 & 0.87473 & 0.59391 & 0.40446 & 0.27520 & 0.16486\\\\\n\t\t$p$ & 47.7294 & 29.6021 & 17.7849 & 10.9674 & 7.00739 & 4.67522 & 3.28559 & 2.44432 & 1.92230 & 1.58965 & 1.37647 & 1.15781\\\\\\hline\\hline\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.8$}\\\\\\hline\n\t\t$\\Gamma$ & 308.935 & 182.395 & 102.261 & 57.3435 & 32.1483 & 18.0355 & 10.1029 & 5.67241 & 3.17978 & 1.78359 & 0.99997  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 18.1745 & 11.1626 & 6.63304 & 4.02868 & 2.51560 & 1.61389 & 1.06328 & 0.71747 & 0.49051 & 0.33739 & 0.23058 & 0.14359\\\\\n\t\t$p$ \t& 41.6428 & 25.5932 & 15.3055 & 9.44338 & 6.07675 & 4.10949 & 2.93845 & 2.22906 & 1.78546 & 1.50402 & 1.32125 & 1.18748\\\\\\hline\\hline\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.0$}\\\\\\hline\n\t\t$\\Gamma$ & 375.818 & 217.422 & 119.600 & 65.7745 & 36.1611 & 19.8980 & 10.9232 & 6.01199 & 3.30681 & 1.81767 & 1.00051 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 15.0964 & 9.17319 & 5.42177 & 3.29200 & 2.06139 & 1.33276 & 0.88426 & 0.60261 & 0.41513 & 0.28650 & 0.19651 & 0.12379\\\\\n\t\t$p$ \t\t\t\t\t& 37.1333 & 22.5775 & 13.4413 & 8.30684 & 5.38337 & 3.68921 & 2.67835 & 2.06727 & 1.68347 & 1.43752 & 1.27850 & 1.16494\\\\\\hline\\hline\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.2$}\\\\\\hline\n\t\t$\\Gamma$ & 463.975 & 262.948 & 141.568 & 76.2338 & 41.0173 & 22.0958 & 11.9035 & 6.41082 & 3.45056 & 1.85303 & 1.00113 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 12.7875 & 7.69994 & 4.52708 & 2.74830 & 1.72461 & 1.12217 & 0.75368 & 0.51642 & 0.35777 & 0.24734 & 0.17009 & 0.10850\\\\\n\t\t$p$ \t\t\t\t\t& 33.6575 & 20.2710 & 12.0118 & 7.43585 & 4.85060 & 3.36425 & 2.48426 & 1.94445 & 1.60450 & 1.38520 & 1.24473 & 1.14408\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.4$}\\\\\\hline\n\t\t$\\Gamma$ & 578.968 & 320.871 & 168.949 & 89.0382 & 46.8778 & 24.7092 & 12.9953 & 6.85634 & 3.60307 & 1.89919 & 0.99952 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 10.9709 & 6.56430 & 3.83850 & 2.33031 & 1.47100 & 0.96365 & 0.65089 & 0.44974 & 0.31141 & 0.21697 & 0.14862 & 0.09589\\\\\n\t\t$p$ & 30.8215 & 18.4175 & 10.8648 & 6.74135 & 4.43655 & 3.11369 & 2.32748 & 1.84722 & 1.53931 & 1.34446 & 1.21673 & 1.12942\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.6$}\\\\\\hline\n\t\t$\\Gamma$ & 723.656 & 392.384 & 202.051 & 104.080 & 53.5742 & 27.6270 & 14.2191 & 7.32182 & 3.76653 & 1.93971 & 1.00200 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 9.50055 & 5.63818 & 3.28596 & 1.99866 & 1.26783 & 0.83500 & 0.56905 & 0.39442 & 0.27600 & 0.19145 & 0.13130 & 0.08576\\\\\n\t\t$p$ \t& 28.3633 & 16.8096 & 9.89231 & 6.16190 & 4.09049 & 2.90245 & 2.19936 & 1.76426 & 1.48858 & 1.30961 & 1.19408 & 1.11954\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.8$}\\\\\\hline\n\t\t$\\Gamma$ & 893.746 & 474.549 & 239.143 & 120.685 & 60.8483 & 30.6642 & 15.4796 & 7.80951 & 3.93161 & 1.98042 & 1.00296 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 8.19448 & 4.82859 & 2.81518 & 1.71951 & 1.09985 & 0.73051 & 0.50093 & 0.35038 & 0.24489 & 0.17117 & 0.11700 & 0.07671\\\\\n\t\t$p$ \t& 25.9004 & 15.2521 & 8.98792 & 5.63831 & 3.78782 & 2.72194 & 2.08856 & 1.69631 & 1.44344 & 1.28133 & 1.17497 & 1.10201\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=3.0$}\\\\\\hline\n\t\t$\\Gamma$ & 1071.02 & 558.495 & 276.444 & 136.953 & 67.7922 & 33.5897 & 16.6383 & 8.22716 & 4.07874 & 2.02013 & 0.99949 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 6.93189 & 4.07091 & 2.38838 & 1.47193 & 0.95056 & 0.64023 & 0.44340 & 0.31146 & 0.21994 & 0.15395 & 0.10494 & 0.06958\\\\\n\t\t$p$ \t& 23.1181 & 13.5906 & 8.07317 & 5.12679 & 3.49444 & 2.55590 & 1.98879 & 1.63334 & 1.40554 & 1.25677 & 1.15868 & 1.09682\\\\\\hline\\hline\n\t\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\small\n\t\\caption{Values of the anharmonic correction coefficient $\\beta$ for different screening parameter $\\kappa$.}\n\t\\label{Table2}\n\t\t\\begin{tabular}{l c c c c c c c c c c c c c c c c c}\n\t\t$\\kappa$ & 0.0 & 0.2 & 0.3 & 0.4 & 0.6 & 0.8 & 1.0 & 1.2 & 1.4 & 1.6 & 1.8 & 2.0 & 2.2 & 2.4 & 2.6 & 2.8 & 3.0 \\\\\\hline\n\t\t$\\beta(\\kappa)$\t& 3.01 & 9.23 & 12.38 & 14.30 & 10.53 & 9.71 & 9.35 & 9.28 & 9.14 & 9.08 & 8.97 & 8.855 & 8.68 & 8.71 & 8.46 & 8.47 & 8.51\n\t\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\small\n\t\\caption{Reduced excess energy $u_{\\mathrm{ex}}$ of the 2D Yukawa crystal obtained in MD simulations for various screening parameters $\\kappa$ and reduced coupling parameters $\\Gamma_{\\rm m}/\\Gamma$.}\n\t\\label{Table3}\n\t\t\\begin{tabular}{lccccccccc}\n\t\t\\multicolumn{1}{c|}{ $\\kappa$}& \\multicolumn{9}{c}{$\\Gamma_{\\rm m}/\\Gamma$}     \\\\ \\hline\\hline\n\t\t\\multicolumn{1}{l|}{ }& 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.6 & 0.7 & 0.8 & 0.9 \\\\ \\cline{2-10}\n\t\t\\multicolumn{1}{l|}{0.5} & 1595.62 & 798.828 & 532.689 & 399.681 & 319.981 & 266.796 & 228.880 & 200.332 & 178.283 \\\\\n\t\t\\multicolumn{1}{l|}{0.6} & 1217.36 & 609.282 & 406.628 & 305.117 & 244.469 & 203.938 & 174.914 & 153.188 & 136.267 \\\\\n\t\t\\multicolumn{1}{l|}{0.8} & 773.025 & 387.104 & 258.328 & 194.074 & 155.484 & 129.733 & 111.343 & 97.5607 & 86.8364 \\\\\n\t\t\\multicolumn{1}{l|}{1.0} & 529.643 & 265.306 & 177.235 & 133.215 & 106.726 & 89.1490 & 76.5169 & 67.1314 & 59.7831 \\\\\n\t\t\\multicolumn{1}{l|}{1.2} & 382.522 & 191.740 & 128.152 & 96.3972 & 77.2970 & 64.6022 & 55.5318 & 48.7317 & 43.4438   \\\\\n\t\t\\multicolumn{1}{l|}{1.4} & 287.408 & 144.232 & 96.4804 & 72.5942 & 58.2862 & 48.7586 & 41.9386 & 36.8484 & 32.8838   \\\\\n\t\t\\multicolumn{1}{l|}{1.6} & 223.185 & 112.096 & 75.0671 & 56.5515 & 45.4466 & 38.0606 & 32.7681 & 28.8120 & 25.7391   \\\\\n\t\t\\multicolumn{1}{l|}{1.8} & 178.133 & 89.6228 & 60.0889 & 45.3116 & 36.4631 & 30.5563 & 26.3521 & 23.1896 & 20.7451   \\\\\n\t\t\\multicolumn{1}{l|}{2.0} & 145.774 & 73.3800 & 49.2712 & 37.2003 & 29.9641 & 25.1447 & 21.7011 & 19.1314 & 17.1275   \\\\\n\t\t\\multicolumn{1}{l|}{2.2} & 121.609 & 61.3067 & 41.2021 & 31.1620 & 25.1352 & 21.1177 & 18.2517 & 16.1113 & 14.4385   \\\\\n\t\t\\multicolumn{1}{l|}{2.4} & 102.908 & 51.9465 & 34.9672 & 26.4819 & 21.3920 & 17.9999 & 15.5706 & 13.7650 & 12.3602   \\\\\n\t\t\\multicolumn{1}{l|}{2.6} & 87.4157 & 44.2324 & 29.8212 & 22.6181 & 18.2990 & 15.4212 & 13.3710 & 11.8300 & 10.6351   \\\\\n\t\t\\multicolumn{1}{l|}{2.8} & 73.5771 & 37.3025 & 25.2028 & 19.1490 & 15.5271 & 13.1108 & 11.3865 & 10.0997 & 9.10597   \\\\\n\t\t\\multicolumn{1}{l|}{3.0} & 60.2002 & 30.6118 & 20.7457 & 15.8118 & 12.8497 & 10.8840 & 9.47465 & 8.43053 & 7.65187   \\\\\n\t\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\small\n\t\\caption{Reduced pressure (compressibility)  $p$ of the 2D Yukawa crystal obtained in MD simulations for various screening parameters $\\kappa$ and reduced coupling parameters $\\Gamma_{\\rm m}/\\Gamma$.}\n\t\\label{Table4}\n\t\t\\begin{tabular}{lccccccccc}\n\t\t\\multicolumn{1}{c|}{ $\\kappa$}& \\multicolumn{9}{c}{$\\Gamma_{\\rm m}/\\Gamma$}     \\\\ \\hline\\hline\n\t\t\\multicolumn{1}{l|}{ }& 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.6 & 0.7 & 0.8 & 0.9 \\\\ \\cline{2-10}\n\t\t\\multicolumn{1}{l|}{0.5} & 2080.63 & 1041.70 & 694.789 & 521.370 & 417.454 & 348.100 & 298.669 & 261.442 & 232.679 \\\\\n\t\t\\multicolumn{1}{l|}{0.6} & 1669.06 & 835.485 & 557.680 & 418.523 & 335.380 & 279.814 & 240.022 & 210.233 & 187.022 \\\\\n\t\t\\multicolumn{1}{l|}{0.8} & 1168.03 & 585.024 & 390.480 & 293.406 & 235.104 & 196.197 & 168.410 & 147.583 & 131.370 \\\\\n\t\t\\multicolumn{1}{l|}{1.0} & 878.208 & 440.005 & 294.000 & 221.023 & 177.106 & 147.964 & 127.016 & 111.450 & 99.2542 \\\\\n\t\t\\multicolumn{1}{l|}{1.2} & 693.046 & 347.470 & 232.288 & 174.765 & 140.162 & 117.162 & 100.726 & 88.4011 & 78.8053   \\\\\n\t\t\\multicolumn{1}{l|}{1.4} & 566.555 & 284.386 & 190.275 & 143.196 & 114.994 & 96.2113 & 82.7636 & 72.7234 & 64.8975   \\\\\n\t\t\\multicolumn{1}{l|}{1.6} & 476.692 & 239.477 & 160.406 & 120.865 & 97.1465 & 81.3696 & 70.0608 & 61.6053 & 55.0288   \\\\\n\t\t\\multicolumn{1}{l|}{1.8} & 410.580 & 206.621 & 138.561 & 104.505 & 84.1086 & 70.4915 & 60.7970 & 53.5005 & 47.8555   \\\\\n\t\t\\multicolumn{1}{l|}{2.0} & 361.191 & 181.859 & 122.134 & 92.2267 & 74.2973 & 62.3524 & 53.8144 & 47.4405 & 42.4641   \\\\\n\t\t\\multicolumn{1}{l|}{2.2} & 322.729 & 162.732 & 109.386 & 82.7430 & 66.7485 & 56.0825 & 48.4703 & 42.7821 & 38.3327   \\\\\n\t\t\\multicolumn{1}{l|}{2.4} & 291.498 & 147.173 & 99.0847 & 75.0489 & 60.6307 & 51.0175 & 44.1300 & 39.0087 & 35.0158   \\\\\n\t\t\\multicolumn{1}{l|}{2.6} & 263.437 & 133.325 & 89.9002 & 68.1935 & 55.1747 & 46.4976 & 40.3128 & 35.6615 & 32.0486   \\\\\n\t\t\\multicolumn{1}{l|}{2.8} & 235.188 & 119.260 & 80.5872 & 61.2342 & 49.6540 & 41.9257 & 36.4074 & 32.2829 & 29.0897   \\\\\n\t\t\\multicolumn{1}{l|}{3.0} & 203.533 & 103.516 & 70.1601 & 53.4777 & 43.4588 & 36.8063 & 32.0351 & 28.4887 & 25.8081  \\\\\n\t\t\\end{tabular}\n\\end{table}\n\\end{widetext}\n\n\\bibliography{Ref-2D-Yukawa} \n\n\\end{document}\n",
        "Descriptive_question1": "What is the value of the anharmonic correction coefficient β for κ = 0.0 in table_3?",
        "Descriptive_question2": "What is the value of β for κ = 3.0 in table_3?",
        "Reasoning_question1": "How does the anharmonic correction coefficient β change as the screening parameter κ increases from 0.0 to 3.0 in table_3, and what might explain this trend?",
        "Reasoning_question2": "What can be inferred about the relationship between the screening parameter κ and the anharmonic correction coefficient β in table_3, particularly around κ = 0.4 to 0.6?",
        "Descriptive_answer1": "3.01",
        "Descriptive_answer2": "8.51",
        "Reasoning_answer1": "As the screening parameter κ increases from 0.0 to 3.0, the anharmonic correction coefficient β initially rises sharply from 3.01 at κ=0.0 to a peak of 14.30 at κ=0.4. After this peak, β decreases to 10.53 at κ=0.6, and continues to decline more gradually to 9.71 at κ=0.8, and further to 8.51 at κ=3.0, with slight fluctuations (e.g., a small increase from 8.46 at κ=2.6 to 8.51 at κ=3.0). This trend suggests a complex interaction between screening and anharmonic effects. Initially, as κ increases from 0.0, the screening of the Yukawa interaction may enhance anharmonic contributions, possibly due to altered inter-particle potentials leading to greater deviation from harmonic behavior. However, beyond κ=0.4, increasing screening likely weakens the interaction strength, reducing the anharmonic effects, which explains the overall decline in β. The context indicates that as κ approaches 0, the interaction resembles unscreened Coulomb potential, where anharmonic corrections are minimal, consistent with the lowest β value at κ=0.0. The fluctuations at higher κ might be due to numerical precision or specific structural effects in the crystal lattice not captured by simple trends.",
        "Reasoning_answer2": "Examining the relationship between the screening parameter κ and the anharmonic correction coefficient β around κ=0.4 to 0.6 reveals a notable transition. At κ=0.4, β reaches its peak value of 14.30, indicating a maximum anharmonic correction. However, as κ increases to 0.6, β drops significantly to 10.53, marking a sharp decline. This suggests that around this range, there is a critical shift in the interaction dynamics of the 2D Yukawa crystal. The peak at κ=0.4 likely indicates a point where the screening parameter maximizes deviations from harmonic behavior, possibly due to a specific balance between interaction range and particle correlations in the lattice. The subsequent decrease to κ=0.6 could reflect a reduction in effective interaction strength with increased screening, leading to lesser anharmonic effects. This behavior might be linked to changes in the lattice structure or thermal motion contributions becoming less dominant as screening alters the potential softness, aligning with discussions in the context about interaction strength weakening with higher κ."
    },
    {
        "paper_id": "1704.00976.json",
        "table_id": "table_4",
        "table_content": "\\begin{table}[!t]\n\t\\centering\n\t\\small\n\t\\caption{Reduced excess energy $u_{\\mathrm{ex}}$ of the 2D Yukawa crystal obtained in MD simulations for various screening parameters $\\kappa$ and reduced coupling parameters $\\Gamma_{\\rm m}/\\Gamma$.}\n\t\\label{Table3}\n\t\t\\begin{tabular}{lccccccccc}\n\t\t\\multicolumn{1}{c|}{ $\\kappa$}& \\multicolumn{9}{c}{$\\Gamma_{\\rm m}/\\Gamma$}     \\\\ \\hline\\hline\n\t\t\\multicolumn{1}{l|}{ }& 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.6 & 0.7 & 0.8 & 0.9 \\\\ \\cline{2-10}\n\t\t\\multicolumn{1}{l|}{0.5} & 1595.62 & 798.828 & 532.689 & 399.681 & 319.981 & 266.796 & 228.880 & 200.332 & 178.283 \\\\\n\t\t\\multicolumn{1}{l|}{0.6} & 1217.36 & 609.282 & 406.628 & 305.117 & 244.469 & 203.938 & 174.914 & 153.188 & 136.267 \\\\\n\t\t\\multicolumn{1}{l|}{0.8} & 773.025 & 387.104 & 258.328 & 194.074 & 155.484 & 129.733 & 111.343 & 97.5607 & 86.8364 \\\\\n\t\t\\multicolumn{1}{l|}{1.0} & 529.643 & 265.306 & 177.235 & 133.215 & 106.726 & 89.1490 & 76.5169 & 67.1314 & 59.7831 \\\\\n\t\t\\multicolumn{1}{l|}{1.2} & 382.522 & 191.740 & 128.152 & 96.3972 & 77.2970 & 64.6022 & 55.5318 & 48.7317 & 43.4438   \\\\\n\t\t\\multicolumn{1}{l|}{1.4} & 287.408 & 144.232 & 96.4804 & 72.5942 & 58.2862 & 48.7586 & 41.9386 & 36.8484 & 32.8838   \\\\\n\t\t\\multicolumn{1}{l|}{1.6} & 223.185 & 112.096 & 75.0671 & 56.5515 & 45.4466 & 38.0606 & 32.7681 & 28.8120 & 25.7391   \\\\\n\t\t\\multicolumn{1}{l|}{1.8} & 178.133 & 89.6228 & 60.0889 & 45.3116 & 36.4631 & 30.5563 & 26.3521 & 23.1896 & 20.7451   \\\\\n\t\t\\multicolumn{1}{l|}{2.0} & 145.774 & 73.3800 & 49.2712 & 37.2003 & 29.9641 & 25.1447 & 21.7011 & 19.1314 & 17.1275   \\\\\n\t\t\\multicolumn{1}{l|}{2.2} & 121.609 & 61.3067 & 41.2021 & 31.1620 & 25.1352 & 21.1177 & 18.2517 & 16.1113 & 14.4385   \\\\\n\t\t\\multicolumn{1}{l|}{2.4} & 102.908 & 51.9465 & 34.9672 & 26.4819 & 21.3920 & 17.9999 & 15.5706 & 13.7650 & 12.3602   \\\\\n\t\t\\multicolumn{1}{l|}{2.6} & 87.4157 & 44.2324 & 29.8212 & 22.6181 & 18.2990 & 15.4212 & 13.3710 & 11.8300 & 10.6351   \\\\\n\t\t\\multicolumn{1}{l|}{2.8} & 73.5771 & 37.3025 & 25.2028 & 19.1490 & 15.5271 & 13.1108 & 11.3865 & 10.0997 & 9.10597   \\\\\n\t\t\\multicolumn{1}{l|}{3.0} & 60.2002 & 30.6118 & 20.7457 & 15.8118 & 12.8497 & 10.8840 & 9.47465 & 8.43053 & 7.65187   \\\\\n\t\t\\end{tabular}\n\\end{table}",
        "caption": "Reduced excess energy $u_{\\mathrm{ex}}$ of the 2D Yukawa crystal obtained in MD simulations for various screening parameters $\\kappa$ and reduced coupling parameters $\\Gamma_{\\rm m}/\\Gamma$.",
        "label": "Table3",
        "section_info": "3 Results\n\\section{Results}\n\n\\subsection{Weakly-coupled fluids}\n\n\\begin{figure}[!b]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R1.eps}\\\\\n    \\caption{The excess energy $u_{\\rm ex}$ of 2D Yukawa weakly coupled fluids versus the screening parameter $\\kappa$ at a fixed coupling parameter $\\Gamma = 0.5$. The symbols correspond to the results of MD simulations, the solid curve is plotted using the analytical expression of Eq.~(\\ref{SVC}).\n    }\n\\label{FigSC}\n\\end{figure}\n\nA simple and physically transparent approach to the thermodynamics of weakly coupled Yukawa systems for small deviations from the ideal gas behavior is to calculate the  second virial coefficient. This has recently been shown to work well in 3D Yukawa systems.~\\cite{KhrapakPPCF2016} In the 2D geometry the excess free energy is expressed in this approximation as\n\\begin{equation}\\label{SVC}\nf_{\\rm ex}\\simeq \\pi n \\int\\left[1-e^{-\\varphi(r)/k_{\\rm B}T}\\right]r dr.\n\\end{equation}\nThe excess energy and pressure can be readily obtained from the excess free energy. We compare the values $u_{\\rm ex}$ at a fixed coupling parameter $\\Gamma=0.5$ obtained from Eq.~(\\ref{SVC}) and computed using MD simulations in Fig.~\\ref{FigSC}. The agreement is satisfactory: in the range of $\\kappa$ investigated the deviations are within several percent. The agreement naturally improves with increasing $\\kappa$, because at a fixed $\\Gamma$ the actual interaction strength weakens as $\\kappa$ increases.\n\n\n\\subsection{Strongly-coupled fluids}\n\nThe excess energy and pressure of the 2D Yukawa fluids have been determined using MD simulations in a wide range of coupling and screening parameters. The results are summarized in the Table \\ref{Table1} of the Appendix. Here we describe simple analytical\napproximations, which can be used to evaluate the energy and pressure for practical purposes.\n\nIn the strongly coupled fluid regime it is helpful to divide the thermodynamic quantities, such as energy and pressure, into static and thermal contributions. The static contribution corresponds to the value of internal energy when the particles are frozen in some regular configuration and the thermal corrections arise due to the deviations of the particles from these fixed positions (due to thermal motion). Of course, such a division is only meaningful when the regular structure is specified. For crystals, the obvious choice is a corresponding lattice sum (Madelung energy). For fluids this choice is also meaningful and we use it here (Note, that in 3D Yukawa system a slightly different definition of the static fluid energy is traditionally employed.~\\cite{KhrapakPPCF2016, KhrapakISM})\n\n\\begin{table}[!b]\n\\caption{\\label{TabM} Madelung constants of the 2D Yukawa crystals (triangular lattice) for various screening parameters in the range $0.5 \\leq \\kappa\\leq 3.0$ }\n\\begin{ruledtabular}\n\\begin{tabular}{cccc}\n$\\kappa$ & $M$ & $\\kappa$ & $M$   \\\\ \\hline\n0.5 & 1.11914 & 1.8 & 0.05449  \\\\\n0.6 & 0.82503 & 2.0 & 0.03660  \\\\\n0.8 & 0.48127 & 2.2 & 0.02470  \\\\\n1.0 & 0.29709 & 2.4 & 0.01672  \\\\\n1.2 & 0.18960 & 2.6 & 0.01135  \\\\\n1.4 & 0.12357 & 2.8 & 0.00772  \\\\\n1.6 & 0.08167 & 3.0 & 0.00525 \\\\ \n\\end{tabular}\n\\end{ruledtabular}\n\\end{table}\n\nThe excess internal energy is thus a sum of the static and thermal contributions,\n\\begin{equation}\nu_{\\rm ex} = u_{\\rm st} + u_{\\rm th},\n\\end{equation}\nwhere $u_{\\rm st} = M\\Gamma$ and $M$ is the Madelung constant.\nThe values of the Madelung constant for 2D Yukawa systems in the regime of relatively weak screening, $0.5 \\leq \\kappa\\leq 3.0$, are tabulated in Table~\\ref{TabM}. The dependence $M(\\kappa)$ can be fitted using a functional form similar to that proposed by  Totsuji~\\emph{et al.}\\cite{PhysRevE.70.016405}\n\\begin{equation}\n\\label{Eq6}\nM = -1.1061+0.5038\\kappa-0.11053\\kappa^2+0.00968\\kappa^3+1/\\kappa.\n\\end{equation}\nThe last term in (\\ref{Eq6}) accounts for the absence of neutralizing background in our case (but present in Ref.~\\onlinecite{PhysRevE.70.016405}), the energy of this background being simply $-\\Gamma/\\kappa$. The fit is chosen in such a way that when $\\kappa\\rightarrow 0$ and the neutralizing background is introduced, the Madelung constant is reduced to the well known value of the triangular lattice sum of the 2D one-component-plasma (OCP) with Coulomb interactions, $M_{\\rm OCP}\\simeq -1.1061$. This fit is accurate to within a tiny fraction of percent for $\\kappa\\lesssim 1.0$ and to within $\\sim 1\\%$ when screening becomes stronger ($\\kappa\\sim 3$).\n\nThe thermal part of the excess energy is expected to exhibit a quasi-universal scaling with respect to the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. This is a general property of classical particle systems with sufficiently soft interactions, which was first pointed out by Rosenfeld and Tarazona (RT scaling) for 3D systems.~\\cite{RT1,RT2} In the context of 3D Yukawa systems, the RT scaling has been proven to be very useful in Refs.~\\onlinecite{1.4921223,KhrapakPPCF2016,KhrapakPRE2015,KhrapakPRE03_2015} The emergence of RT scaling analogue for 2D systems has been discussed in the context of OCP with Coulomb and logarithmic interactions, Yukawa systems near the OCP limit, and inverse-power-law interactions.~\\cite{KhrapakCPP2016,KhrapakPoP08_2015} The dependence of $u_{\\rm th}$ on $\\Gamma/\\Gamma_{\\rm m}$ in the strongly coupled regime is displayed in Fig.~\\ref{FigR1}. The quasi-universality is well pronounced, although there is clearly some systematic tendency of decreasing the value of $u_{\\rm th}$ with $\\kappa$ at the same value of  $\\Gamma/\\Gamma_{\\rm m}$. This tendency is expected when the potential steepness increases (see e.g. Fig.~4 from Ref.~\\onlinecite{KhrapakPoP08_2015}). Overall, the data points corresponding to the dependence $u_{\\rm th}(\\Gamma/\\Gamma_{\\rm m})$ are confined to a relatively narrow range. The important point is that towards the side of soft interactions (sufficiently small $\\kappa$ in our case), the static component of the internal energy is dominant over the thermal one. For example, at $\\kappa=1$ the thermal component contributes only to about $2\\%$ of the total excess energy near the fluid-solid phase transition. Therefore, even moderately accurate fits for $u_{\\rm th}$ allow to obtain high accuracy with respect to the total excess energy $u_{\\rm ex}$.\n\nThree fits are shown in Fig.~\\ref{FigR1}. The upper (lower) curve corresponds to the data portion for $\\kappa=0.5$ ($\\kappa = 3.0$).\nThe intermediate curve has been obtained using the entire massive of the data points (corresponding to the parameter regime shown). It can be considered as representative for strongly coupled 2D Yukawa fluids in the vicinity of the freezing transition.\nThe functional form of the fit is the same as used previously~\\cite{KhrapakPoP08_2015}\n\\begin{equation} \\label{Fit1}\nu_{\\rm th} =A \\ln (1+B\\Gamma/\\Gamma_{\\rm m}).\n\\end{equation}\nThe use of the coefficients $A=0.257$ and $B=195.4$ determined here would somewhat improve previous approximations.\n\nThe excess free energy can be routinely calculated using the model for the excess energy formulated above and the second of Eqs.~(\\ref{pf}). The resulting expression is rather simple,\n\\begin{equation}\\label{fex}\nf_{\\rm ex}=M(\\kappa)\\Gamma - A{\\rm Li}_2(-B\\Gamma/\\Gamma_{\\rm m}),\n\\end{equation}\nwhere ${\\rm Li}_2(z)=\\int_z^0 dt \\ln(1-t)/t$ is dilogarithm. Note that in deriving Eq.~(\\ref{fex}), the thermodynamic integration over the coupling parameter from 0 to $\\Gamma$ has been performed, while Eq.~(\\ref{Fit1}) is strictly speaking not applicable at $\\Gamma\\ll 1$.\nThe correct procedure would be to start thermodynamic integration from some small, but finite value $\\Gamma_0$, and then add the constant $f_{\\rm ex}(\\Gamma_0)$ evaluated using Eq.~(\\ref{SVC}). However, since the actual contribution from the weakly coupling regime is small, Eq.~(\\ref{fex}) remains rather accurate at strong coupling and we use it here.\n\nThe calculation of pressure from the excess free energy is straightforward, but rather cumbersome in the considered case. This is because the differentiation with respect to $\\kappa$ is involved, and  the two fits for $M(\\kappa)$ and $\\Gamma_{\\rm m}(\\kappa)$ are present. For this reason, the explicit expression for $p$ is not displayed. We verified that near freezing (at $\\Gamma/\\Gamma_{\\rm m}\\simeq 0.95$) the derived expression yields the pressures which deviate from the exact MD results by $\\sim 0.001\\%$ at $\\kappa=0.5$, $~\\sim 0.1\\%$ at $\\kappa=1.0$, and $\\sim 1\\%$ at $\\kappa = 2.0-2.8$. The accuracy drops at the highest value $\\kappa=3.0$. This is not surprising, since the fits for $M(\\kappa)$ and $\\Gamma_{\\rm m}(\\kappa)$ are only applicable for $\\kappa\\lesssim 3.0$ and, therefore, derivatives from these fits at   $\\kappa=3.0$ can produce significant errors.\n\nWe also found out that if better accuracy is required, the data for the excess thermal energy can be fitted by the following slightly modified expression\n\\begin{equation}\n\\label{Eq7}\nu_{\\mathrm{ex}} = A(\\kappa)\\ln\\left[ 1 + B(\\kappa) \\Gamma^{s(\\kappa)} \\right],\n\\end{equation}\nwhere $A$ and $B$ are now assumed $\\kappa$-dependent and a $\\kappa$-dependent exponent $s$ is introduced. Based on all the data points obtained in MD simulations the following relations are identified:\n$A(\\kappa) = 0.35708 + 0.09397\\kappa$,\n$B(\\kappa)= 1.65491\\exp(- 0.76911\\kappa)$,\n$s(\\kappa) = 0.68838 - 0.05183\\kappa$.\nSome representative examples are shown in Fig.~\\ref{FigR2}.\nThe fit of Eq.~(\\ref{Eq7}) is clearly more accurate and can be used in\nthe regime of weaker coupling, compared to the simple form (\\ref{Fit1}). However, it is also less practical in evaluating thermodynamic parameters other than the excess internal energy.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R2.eps}\\\\\n    \\caption{\n    Thermal component of the reduced excess energy, $u_{\\rm th}$ of 2D Yukawa fluids near the fluid-solid phase transition versus the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. Symbols correspond to MD simulations for different values of the screening parameter $\\kappa$. The curves are the analytical fits to these data using Eq.~(\\ref{Fit1}): The upper (lower) curve corresponds to fitting the MD results for $\\kappa=0.5$ ($\\kappa = 3.0$) and the intermediate (red) curve is obtained by fitting the entire massive of the data points.}\n\\label{FigR1}\n\\end{figure}\n\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R3.eps}\\\\\n    \\caption{Dependence of the excess thermal energy $u_{\\rm th}$ on the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. All the data points from numerical simulations are plotted. Solid curves correspond to three representative fits using Eq.~\\eqref{Eq7}.}\n\\label{FigR2}\n\\end{figure}\n\n\n\n\\subsection{Relation between excess pressure and energy}\n\nIt is sometimes advantageous to operate with an equation of state written in the form of relation between the pressure and internal energy of the system. For soft purely repulsive potentials a simplest formulation of this kind can be written as\n\\begin{equation}\\label{gamma_ex}\np_{\\rm ex}=\\gamma_{\\rm ex}u_{\\rm ex}.\n\\end{equation}\nHere the parameter $\\gamma_{\\rm ex}$ generally depends both on the temperature and density, that is both on $\\Gamma$ and $\\kappa$ for Yukawa systems. Note that the parameter $\\gamma_{\\rm ex}$ introduced in this way is not directly related to the conventional definitions of either the density scaling exponent or Gr\\\"uneisen parameter.~\\cite{HummelPRB2015} Nevertheless, it may be helpful in characterizing the softness of the repulsive potential. We remind that for inverse-power-law (IPL) repulsive potentials of the form $\\varphi(r)\\propto r^{-\\alpha}$ the relation between the excess pressure and energy is particularly simple, $p_{\\rm ex}=\\tfrac{\\alpha}{2} u_{\\rm ex}$ in 2D. Thus, an ``effective IPL exponent'' may be associated with the quantity $2\\gamma_{\\rm ex}$.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{gamma.eps}\\\\\n    \\caption{Ratio of the excess pressure to the excess energy, $\\gamma_{\\rm ex}=p_{\\rm ex}/u_{\\rm ex}$ on the plane ($\\kappa$, $\\Gamma/\\Gamma_{\\rm m}$).\n    }\n\\label{gamma}\n\\end{figure}\n\nHaving approximations for both $p_{\\rm ex}$ and $u_{\\rm ex}$ for 2D Yukawa fluids we can easily estimate the value of $\\gamma_{\\rm ex}$. The corresponding plot of $\\gamma_{\\rm ex}$ as a function of Yukawa systems state variables $\\kappa$ and $\\Gamma/\\Gamma_{\\rm m}$ is shown in Fig.~\\ref{gamma}. To produce this plot, Eq.~(\\ref{Fit1}) for the thermal component of the excess energy has been used. Figure~\\ref{gamma} shows that in the strongly coupled regime $\\gamma_{\\rm ex}$ is very weakly dependent on the coupling strength (temperature), but exhibits considerable dependence on $\\kappa$ (density). Using the exact MD results for $p_{\\rm ex}/u_{\\rm ex}$ in the vicinity of the fluid-solid phase transition ($\\Gamma/\\Gamma_{\\rm m}\\simeq 0.95$) we have obtained a representative dependence $\\gamma_{\\rm ex}(\\kappa)$ in the strongly coupled regime:\n\\begin{equation}\n\\gamma_{\\rm ex}(\\kappa)=1+0.526\\kappa+0.13\\kappa^2-0.02\\kappa^3.\n\\end{equation}\nImportantly, $\\gamma_{\\rm ex}\\rightarrow 1$ as $\\kappa\\rightarrow 0$.\nThis seems counter-intuitive at first, because one would naturally expect $\\gamma_{\\rm ex}=\\tfrac{1}{2}$ in the OCP Coulomb interaction limit in 2D. The difference is attributed to the presence of the neutralizing background in the OCP model. In the limit of very soft interaction, the energy and pressure are dominated by their static contributions. As $\\kappa\\rightarrow 0$, the dominant contribution is the Madelung energy, so that $f_{\\rm ex}\\sim u_{\\rm ex}\\sim M\\Gamma\\sim \\Gamma/\\kappa$ (without background). This implies $p_{\\rm ex}=\\tfrac{\\Gamma}{2}(\\partial f_{\\rm ex}/\\partial \\Gamma)-\\tfrac{\\kappa}{2}(\\partial f_{\\rm ex}/\\partial \\kappa)\\sim \\Gamma/\\kappa\\sim u_{\\rm ex}$. In the presence of neutralizing background the term $\\Gamma/\\kappa$ disappears and we have $f_{\\rm ex}\\sim u_{\\rm ex}\\sim M_{\\rm OCP}\\Gamma$. This yields $p_{\\rm ex}\\sim \\tfrac{1}{2}M_{\\rm OCP}\\Gamma\\sim \\tfrac{1}{2}u_{\\rm ex}$. This consideration demonstrates that the Yukawa systems in the limit $\\kappa\\rightarrow 0$ are not fully equivalent to the Coulomb systems with the neutralizing background.\n\n\n\\subsection{Crystals}\n\nIn a series of MD simulations for 2D Yukawa crystals, in addition to evaluate the excess energy and pressure (which are summarized in Tables~\\ref{Table3} and \\ref{Table4} of the Appendix), the mean squared displacements were calculated to find the anharmonic correction coefficient $\\beta$. The resulting dependence $\\beta(\\kappa)$ is shown in Figure~\\ref{FigR3} (the corresponding values are also tabulated in Table~\\ref{Table2} of the Appendix for completeness).\nThe inset in Fig.~\\ref{FigR3} presents the radial (isotropic) pair correlation function, $g(r) \\propto \\int{d\\varphi\\; g(\\mathbf{r})}$,\nand demonstrates excellent representation of the short- and long-distance correlations. The obtained anharmonic correction coefficient $\\beta(\\kappa)$ allows to calculate analytically pair correlation function and then the excess energy, pressure and other thermodynamic parameters by the thermodynamic integration with the help of the expressions given in Sec.~\\ref{Thermo}.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R5.eps}\\\\\n    \\caption{    Dependence of the anharmonic correction coefficient $\\beta$ on the screening parameter $\\kappa$. The inset demonstrates a typical comparison between the radial distribution functions obtained in a direct MD simulation and computed using the shortest-graph method. For details see the text.}\n\\label{FigR3}\n\\end{figure}\n\nIt is worth to point out the following observation:\nIn the limit $\\kappa \\rightarrow 0$, the Yukawa interaction tends to the unscreened Coulomb interaction $\\varphi \\propto r^{-1}$. According to our previous MD simulations,~\\cite{1.4926945}\nthe finite-temperature phononic spectra differ weakly from zero-temperature ones for IPL potentials, $\\varphi \\propto r^{-\\alpha}$. Therefore, in the OCP limit ($\\kappa=0$ and $\\alpha=1$) we should obtain the smallest values of $\\beta(\\kappa)$. This is indeed observed in Fig.~\\ref{FigR3}.\n\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R6.eps}\\\\\n    \\caption{ Dependence of the reduced pressure on the reduced excess energy. Open (solid) symbols are the results of MD simulations for fluids and solids, respectively. The solid and dashed curves correspond to the shortest-graph method for solids and to the fit of Eq.~(\\ref{Eq7}) for fluids.}\n\\label{FigR4}\n\\end{figure}\n\nIn figure~\\ref{FigR4} we plot the reduced pressure versus the reduced excess energy of 2D Yukawa fluids and solids. Symbols are the MD results, the solid and dashed curves correspond to the shortest-graph method [with found anharmonic correction coefficient $\\beta(\\kappa)$] for the crystalline phase and the proposed fit by Eq.\\eqref{Eq7} for the fluid phase, respectively. Excellent agreement is observed.\n\n\\subsection{Accuracy}\n\nThe relative difference between the excess energies calculated using the shortest-graph method and those evaluated using direct MD simulations in the solid phase amounts to $\\simeq5\\times 10^{-5}$, which is comparable to the values reported earlier.~\\cite{0953-8984-28-23-235401} The accurate fit of Eq.~\\eqref{Eq7}\nyields the relative error in the excess energy smaller than $5\\times10^{-4}$ and  $2\\times10^{-3}$  for 72\\% and 95\\% of\nthe examined fluids data points, respectively. Maximal relative deviation, $5\\times 10^{-3}$, is observed near the melting line at large values of the screening parameter $\\kappa$. A simpler fit of Eq.~(\\ref{Fit1}) is applicable when the relative deviations within $\\lesssim 1\\%$ are acceptable.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{Pressure_kappa05.eps}\\\\\n    \\caption{Reduced pressure, $p$, as a function of the coupling parameter $\\Gamma$ for a Yukawa 2D fluid with the screening parameter $\\kappa=0.5$. The symbols are exact MD results, the solid (red) line corresponds to the fit of Eq.~(\\ref{Fit1}), the dashed (blue) line is the fit from Ref.~\\onlinecite{0022-3727-49-23-235203}.}\n\\label{FigPressure}\n\\end{figure}\n\nIn addition, we can compare our results with those recently reported in Refs.~\\onlinecite{0022-3727-49-23-235203,1.4962685}, where fits for the pressure of 2D Yukawa fluids in the $(\\kappa,\\Gamma)$ parameter space have been proposed. The case $\\kappa=0.5$ received special attention and a simple two-term fit has been proposed based on the results of a MD simulation,~\\cite{0022-3727-49-23-235203} $p=1.53\\Gamma+1.33$.\nWe plot our MD results along with the fit of Eq.~(\\ref{Fit1}) and the fit from Ref.~\\onlinecite{0022-3727-49-23-235203} in Fig.~\\ref{FigPressure}. One can see that the fit from Ref.~\\onlinecite{0022-3727-49-23-235203} overestimates the pressure systematically at high values of $\\Gamma$. At the strongest coupling in the fluid phase studied in this work, $\\Gamma=135.42$, the present MD simulation yields $p= 199.434$, fit by Eq.~(\\ref{Fit1}) yields $p=199.432$, while the fit from Ref.~\\onlinecite{0022-3727-49-23-235203} yields $p=208.523$. On the other hand, the previous model for 2D Yukawa systems in the OCP (weakly screening) limit discussed in Refs.~\\onlinecite{KhrapakPoP08_2015,1.4935846}\nyields $p=199.445$, providing confidence in the accuracy of the  present results. The reasons for deviations in Ref.~\\onlinecite{0022-3727-49-23-235203} have to be identified.\n\n3.4 Crystals\n\\subsection{Crystals}\n\nIn a series of MD simulations for 2D Yukawa crystals, in addition to evaluate the excess energy and pressure (which are summarized in Tables~\\ref{Table3} and \\ref{Table4} of the Appendix), the mean squared displacements were calculated to find the anharmonic correction coefficient $\\beta$. The resulting dependence $\\beta(\\kappa)$ is shown in Figure~\\ref{FigR3} (the corresponding values are also tabulated in Table~\\ref{Table2} of the Appendix for completeness).\nThe inset in Fig.~\\ref{FigR3} presents the radial (isotropic) pair correlation function, $g(r) \\propto \\int{d\\varphi\\; g(\\mathbf{r})}$,\nand demonstrates excellent representation of the short- and long-distance correlations. The obtained anharmonic correction coefficient $\\beta(\\kappa)$ allows to calculate analytically pair correlation function and then the excess energy, pressure and other thermodynamic parameters by the thermodynamic integration with the help of the expressions given in Sec.~\\ref{Thermo}.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R5.eps}\\\\\n    \\caption{    Dependence of the anharmonic correction coefficient $\\beta$ on the screening parameter $\\kappa$. The inset demonstrates a typical comparison between the radial distribution functions obtained in a direct MD simulation and computed using the shortest-graph method. For details see the text.}\n\\label{FigR3}\n\\end{figure}\n\nIt is worth to point out the following observation:\nIn the limit $\\kappa \\rightarrow 0$, the Yukawa interaction tends to the unscreened Coulomb interaction $\\varphi \\propto r^{-1}$. According to our previous MD simulations,~\\cite{1.4926945}\nthe finite-temperature phononic spectra differ weakly from zero-temperature ones for IPL potentials, $\\varphi \\propto r^{-\\alpha}$. Therefore, in the OCP limit ($\\kappa=0$ and $\\alpha=1$) we should obtain the smallest values of $\\beta(\\kappa)$. This is indeed observed in Fig.~\\ref{FigR3}.\n\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R6.eps}\\\\\n    \\caption{ Dependence of the reduced pressure on the reduced excess energy. Open (solid) symbols are the results of MD simulations for fluids and solids, respectively. The solid and dashed curves correspond to the shortest-graph method for solids and to the fit of Eq.~(\\ref{Eq7}) for fluids.}\n\\label{FigR4}\n\\end{figure}\n\nIn figure~\\ref{FigR4} we plot the reduced pressure versus the reduced excess energy of 2D Yukawa fluids and solids. Symbols are the MD results, the solid and dashed curves correspond to the shortest-graph method [with found anharmonic correction coefficient $\\beta(\\kappa)$] for the crystalline phase and the proposed fit by Eq.\\eqref{Eq7} for the fluid phase, respectively. Excellent agreement is observed.\n\n5 MD results\n\\section{MD results}\n\\label{Appendix}\n\nIn the Appendix, we summarize main results from MD simulations performed in this study. Table \\ref{Table1} reports the reduced excess energies and pressures at different state points in the fluid phase. Table  \\ref{Table2} summarizes the values of the anharmonic correction coefficient $\\beta$ evaluated using MD simulations of the crystalline phase. Finally, Tables  \\ref{Table3} and  \\ref{Table4} report the excess energies and pressures in the crystalline phase.\n\n\\begin{table}[h]\n\t\\centering\n\t\\small\n\t\\caption{Reduced excess energy $u_{\\rm ex}$ and pressure $p$ of two-dimensional Yukawa fluids evaluated using MD simulations for various coupling ($\\Gamma$) and screening ($\\kappa$) parameters.}\n\t\\label{Table1}\n\t\t\\begin{tabular}{l c c c c c c c c c c c c}\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=0.5$}\\\\ \\hline\n\t\t$\\Gamma$ & 135.420 & 86.7254 & 52.7787 & 32.1811 & 19.6073 & 11.9310 & 7.27175 & 4.43126 & 2.69848 & 1.64302 & 1.00136 &  0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 152.944 & 98.3115 & 60.1901 & 37.0087 & 22.8180 & 14.1176 & 8.79838 & 5.51964 & 3.48587 & 2.21772 & 1.42021 & 0.76495\\\\\n\t\t$p$ & 199.434 & 128.303 & 78.6946 & 48.5651 & 30.1485 & 18.8835 & 12.0216 & 7.81631 & 5.22964 & 3.63556 & 2.64961 & 1.85883\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=0.6$}\\\\\\hline\n\t\t$\\Gamma$  & 140.131 & 89.5076 & 54.3171 & 32.9737 & 20.0017 & 12.1359 & 7.36665 & 4.47442 & 2.71053 & 1.64677 & 1.00106 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 116.984 & 75.1128 & 45.9415 & 28.2016 & 17.3768 & 10.7727 & 6.73045 & 4.24422 & 2.69421 & 1.72956 & 1.11776 & 0.61083\\\\\n\t\t$p$ \t\t\t\t\t& 160.369 & 103.050  & 63.1652  & 38.9451  & 24.1971  & 15.2284  & 9.76528  & 6.42899  & 4.37128  & 3.11015  & 2.32663  & 1.69701\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=0.8$}\\\\\\hline\n\t\t$\\Gamma$ & 152.277 & 96.5736 & 58.0604 & 34.9737 & 21.0334 & 12.6675 & 7.61503 & 4.58845 & 2.75830 & 1.66410 & 0.99914  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 74.6424 & 47.7340  & 29.0608 & 17.8181 & 10.9844 & 6.84185 & 4.30139 & 2.74217 & 1.76665 & 1.15293  & 0.75437 & 0.42469\\\\\n\t\t$p$ \t\t\t\t\t& 112.709 & 72.1411  & 44.0441 & 27.1658 & 16.9406 & 10.7731 & 7.01845 & 4.73986 & 3.33679 & 2.47393  & 1.92983 & 1.49910\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.0$}\\\\\\hline\n\t\t$\\Gamma$ & 169.071 & 105.975 & 63.1038 & 37.6027 & 22.4047 & 13.3361 & 7.94729 & 4.73129 & 2.81940 & 1.68034  & 0.99956 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 51.5786 & 32.7335 & 19.8556 & 12.1451 & 7.50279 & 4.68984 & 2.97702 & 1.91799 & 1.25426 & 0.82932 & 0.55059 & 0.31770\\\\\n\t\t$p$ \t& 85.4036 & 54.2492 & 33.0215 & 20.3527 & 12.7618 & 8.19406 & 5.44279 & 3.76791 & 2.74103 & 2.10336 & 1.70075 & 1.38135\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.2$}\\\\\\hline\n\t\t$\\Gamma$ & 191.126 & 118.398 & 69.6429 & 40.9597 & 24.1083 & 14.1893 & 8.34919 & 4.90490 & 2.88868 & 1.70019 & 0.99984 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 37.5852 & 23.6918 & 14.3026 & 8.72609 & 5.39936 & 3.39637 & 2.17547 & 1.41736 & 0.93933 & 0.62908 & 0.42281 & 0.24960\\\\\n\t\t$p$ \t& 67.9344 & 42.8619 & 25.9838 & 16.0024 & 10.0874 & 6.56025 & 4.44041 & 3.15023 & 2.36021 & 1.86635 & 1.55301 & 1.30594\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.4$}\\\\\\hline\n\t\t$\\Gamma$ & 220.172 & 134.441 & 77.9949 & 45.2452 & 26.2578 & 15.2219 & 8.83634 & 5.12702 & 2.97137 & 1.72440 & 1.00140  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 28.5555 & 17.8503 & 10.7244 & 6.53392 & 4.05300 & 2.56405 & 1.65932 & 1.09552 & 0.73364 & 0.49726 & 0.33718 & 0.20253\\\\\n\t\t$p$ \t& 56.0915 & 35.0963 & 21.1892 & 13.0574 & 8.28303 & 5.45288 & 3.76392 & 2.73780 & 2.10241 & 1.70540 & 1.45171 & 1.25396\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.6$}\\\\\\hline\n\t\t$\\Gamma$ & 258.433 & 155.296 & 88.6297 & 50.6106 & 28.9099 & 16.4928 & 9.41249 & 5.37870 & 3.07317 & 1.75217 & 0.99889  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 22.4535 & 13.9136 & 8.31218 & 5.05719 & 3.14728 & 2.00498 & 1.30903 & 0.87473 & 0.59391 & 0.40446 & 0.27520 & 0.16486\\\\\n\t\t$p$ & 47.7294 & 29.6021 & 17.7849 & 10.9674 & 7.00739 & 4.67522 & 3.28559 & 2.44432 & 1.92230 & 1.58965 & 1.37647 & 1.15781\\\\\\hline\\hline\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.8$}\\\\\\hline\n\t\t$\\Gamma$ & 308.935 & 182.395 & 102.261 & 57.3435 & 32.1483 & 18.0355 & 10.1029 & 5.67241 & 3.17978 & 1.78359 & 0.99997  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 18.1745 & 11.1626 & 6.63304 & 4.02868 & 2.51560 & 1.61389 & 1.06328 & 0.71747 & 0.49051 & 0.33739 & 0.23058 & 0.14359\\\\\n\t\t$p$ \t& 41.6428 & 25.5932 & 15.3055 & 9.44338 & 6.07675 & 4.10949 & 2.93845 & 2.22906 & 1.78546 & 1.50402 & 1.32125 & 1.18748\\\\\\hline\\hline\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.0$}\\\\\\hline\n\t\t$\\Gamma$ & 375.818 & 217.422 & 119.600 & 65.7745 & 36.1611 & 19.8980 & 10.9232 & 6.01199 & 3.30681 & 1.81767 & 1.00051 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 15.0964 & 9.17319 & 5.42177 & 3.29200 & 2.06139 & 1.33276 & 0.88426 & 0.60261 & 0.41513 & 0.28650 & 0.19651 & 0.12379\\\\\n\t\t$p$ \t\t\t\t\t& 37.1333 & 22.5775 & 13.4413 & 8.30684 & 5.38337 & 3.68921 & 2.67835 & 2.06727 & 1.68347 & 1.43752 & 1.27850 & 1.16494\\\\\\hline\\hline\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.2$}\\\\\\hline\n\t\t$\\Gamma$ & 463.975 & 262.948 & 141.568 & 76.2338 & 41.0173 & 22.0958 & 11.9035 & 6.41082 & 3.45056 & 1.85303 & 1.00113 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 12.7875 & 7.69994 & 4.52708 & 2.74830 & 1.72461 & 1.12217 & 0.75368 & 0.51642 & 0.35777 & 0.24734 & 0.17009 & 0.10850\\\\\n\t\t$p$ \t\t\t\t\t& 33.6575 & 20.2710 & 12.0118 & 7.43585 & 4.85060 & 3.36425 & 2.48426 & 1.94445 & 1.60450 & 1.38520 & 1.24473 & 1.14408\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.4$}\\\\\\hline\n\t\t$\\Gamma$ & 578.968 & 320.871 & 168.949 & 89.0382 & 46.8778 & 24.7092 & 12.9953 & 6.85634 & 3.60307 & 1.89919 & 0.99952 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 10.9709 & 6.56430 & 3.83850 & 2.33031 & 1.47100 & 0.96365 & 0.65089 & 0.44974 & 0.31141 & 0.21697 & 0.14862 & 0.09589\\\\\n\t\t$p$ & 30.8215 & 18.4175 & 10.8648 & 6.74135 & 4.43655 & 3.11369 & 2.32748 & 1.84722 & 1.53931 & 1.34446 & 1.21673 & 1.12942\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.6$}\\\\\\hline\n\t\t$\\Gamma$ & 723.656 & 392.384 & 202.051 & 104.080 & 53.5742 & 27.6270 & 14.2191 & 7.32182 & 3.76653 & 1.93971 & 1.00200 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 9.50055 & 5.63818 & 3.28596 & 1.99866 & 1.26783 & 0.83500 & 0.56905 & 0.39442 & 0.27600 & 0.19145 & 0.13130 & 0.08576\\\\\n\t\t$p$ \t& 28.3633 & 16.8096 & 9.89231 & 6.16190 & 4.09049 & 2.90245 & 2.19936 & 1.76426 & 1.48858 & 1.30961 & 1.19408 & 1.11954\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.8$}\\\\\\hline\n\t\t$\\Gamma$ & 893.746 & 474.549 & 239.143 & 120.685 & 60.8483 & 30.6642 & 15.4796 & 7.80951 & 3.93161 & 1.98042 & 1.00296 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 8.19448 & 4.82859 & 2.81518 & 1.71951 & 1.09985 & 0.73051 & 0.50093 & 0.35038 & 0.24489 & 0.17117 & 0.11700 & 0.07671\\\\\n\t\t$p$ \t& 25.9004 & 15.2521 & 8.98792 & 5.63831 & 3.78782 & 2.72194 & 2.08856 & 1.69631 & 1.44344 & 1.28133 & 1.17497 & 1.10201\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=3.0$}\\\\\\hline\n\t\t$\\Gamma$ & 1071.02 & 558.495 & 276.444 & 136.953 & 67.7922 & 33.5897 & 16.6383 & 8.22716 & 4.07874 & 2.02013 & 0.99949 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 6.93189 & 4.07091 & 2.38838 & 1.47193 & 0.95056 & 0.64023 & 0.44340 & 0.31146 & 0.21994 & 0.15395 & 0.10494 & 0.06958\\\\\n\t\t$p$ \t& 23.1181 & 13.5906 & 8.07317 & 5.12679 & 3.49444 & 2.55590 & 1.98879 & 1.63334 & 1.40554 & 1.25677 & 1.15868 & 1.09682\\\\\\hline\\hline\n\t\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\small\n\t\\caption{Values of the anharmonic correction coefficient $\\beta$ for different screening parameter $\\kappa$.}\n\t\\label{Table2}\n\t\t\\begin{tabular}{l c c c c c c c c c c c c c c c c c}\n\t\t$\\kappa$ & 0.0 & 0.2 & 0.3 & 0.4 & 0.6 & 0.8 & 1.0 & 1.2 & 1.4 & 1.6 & 1.8 & 2.0 & 2.2 & 2.4 & 2.6 & 2.8 & 3.0 \\\\\\hline\n\t\t$\\beta(\\kappa)$\t& 3.01 & 9.23 & 12.38 & 14.30 & 10.53 & 9.71 & 9.35 & 9.28 & 9.14 & 9.08 & 8.97 & 8.855 & 8.68 & 8.71 & 8.46 & 8.47 & 8.51\n\t\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\small\n\t\\caption{Reduced excess energy $u_{\\mathrm{ex}}$ of the 2D Yukawa crystal obtained in MD simulations for various screening parameters $\\kappa$ and reduced coupling parameters $\\Gamma_{\\rm m}/\\Gamma$.}\n\t\\label{Table3}\n\t\t\\begin{tabular}{lccccccccc}\n\t\t\\multicolumn{1}{c|}{ $\\kappa$}& \\multicolumn{9}{c}{$\\Gamma_{\\rm m}/\\Gamma$}     \\\\ \\hline\\hline\n\t\t\\multicolumn{1}{l|}{ }& 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.6 & 0.7 & 0.8 & 0.9 \\\\ \\cline{2-10}\n\t\t\\multicolumn{1}{l|}{0.5} & 1595.62 & 798.828 & 532.689 & 399.681 & 319.981 & 266.796 & 228.880 & 200.332 & 178.283 \\\\\n\t\t\\multicolumn{1}{l|}{0.6} & 1217.36 & 609.282 & 406.628 & 305.117 & 244.469 & 203.938 & 174.914 & 153.188 & 136.267 \\\\\n\t\t\\multicolumn{1}{l|}{0.8} & 773.025 & 387.104 & 258.328 & 194.074 & 155.484 & 129.733 & 111.343 & 97.5607 & 86.8364 \\\\\n\t\t\\multicolumn{1}{l|}{1.0} & 529.643 & 265.306 & 177.235 & 133.215 & 106.726 & 89.1490 & 76.5169 & 67.1314 & 59.7831 \\\\\n\t\t\\multicolumn{1}{l|}{1.2} & 382.522 & 191.740 & 128.152 & 96.3972 & 77.2970 & 64.6022 & 55.5318 & 48.7317 & 43.4438   \\\\\n\t\t\\multicolumn{1}{l|}{1.4} & 287.408 & 144.232 & 96.4804 & 72.5942 & 58.2862 & 48.7586 & 41.9386 & 36.8484 & 32.8838   \\\\\n\t\t\\multicolumn{1}{l|}{1.6} & 223.185 & 112.096 & 75.0671 & 56.5515 & 45.4466 & 38.0606 & 32.7681 & 28.8120 & 25.7391   \\\\\n\t\t\\multicolumn{1}{l|}{1.8} & 178.133 & 89.6228 & 60.0889 & 45.3116 & 36.4631 & 30.5563 & 26.3521 & 23.1896 & 20.7451   \\\\\n\t\t\\multicolumn{1}{l|}{2.0} & 145.774 & 73.3800 & 49.2712 & 37.2003 & 29.9641 & 25.1447 & 21.7011 & 19.1314 & 17.1275   \\\\\n\t\t\\multicolumn{1}{l|}{2.2} & 121.609 & 61.3067 & 41.2021 & 31.1620 & 25.1352 & 21.1177 & 18.2517 & 16.1113 & 14.4385   \\\\\n\t\t\\multicolumn{1}{l|}{2.4} & 102.908 & 51.9465 & 34.9672 & 26.4819 & 21.3920 & 17.9999 & 15.5706 & 13.7650 & 12.3602   \\\\\n\t\t\\multicolumn{1}{l|}{2.6} & 87.4157 & 44.2324 & 29.8212 & 22.6181 & 18.2990 & 15.4212 & 13.3710 & 11.8300 & 10.6351   \\\\\n\t\t\\multicolumn{1}{l|}{2.8} & 73.5771 & 37.3025 & 25.2028 & 19.1490 & 15.5271 & 13.1108 & 11.3865 & 10.0997 & 9.10597   \\\\\n\t\t\\multicolumn{1}{l|}{3.0} & 60.2002 & 30.6118 & 20.7457 & 15.8118 & 12.8497 & 10.8840 & 9.47465 & 8.43053 & 7.65187   \\\\\n\t\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\small\n\t\\caption{Reduced pressure (compressibility)  $p$ of the 2D Yukawa crystal obtained in MD simulations for various screening parameters $\\kappa$ and reduced coupling parameters $\\Gamma_{\\rm m}/\\Gamma$.}\n\t\\label{Table4}\n\t\t\\begin{tabular}{lccccccccc}\n\t\t\\multicolumn{1}{c|}{ $\\kappa$}& \\multicolumn{9}{c}{$\\Gamma_{\\rm m}/\\Gamma$}     \\\\ \\hline\\hline\n\t\t\\multicolumn{1}{l|}{ }& 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.6 & 0.7 & 0.8 & 0.9 \\\\ \\cline{2-10}\n\t\t\\multicolumn{1}{l|}{0.5} & 2080.63 & 1041.70 & 694.789 & 521.370 & 417.454 & 348.100 & 298.669 & 261.442 & 232.679 \\\\\n\t\t\\multicolumn{1}{l|}{0.6} & 1669.06 & 835.485 & 557.680 & 418.523 & 335.380 & 279.814 & 240.022 & 210.233 & 187.022 \\\\\n\t\t\\multicolumn{1}{l|}{0.8} & 1168.03 & 585.024 & 390.480 & 293.406 & 235.104 & 196.197 & 168.410 & 147.583 & 131.370 \\\\\n\t\t\\multicolumn{1}{l|}{1.0} & 878.208 & 440.005 & 294.000 & 221.023 & 177.106 & 147.964 & 127.016 & 111.450 & 99.2542 \\\\\n\t\t\\multicolumn{1}{l|}{1.2} & 693.046 & 347.470 & 232.288 & 174.765 & 140.162 & 117.162 & 100.726 & 88.4011 & 78.8053   \\\\\n\t\t\\multicolumn{1}{l|}{1.4} & 566.555 & 284.386 & 190.275 & 143.196 & 114.994 & 96.2113 & 82.7636 & 72.7234 & 64.8975   \\\\\n\t\t\\multicolumn{1}{l|}{1.6} & 476.692 & 239.477 & 160.406 & 120.865 & 97.1465 & 81.3696 & 70.0608 & 61.6053 & 55.0288   \\\\\n\t\t\\multicolumn{1}{l|}{1.8} & 410.580 & 206.621 & 138.561 & 104.505 & 84.1086 & 70.4915 & 60.7970 & 53.5005 & 47.8555   \\\\\n\t\t\\multicolumn{1}{l|}{2.0} & 361.191 & 181.859 & 122.134 & 92.2267 & 74.2973 & 62.3524 & 53.8144 & 47.4405 & 42.4641   \\\\\n\t\t\\multicolumn{1}{l|}{2.2} & 322.729 & 162.732 & 109.386 & 82.7430 & 66.7485 & 56.0825 & 48.4703 & 42.7821 & 38.3327   \\\\\n\t\t\\multicolumn{1}{l|}{2.4} & 291.498 & 147.173 & 99.0847 & 75.0489 & 60.6307 & 51.0175 & 44.1300 & 39.0087 & 35.0158   \\\\\n\t\t\\multicolumn{1}{l|}{2.6} & 263.437 & 133.325 & 89.9002 & 68.1935 & 55.1747 & 46.4976 & 40.3128 & 35.6615 & 32.0486   \\\\\n\t\t\\multicolumn{1}{l|}{2.8} & 235.188 & 119.260 & 80.5872 & 61.2342 & 49.6540 & 41.9257 & 36.4074 & 32.2829 & 29.0897   \\\\\n\t\t\\multicolumn{1}{l|}{3.0} & 203.533 & 103.516 & 70.1601 & 53.4777 & 43.4588 & 36.8063 & 32.0351 & 28.4887 & 25.8081  \\\\\n\t\t\\end{tabular}\n\\end{table}\n\\end{widetext}\n\n\\bibliography{Ref-2D-Yukawa} \n\n\\end{document}\n",
        "Descriptive_question1": "What is the reduced excess energy value for kappa = 0.5 and Gamma_m/Gamma = 0.1 in table_4?",
        "Descriptive_question2": "What is the reduced excess energy value for kappa = 3.0 and Gamma_m/Gamma = 0.9 in table_4?",
        "Reasoning_question1": "How does the reduced excess energy change with increasing screening parameter kappa for a fixed Gamma_m/Gamma value in table_4?",
        "Reasoning_question2": "What trend can be observed in the reduced excess energy as Gamma_m/Gamma increases for a specific kappa value in table_4?",
        "Descriptive_answer1": "1595.62",
        "Descriptive_answer2": "7.65187",
        "Reasoning_answer1": "To analyze the change in reduced excess energy (u_ex) with increasing screening parameter kappa for a fixed Gamma_m/Gamma value, I observe the data in Table 3 (referred to as table_4 in the question). Taking Gamma_m/Gamma = 0.1 as an example, when kappa increases from 0.5 to 3.0, u_ex decreases significantly from 1595.62 to 60.2002. This pattern holds for other Gamma_m/Gamma values as well; for instance, at Gamma_m/Gamma = 0.5, u_ex decreases from 319.981 at kappa = 0.5 to 12.8497 at kappa = 3.0. The trend is consistent across the table, indicating that as kappa increases, the interaction strength weakens due to stronger screening, leading to a reduction in the excess energy. Therefore, the reduced excess energy decreases with increasing kappa for a fixed Gamma_m/Gamma value, reflecting the diminishing influence of particle interactions at higher screening levels.",
        "Reasoning_answer2": "To identify the trend in reduced excess energy (u_ex) as Gamma_m/Gamma increases for a specific kappa value, I examine the data in Table 3 (referred to as table_4 in the question). Using kappa = 1.0 as an example, as Gamma_m/Gamma increases from 0.1 to 0.9, u_ex decreases from 529.643 to 59.7831. This decrease is observed across all kappa values; for instance, at kappa = 2.0, u_ex decreases from 145.774 at Gamma_m/Gamma = 0.1 to 17.1275 at Gamma_m/Gamma = 0.9. The pattern suggests that increasing Gamma_m/Gamma, which implies a relative decrease in the coupling strength (since Gamma_m is fixed and Gamma increases inversely), results in lower excess energy due to reduced interaction effects. Thus, the trend is that the reduced excess energy decreases as Gamma_m/Gamma increases for a specific kappa value, indicating weaker coupling effects at higher ratios."
    },
    {
        "paper_id": "1704.00976.json",
        "table_id": "table_5",
        "table_content": "\\begin{table}[!t]\n\t\\centering\n\t\\small\n\t\\caption{Reduced pressure (compressibility)  $p$ of the 2D Yukawa crystal obtained in MD simulations for various screening parameters $\\kappa$ and reduced coupling parameters $\\Gamma_{\\rm m}/\\Gamma$.}\n\t\\label{Table4}\n\t\t\\begin{tabular}{lccccccccc}\n\t\t\\multicolumn{1}{c|}{ $\\kappa$}& \\multicolumn{9}{c}{$\\Gamma_{\\rm m}/\\Gamma$}     \\\\ \\hline\\hline\n\t\t\\multicolumn{1}{l|}{ }& 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.6 & 0.7 & 0.8 & 0.9 \\\\ \\cline{2-10}\n\t\t\\multicolumn{1}{l|}{0.5} & 2080.63 & 1041.70 & 694.789 & 521.370 & 417.454 & 348.100 & 298.669 & 261.442 & 232.679 \\\\\n\t\t\\multicolumn{1}{l|}{0.6} & 1669.06 & 835.485 & 557.680 & 418.523 & 335.380 & 279.814 & 240.022 & 210.233 & 187.022 \\\\\n\t\t\\multicolumn{1}{l|}{0.8} & 1168.03 & 585.024 & 390.480 & 293.406 & 235.104 & 196.197 & 168.410 & 147.583 & 131.370 \\\\\n\t\t\\multicolumn{1}{l|}{1.0} & 878.208 & 440.005 & 294.000 & 221.023 & 177.106 & 147.964 & 127.016 & 111.450 & 99.2542 \\\\\n\t\t\\multicolumn{1}{l|}{1.2} & 693.046 & 347.470 & 232.288 & 174.765 & 140.162 & 117.162 & 100.726 & 88.4011 & 78.8053   \\\\\n\t\t\\multicolumn{1}{l|}{1.4} & 566.555 & 284.386 & 190.275 & 143.196 & 114.994 & 96.2113 & 82.7636 & 72.7234 & 64.8975   \\\\\n\t\t\\multicolumn{1}{l|}{1.6} & 476.692 & 239.477 & 160.406 & 120.865 & 97.1465 & 81.3696 & 70.0608 & 61.6053 & 55.0288   \\\\\n\t\t\\multicolumn{1}{l|}{1.8} & 410.580 & 206.621 & 138.561 & 104.505 & 84.1086 & 70.4915 & 60.7970 & 53.5005 & 47.8555   \\\\\n\t\t\\multicolumn{1}{l|}{2.0} & 361.191 & 181.859 & 122.134 & 92.2267 & 74.2973 & 62.3524 & 53.8144 & 47.4405 & 42.4641   \\\\\n\t\t\\multicolumn{1}{l|}{2.2} & 322.729 & 162.732 & 109.386 & 82.7430 & 66.7485 & 56.0825 & 48.4703 & 42.7821 & 38.3327   \\\\\n\t\t\\multicolumn{1}{l|}{2.4} & 291.498 & 147.173 & 99.0847 & 75.0489 & 60.6307 & 51.0175 & 44.1300 & 39.0087 & 35.0158   \\\\\n\t\t\\multicolumn{1}{l|}{2.6} & 263.437 & 133.325 & 89.9002 & 68.1935 & 55.1747 & 46.4976 & 40.3128 & 35.6615 & 32.0486   \\\\\n\t\t\\multicolumn{1}{l|}{2.8} & 235.188 & 119.260 & 80.5872 & 61.2342 & 49.6540 & 41.9257 & 36.4074 & 32.2829 & 29.0897   \\\\\n\t\t\\multicolumn{1}{l|}{3.0} & 203.533 & 103.516 & 70.1601 & 53.4777 & 43.4588 & 36.8063 & 32.0351 & 28.4887 & 25.8081  \\\\\n\t\t\\end{tabular}\n\\end{table}",
        "caption": "Reduced pressure (compressibility)  $p$ of the 2D Yukawa crystal obtained in MD simulations for various screening parameters $\\kappa$ and reduced coupling parameters $\\Gamma_{\\rm m}/\\Gamma$.",
        "label": "Table4",
        "section_info": "2 Methods\n\\section{Methods}\n\n\n\\subsection{System description}\n\\label{SD}\n\nWe investigate a classical system of point-like particles in the 2D geometry interacting via the pairwise repulsive Yukawa potential of the form\n\\begin{equation*}\n\\varphi (r) = \\frac{\\varepsilon \\lambda}{r}\\exp\\left(-\\frac{r}{\\lambda}\\right),\n\\end{equation*}\nwhere $\\varepsilon$, and $\\lambda$ are the energy and (screening) length scales of the interaction. For charged particles immersed in a plasma-like screening environment, the energy scale is $\\varepsilon=Q^2/4\\pi\\epsilon_0\\lambda$ (in SI units), where $Q$ is the charge and $\\epsilon_0$ is the permittivity of free space. The properties of Yukawa systems are determined by the two dimensionless parameters. The first is the coupling parameter, $\\Gamma = (Q^2/4\\pi  \\epsilon_0 a k_{\\mathrm{B}}T)$, where $k_{\\mathrm{B}}$ is the Boltzmann constant, $T$ is the temperature, $a=(\\pi  n)^{-1/2}$ is the 2D Wigner-Seitz radius, and $n=N/V$ is the areal density of $N$ particles occupying the 2D volume $V$. The second is the screening parameter, $\\kappa = a/\\lambda$. Note, that the coupling parameter is roughly the ratio of the potential energy of interaction between two neighbouring particles to their kinetic energy. The system is usually said to be in the strongly coupled state when this ratio is large, that is $\\Gamma\\gtrsim 1$.\n\nWhen coupling increases the system forms a strongly coupled fluid phase, which can crystallize upon further increase in $\\Gamma$. This fluid-solid transition can be characterized by the temperature and/or coupling parameter,  $T_{\\rm m}$ and $\\Gamma_{\\rm m}$, where the subscript ``m'' refers to melting. Both $T_{\\rm m}$ and $\\Gamma_{\\rm m}$ are the functions of the screening parameter $\\kappa$. The dependence $\\Gamma_{\\rm m}(\\kappa)$ has been approximated in Ref.~\\onlinecite{PhysRevE.72.026409} by the following fit:\n\\begin{equation}\\label{Melting2D}\n\\Gamma_{\\rm m}(\\kappa)\\simeq \\frac{131}{1-0.388\\kappa^2+0.138\\kappa^3-0.0138\\kappa^4}.\n\\end{equation}\nThis fit describes relatively well the melting points found from the bond angular correlation analysis (see Fig.~6 of Ref.~\\onlinecite{PhysRevE.72.026409}) up to $\\kappa = 3.0$ and it should  not be applied for larger $\\kappa$. In the limit $\\kappa = 0$ the system reduces to the 2D one-component-plasma (OCP) with the Coulomb interaction. In this case $\\Gamma_{\\rm m}\\simeq 131$ lies in the range predicted in earlier numerical simulations~\\cite{Gann1979} and obtained in experiments with a classical 2D sheet of electrons~\\cite{Grimes1979} (see also Ref.~\\onlinecite{KhrapakCPP2016} for a recent overview of OCP thermodynamics in 2D and 3D).\n\nFinally, it is worth to comment on the nature of the fluid-solid phase transition in 2D Yukawa systems. Recently, it has been demonstrated that the potential softness is very important factor, which determines the melting scenario.~\\cite{KapferPRL2015}\nFor sufficiently steep repulsive interactions the hard-disk melting scenario holds: a first-order liquid-hexatic and a continuous\nhexatic-solid transition can be identified. ~\\cite{PhysRevLett.107.155704, PhysRevE.87.042134} For softer interactions the liquid-hexatic transition is continuous, with correlations consistent with the Kosterlitz-Thouless-Halperin-Nelson-Young (KTHNY) scenario.  (For example, in 2D colloidal systems, hexatic phase was observed in the experiment by Zahn et al.~\\cite{PhysRevLett.82.2721}) For the Yukawa potential the transition between these two scenarios occurs at about $\\kappa\\simeq 6$.~\\cite{KapferPRL2015} Below we consider systems with $\\kappa$ in the range from $0.5$ to $3.0$ (this range is particularly relevant to 2D plasma crystals and fluids in laboratory experiments~\\cite{FortovUFN2004,FortovPR2005},\\cite{CTPP:CTPP201400099}), thus belonging to the soft interaction class.  In this range of $\\kappa$, the hexatic phase occupies a rather narrow region on the phase diagram,~\\cite{KapferPRL2015} and the study of its properties is beyond the scope of the present investigation.\n\n\\subsection{Computational details}\n\\label{MDdetails}\n\nTo obtain the thermodynamic properties of the 2D Yukawa systems across coupling regime, extensive MD simulations have been performed. The MD simulations have been done in the $NVT$ ensemble at different temperatures using $N=64 000$ particles and the Langevin thermostat. The numerical time step was chosen $\\Delta t_c=5\\times 10^{-4}\\sqrt{m\\lambda^2/\\epsilon}$ for the crystalline phase and $\\Delta t_c \\sqrt{\\Gamma/\\Gamma_{\\rm m}}$  for the fluid phase. The cutoff radius of the Yukawa potential was set equal to $15n^{-1/2}$. The simulations were run for $1.5\\times 10^6$ time steps to equilibrate the system and obtain the equilibrium properties. In the simulation run with $\\kappa = 0.5$ Ewald summation was implemented.\n\nThe simulations have been performed for a number of screening parameters $\\kappa$ ranging from $0.5$ to $3.0$. This corresponds to sufficiently soft interactions as discussed above. For each value of the screening parameter $\\kappa$, twelve simulation runs correspond to the fluid phase and nine runs to the crystalline phase. In the fluid phase the coupling parameter ranges from $\\Gamma=0.5$ to $\\simeq 0.95\\Gamma_{\\rm m}$. In the solid phase the values corresponding to $\\Gamma_{\\rm m}/\\Gamma=0.9,0.8,...,0.1$ are taken.\n\nThe main simulation results are summarized in Tables~\\ref{Table1}-\\ref{Table4} of the Appendix.\n\n\n\n\\subsection{Thermodynamic definitions and relations}\\label{Thermo}\n\nThe main thermodynamic quantities which will be required below are the internal energy $U$, Helmholtz free energy $F$, and pressure $P$ of the system. The following thermodynamic definitions exist~\\cite{LL}\n\\begin{eqnarray}\nU=-T^2\\left(\\frac{\\partial}{\\partial T}\\frac{F}{T}\\right)_V, \\\\\nP=-\\left(\\frac{\\partial F}{\\partial V}\\right)_T.\n\\end{eqnarray}\nIn addition, $U$ and $P$ can be calculated using the integral equations of state~\\cite{hansen-book, frenkel2001}\n\\begin{equation}\n\\begin{split}\n& U= N\\left(k_{\\rm B}T+ n\\int{d\\mathbf{r}\\; \\varphi(r)g(\\mathbf{r})}\\right),\\\\\n& PV = N\\left(k_{\\rm B}T - \\frac{n}{4}\\int{d\\mathbf{r}\\; r\\varphi'(r)g(\\mathbf{r})} \\right),\n\\end{split}\n\\end{equation}\nwhere $g(\\mathbf{r})$ denotes the radial distribution function, which is isotropic in gas and fluid phases and anisotropic in the crystalline phase.\n\nWe will use conventional reduced units: $u=U/Nk_{\\rm B}T$, $f=F/Nk_{\\rm B}T$, and $p=PV/Nk_{\\rm B}T$ and divide the thermodynamic quantities into the kinetic (ideal gas) and potential (excess) components, so that $u=1 + u_{\\rm ex}$ (in 2D), $f=f_{\\rm id}+f_{\\rm ex}$, and $p=1+p_{\\rm ex}$. Finally, it is useful to operate with the Yukawa system state variables $\\Gamma$ and $\\kappa$. In these variables the thermodynamic identities for 2D Yukawa fluids are~\\cite{KhrapakPoP08_2015, 1.4935846}\n\\begin{equation}\\label{pf}\np=1+\\frac{\\Gamma}{2}\\frac{\\partial f_{\\rm ex}}{\\partial\\Gamma}-\\frac{\\kappa}{2}\\frac{\\partial f_{\\rm ex}}{\\partial\\kappa}, \\qquad\nf_{\\rm ex} = \\int_0^{\\Gamma}{d\\Gamma'\\; \\frac{u_{\\mathrm{ex}}(\\kappa, \\Gamma')}{\\Gamma'}}.\n\\end{equation}\n\n\\subsection{The shortest-graph method}\n\nTo describe the thermodynamics of 2D Yukawa crystals analytically,\nwe employ the shortest-graph method, proposed and developed in Refs.~\\onlinecite{1.4869863, 1.4926945, 0953-8984-28-23-235401}.\nFollowing these papers, thermodynamical properties of classical crystals can be obtained very accurately from the following consideration. The anisotropic pair-correlation function $g(\\mathbf{r})$ of a crystal is written in the form\n\\begin{equation}\n\\label{Eq1}\ng(\\mathbf{r}) = \\frac{1}{n}\\sum_\\alpha{p_\\alpha(\\mathbf{r}-\\mathbf{r_\\alpha})},\n\\end{equation}\nwhere the summation is over all the nodes $\\alpha$, and\neach individual peak has the shape\n\\begin{equation}\n\\label{Eq2}\n\\begin{split}\n&p_\\alpha(\\mathbf{r}) \\propto\n \\exp\\left[-\\frac{\\varphi(\\mathbf{r}+\\mathbf{r_\\alpha})}{k_{\\rm B}T}-b_\\alpha(\\mathbf{e_\\alpha}\\cdot\\mathbf{r})-\n\\right. \\\\\n& \\qquad\\qquad \\qquad \\qquad \\left.\n-\\frac{(\\mathbf{e_\\alpha}\\cdot\\mathbf{r})^2}{2 a_{\\|\\alpha}^2}-\n\\frac{\\mathbf{r}^2-(\\mathbf{e_\\alpha}\\cdot\\mathbf{r})^2}{2 a_{\\perp\\alpha}^2}\\right].\n\\end{split}\n\\end{equation}\nThe normalization constant as well as the parameters $a_{\\|,\\perp\\alpha}^2, b_\\alpha$ are defined by the following conditions\\cite{0953-8984-28-23-235401}\n\\begin{equation}\n\\label{Eq3}\n\\begin{split}\n& \\int{d\\mathbf{r}\\;p_\\alpha(\\mathbf{r})}=1, \\qquad \\int{d\\mathbf{r}\\;\\mathbf{r}p_\\alpha(\\mathbf{r})}=0, \\\\\n& \\int{d\\mathbf{r}\\;(\\mathbf{e_\\alpha}\\cdot\\mathbf{r})^2 p_\\alpha(\\mathbf{r})}=\\sigma_{\\|\\alpha}^2,\\\\\n& \\int{d\\mathbf{r}\\;[\\mathbf{r}^2-(\\mathbf{e_\\alpha}\\cdot\\mathbf{r})^2] p_\\alpha(\\mathbf{r})}=(D-1) \\sigma_{\\perp\\alpha}^2,\n\\end{split}\n\\end{equation}\nwhere $D=2$ is the spatial dimensionality and $\\mathbf{e_\\alpha}=\\mathbf{r_\\alpha}/r_\\alpha$ is the unit vector in the\ndirection of $\\mathbf{r_\\alpha}$,\n$\\sigma_{\\|,\\perp}^2$ is the mean squared displacement for longitudinal and transversal directions, respectively, calculated using the finite-temperature phonon spectra,\ntaking into account the anharmonic effects.\\cite{0953-8984-28-23-235401} By using the pair correlation function $g(\\mathbf{r})$ the excess energy and pressure can then be obtained.\nHowever, calculation of the finite-temperature phonon spectra is a difficult problem, which is beyond the scope of the present paper.\nTherefore, we propose here a simpler practical approach, which yields very accurate results and can be used for practical calculations.\n\nDue to the anharmonicity of phonon spectra at finite temperatures,\nthe second-order term becomes more significant in the temperature expansion of the mean-squared displacements $\\sigma^2$.\nTo account for this effect, we propose the anharmonic correction of the mean-squared displacements\n\\begin{equation}\n\\label{Eq5}\n\\sigma_{\\|,\\perp\\alpha}^2 = \\widetilde{\\sigma}_{\\|,\\perp\\alpha}^2 \\left[1+\\beta(\\kappa)N\\widetilde{\\sigma}_{1}^2/V\\right],\n\\end{equation}\nwhere the tildes denote the mean-squared displacement calculated using zero-temperature phonon spectra (see Ref.\\onlinecite{1.4926945}),\n$\\widetilde{\\sigma}_1^2$ is the total mean-squared displacement for the nearest neighbours, and we have introduced the anharmonic correction coefficient $\\beta(\\kappa)$, which does not depend on the temperature and should be found using MD simulations for different screening parameters.\nThe correction given by Eq.\\eqref{Eq5} conserves the ratio $\\sigma_\\|^2/\\sigma_\\perp^2$ between the mean-squared displacements in the longitudinal and transversal directions.\n\\emph{A posteriori} comparison with MD results proves that this assumption allows to obtain excellent accuracy.\n\n2.2 Computational details\n\\subsection{Computational details}\n\\label{MDdetails}\n\nTo obtain the thermodynamic properties of the 2D Yukawa systems across coupling regime, extensive MD simulations have been performed. The MD simulations have been done in the $NVT$ ensemble at different temperatures using $N=64 000$ particles and the Langevin thermostat. The numerical time step was chosen $\\Delta t_c=5\\times 10^{-4}\\sqrt{m\\lambda^2/\\epsilon}$ for the crystalline phase and $\\Delta t_c \\sqrt{\\Gamma/\\Gamma_{\\rm m}}$  for the fluid phase. The cutoff radius of the Yukawa potential was set equal to $15n^{-1/2}$. The simulations were run for $1.5\\times 10^6$ time steps to equilibrate the system and obtain the equilibrium properties. In the simulation run with $\\kappa = 0.5$ Ewald summation was implemented.\n\nThe simulations have been performed for a number of screening parameters $\\kappa$ ranging from $0.5$ to $3.0$. This corresponds to sufficiently soft interactions as discussed above. For each value of the screening parameter $\\kappa$, twelve simulation runs correspond to the fluid phase and nine runs to the crystalline phase. In the fluid phase the coupling parameter ranges from $\\Gamma=0.5$ to $\\simeq 0.95\\Gamma_{\\rm m}$. In the solid phase the values corresponding to $\\Gamma_{\\rm m}/\\Gamma=0.9,0.8,...,0.1$ are taken.\n\nThe main simulation results are summarized in Tables~\\ref{Table1}-\\ref{Table4} of the Appendix.\n\n\n\n3 Results\n\\section{Results}\n\n\\subsection{Weakly-coupled fluids}\n\n\\begin{figure}[!b]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R1.eps}\\\\\n    \\caption{The excess energy $u_{\\rm ex}$ of 2D Yukawa weakly coupled fluids versus the screening parameter $\\kappa$ at a fixed coupling parameter $\\Gamma = 0.5$. The symbols correspond to the results of MD simulations, the solid curve is plotted using the analytical expression of Eq.~(\\ref{SVC}).\n    }\n\\label{FigSC}\n\\end{figure}\n\nA simple and physically transparent approach to the thermodynamics of weakly coupled Yukawa systems for small deviations from the ideal gas behavior is to calculate the  second virial coefficient. This has recently been shown to work well in 3D Yukawa systems.~\\cite{KhrapakPPCF2016} In the 2D geometry the excess free energy is expressed in this approximation as\n\\begin{equation}\\label{SVC}\nf_{\\rm ex}\\simeq \\pi n \\int\\left[1-e^{-\\varphi(r)/k_{\\rm B}T}\\right]r dr.\n\\end{equation}\nThe excess energy and pressure can be readily obtained from the excess free energy. We compare the values $u_{\\rm ex}$ at a fixed coupling parameter $\\Gamma=0.5$ obtained from Eq.~(\\ref{SVC}) and computed using MD simulations in Fig.~\\ref{FigSC}. The agreement is satisfactory: in the range of $\\kappa$ investigated the deviations are within several percent. The agreement naturally improves with increasing $\\kappa$, because at a fixed $\\Gamma$ the actual interaction strength weakens as $\\kappa$ increases.\n\n\n\\subsection{Strongly-coupled fluids}\n\nThe excess energy and pressure of the 2D Yukawa fluids have been determined using MD simulations in a wide range of coupling and screening parameters. The results are summarized in the Table \\ref{Table1} of the Appendix. Here we describe simple analytical\napproximations, which can be used to evaluate the energy and pressure for practical purposes.\n\nIn the strongly coupled fluid regime it is helpful to divide the thermodynamic quantities, such as energy and pressure, into static and thermal contributions. The static contribution corresponds to the value of internal energy when the particles are frozen in some regular configuration and the thermal corrections arise due to the deviations of the particles from these fixed positions (due to thermal motion). Of course, such a division is only meaningful when the regular structure is specified. For crystals, the obvious choice is a corresponding lattice sum (Madelung energy). For fluids this choice is also meaningful and we use it here (Note, that in 3D Yukawa system a slightly different definition of the static fluid energy is traditionally employed.~\\cite{KhrapakPPCF2016, KhrapakISM})\n\n\\begin{table}[!b]\n\\caption{\\label{TabM} Madelung constants of the 2D Yukawa crystals (triangular lattice) for various screening parameters in the range $0.5 \\leq \\kappa\\leq 3.0$ }\n\\begin{ruledtabular}\n\\begin{tabular}{cccc}\n$\\kappa$ & $M$ & $\\kappa$ & $M$   \\\\ \\hline\n0.5 & 1.11914 & 1.8 & 0.05449  \\\\\n0.6 & 0.82503 & 2.0 & 0.03660  \\\\\n0.8 & 0.48127 & 2.2 & 0.02470  \\\\\n1.0 & 0.29709 & 2.4 & 0.01672  \\\\\n1.2 & 0.18960 & 2.6 & 0.01135  \\\\\n1.4 & 0.12357 & 2.8 & 0.00772  \\\\\n1.6 & 0.08167 & 3.0 & 0.00525 \\\\ \n\\end{tabular}\n\\end{ruledtabular}\n\\end{table}\n\nThe excess internal energy is thus a sum of the static and thermal contributions,\n\\begin{equation}\nu_{\\rm ex} = u_{\\rm st} + u_{\\rm th},\n\\end{equation}\nwhere $u_{\\rm st} = M\\Gamma$ and $M$ is the Madelung constant.\nThe values of the Madelung constant for 2D Yukawa systems in the regime of relatively weak screening, $0.5 \\leq \\kappa\\leq 3.0$, are tabulated in Table~\\ref{TabM}. The dependence $M(\\kappa)$ can be fitted using a functional form similar to that proposed by  Totsuji~\\emph{et al.}\\cite{PhysRevE.70.016405}\n\\begin{equation}\n\\label{Eq6}\nM = -1.1061+0.5038\\kappa-0.11053\\kappa^2+0.00968\\kappa^3+1/\\kappa.\n\\end{equation}\nThe last term in (\\ref{Eq6}) accounts for the absence of neutralizing background in our case (but present in Ref.~\\onlinecite{PhysRevE.70.016405}), the energy of this background being simply $-\\Gamma/\\kappa$. The fit is chosen in such a way that when $\\kappa\\rightarrow 0$ and the neutralizing background is introduced, the Madelung constant is reduced to the well known value of the triangular lattice sum of the 2D one-component-plasma (OCP) with Coulomb interactions, $M_{\\rm OCP}\\simeq -1.1061$. This fit is accurate to within a tiny fraction of percent for $\\kappa\\lesssim 1.0$ and to within $\\sim 1\\%$ when screening becomes stronger ($\\kappa\\sim 3$).\n\nThe thermal part of the excess energy is expected to exhibit a quasi-universal scaling with respect to the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. This is a general property of classical particle systems with sufficiently soft interactions, which was first pointed out by Rosenfeld and Tarazona (RT scaling) for 3D systems.~\\cite{RT1,RT2} In the context of 3D Yukawa systems, the RT scaling has been proven to be very useful in Refs.~\\onlinecite{1.4921223,KhrapakPPCF2016,KhrapakPRE2015,KhrapakPRE03_2015} The emergence of RT scaling analogue for 2D systems has been discussed in the context of OCP with Coulomb and logarithmic interactions, Yukawa systems near the OCP limit, and inverse-power-law interactions.~\\cite{KhrapakCPP2016,KhrapakPoP08_2015} The dependence of $u_{\\rm th}$ on $\\Gamma/\\Gamma_{\\rm m}$ in the strongly coupled regime is displayed in Fig.~\\ref{FigR1}. The quasi-universality is well pronounced, although there is clearly some systematic tendency of decreasing the value of $u_{\\rm th}$ with $\\kappa$ at the same value of  $\\Gamma/\\Gamma_{\\rm m}$. This tendency is expected when the potential steepness increases (see e.g. Fig.~4 from Ref.~\\onlinecite{KhrapakPoP08_2015}). Overall, the data points corresponding to the dependence $u_{\\rm th}(\\Gamma/\\Gamma_{\\rm m})$ are confined to a relatively narrow range. The important point is that towards the side of soft interactions (sufficiently small $\\kappa$ in our case), the static component of the internal energy is dominant over the thermal one. For example, at $\\kappa=1$ the thermal component contributes only to about $2\\%$ of the total excess energy near the fluid-solid phase transition. Therefore, even moderately accurate fits for $u_{\\rm th}$ allow to obtain high accuracy with respect to the total excess energy $u_{\\rm ex}$.\n\nThree fits are shown in Fig.~\\ref{FigR1}. The upper (lower) curve corresponds to the data portion for $\\kappa=0.5$ ($\\kappa = 3.0$).\nThe intermediate curve has been obtained using the entire massive of the data points (corresponding to the parameter regime shown). It can be considered as representative for strongly coupled 2D Yukawa fluids in the vicinity of the freezing transition.\nThe functional form of the fit is the same as used previously~\\cite{KhrapakPoP08_2015}\n\\begin{equation} \\label{Fit1}\nu_{\\rm th} =A \\ln (1+B\\Gamma/\\Gamma_{\\rm m}).\n\\end{equation}\nThe use of the coefficients $A=0.257$ and $B=195.4$ determined here would somewhat improve previous approximations.\n\nThe excess free energy can be routinely calculated using the model for the excess energy formulated above and the second of Eqs.~(\\ref{pf}). The resulting expression is rather simple,\n\\begin{equation}\\label{fex}\nf_{\\rm ex}=M(\\kappa)\\Gamma - A{\\rm Li}_2(-B\\Gamma/\\Gamma_{\\rm m}),\n\\end{equation}\nwhere ${\\rm Li}_2(z)=\\int_z^0 dt \\ln(1-t)/t$ is dilogarithm. Note that in deriving Eq.~(\\ref{fex}), the thermodynamic integration over the coupling parameter from 0 to $\\Gamma$ has been performed, while Eq.~(\\ref{Fit1}) is strictly speaking not applicable at $\\Gamma\\ll 1$.\nThe correct procedure would be to start thermodynamic integration from some small, but finite value $\\Gamma_0$, and then add the constant $f_{\\rm ex}(\\Gamma_0)$ evaluated using Eq.~(\\ref{SVC}). However, since the actual contribution from the weakly coupling regime is small, Eq.~(\\ref{fex}) remains rather accurate at strong coupling and we use it here.\n\nThe calculation of pressure from the excess free energy is straightforward, but rather cumbersome in the considered case. This is because the differentiation with respect to $\\kappa$ is involved, and  the two fits for $M(\\kappa)$ and $\\Gamma_{\\rm m}(\\kappa)$ are present. For this reason, the explicit expression for $p$ is not displayed. We verified that near freezing (at $\\Gamma/\\Gamma_{\\rm m}\\simeq 0.95$) the derived expression yields the pressures which deviate from the exact MD results by $\\sim 0.001\\%$ at $\\kappa=0.5$, $~\\sim 0.1\\%$ at $\\kappa=1.0$, and $\\sim 1\\%$ at $\\kappa = 2.0-2.8$. The accuracy drops at the highest value $\\kappa=3.0$. This is not surprising, since the fits for $M(\\kappa)$ and $\\Gamma_{\\rm m}(\\kappa)$ are only applicable for $\\kappa\\lesssim 3.0$ and, therefore, derivatives from these fits at   $\\kappa=3.0$ can produce significant errors.\n\nWe also found out that if better accuracy is required, the data for the excess thermal energy can be fitted by the following slightly modified expression\n\\begin{equation}\n\\label{Eq7}\nu_{\\mathrm{ex}} = A(\\kappa)\\ln\\left[ 1 + B(\\kappa) \\Gamma^{s(\\kappa)} \\right],\n\\end{equation}\nwhere $A$ and $B$ are now assumed $\\kappa$-dependent and a $\\kappa$-dependent exponent $s$ is introduced. Based on all the data points obtained in MD simulations the following relations are identified:\n$A(\\kappa) = 0.35708 + 0.09397\\kappa$,\n$B(\\kappa)= 1.65491\\exp(- 0.76911\\kappa)$,\n$s(\\kappa) = 0.68838 - 0.05183\\kappa$.\nSome representative examples are shown in Fig.~\\ref{FigR2}.\nThe fit of Eq.~(\\ref{Eq7}) is clearly more accurate and can be used in\nthe regime of weaker coupling, compared to the simple form (\\ref{Fit1}). However, it is also less practical in evaluating thermodynamic parameters other than the excess internal energy.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R2.eps}\\\\\n    \\caption{\n    Thermal component of the reduced excess energy, $u_{\\rm th}$ of 2D Yukawa fluids near the fluid-solid phase transition versus the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. Symbols correspond to MD simulations for different values of the screening parameter $\\kappa$. The curves are the analytical fits to these data using Eq.~(\\ref{Fit1}): The upper (lower) curve corresponds to fitting the MD results for $\\kappa=0.5$ ($\\kappa = 3.0$) and the intermediate (red) curve is obtained by fitting the entire massive of the data points.}\n\\label{FigR1}\n\\end{figure}\n\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R3.eps}\\\\\n    \\caption{Dependence of the excess thermal energy $u_{\\rm th}$ on the reduced coupling parameter $\\Gamma/\\Gamma_{\\rm m}$. All the data points from numerical simulations are plotted. Solid curves correspond to three representative fits using Eq.~\\eqref{Eq7}.}\n\\label{FigR2}\n\\end{figure}\n\n\n\n\\subsection{Relation between excess pressure and energy}\n\nIt is sometimes advantageous to operate with an equation of state written in the form of relation between the pressure and internal energy of the system. For soft purely repulsive potentials a simplest formulation of this kind can be written as\n\\begin{equation}\\label{gamma_ex}\np_{\\rm ex}=\\gamma_{\\rm ex}u_{\\rm ex}.\n\\end{equation}\nHere the parameter $\\gamma_{\\rm ex}$ generally depends both on the temperature and density, that is both on $\\Gamma$ and $\\kappa$ for Yukawa systems. Note that the parameter $\\gamma_{\\rm ex}$ introduced in this way is not directly related to the conventional definitions of either the density scaling exponent or Gr\\\"uneisen parameter.~\\cite{HummelPRB2015} Nevertheless, it may be helpful in characterizing the softness of the repulsive potential. We remind that for inverse-power-law (IPL) repulsive potentials of the form $\\varphi(r)\\propto r^{-\\alpha}$ the relation between the excess pressure and energy is particularly simple, $p_{\\rm ex}=\\tfrac{\\alpha}{2} u_{\\rm ex}$ in 2D. Thus, an ``effective IPL exponent'' may be associated with the quantity $2\\gamma_{\\rm ex}$.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{gamma.eps}\\\\\n    \\caption{Ratio of the excess pressure to the excess energy, $\\gamma_{\\rm ex}=p_{\\rm ex}/u_{\\rm ex}$ on the plane ($\\kappa$, $\\Gamma/\\Gamma_{\\rm m}$).\n    }\n\\label{gamma}\n\\end{figure}\n\nHaving approximations for both $p_{\\rm ex}$ and $u_{\\rm ex}$ for 2D Yukawa fluids we can easily estimate the value of $\\gamma_{\\rm ex}$. The corresponding plot of $\\gamma_{\\rm ex}$ as a function of Yukawa systems state variables $\\kappa$ and $\\Gamma/\\Gamma_{\\rm m}$ is shown in Fig.~\\ref{gamma}. To produce this plot, Eq.~(\\ref{Fit1}) for the thermal component of the excess energy has been used. Figure~\\ref{gamma} shows that in the strongly coupled regime $\\gamma_{\\rm ex}$ is very weakly dependent on the coupling strength (temperature), but exhibits considerable dependence on $\\kappa$ (density). Using the exact MD results for $p_{\\rm ex}/u_{\\rm ex}$ in the vicinity of the fluid-solid phase transition ($\\Gamma/\\Gamma_{\\rm m}\\simeq 0.95$) we have obtained a representative dependence $\\gamma_{\\rm ex}(\\kappa)$ in the strongly coupled regime:\n\\begin{equation}\n\\gamma_{\\rm ex}(\\kappa)=1+0.526\\kappa+0.13\\kappa^2-0.02\\kappa^3.\n\\end{equation}\nImportantly, $\\gamma_{\\rm ex}\\rightarrow 1$ as $\\kappa\\rightarrow 0$.\nThis seems counter-intuitive at first, because one would naturally expect $\\gamma_{\\rm ex}=\\tfrac{1}{2}$ in the OCP Coulomb interaction limit in 2D. The difference is attributed to the presence of the neutralizing background in the OCP model. In the limit of very soft interaction, the energy and pressure are dominated by their static contributions. As $\\kappa\\rightarrow 0$, the dominant contribution is the Madelung energy, so that $f_{\\rm ex}\\sim u_{\\rm ex}\\sim M\\Gamma\\sim \\Gamma/\\kappa$ (without background). This implies $p_{\\rm ex}=\\tfrac{\\Gamma}{2}(\\partial f_{\\rm ex}/\\partial \\Gamma)-\\tfrac{\\kappa}{2}(\\partial f_{\\rm ex}/\\partial \\kappa)\\sim \\Gamma/\\kappa\\sim u_{\\rm ex}$. In the presence of neutralizing background the term $\\Gamma/\\kappa$ disappears and we have $f_{\\rm ex}\\sim u_{\\rm ex}\\sim M_{\\rm OCP}\\Gamma$. This yields $p_{\\rm ex}\\sim \\tfrac{1}{2}M_{\\rm OCP}\\Gamma\\sim \\tfrac{1}{2}u_{\\rm ex}$. This consideration demonstrates that the Yukawa systems in the limit $\\kappa\\rightarrow 0$ are not fully equivalent to the Coulomb systems with the neutralizing background.\n\n\n\\subsection{Crystals}\n\nIn a series of MD simulations for 2D Yukawa crystals, in addition to evaluate the excess energy and pressure (which are summarized in Tables~\\ref{Table3} and \\ref{Table4} of the Appendix), the mean squared displacements were calculated to find the anharmonic correction coefficient $\\beta$. The resulting dependence $\\beta(\\kappa)$ is shown in Figure~\\ref{FigR3} (the corresponding values are also tabulated in Table~\\ref{Table2} of the Appendix for completeness).\nThe inset in Fig.~\\ref{FigR3} presents the radial (isotropic) pair correlation function, $g(r) \\propto \\int{d\\varphi\\; g(\\mathbf{r})}$,\nand demonstrates excellent representation of the short- and long-distance correlations. The obtained anharmonic correction coefficient $\\beta(\\kappa)$ allows to calculate analytically pair correlation function and then the excess energy, pressure and other thermodynamic parameters by the thermodynamic integration with the help of the expressions given in Sec.~\\ref{Thermo}.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R5.eps}\\\\\n    \\caption{    Dependence of the anharmonic correction coefficient $\\beta$ on the screening parameter $\\kappa$. The inset demonstrates a typical comparison between the radial distribution functions obtained in a direct MD simulation and computed using the shortest-graph method. For details see the text.}\n\\label{FigR3}\n\\end{figure}\n\nIt is worth to point out the following observation:\nIn the limit $\\kappa \\rightarrow 0$, the Yukawa interaction tends to the unscreened Coulomb interaction $\\varphi \\propto r^{-1}$. According to our previous MD simulations,~\\cite{1.4926945}\nthe finite-temperature phononic spectra differ weakly from zero-temperature ones for IPL potentials, $\\varphi \\propto r^{-\\alpha}$. Therefore, in the OCP limit ($\\kappa=0$ and $\\alpha=1$) we should obtain the smallest values of $\\beta(\\kappa)$. This is indeed observed in Fig.~\\ref{FigR3}.\n\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R6.eps}\\\\\n    \\caption{ Dependence of the reduced pressure on the reduced excess energy. Open (solid) symbols are the results of MD simulations for fluids and solids, respectively. The solid and dashed curves correspond to the shortest-graph method for solids and to the fit of Eq.~(\\ref{Eq7}) for fluids.}\n\\label{FigR4}\n\\end{figure}\n\nIn figure~\\ref{FigR4} we plot the reduced pressure versus the reduced excess energy of 2D Yukawa fluids and solids. Symbols are the MD results, the solid and dashed curves correspond to the shortest-graph method [with found anharmonic correction coefficient $\\beta(\\kappa)$] for the crystalline phase and the proposed fit by Eq.\\eqref{Eq7} for the fluid phase, respectively. Excellent agreement is observed.\n\n\\subsection{Accuracy}\n\nThe relative difference between the excess energies calculated using the shortest-graph method and those evaluated using direct MD simulations in the solid phase amounts to $\\simeq5\\times 10^{-5}$, which is comparable to the values reported earlier.~\\cite{0953-8984-28-23-235401} The accurate fit of Eq.~\\eqref{Eq7}\nyields the relative error in the excess energy smaller than $5\\times10^{-4}$ and  $2\\times10^{-3}$  for 72\\% and 95\\% of\nthe examined fluids data points, respectively. Maximal relative deviation, $5\\times 10^{-3}$, is observed near the melting line at large values of the screening parameter $\\kappa$. A simpler fit of Eq.~(\\ref{Fit1}) is applicable when the relative deviations within $\\lesssim 1\\%$ are acceptable.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{Pressure_kappa05.eps}\\\\\n    \\caption{Reduced pressure, $p$, as a function of the coupling parameter $\\Gamma$ for a Yukawa 2D fluid with the screening parameter $\\kappa=0.5$. The symbols are exact MD results, the solid (red) line corresponds to the fit of Eq.~(\\ref{Fit1}), the dashed (blue) line is the fit from Ref.~\\onlinecite{0022-3727-49-23-235203}.}\n\\label{FigPressure}\n\\end{figure}\n\nIn addition, we can compare our results with those recently reported in Refs.~\\onlinecite{0022-3727-49-23-235203,1.4962685}, where fits for the pressure of 2D Yukawa fluids in the $(\\kappa,\\Gamma)$ parameter space have been proposed. The case $\\kappa=0.5$ received special attention and a simple two-term fit has been proposed based on the results of a MD simulation,~\\cite{0022-3727-49-23-235203} $p=1.53\\Gamma+1.33$.\nWe plot our MD results along with the fit of Eq.~(\\ref{Fit1}) and the fit from Ref.~\\onlinecite{0022-3727-49-23-235203} in Fig.~\\ref{FigPressure}. One can see that the fit from Ref.~\\onlinecite{0022-3727-49-23-235203} overestimates the pressure systematically at high values of $\\Gamma$. At the strongest coupling in the fluid phase studied in this work, $\\Gamma=135.42$, the present MD simulation yields $p= 199.434$, fit by Eq.~(\\ref{Fit1}) yields $p=199.432$, while the fit from Ref.~\\onlinecite{0022-3727-49-23-235203} yields $p=208.523$. On the other hand, the previous model for 2D Yukawa systems in the OCP (weakly screening) limit discussed in Refs.~\\onlinecite{KhrapakPoP08_2015,1.4935846}\nyields $p=199.445$, providing confidence in the accuracy of the  present results. The reasons for deviations in Ref.~\\onlinecite{0022-3727-49-23-235203} have to be identified.\n\n3.4 Crystals\n\\subsection{Crystals}\n\nIn a series of MD simulations for 2D Yukawa crystals, in addition to evaluate the excess energy and pressure (which are summarized in Tables~\\ref{Table3} and \\ref{Table4} of the Appendix), the mean squared displacements were calculated to find the anharmonic correction coefficient $\\beta$. The resulting dependence $\\beta(\\kappa)$ is shown in Figure~\\ref{FigR3} (the corresponding values are also tabulated in Table~\\ref{Table2} of the Appendix for completeness).\nThe inset in Fig.~\\ref{FigR3} presents the radial (isotropic) pair correlation function, $g(r) \\propto \\int{d\\varphi\\; g(\\mathbf{r})}$,\nand demonstrates excellent representation of the short- and long-distance correlations. The obtained anharmonic correction coefficient $\\beta(\\kappa)$ allows to calculate analytically pair correlation function and then the excess energy, pressure and other thermodynamic parameters by the thermodynamic integration with the help of the expressions given in Sec.~\\ref{Thermo}.\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R5.eps}\\\\\n    \\caption{    Dependence of the anharmonic correction coefficient $\\beta$ on the screening parameter $\\kappa$. The inset demonstrates a typical comparison between the radial distribution functions obtained in a direct MD simulation and computed using the shortest-graph method. For details see the text.}\n\\label{FigR3}\n\\end{figure}\n\nIt is worth to point out the following observation:\nIn the limit $\\kappa \\rightarrow 0$, the Yukawa interaction tends to the unscreened Coulomb interaction $\\varphi \\propto r^{-1}$. According to our previous MD simulations,~\\cite{1.4926945}\nthe finite-temperature phononic spectra differ weakly from zero-temperature ones for IPL potentials, $\\varphi \\propto r^{-\\alpha}$. Therefore, in the OCP limit ($\\kappa=0$ and $\\alpha=1$) we should obtain the smallest values of $\\beta(\\kappa)$. This is indeed observed in Fig.~\\ref{FigR3}.\n\n\n\\begin{figure}[!t]\n    \\centering\n    \\includegraphics[width=85mm]{2DYukawa-R6.eps}\\\\\n    \\caption{ Dependence of the reduced pressure on the reduced excess energy. Open (solid) symbols are the results of MD simulations for fluids and solids, respectively. The solid and dashed curves correspond to the shortest-graph method for solids and to the fit of Eq.~(\\ref{Eq7}) for fluids.}\n\\label{FigR4}\n\\end{figure}\n\nIn figure~\\ref{FigR4} we plot the reduced pressure versus the reduced excess energy of 2D Yukawa fluids and solids. Symbols are the MD results, the solid and dashed curves correspond to the shortest-graph method [with found anharmonic correction coefficient $\\beta(\\kappa)$] for the crystalline phase and the proposed fit by Eq.\\eqref{Eq7} for the fluid phase, respectively. Excellent agreement is observed.\n\n5 MD results\n\\section{MD results}\n\\label{Appendix}\n\nIn the Appendix, we summarize main results from MD simulations performed in this study. Table \\ref{Table1} reports the reduced excess energies and pressures at different state points in the fluid phase. Table  \\ref{Table2} summarizes the values of the anharmonic correction coefficient $\\beta$ evaluated using MD simulations of the crystalline phase. Finally, Tables  \\ref{Table3} and  \\ref{Table4} report the excess energies and pressures in the crystalline phase.\n\n\\begin{table}[h]\n\t\\centering\n\t\\small\n\t\\caption{Reduced excess energy $u_{\\rm ex}$ and pressure $p$ of two-dimensional Yukawa fluids evaluated using MD simulations for various coupling ($\\Gamma$) and screening ($\\kappa$) parameters.}\n\t\\label{Table1}\n\t\t\\begin{tabular}{l c c c c c c c c c c c c}\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=0.5$}\\\\ \\hline\n\t\t$\\Gamma$ & 135.420 & 86.7254 & 52.7787 & 32.1811 & 19.6073 & 11.9310 & 7.27175 & 4.43126 & 2.69848 & 1.64302 & 1.00136 &  0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 152.944 & 98.3115 & 60.1901 & 37.0087 & 22.8180 & 14.1176 & 8.79838 & 5.51964 & 3.48587 & 2.21772 & 1.42021 & 0.76495\\\\\n\t\t$p$ & 199.434 & 128.303 & 78.6946 & 48.5651 & 30.1485 & 18.8835 & 12.0216 & 7.81631 & 5.22964 & 3.63556 & 2.64961 & 1.85883\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=0.6$}\\\\\\hline\n\t\t$\\Gamma$  & 140.131 & 89.5076 & 54.3171 & 32.9737 & 20.0017 & 12.1359 & 7.36665 & 4.47442 & 2.71053 & 1.64677 & 1.00106 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 116.984 & 75.1128 & 45.9415 & 28.2016 & 17.3768 & 10.7727 & 6.73045 & 4.24422 & 2.69421 & 1.72956 & 1.11776 & 0.61083\\\\\n\t\t$p$ \t\t\t\t\t& 160.369 & 103.050  & 63.1652  & 38.9451  & 24.1971  & 15.2284  & 9.76528  & 6.42899  & 4.37128  & 3.11015  & 2.32663  & 1.69701\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=0.8$}\\\\\\hline\n\t\t$\\Gamma$ & 152.277 & 96.5736 & 58.0604 & 34.9737 & 21.0334 & 12.6675 & 7.61503 & 4.58845 & 2.75830 & 1.66410 & 0.99914  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 74.6424 & 47.7340  & 29.0608 & 17.8181 & 10.9844 & 6.84185 & 4.30139 & 2.74217 & 1.76665 & 1.15293  & 0.75437 & 0.42469\\\\\n\t\t$p$ \t\t\t\t\t& 112.709 & 72.1411  & 44.0441 & 27.1658 & 16.9406 & 10.7731 & 7.01845 & 4.73986 & 3.33679 & 2.47393  & 1.92983 & 1.49910\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.0$}\\\\\\hline\n\t\t$\\Gamma$ & 169.071 & 105.975 & 63.1038 & 37.6027 & 22.4047 & 13.3361 & 7.94729 & 4.73129 & 2.81940 & 1.68034  & 0.99956 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 51.5786 & 32.7335 & 19.8556 & 12.1451 & 7.50279 & 4.68984 & 2.97702 & 1.91799 & 1.25426 & 0.82932 & 0.55059 & 0.31770\\\\\n\t\t$p$ \t& 85.4036 & 54.2492 & 33.0215 & 20.3527 & 12.7618 & 8.19406 & 5.44279 & 3.76791 & 2.74103 & 2.10336 & 1.70075 & 1.38135\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.2$}\\\\\\hline\n\t\t$\\Gamma$ & 191.126 & 118.398 & 69.6429 & 40.9597 & 24.1083 & 14.1893 & 8.34919 & 4.90490 & 2.88868 & 1.70019 & 0.99984 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 37.5852 & 23.6918 & 14.3026 & 8.72609 & 5.39936 & 3.39637 & 2.17547 & 1.41736 & 0.93933 & 0.62908 & 0.42281 & 0.24960\\\\\n\t\t$p$ \t& 67.9344 & 42.8619 & 25.9838 & 16.0024 & 10.0874 & 6.56025 & 4.44041 & 3.15023 & 2.36021 & 1.86635 & 1.55301 & 1.30594\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.4$}\\\\\\hline\n\t\t$\\Gamma$ & 220.172 & 134.441 & 77.9949 & 45.2452 & 26.2578 & 15.2219 & 8.83634 & 5.12702 & 2.97137 & 1.72440 & 1.00140  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 28.5555 & 17.8503 & 10.7244 & 6.53392 & 4.05300 & 2.56405 & 1.65932 & 1.09552 & 0.73364 & 0.49726 & 0.33718 & 0.20253\\\\\n\t\t$p$ \t& 56.0915 & 35.0963 & 21.1892 & 13.0574 & 8.28303 & 5.45288 & 3.76392 & 2.73780 & 2.10241 & 1.70540 & 1.45171 & 1.25396\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.6$}\\\\\\hline\n\t\t$\\Gamma$ & 258.433 & 155.296 & 88.6297 & 50.6106 & 28.9099 & 16.4928 & 9.41249 & 5.37870 & 3.07317 & 1.75217 & 0.99889  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 22.4535 & 13.9136 & 8.31218 & 5.05719 & 3.14728 & 2.00498 & 1.30903 & 0.87473 & 0.59391 & 0.40446 & 0.27520 & 0.16486\\\\\n\t\t$p$ & 47.7294 & 29.6021 & 17.7849 & 10.9674 & 7.00739 & 4.67522 & 3.28559 & 2.44432 & 1.92230 & 1.58965 & 1.37647 & 1.15781\\\\\\hline\\hline\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=1.8$}\\\\\\hline\n\t\t$\\Gamma$ & 308.935 & 182.395 & 102.261 & 57.3435 & 32.1483 & 18.0355 & 10.1029 & 5.67241 & 3.17978 & 1.78359 & 0.99997  & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 18.1745 & 11.1626 & 6.63304 & 4.02868 & 2.51560 & 1.61389 & 1.06328 & 0.71747 & 0.49051 & 0.33739 & 0.23058 & 0.14359\\\\\n\t\t$p$ \t& 41.6428 & 25.5932 & 15.3055 & 9.44338 & 6.07675 & 4.10949 & 2.93845 & 2.22906 & 1.78546 & 1.50402 & 1.32125 & 1.18748\\\\\\hline\\hline\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.0$}\\\\\\hline\n\t\t$\\Gamma$ & 375.818 & 217.422 & 119.600 & 65.7745 & 36.1611 & 19.8980 & 10.9232 & 6.01199 & 3.30681 & 1.81767 & 1.00051 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 15.0964 & 9.17319 & 5.42177 & 3.29200 & 2.06139 & 1.33276 & 0.88426 & 0.60261 & 0.41513 & 0.28650 & 0.19651 & 0.12379\\\\\n\t\t$p$ \t\t\t\t\t& 37.1333 & 22.5775 & 13.4413 & 8.30684 & 5.38337 & 3.68921 & 2.67835 & 2.06727 & 1.68347 & 1.43752 & 1.27850 & 1.16494\\\\\\hline\\hline\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.2$}\\\\\\hline\n\t\t$\\Gamma$ & 463.975 & 262.948 & 141.568 & 76.2338 & 41.0173 & 22.0958 & 11.9035 & 6.41082 & 3.45056 & 1.85303 & 1.00113 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 12.7875 & 7.69994 & 4.52708 & 2.74830 & 1.72461 & 1.12217 & 0.75368 & 0.51642 & 0.35777 & 0.24734 & 0.17009 & 0.10850\\\\\n\t\t$p$ \t\t\t\t\t& 33.6575 & 20.2710 & 12.0118 & 7.43585 & 4.85060 & 3.36425 & 2.48426 & 1.94445 & 1.60450 & 1.38520 & 1.24473 & 1.14408\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.4$}\\\\\\hline\n\t\t$\\Gamma$ & 578.968 & 320.871 & 168.949 & 89.0382 & 46.8778 & 24.7092 & 12.9953 & 6.85634 & 3.60307 & 1.89919 & 0.99952 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 10.9709 & 6.56430 & 3.83850 & 2.33031 & 1.47100 & 0.96365 & 0.65089 & 0.44974 & 0.31141 & 0.21697 & 0.14862 & 0.09589\\\\\n\t\t$p$ & 30.8215 & 18.4175 & 10.8648 & 6.74135 & 4.43655 & 3.11369 & 2.32748 & 1.84722 & 1.53931 & 1.34446 & 1.21673 & 1.12942\\\\\\hline\\hline\t\t\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.6$}\\\\\\hline\n\t\t$\\Gamma$ & 723.656 & 392.384 & 202.051 & 104.080 & 53.5742 & 27.6270 & 14.2191 & 7.32182 & 3.76653 & 1.93971 & 1.00200 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 9.50055 & 5.63818 & 3.28596 & 1.99866 & 1.26783 & 0.83500 & 0.56905 & 0.39442 & 0.27600 & 0.19145 & 0.13130 & 0.08576\\\\\n\t\t$p$ \t& 28.3633 & 16.8096 & 9.89231 & 6.16190 & 4.09049 & 2.90245 & 2.19936 & 1.76426 & 1.48858 & 1.30961 & 1.19408 & 1.11954\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=2.8$}\\\\\\hline\n\t\t$\\Gamma$ & 893.746 & 474.549 & 239.143 & 120.685 & 60.8483 & 30.6642 & 15.4796 & 7.80951 & 3.93161 & 1.98042 & 1.00296 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 8.19448 & 4.82859 & 2.81518 & 1.71951 & 1.09985 & 0.73051 & 0.50093 & 0.35038 & 0.24489 & 0.17117 & 0.11700 & 0.07671\\\\\n\t\t$p$ \t& 25.9004 & 15.2521 & 8.98792 & 5.63831 & 3.78782 & 2.72194 & 2.08856 & 1.69631 & 1.44344 & 1.28133 & 1.17497 & 1.10201\\\\\\hline\\hline\n\t\t\\multicolumn{13}{c}{ $\\kappa=3.0$}\\\\\\hline\n\t\t$\\Gamma$ & 1071.02 & 558.495 & 276.444 & 136.953 & 67.7922 & 33.5897 & 16.6383 & 8.22716 & 4.07874 & 2.02013 & 0.99949 & 0.5  \\\\\n\t\t$u_{\\mathrm{ex}}$ & 6.93189 & 4.07091 & 2.38838 & 1.47193 & 0.95056 & 0.64023 & 0.44340 & 0.31146 & 0.21994 & 0.15395 & 0.10494 & 0.06958\\\\\n\t\t$p$ \t& 23.1181 & 13.5906 & 8.07317 & 5.12679 & 3.49444 & 2.55590 & 1.98879 & 1.63334 & 1.40554 & 1.25677 & 1.15868 & 1.09682\\\\\\hline\\hline\n\t\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\small\n\t\\caption{Values of the anharmonic correction coefficient $\\beta$ for different screening parameter $\\kappa$.}\n\t\\label{Table2}\n\t\t\\begin{tabular}{l c c c c c c c c c c c c c c c c c}\n\t\t$\\kappa$ & 0.0 & 0.2 & 0.3 & 0.4 & 0.6 & 0.8 & 1.0 & 1.2 & 1.4 & 1.6 & 1.8 & 2.0 & 2.2 & 2.4 & 2.6 & 2.8 & 3.0 \\\\\\hline\n\t\t$\\beta(\\kappa)$\t& 3.01 & 9.23 & 12.38 & 14.30 & 10.53 & 9.71 & 9.35 & 9.28 & 9.14 & 9.08 & 8.97 & 8.855 & 8.68 & 8.71 & 8.46 & 8.47 & 8.51\n\t\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\small\n\t\\caption{Reduced excess energy $u_{\\mathrm{ex}}$ of the 2D Yukawa crystal obtained in MD simulations for various screening parameters $\\kappa$ and reduced coupling parameters $\\Gamma_{\\rm m}/\\Gamma$.}\n\t\\label{Table3}\n\t\t\\begin{tabular}{lccccccccc}\n\t\t\\multicolumn{1}{c|}{ $\\kappa$}& \\multicolumn{9}{c}{$\\Gamma_{\\rm m}/\\Gamma$}     \\\\ \\hline\\hline\n\t\t\\multicolumn{1}{l|}{ }& 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.6 & 0.7 & 0.8 & 0.9 \\\\ \\cline{2-10}\n\t\t\\multicolumn{1}{l|}{0.5} & 1595.62 & 798.828 & 532.689 & 399.681 & 319.981 & 266.796 & 228.880 & 200.332 & 178.283 \\\\\n\t\t\\multicolumn{1}{l|}{0.6} & 1217.36 & 609.282 & 406.628 & 305.117 & 244.469 & 203.938 & 174.914 & 153.188 & 136.267 \\\\\n\t\t\\multicolumn{1}{l|}{0.8} & 773.025 & 387.104 & 258.328 & 194.074 & 155.484 & 129.733 & 111.343 & 97.5607 & 86.8364 \\\\\n\t\t\\multicolumn{1}{l|}{1.0} & 529.643 & 265.306 & 177.235 & 133.215 & 106.726 & 89.1490 & 76.5169 & 67.1314 & 59.7831 \\\\\n\t\t\\multicolumn{1}{l|}{1.2} & 382.522 & 191.740 & 128.152 & 96.3972 & 77.2970 & 64.6022 & 55.5318 & 48.7317 & 43.4438   \\\\\n\t\t\\multicolumn{1}{l|}{1.4} & 287.408 & 144.232 & 96.4804 & 72.5942 & 58.2862 & 48.7586 & 41.9386 & 36.8484 & 32.8838   \\\\\n\t\t\\multicolumn{1}{l|}{1.6} & 223.185 & 112.096 & 75.0671 & 56.5515 & 45.4466 & 38.0606 & 32.7681 & 28.8120 & 25.7391   \\\\\n\t\t\\multicolumn{1}{l|}{1.8} & 178.133 & 89.6228 & 60.0889 & 45.3116 & 36.4631 & 30.5563 & 26.3521 & 23.1896 & 20.7451   \\\\\n\t\t\\multicolumn{1}{l|}{2.0} & 145.774 & 73.3800 & 49.2712 & 37.2003 & 29.9641 & 25.1447 & 21.7011 & 19.1314 & 17.1275   \\\\\n\t\t\\multicolumn{1}{l|}{2.2} & 121.609 & 61.3067 & 41.2021 & 31.1620 & 25.1352 & 21.1177 & 18.2517 & 16.1113 & 14.4385   \\\\\n\t\t\\multicolumn{1}{l|}{2.4} & 102.908 & 51.9465 & 34.9672 & 26.4819 & 21.3920 & 17.9999 & 15.5706 & 13.7650 & 12.3602   \\\\\n\t\t\\multicolumn{1}{l|}{2.6} & 87.4157 & 44.2324 & 29.8212 & 22.6181 & 18.2990 & 15.4212 & 13.3710 & 11.8300 & 10.6351   \\\\\n\t\t\\multicolumn{1}{l|}{2.8} & 73.5771 & 37.3025 & 25.2028 & 19.1490 & 15.5271 & 13.1108 & 11.3865 & 10.0997 & 9.10597   \\\\\n\t\t\\multicolumn{1}{l|}{3.0} & 60.2002 & 30.6118 & 20.7457 & 15.8118 & 12.8497 & 10.8840 & 9.47465 & 8.43053 & 7.65187   \\\\\n\t\t\\end{tabular}\n\\end{table}\n\n\\begin{table}[!t]\n\t\\centering\n\t\\small\n\t\\caption{Reduced pressure (compressibility)  $p$ of the 2D Yukawa crystal obtained in MD simulations for various screening parameters $\\kappa$ and reduced coupling parameters $\\Gamma_{\\rm m}/\\Gamma$.}\n\t\\label{Table4}\n\t\t\\begin{tabular}{lccccccccc}\n\t\t\\multicolumn{1}{c|}{ $\\kappa$}& \\multicolumn{9}{c}{$\\Gamma_{\\rm m}/\\Gamma$}     \\\\ \\hline\\hline\n\t\t\\multicolumn{1}{l|}{ }& 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.6 & 0.7 & 0.8 & 0.9 \\\\ \\cline{2-10}\n\t\t\\multicolumn{1}{l|}{0.5} & 2080.63 & 1041.70 & 694.789 & 521.370 & 417.454 & 348.100 & 298.669 & 261.442 & 232.679 \\\\\n\t\t\\multicolumn{1}{l|}{0.6} & 1669.06 & 835.485 & 557.680 & 418.523 & 335.380 & 279.814 & 240.022 & 210.233 & 187.022 \\\\\n\t\t\\multicolumn{1}{l|}{0.8} & 1168.03 & 585.024 & 390.480 & 293.406 & 235.104 & 196.197 & 168.410 & 147.583 & 131.370 \\\\\n\t\t\\multicolumn{1}{l|}{1.0} & 878.208 & 440.005 & 294.000 & 221.023 & 177.106 & 147.964 & 127.016 & 111.450 & 99.2542 \\\\\n\t\t\\multicolumn{1}{l|}{1.2} & 693.046 & 347.470 & 232.288 & 174.765 & 140.162 & 117.162 & 100.726 & 88.4011 & 78.8053   \\\\\n\t\t\\multicolumn{1}{l|}{1.4} & 566.555 & 284.386 & 190.275 & 143.196 & 114.994 & 96.2113 & 82.7636 & 72.7234 & 64.8975   \\\\\n\t\t\\multicolumn{1}{l|}{1.6} & 476.692 & 239.477 & 160.406 & 120.865 & 97.1465 & 81.3696 & 70.0608 & 61.6053 & 55.0288   \\\\\n\t\t\\multicolumn{1}{l|}{1.8} & 410.580 & 206.621 & 138.561 & 104.505 & 84.1086 & 70.4915 & 60.7970 & 53.5005 & 47.8555   \\\\\n\t\t\\multicolumn{1}{l|}{2.0} & 361.191 & 181.859 & 122.134 & 92.2267 & 74.2973 & 62.3524 & 53.8144 & 47.4405 & 42.4641   \\\\\n\t\t\\multicolumn{1}{l|}{2.2} & 322.729 & 162.732 & 109.386 & 82.7430 & 66.7485 & 56.0825 & 48.4703 & 42.7821 & 38.3327   \\\\\n\t\t\\multicolumn{1}{l|}{2.4} & 291.498 & 147.173 & 99.0847 & 75.0489 & 60.6307 & 51.0175 & 44.1300 & 39.0087 & 35.0158   \\\\\n\t\t\\multicolumn{1}{l|}{2.6} & 263.437 & 133.325 & 89.9002 & 68.1935 & 55.1747 & 46.4976 & 40.3128 & 35.6615 & 32.0486   \\\\\n\t\t\\multicolumn{1}{l|}{2.8} & 235.188 & 119.260 & 80.5872 & 61.2342 & 49.6540 & 41.9257 & 36.4074 & 32.2829 & 29.0897   \\\\\n\t\t\\multicolumn{1}{l|}{3.0} & 203.533 & 103.516 & 70.1601 & 53.4777 & 43.4588 & 36.8063 & 32.0351 & 28.4887 & 25.8081  \\\\\n\t\t\\end{tabular}\n\\end{table}\n\\end{widetext}\n\n\\bibliography{Ref-2D-Yukawa} \n\n\\end{document}\n",
        "Descriptive_question1": "What is the reduced pressure value for kappa = 0.5 at Γm/Γ = 0.1 in table_5?",
        "Descriptive_question2": "What is the screening parameter value with the highest reduced pressure at Γm/Γ = 0.9 in table_5?",
        "Reasoning_question1": "How does the reduced pressure change with increasing screening parameter κ for a fixed reduced coupling parameter Γm/Γ = 0.5 in table_5, and what might this indicate about the interaction strength?",
        "Reasoning_question2": "Compare the trend of reduced pressure across different Γm/Γ values for κ = 1.0 and κ = 3.0 in table_5. What could explain the observed differences in pressure behavior?",
        "Descriptive_answer1": "2080.63",
        "Descriptive_answer2": "0.5",
        "Reasoning_answer1": "Let's analyze the trend of reduced pressure (p) with increasing screening parameter κ for a fixed reduced coupling parameter Γm/Γ = 0.5. From the table, for Γm/Γ = 0.5, as κ increases from 0.5 to 3.0, the reduced pressure decreases significantly: at κ=0.5, p=417.454; at κ=1.0, p=177.106; at κ=1.6, p=97.1465; at κ=2.0, p=74.2973; and at κ=3.0, p=43.4588. This consistent decrease suggests a strong inverse relationship between κ and reduced pressure. The screening parameter κ relates to the interaction range of the Yukawa potential, where a higher κ indicates stronger screening, effectively reducing the range and strength of the repulsive interaction between particles. This reduced interaction strength leads to lower pressure in the system, as particles are less strongly repelled, indicating a transition to softer interactions as κ increases, which aligns with the context provided on Yukawa systems.",
        "Reasoning_answer2": "Let's compare the trend of reduced pressure (p) across different Γm/Γ values for κ=1.0 and κ=3.0. For κ=1.0, as Γm/Γ increases from 0.1 to 0.9 (indicating decreasing coupling strength since Γm/Γ is inversely related to Γ), the pressure decreases from 878.208 to 99.2542. Specifically, the values are: 878.208 (0.1), 440.005 (0.2), 294.000 (0.3), 221.023 (0.4), 177.106 (0.5), 147.964 (0.6), 127.016 (0.7), 111.450 (0.8), and 99.2542 (0.9), showing a roughly proportional decrease with each step. For κ=3.0, the pressure also decreases from 203.533 to 25.8081 over the same Γm/Γ range: 203.533 (0.1), 103.516 (0.2), 70.1601 (0.3), 53.4777 (0.4), 43.4588 (0.5), 36.8063 (0.6), 32.0351 (0.7), 28.4887 (0.8), and 25.8081 (0.9). However, the absolute decrease is much smaller at κ=3.0 compared to κ=1.0, and the relative rate of decrease appears less steep at higher κ. This difference could be explained by the effect of screening on interaction strength. At κ=1.0, the interactions are less screened, meaning particles interact more strongly, leading to higher pressure values and a more pronounced decrease as coupling strength reduces (higher Γm/Γ). At κ=3.0, stronger screening softens the interactions significantly, resulting in lower pressure values overall and a less dramatic change with varying coupling strength, as the interaction range is already quite short. This reflects the context provided, where higher κ corresponds to softer potentials, influencing the system's response to changes in coupling parameter."
    }
]