Buckets:
Title: Graph Neural Networks can Recover the Hidden Features Solely from the Graph Structure
URL Source: https://arxiv.org/html/2301.10956
Markdown Content: Back to arXiv
This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.
Why HTML? Report Issue Back to Abstract Download PDF 1Introduction 2Related Work 3Background and Problem Formulation 4Main Results 5Experiments 6Conclusion License: arXiv.org perpetual non-exclusive license arXiv:2301.10956v4 [cs.LG] 23 Mar 2024 Graph Neural Networks can Recover the Hidden Features Solely from the Graph Structure Ryoma Sato Abstract
Graph Neural Networks (GNNs) are popular models for graph learning problems. GNNs show strong empirical performance in many practical tasks. However, the theoretical properties have not been completely elucidated. In this paper, we investigate whether GNNs can exploit the graph structure from the perspective of the expressive power of GNNs. In our analysis, we consider graph generation processes that are controlled by hidden (or latent) node features, which contain all information about the graph structure. A typical example of this framework is kNN graphs constructed from the hidden features. In our main results, we show that GNNs can recover the hidden node features from the input graph alone, even when all node features, including the hidden features themselves and any indirect hints, are unavailable. GNNs can further use the recovered node features for downstream tasks. These results show that GNNs can fully exploit the graph structure by themselves, and in effect, GNNs can use both the hidden and explicit node features for downstream tasks. In the experiments, we confirm the validity of our results by showing that GNNs can accurately recover the hidden features using a GNN architecture built based on our theoretical analysis.
Graph Neural Networks, Expressive Power, Metric Recovery 1Introduction Figure 1:(a) Traditional View (Rich Node Features). GNNs filter features by mixing them with neighboring nodes. (b) Traditional View (Uninformative Features). Filters cannot generate informative features if the inputs are not informative, i.e., garbage in, garbage out. (c) Our Results. GNNs create informative node features by themselves even when the input node features are uninformative by absorbing information from the underlying graph. (d) Illustrations of the Problem Setting. (d.1) Nodes have hidden features from which the input graph is generated. (d.2) The input to GNNs is a vanilla graph without any additional features. Nodes have coordinates for visualization in this panel, but these coordinates are neither fed to GNNs. (d.3) GNNs try to recover the hidden features.
Graph Neural Networks (GNNs) [15, 45] are popular machine learning models for processing graph data. GNNs take a graph with node features as input and output embeddings of nodes. At each node, GNNs send the node features to neighboring nodes, aggregate the received features, and output the new node features [14]. In this way, GNNs produce valuable node embeddings that take neighboring nodes into account. GNNs show strong empirical performance in machine learning and data mining tasks [65, 12, 21, 18].
Roughly speaking, GNNs smooth out the node features on the input graph by recursively mixing the node features of neighboring nodes, and GNNs thereby transform noisy features into clean ones (Figure 1 (a)). This smoothing effect has been observed empirically [4, 36] and shown theoretically [29, 37]. There are several GNN architectures that are inspired by the smoothing process [28, 36, 60]. It has also been pointed out that stacking too many layers harms the performance of GNNs due to the over-smoothing effect [29, 6], which is caused by too much mixing of the node features.
In this perspective, node features are the primary actors in GNNs, and graphs are secondary. If node features are uninformative at all, GNNs should fail to obtain meaningful node embeddings no matter how they mix node features (Figure 1 (b)). This is in contrast to the opposite scenario: Even if the graphs are uninformative at all, if the node features are informative for downstream tasks, GNNs can obtain meaningful node embeddings just by ignoring edges or not mixing node features at all. Therefore, node features are the first requirement of GNNs, and the graph only provides some boost to the quality of node features [36]. It indicates that GNNs cannot utilize the graph information without the aid of good node features.
The central research question of this paper is as follows:
Can GNNs utilize the graph information
without the aid of node features?
We positively answer this question through our theoretical analysis. We show that GNNs can recover the hidden node features that control the generation of the graph structure even without the help of informative node features (Figure 1 (c)). The recovered features contain all the information of the graph structure. The recovered node features can be further used for downstream tasks. These results show that GNNs can essentially use both given node features and graph-based features extracted from the graph structure. Our theoretical results provide a different perspective from the existing beliefs [58, 36] based on empirical observations that GNNs only mix and smooth out node features. In the experiments, we show that existing GNN architectures do not necessarily extract the hidden node features well, and special architectures are required to learn the recovery in empirical situations.
The contributions of this paper are summarized as follows:
•
We establish the theory of the feature recovery problem by GNNs for the first time. Our analysis provides a new perspective on the expressive power of GNNs.
•
We prove that GNNs can recover the hidden features solely from the graph structure (Theorem 4.4). These results show that GNNs have an inherent ability to extract information from the input graph.
•
We validate the theoretical results in the experiments by showing that GNNs can accurately recover the hidden features. We also show that existing GNN architectures are mediocre in this task. These results highlight the importance of inductive biases for GNNs.
Reproducibility: Our code is publicly available at https://github.com/joisino/gnnrecover.
2Related Work 2.1Graph Neural Networks and Its Theory
Graph Neural Networks (GNNs) [15, 45] are now de facto standard models for graph learning problems [27, 55, 65]. There are many applications of GNNs, including bioinformatics [30], physics [7, 38], recommender systems [12, 21], and transportation [59]. There are several formulations of GNNs, including spectral [8], spatial [14], and equivariant [33] ones. We use the message-passing formulation [14] in this paper.
The theory of GNNs has been studied extensively in the literature, including generalization GNNs [46, 13, 62] and computational complexity [17, 5, 66, 44]. The most relevant topic to this paper is the expressive power of GNNs, which we will review in the following.
Expressive Power (or Representation Power) means what kind of functional classes GNNs can realize. Originally, Morris et al. [34] and Xu et al. [61] showed that message-passing GNNs are at most as powerful as the 1-WL test, and they proposed GNNs that are as powerful as the 1-WL and 𝑘 -(set)WL tests. Sato [42, 43] and Loukas [31] also showed that message-passing GNNs are as powerful as a computational model of distributed local algorithms, and they proposed GNNs that are as powerful as port-numbering and randomized local algorithms. Loukas [31] showed that GNNs are Turing-complete under certain conditions (i.e., with unique node ids and infinitely increasing depths). There are various efforts to improve the expressive power of GNNs by non-message-passing architectures [33, 32, 35]. We refer the readers to survey papers [40, 24] for more details on the expressive power of GNNs.
The main difference between our analysis and existing ones is that the existing analyses focus on combinatorial characteristics of the expressive power of GNNs, e.g., the WL test, which are not necessarily aligned with the interests of realistic machine learning applications. By contrast, we consider the continuous task of recovering the hidden features from the input graph, which is an important topic in machine learning in its own right [53, 3, 52, 41]. To the best of our knowledge, this is the first paper that reveals the expressive power of GNNs in the context of feature recovery. Furthermore, the existing analysis of expressive power does not take into account the complexity of the models. The existing analyses show that GNNs can solve certain problems, but they may be too complex to be learned by GNNs. By contrast, we show that the feature recovery problem can be solved with low complexity.
2.2Feature Recovery
Estimation of hidden variables that control the generation process of data has been extensively studied in the machine learning literature [53, 52, 26]. These methods are sometimes used for dimensionality reduction, and the estimated features are fed to downstream models. In this paper, we consider the estimation of hidden embeddings from a graph observation [2, 57, 54, 20]. The critical difference between our analysis and the existing ones is that we investigate whether GNNs can represent a recovery algorithm, while the existing works propose general (non-GNN) algorithms that recover features. To the best of our knowledge, we are the first to establish the theory of feature recovery based on GNNs.
Many empirical works propose feature learning methods for GNNs [17, 56, 64, 22, 39]. The differences between these papers and ours are twofold. First, these methods are not proven to converge to the true features, while we consider a feature learning method that converges to the true features. Second, the existing methods rely heavily on the input node features while we do not assume any input node features. The latter point is important because how GNNs exploit the input graph structure is a central topic in the GNN literature, and sometimes GNNs are shown to NOT benefit from the graph structure [11]. By contrast, our results show that GNNs can extract meaningful information from the input graph from a different perspective than existing work.
3Background and Problem Formulation
In this paper, we assume that each node 𝑣 has hidden (or latent) features 𝒛 𝑣 , and the graph is generated by connecting nodes with similar hidden features. For example, (i) 𝒛 𝑣 ∈ ℝ 𝑑 represents the preference of person 𝑣 in social networks, (ii) 𝒛 𝑣 represents the topic of paper 𝑣 in citation networks, and (iii) 𝒛 𝑣 represents the geographic location of point 𝑣 in spatial networks.
The critical assumption of our problem setting is that the features { 𝒛 𝑣 ∣ 𝑣 ∈ 𝑉 } , such as the true preference of people and the true topic of papers, are not observed, but only the resulting graph 𝐺 is observed.
Somewhat surprisingly, we will show that GNNs that take the vanilla graph 𝐺 with only simple synthetic node features such as degree features 𝑑 𝑣 and graph size 𝑛
| 𝑉 | can consistently estimate the hidden features { 𝒛 𝑣 ∣ 𝑣 ∈ 𝑉 } (Figure 1 (d)).
In the following, we describe the assumptions on data and models in detail.
3.1Assumptions
In this paper, we deal with directed graphs. Directed graphs are general, and undirected graphs can be converted to directed graphs by duplicating every edge in both directions. We assume that there is an arc from 𝑣 to 𝑢 if and only if ‖ 𝒛 𝑣 − 𝒛 𝑢 ‖ < 𝑠 ( 𝒛 𝑣 ) for a threshold function 𝑠 : ℝ 𝑑 → ℝ , i.e., nodes with similar hidden features are connected. It is also assumed that the hidden features { 𝒛 𝑣 } are sampled from an unknown distribution 𝑝 ( 𝒛 ) in an i.i.d. manner. As we consider the consistency of estimators or the behavior of estimators in a limit of infinite samples (nodes), we assume that a node 𝑣 𝑖 and its features 𝒛 𝑣 𝑖 ∼ 𝑝 ( 𝒛 ) are generated one by one, and we consider a series of graphs 𝐺 1
( 𝑉 1
{ 𝑣 1 } , 𝐸 1 ) , 𝐺 2
( 𝑉 2
{ 𝑣 1 , 𝑣 2 } , 𝐸 2 ) , … , 𝐺 𝑛
( 𝑉 𝑛 , 𝐸 𝑛 ) , … with an increasing number of nodes. Formally, the data generation process and the assumptions are summarized as follows.
Assumption 1 (Domain)
The domain 𝒵 of the hidden features is a convex compact domain in ℝ 𝑑 with smooth boundary ∂ 𝒵 .
Assumption 2 (Graph Generation)
For each 𝑖 ∈ ℤ + , 𝒛 𝑣 𝑖 is sampled from 𝑝 ( 𝒗 ) in an i.i.d. manner. There is a directed edge from 𝑣 to 𝑢 in 𝐺 𝑛 if and only if ‖ 𝒛 𝑣 − 𝒛 𝑢 ‖ < 𝑠 𝑛 ( 𝒛 𝑣 ) .
Assumption 3 (Density)
The density 𝑝 ( 𝒛 ) is positive and differentiable with bounded ∇ log ( 𝑝 ( 𝒛 ) ) on 𝒵 .
Assumption 4 (Threshold Function)
There exists a deterministic continuous function 𝑠 ¯ ( 𝑥 )
0 on 𝒵 ¯ such that 𝑔 𝑛 − 1 𝑠 𝑛 ( 𝑥 ) converges uniformly to 𝑠 ¯ ( 𝑥 ) for some 𝑔 𝑛 ∈ ℝ with 𝑔 𝑛 → 𝑛 → ∞ 0 and 𝑔 𝑛 𝑛 1 𝑛 + 2 log − 1 𝑑 + 2 𝑛 → 𝑛 → ∞ ∞ almost surely.
Assumption 5 (Stationary Distribution)
𝑛 𝜋 𝐺 𝑛 ( 𝑣 ) is uniformly equicontinuous almost surely, where 𝜋 𝐺 𝑛 ( 𝑣 ) is the stationary distribution of random walks on 𝐺 𝑛 .
Note that these assumptions are common to [20]. It should be noted that the threshold functions 𝑠 𝑛 can be stochastic and/or dependent on the data as long as Assumption 4 holds. For example, 𝑘 -NN graphs can be realized in this framework by setting 𝑠 ( 𝒛 𝑣 ) to be the distance to the 𝑘 -th nearest neighbor from 𝒛 𝑣 . We also note that Assumption 4 implies that the degree is the order of 𝜔 ( 𝑛 2 𝑑 + 2 log 𝑑 𝑑 + 2 𝑛 ) . Thus, the degree increases as the number of nodes increases. It ensures the graph is connected with high probability and is consistent with our scenario.
Remark (One by One Generation). The assumption of adding nodes one at a time may seem tricky. New users are indeed inserted into social networks one by one in some scenarios, but some other graphs do not necessarily follow this process. This assumption is introduced for technical convenience to consider the limit of 𝑛 → ∞ and to prove the consistency. In practice, the generation process of datasets does not need to follow this assumption. We use a single fixed graph in the experiments. GNNs succeed in recovering the hidden features only if the graph is sufficiently large.
3.2Graph Neural Networks
We consider message-passing GNNs [14] in this paper. Formally, 𝐿 -layer GNNs can be formulated as follows. Let 𝑿
[ 𝒙 1 , … , 𝒙 𝑛 ] ⊤ ∈ ℝ 𝑛 × 𝑑 in be the explicit (i.e., given, observed) node features, and
𝒉 𝑣 ( 0 )
𝒙
𝑣
(
∀
𝑣
∈
𝑉
)
,
𝒂
𝑣
(
𝑙
)
𝑓
agg
(
𝑙
)
(
{
{
𝒉
𝑢
(
𝑙
−
1
)
∣
𝑢
∈
𝒩
−
(
𝑣
)
}
}
)
(
∀
𝑙
∈
[
𝐿
]
,
𝑣
∈
𝑉
)
,
𝒉
𝑣
(
𝑙
)
𝑓 upd ( 𝑙 ) ( 𝒉 𝑣 ( 𝑙 − 1 ) , 𝒂 𝑣 ( 𝑙 ) )
( ∀ 𝑙 ∈ [ 𝐿 ] , 𝑣 ∈ 𝑉 ) ,
where { { ⋅ } } denotes a multiset, and 𝒩 − ( 𝑣 ) is the set of the neighbors with outgoing edges to node 𝑣 . We call 𝑓 agg ( 𝑙 ) an aggregation function and 𝑓 upd ( 𝑙 ) a update function. Let 𝜃
[ 𝐿 , 𝑓 agg ( 1 ) , 𝑓 upd ( 1 ) , … , 𝑓 agg ( 𝐿 ) , 𝑓 upd ( 𝐿 ) ] denote a list of all aggregation and update functions, i.e., 𝜃 specifies a model. Let 𝑓 ( 𝑣 , 𝐺 , 𝑿 ; 𝜃 )
𝒉 𝑣 ( 𝐿 ) be the output of the GNN 𝜃 for node 𝑣 and input graph 𝐺 . For notational convenience, 𝐿 𝜃 , 𝑓 agg , 𝜃 ( 𝑙 ) , and 𝑓 upd , 𝜃 ( 𝑙 ) denote the number of layers, 𝑙 -th aggregation function, and 𝑙 -th aggregation function of model 𝜃 , respectively.
Typical applications of GNNs assume that each node has rich explicit features 𝒙 𝑣 . However, this is not the case in many applications, and only the graph structure 𝐺 is available. For example, when we analyze social networks, demographic features of users may be masked due to privacy concerns. In such a case, synthetic features that can be computed solely from the input graph, such as degree features and the number of nodes, are used as explicit node features 𝒙 𝑣 [11, 17, 61]. In this paper, we tackle this general and challenging setting to show how GNNs exploit the graph structure. Specifically, we do not assume any external node features but set
𝒙 𝑣
[ 𝑑 𝑣 , 𝑛 ] ⊤ ∈ ℝ 2 ,
(1)
where 𝑑 𝑣 is the degree of node 𝑣 , and 𝑛
| 𝑉 | is the number of nodes in 𝐺 .
The goal of this paper is that GNNs can recover the hidden features 𝒛 𝑣 even if the node features are as scarce as the simple synthetic features. In words, we show that there exists GNN 𝜃 that uses the explicit node features 𝑿 defined by Eq. (1)1 and outputs 𝑓 ( 𝑣 , 𝐺 , 𝑿 ; 𝜃 ) ≈ 𝒛 𝑣 . This result is surprising because GNNs have been considered to simply smooth out the input features along the input graph [29, 36]. Our results show that GNNs can imagine new features 𝒛 𝑣 that are not included in the explicit features 𝑿 from scratch.
Remark (Expressive Power and Optimization). We note that the goal of this paper is to show the expressive power of GNNs, i.e., the existence of the parameters 𝜃 or the model specification that realizes some function, and how to find them from the data, i.e., optimization, is out of the scope of this paper. The separation of the studies of expressive power and optimization is a convention in the literature [42, 31, 1]. This paper is in line with them. In the experiments, we briefly show the empirical results of optimization.
3.3Why Is Recovery Challenging? Figure 2:Illustrations of the Difficulty of Recovery. The input graph is 10 -NN graph of the hidden features. The shortest path distance between points A and B is 21 hops, and the shortest path distance between points A and C is 18 hops. These distances indicate that point C is closer to point A than point B, but this is not the case in the true feature space. Standard node embedding methods would embed node C closer to A than node B to A, which is not consistent with the true feature. Embedding nodes that are close in the input graph close is the critical assumption in various embedding methods. This assumption does NOT hold in our situation. This disagreement is caused by the different scales of edges in sparse and dense regions. The difficulty lies in the fact that these scales are not directly available in the input information.
The difficulty lies in the fact that the graph distance on the unweighted graph is not consistent with the distance in the hidden feature space. In other words, two nodes can be far in the hidden feature space even if they are close in the input graph. This fact is illustrated in Figure 2. Most of the traditional node embedding methods rely on the assumption that close nodes in the graph should be embedded close. However, this assumption does not hold in our setting, and thus these methods fail to recover the hidden features from the vanilla graph. If the edge lengths ‖ 𝒛 𝑣 − 𝒛 𝑢 ‖ in the hidden feature space are taken into account, the shortest-path distance on the graph is a consistent estimator for the distance of nodes in the hidden feature space. However, the problem is that the edge length ‖ 𝒛 𝑣 − 𝒛 𝑢 ‖ , such as the quantitative intimacy between people in social networks and the distance between the true topics of two papers in citation networks, is not available, and what we observe is only the vanilla graph 𝐺 in many applications. If we cannot rely on the distance structure in the given graph, it seems impossible to estimate the distance structure of the hidden feature space.
A hop count does not reflect the distance in the feature space because one hop on the graph in a sparse region is longer in the feature space than a hop in a dense region. The problem is, we do not know whether the region around node 𝑣 is dense because we do not know the hidden features 𝒛 𝑣 . One might think that the density around a node could be estimated e.g., by the degree of the node, but this is not the case. Indeed, as the graph shown in Figure 2 is a 𝑘 -NN graph, the degrees of all nodes are the same, and therefore the degree does not provide any information about the density. In general, the density cannot be estimated from a local structure, as von Luxburg et al. [57] noted that “It is impossible to estimate the density in an unweighted 𝑘 -NN graph by local quantities alone.”
However, somewhat unexpectedly, it can be shown that the density function 𝑝 ( 𝒛 ) can be estimated solely from the unweighted graph [20, 51]. Intuitively, the random walk on the unweighted graph converges to the diffusion process on the true feature space 𝑝 ( 𝒛 ) as the number of nodes increases, and we can estimate the density function 𝑝 ( 𝒛 ) from it. Once the density is estimated, we can roughly estimate the threshold function 𝑠 ( 𝒛 ) as edges in low-density regions are long, and edges in dense regions are short. The scale function represents a typical scale of edges around 𝒛 can be used as a surrogate value for the edge length.
Even if the scale function can be estimated in principle, it is another story whether GNNs can estimate it. We positively prove this. In the following, we first focus on estimating the threshold function 𝑠 ( 𝒛 ) by GNNs, and then, we show that GNNs can recover the hidden features 𝒛 𝑣 by leveraging the threshold function.
4Main Results
We present our main results and their proofs in this section. At a high level, our results are summarized as follows:
•
We show in Section 4.1 that GNNs can estimate the threshold function 𝑠 ( 𝒛 ) with the aid of the metric recovery theory of unweighted graphs [20, 2]. We use the tool on the random walk and diffusion process, developed in [20, 51]
•
We show in Section 4.2 that GNNs can recover the hidden features up to rigid transformation with the aid of the theory of multidimensional scaling [50, 49] and random node features [43, 1].
•
We show in Theorems 4.2 and 4.5 that the number of the functions to be learned is finite regardless of the number of nodes, which is important for learning and generalization [62].
4.1Graph Neural Networks can Recover the Threshold Function
First, we show that GNNs can consistently estimate the threshold function 𝑠 . As we mentioned in the previous section, the density and threshold function cannot be estimated solely from the local structure. As we will show (and as is known in the classical context), they can be estimated by a PageRank-like global quantity of the input graph.
Theorem 4.1.
For any 𝑠 and 𝑔 that satisfy Assumptions 1-5, there exist 𝜃 1 , 𝜃 2 , … such that with the explicit node features 𝐗 defined by Eq. (1),
Pr [ 𝑓 ( 𝑣 , 𝐺 𝑛 , 𝑿 ; 𝜃 𝑛 ) → 𝑛 → ∞ 𝑠 ( 𝒛 𝑣 ) ]
1 ,
where the probability is with respect to the draw of samples 𝐳 1 , 𝐳 2 , … .
Proof Sketch.
We prove this theorem by construction. The key idea is that GNNs can simulate random walks on graphs [9]. Once the stationary distribution of random walks is estimated, we can recover the scale from it [20]. The full proof can be found in Appendix A. ∎
This theorem states that GNNs can represent a consistent estimator of 𝑠 ( 𝒛 𝑣 ) .
However, this theorem does not bound the number of layers, and the number of layers may grow infinitely as the number of nodes increases. This means that if the size of the graphs is not bounded, the number of functions to be learned grows infinitely. This is undesirable for learning. The following theorem resolves this issue.
Theorem 4.2.
There exist ℎ 𝑎𝑔𝑔 ( 1 ) , ℎ 𝑢𝑝𝑑 ( 1 ) , ℎ 𝑎𝑔𝑔 ( 2 ) , ℎ 𝑢𝑝𝑑 ( 2 ) such that for any 𝑠 and 𝑔 , there exist ℎ 𝑢𝑝𝑑 ( 3 ) such that Theorem 4.1 holds with
𝑓 𝑎𝑔𝑔 , 𝜃 𝑛 ( 1 )
ℎ
𝑎𝑔𝑔
(
1
)
,
𝑓
𝑢𝑝𝑑
,
𝜃
𝑛
(
1
)
ℎ
𝑢𝑝𝑑
(
1
)
𝑓
𝑎𝑔𝑔
,
𝜃
𝑛
(
𝑙
)
ℎ
𝑎𝑔𝑔
(
2
)
,
𝑓
𝑢𝑝𝑑
,
𝜃
𝑛
(
𝑙
)
ℎ
𝑢𝑝𝑑
(
2
)
(
𝑙
2
,
…
,
𝐿
𝜃
𝑛
−
1
)
𝑓
𝑎𝑔𝑔
,
𝜃
𝑛
(
𝐿
𝜃
𝑛
)
ℎ
𝑎𝑔𝑔
(
2
)
,
𝑓
𝑢𝑝𝑑
,
𝜃
𝑛
(
𝐿
𝜃
𝑛
)
ℎ 𝑢𝑝𝑑 ( 3 ) .
Proof Sketch.
From the proof of Theorem 4.1, most of the layers in 𝜃 𝑖 are used for estimating the stationary distribution, which can be realized by a repetition of the same layer. ∎
This theorem shows that the number of functions we need to learn is essentially five. This result indicates that learning the scale function has a good algorithmic alignment [62, Definition 3.4]. Moreover, these functions are the same regardless of the graph size. Therefore, in theory, one can fit these functions using small graphs and apply the resulting model to large graphs as long as the underlying law for the generation process, namely 𝑠 and 𝑔 , is fixed. Note that the order of the logical quantification matters. As ℎ agg ( 1 ) , ℎ upd ( 1 ) , ℎ agg ( 2 ) , ℎ upd ( 2 ) are universal and are independent with the generation process, they can be learned using other graphs and can be transferred to other types of graphs. The construction of these layers (i.e., the computation of the stationary distribution) can also be used for introducing indicative biases to GNN architectures.
4.2Graph Neural Networks can Recover the Hidden Features
As we have estimated the scale function, it seems easy to estimate the distance structure by applying the Bellman-Ford algorithm with edge lengths and to recover the hidden features, but this does not work well.
The first obstacle is that there is a freedom of rigid transformation. As rotating and shifting the true hidden features does not change the observed graph, we cannot distinguish hidden features that are transformed by rotation solely from the graph. To absorb the degree of freedom, we introduce the following measure of discrepancy of features.
Definition 4.3.
We define the distance between two feature matrices 𝑿 , 𝒀 ∈ ℝ 𝑛 × 𝑑 as
𝑑 𝐺 ( 𝑿 , 𝒀 )
def min 𝑷 ∈ ℝ 𝑑 × 𝑑
𝑷 ⊤ 𝑷
𝐼 𝑑 1 𝑛 ‖ 𝑪 𝑛 𝑿 − 𝑪 𝑛 𝒀 𝑷 ‖ 𝐹 2 ,
(4)
where 𝑪 𝑛
def ( 𝑰 𝑛 − 1 𝑛 𝟙 𝑛 𝟙 𝑛 ⊤ ) ∈ ℝ 𝑛 × 𝑛 is the centering matrix, 𝑰 𝑛 ∈ ℝ 𝑛 × 𝑛 is the identity matrix, and 𝟙 𝑛 ∈ ℝ 𝑛 is the vector of ones. We say that we recover the hidden features 𝑿 if we obtain features 𝒀 such that 𝑑 𝐺 ( 𝑿 , 𝒀 ) < 𝜀 for sufficiently small 𝜀
0 .
In other words, the distance is the minimum average distance between two features after rigid transformation. This distance is sometimes referred to as the orthogonal Procrustes distance [23, 47, 49], and can be computed efficiently by SVD [47]. Note that if one further wants to recover the rigid transformation factor, one can recover it in a semi-supervised manner by the Procrustes analysis.
The second obstacle is that GNNs cannot distinguish nodes. A naive solution is to include unique node ids in the node features. However, this leads the number of dimensions of node features to infinity as the number of nodes tends to infinity. This is not desirable for learning and generalization of the size of graphs. Our solution is to randomly select a constant number 𝑚 of nodes and assign unique node ids only to the selected nodes. Specifically, let 𝑚 ∈ ℤ + be a constant hyperparameter, and we first select 𝑚 nodes 𝒰
{ 𝑢 1 , … , 𝑢 𝑚 } ⊂ 𝑉 uniformly and randomly and set the input node features 𝒙 𝑣 ∈ ℝ 2 + 𝑚 as
𝒙 𝑣 syn
{
[
𝑑
𝑣
,
𝑛
,
𝒆
𝑖
⊤
]
⊤
(
𝑣
𝑢 𝑖 )
[ 𝑑 𝑣 , 𝑛 , 𝟘 𝑚 ⊤ ] ⊤
( 𝑣 ∉ 𝒰 ) ,
(5)
where 𝒆 𝑖 ∈ ℝ 𝑚 is the 𝑖 -th standard basis, and 𝟘 𝑚 is the vector of zeros. Importantly, this approach does not increase the number of dimensions even if the number of nodes tends to infinity because 𝑚 is a constant with respect to 𝑛 . From a technical point of view, this is a critical difference from existing analyses [31, 43, 1], which assume unique node ids. Our analysis strikes an excellent trade-off between a small complexity (a constant dimension) and a strong expressive power (precise recovery). In addition, adding node ids have been though to be valid only for transductive settings [16, Section 5.1.1], but our analysis is valid for inductive setting as well (see also the experiments).
We show that we can accurately estimate the distance structure and the hidden features by setting an appropriate number of the selected nodes.
Theorem 4.4.
For any 𝑠 and 𝑔 that satisfy Assumptions 1-5, for any 𝜀 , 𝛿
0 , there exist 𝑚 and 𝜃 1 , 𝜃 2 , … such that with the explicit node features 𝐗 defined by Eq. (5),
Pr [ lim sup 𝑛 → ∞ 𝑑 𝐺 ( 𝒁 ^ 𝜃 𝑛 , 𝒁 ) < 𝜀 ]
1 − 𝛿 ,
where 𝐙 ^ 𝜃 𝑛
[ 𝑓 ( 𝑣 1 , 𝐺 𝑛 , 𝐗 ; 𝜃 𝑛 ) , … , 𝑓 ( 𝑣 𝑛 , 𝐺 𝑛 , 𝐗 ; 𝜃 𝑛 ) ] ⊤ ∈ ℝ 𝑛 × 𝑑 is the estimated hidden features by GNN 𝜃 𝑖 , and 𝐙
[ 𝐳 1 , … , 𝐳 𝑛 ] ⊤ ∈ ℝ 𝑛 × 𝑑 is the true hidden features. The probability is with respect to the draw of samples 𝐳 1 , 𝐳 2 , … and the draw of a random selection of 𝒰 .
Proof Sketch.
We prove this theorem by construction. We estimate the threshold function 𝑠 by Theorem 4.1 and compute the shortest path distances from each selected node in 𝒰 with the estimated edge lengths. The computation of shortest path distances can be done by GNNs [62]. After this process, each node has the information of the (approximate) distance matrix among the selected nodes, which consists of 𝑚 2 dimensions. We then run multidimensional scaling in each node independently and recover the coordinates of the selected nodes. Lastly, the selected nodes announce their coordinates, and the non-selected nodes output the coordinates of the closest nodes in 𝒰 . With sufficiently large 𝑚 , the selected nodes 𝒰 form an 𝜀 ′ -covering of 𝒟 with high probability, and therefore, the mismatch of the non-selected nodes is negligibly small. The full proof can be found in Appendix B. ∎
As in Theorem 4.1, the statement of Theorem 4.4 does not bound the number of layers. However, as in Theorem 4.2, Theorem 4.4 can also be realized with a fixed number of functions.
Theorem 4.5.
For any 𝑠 and 𝑔 , there exist ℎ 𝑎𝑔𝑔 ( 1 ) , ℎ 𝑢𝑝𝑑 ( 1 ) , … , ℎ 𝑎𝑔𝑔 ( 8 ) , ℎ 𝑢𝑝𝑑 ( 8 ) such that Theorem 4.4 holds with these functions.
Therefore, the number of functions we need to learn is essentially a constant. This fact indicates that learning the hidden features has a good algorithmic alignment [62, Definition 3.4]. Besides, the components of these functions, i.e., computation of the stationary distribution, shortest-path distances, and multidimensional scaling, are differentiable almost everywhere. Here, we mean by almost everywhere the existence of non-differentiable points due to the min-operator of the shortest-path algorithm. Strictly speaking, this is no more differentiable than the ReLU function is, but can be optimized in an end-to-end manner by backpropagation using auto-differential frameworks such as PyTorch.
Figure 3:Results for the Transductive Setting. Overall, the proposed method succeeded in recovering the ground truth hidden features while tSNE to 𝑿 (Eq. (5)) fails, and GINs and GATs are mediocre. (Top Left) The ground truth hidden embeddings. The node ids are numbered based on the x-coordinate and shown in the node colors. These node ids are for visualization purposes only and are NOT shown to GNNs and downstream algorithms. (Top Mid) The input graph constructed from the hidden features. The positions of the visualization are NOT shown to GNNs. (Top Right) tSNE plot on the synthetic node features, i.e., Eq. (5). These results indicate that the node features are not informative for feature recovery. This introduces challenges to the task. (Bottom Left) The recovered features by the proposed method. They resemble the ground truth not only with respect to the cluster structure but also the x-coordinates (shown in the node colors), the curved moon shapes in the two-moon dataset, and the striped pattern in the Adult dataset. The 𝑑 𝐺 value (Eq. (4)) is small, which indicates the success of the recovery and validates the theory. (Bottom Mid) The recovered features by GINs. They do not resemble the true hidden features. The 𝑑 𝐺 value is mediocre. (Bottom Right) The recovered features by GATs. They do not resemble hidden features, but some clusters are detected (shown in the node colors). The 𝑑 𝐺 value is mediocre. These results show that existing GNNs can extract some information from the graph structure, but they do not fully recover the hidden features. Figure 4:Results for the Inductive Setting. The legends and tendencies are the same as in Figure 3. The proposed method succeeded in generalizing to different sizes and keeping 𝑑 𝐺 low even in the extrapolation setting. GINs and GAT partially succeeded in extracting some of graph information, but they are not perfect, and 𝑑 𝐺 is moderately high.
Remark (GNNs with Input Node Features). Many of graph-related tasks provide node features { 𝒙 𝑣 given ∣ 𝑣 ∈ 𝑉 } as input. Theorem 4.4 shows that GNNs with 𝒙 𝑣 syn as explicit node features can recover 𝒛 𝑣 , where 𝒙 𝑣 syn is defined by Eq. (5). Thus, if we feed 𝒙 𝑣
[ 𝒙 𝑣 given ⊤ , 𝒙 𝑣 syn ⊤ ] ⊤ to GNNs as explicit node features, GNN can implicitly use both of 𝒙 𝑣 given and 𝒛 𝑣 .
The most straightforward method for node classification is to apply a feed-forward network to each node independently,
𝑦 ^ 𝑣 MLP
ℎ 𝜃 ( 𝒙 𝑣 given ) .
(6)
This approach is fairly strong when 𝒙 𝑣 given is rich [36] but ignores the graph. Our analysis shows that GNNs can classify nodes using both 𝒙 𝑣 given and 𝒛 𝑣 , i.e.,
𝑦 ^ 𝑣 GNN
ℎ 𝜃 ( 𝒙 𝑣 given , 𝒛 𝑣 ) .
(7)
Comparison of Eqs. (6) and (7) highlights a strength of GNNs compared to feed-forward networks.
5Experiments
In the experiments, we validate the theorems by empirically showing that GNNs can recover hidden features solely from the input graph.
5.1Recovering Features
Datasets. We use the following synthetic and real datasets.
Two-Moon
is a synthetic dataset with a two-moon shape. We construct a 𝑘 -nearest neighbor graph with 𝑘
floor ( 1 10 𝑛 1 / 2 log 𝑛 ) , which satisfies Assumption 4. As we know the ground-truth generation process and can generate different graphs with the same law of data generation, we use this dataset for validating the theorems and showing generalization ability in an inductive setting.
Adult
is a real consensus dataset. We use age and the logarithm of capital gain as the hidden features and construct a 𝑘 -nearest neighbor graph, i.e., people with similar ages and incomes become friends.
Settings. We use two different problem settings.
Transductive
setting uses a single graph. We are given the true hidden features for some training nodes, and estimate the hidden features of other nodes. We use 70 percent of the nodes for training and the rest of the nodes for testing.
Inductive
setting uses multiple graphs. In the training phase, we are given two-moon datasets with 𝑛
1000 to 5000 nodes and their true hidden features. In the test phase, we are given a new two-moon dataset with 𝑛
10000 nodes and estimate the hidden features of the test graph. This setting is challenging because (i) we do not know any hidden features of the test graphs, and (ii) models need to generalize to extrapolation in the size of the input graphs.
Methods. As we prove the theorems by construction and know the configuration of GNNs that recover the hidden features except for the unknown parameters about the ground truth data (i.e., the scale 𝑔 𝑛 and the constant 𝑐 that depends on 𝑝 and 𝑠 ¯ ), we use the model architecture that we constructed in our proof and model the unknown parameters, i.e., scaling factor 𝑔 𝑛 , using 3 -layer perceptron with hidden 128 neurons that takes 𝑛 as input and output 𝑔 𝑛 . This model can be regarded as the GNNs with the maximum inductive bias for recovering the hidden features. We fix the number of the selected nodes 𝑚
500 throughout the experiments.
Baselines. We use 3 -layer Graph Attention Networks (GATs) [55] and Graph Isomorphism Networks (GINs) [61] as baselines. We feed the same explicit node features as in our method, i.e., Eq. (5), and use the hidden features as the target of the regression.
Details. We optimize all the methods with Adam [25] with a learning rate of 0.001 for 100 epochs. The loss function is 𝑑 𝐺 ( 𝒁 ^ 𝜃 [ train-mask ] , 𝒁 [ train-mask ] ) , where 𝒁 ^ 𝜃 ∈ ℝ 𝑛 × 𝑑 is the output of GNNs, 𝒁 is the ground truth hidden embeddings, and train-mask extracts the coordinates of the training nodes.
Table 1:Performance on Downstream Tasks. Each value represents accuracy. Higher is Better. The performance of the baseline method shows that 𝒙 𝑣 syn does not contain any information for solving the downstream tasks. GNNs take such uninformative node features only as node features. Nevertheless, the recovered feature is highly predictive. This indicates that GNNs create completely new and useful node features by themselves even when the input node features are uninformative. Cora CiteSeer PubMed Coauthor CS Coauthor Physics Computers Photo Baseline 𝒙 𝑣 syn 0.122 0.231 0.355 0.066 0.307 0.185 0.207 Recovered Feature 𝒛 ^ 𝑣 0.671 0.640 0.653 0.492 0.745 0.528 0.566
Results. Figures 3 and 4 show the results. As the rigid transformation factor cannot be determined, we align the recovered features using the orthogonal Procrustes analysis in the postprocessing. We make the following observations.
Observation 1. Recovery Succeeded. As the lower left panels show, the proposed method succeeds in recovering the hidden features solely from the input graphs. Notably, not only coarse structures such as connected components are recovered but also details such as the curved moon shape in the two-moon dataset and the striped pattern in the Adult dataset are recovered.
Observation 2. Existing GNNs are mediocre. As the lower right panels show, GINs and GATs extract some information from the graph structure, e.g., they map nearby nodes to similar embeddings as shown by the node colors, but they fail to recover the hidden features accurately regardless of their strong expressive powers. This is primarily because the input node features contain little information, which makes recovery difficult. These results highligt the importance of inductive biases of GNNs to exploit the hidden features.
Observation 3. tSNE fails. The upper right panels show that tSNE on the explicit node features 𝑿 failed to extract meaningful structures. These results indicate that the synthetic node features (Eq. (5)) do not tell anything about the hidden features, and GNNs recover the hidden features solely from the graph structure.
Observation 4. Recovery Succeeded in the Inductive Setting. Figure 4 shows that the proposed method succeeded in the inductive setting as well. This shows that the ability of recovery can be transferred to other graph sizes as long as the law of data generation is the same.
5.2Performance on Downstream Tasks
We confirm that the recovered feature 𝒛 ^ 𝑣 is useful for downstream tasks using popular benchmarks.
We use the Planetoid datasets (Cora, CiteSeer, PubMed) [63], Coauthor datasets, and Amazon datasets [48]. First, we discard all the node features in the datasets (e.g., the text information of the citation network). We then feed the vanilla graph to GNNs in the proposed way and recover the hidden node features by GNNs. We fit a logistic regression that estimates the node label 𝑦 𝑣 from the recovered features 𝒛 ^ 𝑣 . As a baseline, we fit a logistic regression that estimates the node label 𝑦 𝑣 from the input node feature 𝒙 𝑣 syn , i.e., Eq. (5). We use the standard train/val/test splits of these datasets, i.e., 20 training nodes per class. The accuracy in the test sets is shown in Table 1.
These results show that the recovered features by GNNs are informative for downstream tasks while the input node features are not at all. This indicates that GNNs extract meaningful information solely from the graph structure. We stress that this problem setting where no node features are available is extremely challenging for GNNs. Recall that existing GNNs use the node features (e.g., the text information of the citation network) contained in these datasets, which we intentionally discard and do not use. The results above show that GNNs work well (somewhat unexpectedly) in such a challenging situation.
6Conclusion
In this paper, we showed that GNNs can recover the hidden node features, which contain all information about the graph structure, solely from the graph input. These results provide a different perspective from the existing results, which indicate that GNNs simply mix and smooth out the given node features. In the experiments, GNNs accurately recover the hidden features in both transductive and inductive settings.
Acknowledgements. This work was supported by JSPS KAKENHI GrantNumber 21J22490 and 20H04244.
References [1] ↑ Ralph Abboud, İsmail İlkan Ceylan, Martin Grohe, and Thomas Lukasiewicz.The surprising power of graph neural networks with random node initialization.In Proceedings of the 30th International Joint Conference on Artificial Intelligence, IJCAI, pages 2112–2118, 2021. [2] ↑ Morteza Alamgir and Ulrike von Luxburg.Shortest path distance in random k-nearest neighbor graphs.In Proceedings of the 29th International Conference on Machine Learning, ICML, 2012. [3] ↑ Mikhail Belkin and Partha Niyogi.Laplacian eigenmaps for dimensionality reduction and data representation.Neural Comput., 15(6):1373–1396, 2003. [4] ↑ Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun.Measuring and relieving the over-smoothing problem for graph neural networks from the topological view.In Proceedings of the 34th AAAI Conference on Artificial Intelligence, AAAI, pages 3438–3445, 2020. [5] ↑ Jie Chen, Tengfei Ma, and Cao Xiao.FastGCN: Fast learning with graph convolutional networks via importance sampling.In Proceedings of the 6th International Conference on Learning Representations, ICLR, 2018. [6] ↑ Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li.Simple and deep graph convolutional networks.In Proceedings of the 37th International Conference on Machine Learning, ICML, pages 1725–1735, 2020. [7] ↑ Miles D. Cranmer, Alvaro Sanchez-Gonzalez, Peter W. Battaglia, Rui Xu, Kyle Cranmer, David N. Spergel, and Shirley Ho.Discovering symbolic models from deep learning with inductive biases.In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS, 2020. [8] ↑ Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst.Convolutional neural networks on graphs with fast localized spectral filtering.In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, NeurIPS, pages 3837–3845, 2016. [9] ↑ Nima Dehmamy, Albert-László Barabási, and Rose Yu.Understanding the representation power of graph neural networks in learning graph topology.In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS, pages 15387–15397, 2019. [10] ↑ Paul Erdős and Alfréd Rényi.On random graphs i.Publicationes mathematicae, 6(1):290–297, 1959. [11] ↑ Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli.A fair comparison of graph neural networks for graph classification.In Proceedings of the 8th International Conference on Learning Representations, ICLR, 2020. [12] ↑ Wenqi Fan, Yao Ma, Qing Li, Yuan He, Yihong Eric Zhao, Jiliang Tang, and Dawei Yin.Graph neural networks for social recommendation.In The Web Conference 2019, WWW, pages 417–426, 2019. [13] ↑ Vikas K. Garg, Stefanie Jegelka, and Tommi S. Jaakkola.Generalization and representational limits of graph neural networks.In Proceedings of the 37th International Conference on Machine Learning, ICML, pages 3419–3430, 2020. [14] ↑ Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl.Neural message passing for quantum chemistry.In Proceedings of the 34th International Conference on Machine Learning, ICML, pages 1263–1272, 2017. [15] ↑ Marco Gori, Gabriele Monfardini, and Franco Scarselli.A new model for learning in graph domains.In Proceedings of the International Joint Conference on Neural Networks, IJCNN, volume 2, pages 729–734, 2005. [16] ↑ William L Hamilton.Graph representation learning.Synthesis Lectures on Artifical Intelligence and Machine Learning, 14(3):1–159, 2020. [17] ↑ William L. Hamilton, Zhitao Ying, and Jure Leskovec.Inductive representation learning on large graphs.In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, NeurIPS, pages 1024–1034, 2017. [18] ↑ Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, and Enhua Wu.Vision GNN: an image is worth graph of nodes.In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS, 2022. [19] ↑ Peng Han, Peng Yang, Peilin Zhao, Shuo Shang, Yong Liu, Jiayu Zhou, Xin Gao, and Panos Kalnis.GCN-MF: disease-gene association identification by graph convolutional networks and matrix factorization.In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD, pages 705–713, 2019. [20] ↑ Tatsunori B. Hashimoto, Yi Sun, and Tommi S. Jaakkola.Metric recovery from directed unweighted graphs.In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, AISTATS, 2015. [21] ↑ Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yong-Dong Zhang, and Meng Wang.Lightgcn: Simplifying and powering graph convolution network for recommendation.In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR, pages 639–648, 2020. [22] ↑ Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun.GPT-GNN: generative pre-training of graph neural networks.In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD, pages 1857–1867, 2020. [23] ↑ John R Hurley and Raymond B Cattell.The procrustes program: Producing direct rotation to test a hypothesized factor structure.Behavioral science, 7(2):258, 1962. [24] ↑ Stefanie Jegelka.Theory of graph neural networks: Representation and learning.arXiv, abs/2204.07697, 2022. [25] ↑ Diederik P. Kingma and Jimmy Ba.Adam: A method for stochastic optimization.In Proceedings of the 3rd International Conference on Learning Representations, ICLR, 2015. [26] ↑ Diederik P. Kingma and Max Welling.Auto-encoding variational bayes.In Proceedings of the 2nd International Conference on Learning Representations, ICLR, 2014. [27] ↑ Thomas N. Kipf and Max Welling.Semi-supervised classification with graph convolutional networks.In Proceedings of the 5th International Conference on Learning Representations, ICLR, 2017. [28] ↑ Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann.Predict then propagate: Graph neural networks meet personalized pagerank.In Proceedings of the 7th International Conference on Learning Representations, ICLR, 2019. [29] ↑ Qimai Li, Zhichao Han, and Xiao-Ming Wu.Deeper insights into graph convolutional networks for semi-supervised learning.In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI, pages 3538–3545, 2018. [30] ↑ Shuangli Li, Jingbo Zhou, Tong Xu, Liang Huang, Fan Wang, Haoyi Xiong, Weili Huang, Dejing Dou, and Hui Xiong.Structure-aware interactive graph neural networks for the prediction of protein-ligand binding affinity.In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD, pages 975–985, 2021. [31] ↑ Andreas Loukas.What graph neural networks cannot learn: depth vs width.In Proceedings of the 8th International Conference on Learning Representations, ICLR, 2020. [32] ↑ Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, and Yaron Lipman.Provably powerful graph networks.In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS, pages 2153–2164, 2019. [33] ↑ Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman.Invariant and equivariant graph networks.In Proceedings of the 7th International Conference on Learning Representations, ICLR, 2019. [34] ↑ Christopher Morris, Martin Ritzert, Matthias Fey, William L. Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe.Weisfeiler and leman go neural: Higher-order graph neural networks.In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, AAAI, pages 4602–4609, 2019. [35] ↑ Ryan L. Murphy, Balasubramaniam Srinivasan, Vinayak A. Rao, and Bruno Ribeiro.Relational pooling for graph representations.In Proceedings of the 36th International Conference on Machine Learning, ICML, pages 4663–4673, 2019. [36] ↑ Hoang NT and Takanori Maehara.Revisiting graph neural networks: All we have is low-pass filters.arXiv, abs/1905.09550, 2019. [37] ↑ Kenta Oono and Taiji Suzuki.Graph neural networks exponentially lose expressive power for node classification.In Proceedings of the 8th International Conference on Learning Representations, ICLR, 2020. [38] ↑ Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter W. Battaglia.Learning mesh-based simulation with graph networks.In Proceedings of the 9th International Conference on Learning Representations, ICLR, 2021. [39] ↑ Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang.GCC: graph contrastive coding for graph neural network pre-training.In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD, pages 1150–1160, 2020. [40] ↑ Ryoma Sato.A survey on the expressive power of graph neural networks.arXiv, abs/2003.04078, 2020. [41] ↑ Ryoma Sato.Towards principled user-side recommender systems.In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, CIKM, pages 1757–1766, 2022. [42] ↑ Ryoma Sato, Makoto Yamada, and Hisashi Kashima.Approximation ratios of graph neural networks for combinatorial problems.In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS, pages 4083–4092, 2019. [43] ↑ Ryoma Sato, Makoto Yamada, and Hisashi Kashima.Random features strengthen graph neural networks.In Proceedings of the 2021 SIAM International Conference on Data Mining, SDM, pages 333–341, 2021. [44] ↑ Ryoma Sato, Makoto Yamada, and Hisashi Kashima.Constant time graph neural networks.ACM Trans. Knowl. Discov. Data, 16(5):92:1–92:31, 2022. [45] ↑ Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.The graph neural network model.IEEE Trans. Neural Networks, 20(1):61–80, 2009. [46] ↑ Franco Scarselli, Ah Chung Tsoi, and Markus Hagenbuchner.The vapnik-chervonenkis dimension of graph and recursive neural networks.Neural Networks, 108:248–259, 2018. [47] ↑ Peter H Schönemann.A generalized solution of the orthogonal procrustes problem.Psychometrika, 31(1):1–10, 1966. [48] ↑ Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann.Pitfalls of graph neural network evaluation.arXiv, 2018. [49] ↑ Robin Sibson.Studies in the robustness of multidimensional scaling: Procrustes statistics.Journal of the Royal Statistical Society: Series B (Methodological), 40(2):234–238, 1978. [50] ↑ Robin Sibson.Studies in the robustness of multidimensional scaling: Perturbational analysis of classical scaling.Journal of the Royal Statistical Society: Series B (Methodological), 41(2):217–229, 1979. [51] ↑ Daniel W Stroock and SR Srinivasa Varadhan.Diffusion processes with boundary conditions.Communications on Pure and Applied Mathematics, 24(2):147–225, 1971. [52] ↑ Daniel L. Sussman, Minh Tang, and Carey E. Priebe.Consistent latent position estimation and vertex classification for random dot product graphs.IEEE Trans. Pattern Anal. Mach. Intell., 36(1):48–57, 2014. [53] ↑ Joshua B Tenenbaum, Vin de Silva, and John C Langford.A global geometric framework for nonlinear dimensionality reduction.science, 290(5500):2319–2323, 2000. [54] ↑ Yoshikazu Terada and Ulrike von Luxburg.Local ordinal embedding.In Proceedings of the 31st International Conference on Machine Learning, ICML, pages 847–855, 2014. [55] ↑ Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio.Graph attention networks.In Proceedings of the 6th International Conference on Learning Representations, ICLR, 2018. [56] ↑ Petar Velickovic, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R. Devon Hjelm.Deep graph infomax.In Proceedings of the 7th International Conference on Learning Representations, ICLR, 2019. [57] ↑ Ulrike von Luxburg and Morteza Alamgir.Density estimation from unweighted k-nearest neighbor graphs: a roadmap.In Advances in Neural Information Processing Systems 26: Annual Conference on Neural Information Processing Systems 2013, NeurIPS, pages 225–233, 2013. [58] ↑ Xiao Wang, Meiqi Zhu, Deyu Bo, Peng Cui, Chuan Shi, and Jian Pei.AM-GCN: adaptive multi-channel graph convolutional networks.In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD, pages 1243–1253, 2020. [59] ↑ Xiaoyang Wang, Yao Ma, Yiqi Wang, Wei Jin, Xin Wang, Jiliang Tang, Caiyan Jia, and Jian Yu.Traffic flow prediction via spatial temporal graph neural network.In The Web Conference 2020, WWW, pages 1082–1092, 2020. [60] ↑ Felix Wu, Amauri H. Souza Jr., Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Q. Weinberger.Simplifying graph convolutional networks.In Proceedings of the 36th International Conference on Machine Learning, ICML, pages 6861–6871, 2019. [61] ↑ Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka.How powerful are graph neural networks?In Proceedings of the 7th International Conference on Learning Representations, ICLR, 2019. [62] ↑ Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S. Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka.What can neural networks reason about?In Proceedings of the 8th International Conference on Learning Representations, ICLR, 2020. [63] ↑ Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov.Revisiting semi-supervised learning with graph embeddings.In Proceedings of the 33rd International Conference on Machine Learning, ICML, pages 40–48, 2016. [64] ↑ Yuning You, Tianlong Chen, Zhangyang Wang, and Yang Shen.When does self-supervision help graph convolutional networks?In Proceedings of the 37th International Conference on Machine Learning, ICML, pages 10871–10880, 2020. [65] ↑ Muhan Zhang and Yixin Chen.Link prediction based on graph neural networks.In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS, pages 5171–5181, 2018. [66] ↑ Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, and Quanquan Gu.Layer-dependent importance sampling for training deep and large graph convolutional networks.In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS, pages 11247–11256, 2019. Appendix AProof of Theorems 4.1 and 4.2
First, we introduce the following lemma.
Lemma A.1.
[20] Under assumptions 1-5,
( 𝑐 𝑑 𝑣 𝑛 2 𝑔 𝑛 𝑑 𝜋 𝐺 𝑛 , 𝑣 ) 1 𝑑 + 2 → 𝑠 ¯ ( 𝒛 𝑣 )
(8)
holds almost surely, where 𝑐 ∈ ℝ is a constant that depends on 𝑝 and 𝑠 ¯ , and 𝜋 𝐺 𝑛 , 𝑣 is the stationary distribution of random walks on 𝐺 𝑛 .
We then prove Theorem 4.1.
Proof of Theorem 4.1.
We prove the theorem by construction. Let
𝐿 𝑛
def max 𝐺
( 𝑉 , 𝐸 ) , | 𝑉 |
𝑛 min { 𝑙 ∈ ℤ + | ‖ 𝑹 𝐺 𝑙 𝟙 𝑛 − 𝑛 𝜋 𝐺 𝑛 ‖ ≤ 1 𝑛 } ,
(9)
where 𝑹 𝐺 ∈ ℝ 𝑛 × 𝑛 is the random walk matrix of graph 𝐺 . As 𝑹 𝐺 𝑙 𝟙 𝑛 → 𝑙 → ∞ 𝑛 𝜋 𝐺 𝑛 , the set in the minimum is not empty. As { 𝐺
( 𝑉 , 𝐸 ) ∣ | 𝑉 |
𝑛 } is finite for every 𝑛 , the outer max exists, and therefore 𝐿 𝑛 exists for every 𝑛 . We set 𝐿 𝜃 𝑛
𝐿 𝑛 . In the following, we build a GNN whose embeddings of 𝑙 -th layer ( 1 ≤ 𝑙 ≤ 𝐿 𝑛 − 1 ) is
𝒉 𝑣 ( 𝑙 )
[ 𝑑 𝑣 , 𝑛 , ( 𝑹 𝐺 𝑛 𝑙 𝟙 𝑛 ) 𝑣 ] ⊤ .
(10)
The aggregation function of the first layer is
𝑓 agg , 𝜃 𝑛 ( 1 ) ( { { [ 𝑑 𝑢 , 𝑛 ] ∣ 𝑢 ∈ 𝒩 − ( 𝑣 ) } } )
def ∑ 𝑢 ∈ 𝒩 − ( 𝑣 ) 1 𝑑 𝑢 ,
(11)
i.e., 𝑓 agg , 𝜃 𝑛 ( 1 ) computes ( 𝑹 𝐺 𝑛 𝟙 𝑛 ) 𝑣 , which is the sum of probabilities from the incoming edges. The update function of the first layer is
𝑓 upd , 𝜃 𝑛 ( 1 ) ( [ 𝑑 𝑣 , 𝑛 ] ⊤ , 𝒂 𝑣 ( 1 ) )
def [ 𝑑 𝑣 , 𝑛 , 𝒂 𝑣 ( 1 ) ] ⊤ .
(12)
𝒉 𝑣 ( 1 ) holds condition (10) by construction. The aggregation function of the 𝑙 -th layer ( 2 ≤ 𝑙 ≤ 𝐿 𝑛 ) is
𝑓 agg , 𝜃 𝑛 ( 𝑙 ) ( { { [ 𝑑 𝑢 , 𝑛 , ( 𝑹 𝐺 𝑛 𝑙 − 1 𝟙 𝑛 ) 𝑢 ] ∣ 𝑢 ∈ 𝒩 − ( 𝑣 ) } } )
∑ 𝑢 ∈ 𝒩 − ( 𝑣 ) ( 𝑹 𝐺 𝑛 𝑙 − 1 𝟙 𝑛 ) 𝑢 𝑑 𝑢
def ( 𝑹 𝐺 𝑛 𝑙 𝟙 𝑛 ) 𝑣 ,
(13)
and the update function of the 𝑙 -th layer ( 2 ≤ 𝑙 ≤ 𝐿 𝑛 − 1 ) is
𝑓 upd , 𝜃 𝑛 ( 𝑙 ) ( [ 𝑑 𝑣 , 𝑛 , ( 𝑹 𝐺 𝑛 𝑙 − 1 𝟙 𝑛 ) 𝑣 ] ⊤ , 𝒂 𝑣 ( 𝑙 ) )
def [ 𝑑 𝑣 , 𝑛 , 𝒂 𝑣 ( 𝑙 ) ] ⊤ .
(14)
𝒉 𝑣 ( 𝑙 ) ( 2 ≤ 𝑙 ≤ 𝐿 𝑛 − 1 ) holds condition (10) by construction. Lastly, the aggregation function of the 𝐿 𝑛 -th layer is
𝑓 upd , 𝜃 𝑛 ( 𝐿 𝑛 ) ( [ 𝑑 𝑣 , 𝑛 , ( 𝑹 𝐺 𝑛 𝑙 − 1 𝟙 𝑛 ) 𝑣 ] ⊤ , 𝒂 𝑣 ( 𝐿 𝑛 ) )
def ( 𝑐 𝑑 𝑣 𝑛 𝑔 𝑛 𝑑 𝒂 𝑣 ( 𝑙 ) ) 1 𝑑 + 2 .
(15)
By Eq. (9) and (10), | 𝒂 𝑣 ( 𝐿 𝑛 ) − 𝑛 𝜋 𝐺 𝑛 | → 𝑛 → ∞ 0 surely. Combining with Eq. (8) and (15) yields 𝑓 ( 𝑣 , 𝐺 𝑛 , 𝑿 ; 𝜃 𝑛 )
𝒉 𝑣 ( 𝐿 𝑛 ) → 𝑛 → ∞ 𝑠 ¯ ( 𝒛 𝑣 ) almost surely. ∎
The definitions of the layers, i.e., equations (11), (12), (13), (14), and (15), prove Theorem 4.2.
Appendix BProof of Theorems 4.4 and 4.5 Proof of Theorem 4.4.
We prove the theorem by construction. Let 𝒞 be an arbitrary 𝜀 6 covering of 𝒟 . For each point 𝑐 ∈ 𝒞 , let ℬ ( 𝑐 ; 𝜀 6 ) ⊂ ℝ 𝑑 be the ball centered at 𝑐 with radius 𝜀 6 . The number 𝑀 𝑐 of the selected points 𝒰 in ℬ ( 𝑐 ; 𝜀 6 ) follows the binomial distribution Bi ( 𝑞 𝑐 , 𝑚 ) , where
𝑞 𝑐
def ∫ ℬ ( 𝑐 ; 𝜀 6 ) 𝑝 ( 𝑥 ) 𝑑 𝑥
is positive. Therefore, Pr [ 𝑀 𝑐
0 ]
( 1 − 𝑞 𝑐 ) 𝑚 . Let
𝑚
def ceil ( log 𝛿 2 | 𝒞 | log ( 1 − min 𝑐 ∈ 𝒞 𝑞 𝑐 ) ) ,
then
( 1 − 𝑞 𝑐 ) 𝑚 ≤ 𝛿 2 | 𝒞 | ,
and
Pr [ ∃ 𝑐 ∈ 𝒞 , 𝑀 𝑐
0 ] ≤ | 𝒞 | ( 1 − 𝑞 𝑐 ) 𝑚 ≤ 𝛿 2
by the union bound. Therefore,
Pr [ ∀ 𝑐 ∈ 𝒞 , 𝑀 𝑐 ≥ 1 ]
1 − Pr [ ∃ 𝑐 ∈ 𝒞 , 𝑀 𝑐
0 ] ≥ 1 − 𝛿 2 .
In words, with at least probability 1 − 𝛿 2 , each of ℬ ( 𝑐 ; 𝜀 6 ) contains at least one point. As 𝒞 is an 𝜀 6 covering of 𝒟 , 𝒰 forms an 𝜀 3 covering of 𝒟 under ∀ 𝑐 ∈ 𝒞 , 𝑀 𝑐 ≥ 1 , i.e,
Pr [ 𝒰 forms an 𝜀 3 covering ] ≥ 1 − 𝛿 2 .
(16)
We set 𝐿 𝜃 𝑛
𝐿 𝑛 + 2 𝑛 . The first 𝐿 𝑛 layers are almost the same as the construction in Theorem 4.1. The only difference is that as we take 𝒙 𝑣 id
{
𝒆
𝑖
(
𝑣
𝑢 𝑖 )
𝟘
𝑚
(
𝑣
∉
𝒰
)
as input (Eq. (5)), we retain this information in the update functions. Therefore, in the
𝐿
𝑛
-th layer, each node has
𝒉
𝑣
(
𝐿
𝑛
)
[ 𝑠 ¯ ^ ( 𝒛 𝑣 ) , 𝒙 𝑣 id ⊤ ] ⊤ in the embedding, where, 𝑠 ¯ ^ ( 𝒛 𝑣 ) is the estimate of 𝑠 ¯ ( 𝒛 𝑣 ) computed by the GNN (Eq. (15)).
The next 𝑛 layers compute the shortest-path distances from each node in 𝒰 using the Bellman-Ford algorithm. Specifically, in the 𝐿 𝑛 + 1 -th layer, the aggregation function is
𝑓 agg , 𝜃 𝑛 ( 𝐿 𝑛 + 1 ) ( { { [ 𝑠 ¯ ^ ( 𝒛 𝑢 ) , 𝒙 𝑢 id ⊤ ] ⊤ ∣ 𝑢 ∈ 𝒩 − ( 𝑣 ) } } )
def min 𝑢 ∈ 𝒩 − ( 𝑣 ) 𝒎 𝑢 ∈ ℝ 𝑚 ,
(17)
𝒎 𝑢
def
{
[
INF
,
…
,
INF
,
𝑔
𝑛
𝑠
¯
^
(
𝒛
𝑢
)
⏞
𝑖
th
,
INF
,
…
,
INF
]
(
𝒙
𝑢
𝑖
id
1 )
[
INF
,
…
,
INF
,
…
,
INF
]
(
𝒙
𝑢
id
𝟘 𝑚 ) ,
(18)
where min is element-wise minimum, and INF is a sufficiently large constant such as diam ( 𝒟 ) + 1 . The update function of the 𝐿 𝑛 + 1 -th layer is
𝑓 upd , 𝜃 𝑛 ( 𝐿 𝑛 + 1 ) ( [ 𝑠 ¯ ^ ( 𝒛 𝑣 ) , 𝒙 𝑣 id ⊤ ] ⊤ , 𝒂 𝑣 ( 𝐿 𝑛 + 1 ) )
def [ 𝑠 ¯ ^ ( 𝒛 𝑣 ) , 𝒙 𝑣 id ⊤ , 𝒅 𝑣 ( 𝐿 𝑛 + 1 ) ⊤ ] ⊤ ∈ ℝ 2 𝑚 + 1 ,
(19)
𝒅 𝑣 ( 𝐿 𝑛 + 1 )
def min ( 𝒖 𝑣 , 𝒂 𝑣 ( 𝐿 𝑛 + 1 ) ) ,
(20)
𝒖 𝑣
def
{
[
INF
,
…
,
INF
,
0
⏞
𝑖
th
,
INF
,
…
,
INF
]
(
𝒙
𝑣
𝑖
id
1 )
[
INF
,
…
,
INF
,
…
,
INF
]
(
𝒙
𝑢
id
𝟘 𝑚 ) .
(21)
The aggregation function of the 𝐿 𝑛 + 𝑖 -th layer ( 2 ≤ 𝑖 ≤ 𝑛 ) is
𝑓 agg , 𝜃 𝑛 ( 𝐿 𝑛 + 𝑖 ) ( { { [ 𝑠 ¯ ^ ( 𝒛 𝑢 ) , 𝒙 𝑢 id ⊤ , 𝒅 𝑢 ( 𝐿 𝑛 + 𝑖 − 1 ) ⊤ ] ⊤ ∣ 𝑢 ∈ 𝒩 − ( 𝑣 ) } } )
def min 𝑢 ∈ 𝒩 − ( 𝑣 ) 𝒅 𝑢 ( 𝐿 𝑛 + 𝑖 − 1 ) + 𝑔 𝑛 𝑠 ¯ ^ ( 𝒛 𝑢 ) ,
(22)
and the update function is
𝑓 upd , 𝜃 𝑛 ( 𝐿 𝑛 + 𝑖 ) ( [ 𝑠 ¯ ^ ( 𝒛 𝑣 ) , 𝒙 𝑣 id ⊤ , 𝒅 𝑣 ( 𝐿 𝑛 + 𝑖 − 1 ) ⊤ ] ⊤ , 𝒂 𝑣 ( 𝐿 𝑛 + 𝑖 ) )
def [ 𝑠 ¯ ^ ( 𝒛 𝑣 ) , 𝒙 𝑣 id ⊤ , 𝒅 𝑣 ( 𝐿 𝑛 + 𝑖 ) ⊤ ] ⊤ ∈ ℝ 2 𝑚 + 1 ,
(23)
𝒅 𝑣 ( 𝐿 𝑛 + 𝑖 )
def min ( 𝒅 𝑣 ( 𝐿 𝑛 + 𝑖 − 1 ) , 𝒂 𝑣 ( 𝐿 𝑛 + 𝑖 ) ) .
(24)
As the diameter of 𝐺 𝑛 is at most 𝑛 , the computation of the shortest path distance is complete after 𝑛 iterations. Therefore, 𝒅 𝑣 ( 𝐿 𝑛 + 𝑛 ) 𝑗 is the shortest-path distance from 𝑢 𝑗 to 𝑣 with the length of edge ( 𝑠 , 𝑡 ) being 𝑔 𝑛 𝑠 ¯ ^ ( 𝒛 𝑠 ) .
The following 𝑛 layers propagate the distance matrices among 𝒰 . Specifically, the aggregation function of the 𝐿 𝑛 + 𝑛 + 1 -th layer is
𝑓 agg , 𝜃 𝑛 ( 𝐿 𝑛 + 1 ) ( { { [ 𝑠 ¯ ^ ( 𝒛 𝑢 ) , 𝒙 𝑢 id ⊤ , 𝒅 𝑢 ( 𝐿 𝑛 + 𝑛 ) ⊤ ] ⊤ ∣ 𝑢 ∈ 𝒩 − ( 𝑣 ) } } )
def min 𝑢 ∈ 𝒩 − ( 𝑣 ) 𝑴 𝑢 ∈ ℝ 𝑚 2 ,
(25)
𝑴 𝑢
def
{
[
INF
𝟙
𝑚
⊤
,
…
,
INF
𝟙
𝑚
⊤
,
𝒅
𝑢
(
𝐿
𝑛
+
𝑛
)
⊤
,
INF
,
…
,
INF
]
(
𝒙
𝑢
𝑖
id
1 )
[
INF
𝟙
𝑚
⊤
,
…
,
INF
𝟙
𝑚
⊤
,
…
,
INF
𝟙
𝑚
⊤
]
(
𝒙
𝑢
id
𝟘 𝑚 ) .
(26)
The update function of the 𝐿 𝑛 + 𝑛 + 1 -th layer is
𝑓 upd , 𝜃 𝑛 ( 𝐿 𝑛 + 1 ) ( [ 𝑠 ¯ ^ ( 𝒛 𝑣 ) , 𝒙 𝑣 id ⊤ , 𝒅 𝑣 ( 𝐿 𝑛 + 𝑛 ) ⊤ ] ⊤ , 𝒂 𝑣 ( 𝐿 𝑛 + 𝑛 + 1 ) )
def [ 𝑠 ¯ ^ ( 𝒛 𝑣 ) , 𝒙 𝑣 id ⊤ , 𝒅 𝑣 ( 𝐿 𝑛 + 𝑛 ) ⊤ , 𝑫 𝑣 ( 𝐿 𝑛 + 𝑛 + 1 ) ⊤ ] ⊤ ∈ ℝ 2 𝑚 + 𝑚 2 + 1 ,
(27)
𝑫 𝑣 ( 𝐿 𝑛 + 𝑛 + 1 )
def min ( 𝑼 𝑣 , 𝒂 𝑣 ( 𝐿 𝑛 + 1 ) ) ∈ ℝ 𝑚 2 ,
(28)
𝑼 𝑣
def
{
[
INF
𝟙
𝑚
⊤
,
…
,
INF
𝟙
𝑚
⊤
,
𝒅
𝑣
(
𝐿
𝑛
+
𝑛
)
⊤
⏞
𝑖
th
,
INF
𝟙
𝑚
⊤
,
…
,
INF
𝟙
𝑚
⊤
]
(
𝒙
𝑣
𝑖
id
1 )
[
INF
𝟙
𝑚
⊤
,
…
,
INF
𝟙
𝑚
⊤
,
…
,
INF
𝟙
𝑚
⊤
]
(
𝒙
𝑢
id
𝟘 𝑚 ) .
(29)
The aggregation function of the 𝐿 𝑛 + 𝑛 + 𝑖 -th layer ( 2 ≤ 𝑖 ≤ 𝑛 ) is
𝑓 agg , 𝜃 𝑛 ( 𝐿 𝑛 + 𝑛 + 𝑖 ) ( { { [ 𝑠 ¯ ^ ( 𝒛 𝑢 ) , 𝒙 𝑢 id ⊤ , 𝒅 𝑢 ( 𝐿 𝑛 + 𝑛 ) ⊤ , 𝑫 𝑢 ( 𝐿 𝑛 + 𝑛 + 𝑖 − 1 ) ⊤ ] ⊤ ∣ 𝑢 ∈ 𝒩 − ( 𝑣 ) } } )
def min 𝑢 ∈ 𝒩 − ( 𝑣 ) 𝑫 𝑢 ( 𝐿 𝑛 + 𝑛 + 𝑖 − 1 ) ,
(30)
and the update function of the 𝐿 𝑛 + 𝑛 + 𝑖 -th layer ( 2 ≤ 𝑖 ≤ 𝑛 − 1 ) is
𝑓 upd , 𝜃 𝑛 ( 𝐿 𝑛 + 𝑛 + 𝑖 ) ( [ 𝑠 ¯ ^ ( 𝒛 𝑣 ) , 𝒙 𝑣 id ⊤ , 𝒅 𝑣 ( 𝐿 𝑛 + 𝑛 ) ⊤ , 𝑫 𝑣 ( 𝐿 𝑛 + 𝑛 + 𝑖 − 1 ) ⊤ ] ⊤ , 𝒂 𝑣 ( 𝐿 𝑛 + 𝑛 + 𝑖 ) )
def [ 𝑠 ¯ ^ ( 𝒛 𝑣 ) , 𝒙 𝑣 id ⊤ , 𝒅 𝑣 ( 𝐿 𝑛 + 𝑛 ) ⊤ , 𝑫 𝑣 ( 𝐿 𝑛 + 𝑛 + 𝑖 ) ⊤ ] ⊤ ,
(31)
𝑫 𝑣 ( 𝐿 𝑛 + 𝑛 + 𝑖 )
def min ( 𝑫 𝑣 ( 𝐿 𝑛 + 𝑛 + 𝑖 − 1 ) , 𝒂 𝑣 ( 𝐿 𝑛 + 𝑛 + 𝑖 ) ) .
(32)
The last update function is defined as follows.
𝑓 upd , 𝜃 𝑛 ( 𝐿 𝑛 + 2 𝑛 ) ( [ 𝑠 ¯ ^ ( 𝒛 𝑣 ) , 𝒙 𝑣 id ⊤ , 𝒅 𝑣 ( 𝐿 𝑛 + 𝑛 ) ⊤ , 𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 − 1 ) ⊤ ] ⊤ , 𝒂 𝑣 ( 𝐿 𝑛 + 2 𝑛 ) )
def
{
MDS
(
𝑫
𝑣
(
𝐿
𝑛
+
2
𝑛
)
)
𝑖
(
𝒙
𝑣
𝑖
id
1 )
MDS
(
𝑫
𝑣
(
𝐿
𝑛
+
2
𝑛
)
)
𝑘
𝑣
(
𝒙
𝑣
id
𝟘 𝑚 ) ,
(33)
𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 )
def min ( 𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 − 1 ) , 𝒂 𝑣 ( 𝐿 𝑛 + 𝑛 + 𝑖 ) ) ,
(34)
𝑘 ( 𝑣 )
def argmin 𝑖 𝒅 𝑣 ( 𝐿 𝑛 + 𝑛 ) 𝑖 .
(35)
As the diameter of 𝐺 𝑛 is at most 𝑛 , the propagation is complete after 𝑛 iterations. Therefore, 𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 ) 𝑖 𝑛 + 𝑗 is the shortest-path distance from 𝑢 𝑖 to 𝑢 𝑗 with the length of edge ( 𝑠 , 𝑡 ) being 𝑔 𝑛 𝑠 ¯ ^ ( 𝒛 𝑠 ) . MDS : ℝ 𝑚 2 → ℝ 𝑚 × 𝑑 runs the multidimensional scaling. Note that in the ( 𝐿 𝑛 + 2 𝑛 ) -th layer, each node has the distance matrix 𝑫 ( 𝐿 𝑛 + 2 𝑛 ) in its embedding, therefore MDS can be run in each node in a parallel manner. If 𝑣 is in 𝒰 , 𝑓 upd , 𝜃 𝑛 ( 𝐿 𝑛 + 2 𝑛 ) outputs the coordinate of 𝑣 recovered by MDS. If 𝑣 is not in 𝒰 , 𝑓 upd , 𝜃 𝑛 ( 𝐿 𝑛 + 2 𝑛 ) outputs the coordinate of the closest selected node, i.e., 𝑢 𝑘 ( 𝑣 ) .
We analyze the approximation error of the above processes. Let 𝒁 𝒰
[ 𝒛 𝑢 1 , … , 𝒛 𝑢 𝑚 ] ⊤ ∈ ℝ 𝑚 × 𝑑 be the true hidden embeddings of the selected nodes. By Corollary 4.2 of [50], the noise 𝐶 to the distance matrix causes 𝑂 ( 𝐶 4 ) of misalignment of the coordinates. Therefore, if
∀ 𝑢 𝑖 , 𝑢 𝑗 ∈ 𝒰 , | 𝒅 𝑢 𝑖 ( 𝐿 𝑛 + 𝑛 ) 𝑗 − ‖ 𝒛 𝑢 𝑖 − 𝒛 𝑢 𝑗 ‖ | < 𝜀 ′
(36)
holds for some 𝜀 ′
0 , then
𝑑 𝐺 ( MDS ( 𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 ) ) , 𝒁 𝒰 ) < 𝜀 2 9 𝑚 .
(37)
Let
𝑷 𝒰
def argmin 𝑷 ∈ ℝ 𝑑 × 𝑑 , 𝑷 ⊤ 𝑷
𝐼 𝑑 ‖ 𝑪 𝑚 MDS ( 𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 ) ) − 𝑪 𝑚 𝒁 𝒰 𝑷 ‖ 𝐹 2 .
If Eq. (37) holds, for all 𝑖
1 , … , 𝑚 ,
‖ ( 𝑪 𝑚 MDS ( 𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 ) ) ) 𝑖 − ( 𝑪 𝑚 𝒁 𝒰 𝑷 𝒰 ) 𝑖 ‖ 2 ≤ 𝜀 3 .
(38)
If 𝑛 is sufficiently large,
Pr [ ∀ 𝑣 ∈ 𝑉 , 𝑢 𝑖 ∈ 𝒰 , | 𝒅 𝑣 ( 𝐿 𝑛 + 𝑛 ) 𝑖 − ‖ 𝒛 𝑣 − 𝒛 𝑢 𝑖 ‖ | < min ( 𝜀 ′ , 𝜀 6 ) ] ≥ 1 − 𝛿 2
(39)
holds from Theorem S.4.5 of [20]. Note that although Theorem S.4.5 of [20] uses the true 𝜋 𝐺 𝑛 while we use 𝒂 𝑣 ( 𝐿 𝑛 ) , from Eq. (9) and (10), | 𝒂 𝑣 ( 𝐿 𝑛 ) − 𝑛 𝜋 𝐺 𝑛 | → 𝑛 → ∞ 0 holds surely, and therefore, the theorem holds because the mismatch diminishes as 𝑛 → ∞ .
We suppose the following event 𝑃 :
𝒰 forms an 𝜀 3 covering and
(40)
∀ 𝑣 ∈ 𝑉 , 𝑢 𝑖 ∈ 𝒰 , | 𝒅 𝑣 ( 𝐿 𝑛 + 𝑛 ) 𝑖 − ‖ 𝒛 𝑣 − 𝒛 𝑢 𝑖 ‖ | < min ( 𝜀 ′ , 𝜀 6 ) .
(41)
The probability of this event is at least 1 − 𝛿 by Eq. (16) and (39). Under this event, for all 𝑖
1 , … , 𝑚 ,
‖ ( 𝒁 ^ − 1 𝑚 𝟙 𝑛 𝟙 𝑚 ⊤ MDS ( 𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 ) ) ) 𝑢 𝑖 − ( ( 𝒁 − 1 𝑚 𝟙 𝑛 𝟙 𝑚 ⊤ 𝒁 𝒰 ) 𝑷 𝒰 ) 𝑖 ‖ 2
(42)
= ‖ ( 𝑪 𝑚 MDS ( 𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 ) ) ) 𝑖 − ( 𝑪 𝑚 𝒁 𝒰 𝑷 𝒰 ) 𝑖 ‖ 2
(43)
< 𝜀 3 .
(44)
holds by Eq. (41), (36), (37), and (38), and the definition of 𝒁 ^ (Eq. (33)) and 𝑪 𝑚 . Under event 𝑃 , for any 𝑣 ∉ 𝒰 , there exists 𝑢 𝑖 ∈ 𝒰 such that ‖ 𝒛 𝑣 − 𝒛 𝑢 𝑖 ‖ ≤ 𝜀 3 by Eq. (40). By applying Eq. (41) twice,
𝒅 𝑣 ( 𝐿 𝑛 + 𝑛 ) 𝑘 ( 𝑣 )
min 𝑖 𝒅 𝑣 ( 𝐿 𝑛 + 𝑛 ) 𝑖
< 𝜀 2 ,
(45)
‖ 𝒛 𝑣 − 𝒛 𝑢 𝑘 ( 𝑣 ) ‖
< 2 𝜀 3 .
(46)
Then,
‖ ( 𝒁 ^ − 1 𝑚 𝟙 𝑛 𝟙 𝑚 ⊤ MDS ( 𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 ) ) ) 𝑣 − ( ( 𝒁 − 1 𝑚 𝟙 𝑛 𝟙 𝑚 ⊤ 𝒁 𝒰 ) 𝑷 𝒰 ) 𝑣 ‖ 2
(47)
= (a) ‖ ( 𝑪 𝑚 MDS ( 𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 ) ) ) 𝑘 ( 𝑣 ) − ( ( 𝒁 − 1 𝑚 𝟙 𝑛 𝟙 𝑚 ⊤ 𝒁 𝒰 ) 𝑷 𝒰 ) 𝑣 ‖ 2
(48)
≤ (b) ‖ ( 𝑪 𝑚 MDS ( 𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 ) ) ) 𝑘 ( 𝑣 ) − ( ( 𝒁 − 1 𝑚 𝟙 𝑛 𝟙 𝑚 ⊤ 𝒁 𝒰 ) 𝑷 𝒰 ) 𝑘 ( 𝑣 ) ‖ 2
(49)
- ‖ ( ( 𝒁 − 1 𝑚 𝟙 𝑛 𝟙 𝑚 ⊤ 𝒁 𝒰 ) 𝑷 𝒰 ) 𝑣 − ( ( 𝒁 − 1 𝑚 𝟙 𝑛 𝟙 𝑚 ⊤ 𝒁 𝒰 ) 𝑷 𝒰 ) 𝑘 ( 𝑣 ) ‖ 2
(50)
= (c) ‖ ( 𝑪 𝑚 MDS ( 𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 ) ) ) 𝑘 ( 𝑣 ) − ( ( 𝒁 − 1 𝑚 𝟙 𝑛 𝟙 𝑚 ⊤ 𝒁 𝒰 ) 𝑷 𝒰 ) 𝑘 ( 𝑣 ) ‖ 2 + ‖ 𝒛 𝑣 − 𝒛 𝑘 ( 𝑣 ) ‖ 2
(51)
< (d) ‖ ( 𝑪 𝑚 MDS ( 𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 ) ) ) 𝑘 ( 𝑣 ) − ( ( 𝒁 − 1 𝑚 𝟙 𝑛 𝟙 𝑚 ⊤ 𝒁 𝒰 ) 𝑷 𝒰 ) 𝑘 ( 𝑣 ) ‖ 2 + 2 3 𝜀
(52)
< (e) 𝜀 ,
(53)
where (a) follows because 𝒁 ^ 𝑣
MDS ( 𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 ) ) 𝑘 ( 𝑣 ) by Eq. (33) and by the definition of 𝑪 𝑚 , (b) follows by the triangle inequality, (c) follows because 𝑷 𝒰 is an orthogonal matrix, (d) follows by Eq. (46), and (e) follows by (44). By Eq. (44) and (53), the the distance between the true embedding 𝒛 𝑣 and the estimate 𝒛 ^ 𝑣 is less than 𝜀 with rigid transformation 𝑷 𝒰 and translation − 1 𝑚 𝟙 𝑛 𝟙 𝑚 ⊤ MDS ( 𝑫 𝑣 ( 𝐿 𝑛 + 2 𝑛 ) ) . As these transformation parameters are suboptimal for Eq. (4), these distances are overestimated. Therefore, Under event 𝑃 , 𝑑 𝐺 ( 𝒁 ^ , 𝒁 ) < 𝜀 , which holds with probability at least 1 − 𝛿 .
∎
The definitions of the layers prove Theorem 4.5.
Appendix CTechnical Remarks
Remark (Global Information). The existing analyses of the over-smoothing effect [29, 37] show that GNNs with too many layers fail. Therefore, GNNs cannot have wide receptive fields, and GNNs cannot aggregate global information. By contrast, our analysis shows that GNNs can obtain global information, i.e., 𝒛 𝑣 . This result provides a new insight into the understanding of GNNs. Note that the assumptions of the existing analyses [29, 37] do not hold for our GNN architectures. Therefore, our results do not contradict with the existing results.
Remark (Positions as Explicit Node Features are Redundant). In some applications, the graph is constructed from observed features, and { 𝒛 𝑣 } are available as the explicit node features [19, 58, 18]. For example, each node represents a position of interest in spatial data, the graph is constructed by nearest neighbors based on geographic positions, and the positions are included in the explicit node features 𝒙 𝑣 . Our main results show that such position features are asymptotically redundant because GNNs can recover them solely from the graph structure. In practice with finite samples, the position features can be informative, and they can introduce a good inductive bias, though.
Limitation (High Node Degree). We assume high node degrees in Assumption 4 and in the experiments, i.e., 𝑑 𝑣
𝜔 ( 𝑛 2 𝑑 + 2 log 𝑑 𝑑 + 2 𝑛 ) . Note that 𝑑 𝑣
Θ ( log 𝑛 ) is required for random graphs to be connected [10], so we cannot reduce node degrees so much from a technical point of view. Having said that, there is indeed room for improvement by a factor of ( 𝑛 log 𝑛 ) 2 𝑑 + 2 , which can be indeed large when 𝑑 is small. This bound is common with [20], and improving the bound is important future work.
Remark (Dimensionality of the True Features). We need to specify the number of dimensions of the true features, which are not necessarily available in practice. Specifying higher number of dimensions than the true one is not so problematic, as the lower dimensional features are recovered in the subspace of the entire space. In practice, we can find a good dimensionality by measuring a reconstruction loss in an unsupervised manner. Namely, after we recover the features, we construct a nearest neighbor graph from it. If it does not resemble the input graph, the dimensionality may not be sufficient.
Generated by L A T E xml Instructions for reporting errors
We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:
Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.
Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.
Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.
Report Issue Report Issue for Selection
Xet Storage Details
- Size:
- 79.1 kB
- Xet hash:
- fb3385fa272e28d61de91dcfef1fcb7798bd8d8e0998ccae8d716564446a2e76
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.