Searching for Embeddings in a Haystack: Link Prediction on Knowledge Graphs with Subgraph Pruning

Embedding-based models of Knowledge Graphs (KGs) can be used to predict the existence of missing links by ranking the entities according to some likelihood scores. An exhaustive computation of all likelihood scores is very expensive if the KG is large. To counter this problem, we propose a technique to reduce the search space by identifying smaller subsets of promising entities. Our technique first creates embeddings of subgraphs using the embeddings from the model. Then, it ranks the subgraphs with some proposed ranking functions and considers only the entities in the top k subgraphs. Our experiments show that our technique is able to reduce the search space significantly while maintaining a good recall.


INTRODUCTION
Knowledge Graphs (KGs) are a popular representation model to publish factual knowledge on the Web. KGs are crucial assets for enhancing various tasks like question answering [7], ontology-based data access [5], task-oriented dialogs [19], data integration [16], or named entity recognition [24]. Although the largest public KGs contain billions of statements (e.g., Wikidata [26], DBpedia [18]), they are still far from being complete.
The problem of completing the KGs is addressed by numerous techniques which range from rule mining [11], extraction from unstructured sources [15], or ontological reasoning [1]. In this paper we consider embedding-based models [20], i.e., models where the entities and relations in the KG are "embedded" into highdimensional numerical vectors (called embeddings) and potential new links are identified by computing numerical likelihood scores.
Multiple studies have shown that these techniques return good results for KG completion (see surveys at [4,20]) but their application to large KGs is problematic for two reasons. First, large KGs can contain millions of entities and this leads to models with a huge number of parameters. The second problem concerns the discovery of links that complete partially-bounded statements like (?, bornIn, U K), which can be seen as a query answering problem. To solve this problem, embeddings are used to construct an numerical representation of the query, which is then combined with the embeddings of all entities to assign likelihood scores to each of them. Then, only the entities with the best scores are considered as potential answers. The problem is that if the KG has many entities, then computing the likelihood score for every entity is too expensive.
In this paper, we propose a technique that addresses these two problems. The main idea is to restrict the ranking only to the subset of the most promising entities. This subset is computed in two stages: First, we compute the likelihood scores considering embeddings that represent sets of entities contained in star-shaped subgraphs. We refer to these embeddings as subgraph embeddings. Then, we compute the likelihood scores considering only the entities in the top-ranked subgraphs. Since typically there are far fewer subgraphs than the entities, computing the first ranking is fast and does not require loading the entire model. Therefore, our method introduces savings both in terms of runtime and resource utilization.
Our technique can be applied with several existing embedding models. We considered TransE [3], HolE [21], DistMult [30], and ConvE [6], which are currently among the most popular models. Our discussion focuses on three key aspects of our technique. Firstly, we study the benefits with two representations of the subgraph embeddings. The first representation views the subgraph embedding as the average of the embeddings of its entities. The second one constructs Gaussian embeddings, i.e., constructs a Gaussian probability distribution for each subgraph. Secondly, we introduce two different scoring functions to rank the best subgraphs. The first function reuses the likelihood score of the embedding model while the second one applies KL-divergence [17] between the distribution of the known answers of the query and of the subgraph embeddings. Thirdly, we describe how we can determine automatically the number of top k subgraphs using evidence in previous query executions.
Our experiments show that our method can reduce the search space to a fraction of all entities. In many cases this reduction does not compromise the recall, i.e., known correct answers are not ignored. In the best cases, the reduction can be up to one order of magnitude while preserving a recall ≥ 50%. This makes our method a valid alternative to perform an exhaustive ranking with all entities.
All the code and models are available at the link https://github. com/unmeshvrije/scikit-kge.

PRELIMINARIES
We view a KG as a directed labeled graph K = (V, E, R) where V is a set of nodes, E is a set of edges, and R is a set of named relations. We denote each edge in E as a triple (h, r, t) where h (head) is the outgoing node, t (tail) is the incoming one, and r ∈ R (relation name) is the label of the edge. Intuitively, nodes represent entities while edges indicate semantic relations between them, e.g., (London, capitalO f , U K). The set of triples is divided into three sets: training (E t r ain ), validation (E val ) and test (E t est ).
An embedding is a vector in R d with d > 0. We use boldface fonts to denote them, i.e., we write e and r to refer to the vectors associated with the entity e and the relation r respectively. A model is a set of embeddings. To learn a model, techniques like TransE, ConvE, etc. define a score function ψ (h, r, t) to compute the likelihood that (h, r , t) is true and then use it to specify a loss function (e.g., pairwise hinge loss [3], binary cross entropy [6]) that should be minimized during training. Table 1 shows the score functions of the models that we considered in this paper.
Once the embeddings are trained, they can be used for tasks like link prediction. The goal of link prediction is to predict the head or the tail entities given the relation and the other entity. For instance, if r and t are given, then the goal is to predict the correct h, which can also be seen as answering a query of the form (?, r , t). This can be done by computing the score function for r and t and any other entity e and use it as a likelihood score that e is an answer.
Computing the likelihood score for every entity can be computationally expensive. The goal of our technique is to reduce this cost with an approximate ranking. This problem can also be seen as a k-nearest neighbour (kNN) search problem. Given an embedding as input, the goal of a kNN technique is to find the top k embeddings which are closest according to a distance function. In our evaluation, we will compare the performance of our method against IVFADC [13], one state-of-the-art technique of this kind.

APPROACH
Overview. We propose a two-stage approach to speed up link prediction. In the first stage, we create embeddings of the subgraphs. In the second stage, we rank the subgraphs based on the likelihood scores between the embedding of the query and subgraph embeddings. Optionally, a third stage can be applied to compute a suitable value of k to select the top k subgraphs.  Figure 1. It is likely that answers for this query are persons that are members of the UK parliament, or persons who work in London. These are entities contained in star-shaped subgraphs rooted in the entities U K_Parliament and London respectively. It is much less likely that an answer can be found in a subgraph that contains vehicles. First, our approach uses the embeddings of the people who are members of the UK Parliament to construct a subgraph embedding for the subgraph rooted at UK Parliament. Similarly, it creates subgraph embeddings for other subgraphs. Then, it ranks the likelihood scores computed using these embeddings and the query to identify the top k subgraphs that most likely contain answers for the query.
Creation of Subgraph Embeddings (First stage). We use subgraphs to identify groups of similar entities. In principle, our method can be applied with any type of subgraph, provided that we can construct an embedding for it. In this paper, we consider only starshaped subgraphs which are formed by sets of entities that share a common neighbor with a fixed relation. The reason for this restriction is that the similarity criterion in such subgraphs is naturally defined by the shared neighbor. For instance, all entities which share a connection to the entity London with the relation worksIn are similar to each other precisely because they all work in London. We distinguish two types of subgraphs: The ones which have an outgoing edge to the common neighbour and the ones that have an incoming one.
Definition 3.2. The outgoing subgraph rooted at entity e and relation r is defined as the set of entities {t | (e, r , t) ∈ E t r ain }, i.e., the set of entities which are connected to e with an incoming arc with label r . Likewise, an incoming subgraph rooted at e and r is the set {t | (t, r , e) ∈ E t r ain } Since our goal is to reduce the search space for potential answers, we ignore subgraphs that are too small to lead to a significant reduction. To this end, we introduce a threshold value τ and ignore all subgraphs with less than τ entities.
In order to compute the subgraph embeddings, we rely on an external model that computes the embeddings of the entities. We refer to this model as the embedding model. In this paper, we considered four methods (TransE, HolE, DistMult, ConvE) but we believe that our approach can also be used with other models. Note that the embedding model provides not only the entity embeddings but also the score functions to determine the similarity between them.
Given the embeddings provided by the embedding model, we consider two approaches for constructing the subgraph embeddings. The first approach computes the subgraph embedding as the average of embeddings of the members of the subgraph. That is, given a subgraph S, the corresponding embedding S µ is computed as With this method, the subgraph is embedded as a single "point" in the high-dimensional space, but the position is sensible to outliers as it is computed with the average. To counter this problem, the second approach represents the subgraphs using Gaussian embeddings [10,25] so that they are no longer represented by single points but rather as areas where the subgraphs are more or less likely to be located. The idea behind Gaussian embeddings is to represent each symbol as a multi-dimensional Gaussian probability distribution. The probability distribution determines the likelihood that the symbol is positioned at certain coordinates. This likelihood is high around the average but diminishes as we move away with the rate defined in the normal distribution.
With this method, every subgraph is defined by an average and variance embeddings, i.e., a tuple ⟨S µ , S σ ⟩. The average embedding (2) we divide by |S | − 1 instead of |S | following Bessel's correction [28].
Ranking Subgraph Embeddings (Second stage). After we computed the embeddings of all subgraphs with more than τ entities, our system is ready to perform link prediction. Let us consider an input query q of the form (?, r , e) where r ∈ R, e ∈ V, and U is the set of all subgraphs that we extracted from K. Moreover, let us assume that there exists an ideal KG K ′ ⊇ K which contains all true facts over V and R, and let A = {e ′ | (e ′ , r, e) ∈ K ′ } be the set of all admissible answers for q (the case for q = (e, r , ?) is analogous). Our goal is to rank the entities in a list ρ = ⟨e 1 , . . . , e n ⟩ that contains exactly the entities in A. If we rank all the entity embeddings, then we obtain an approximate ρ ′ where n = |V | and possibly not all the first entities in ρ ′ are in A. Since we are not interested in the answers not in A, we would like to rank as few entities as possible without excluding the ones in A. In other words, our objective is to compute ρ such that: 1) n is as close as possible to |A| and 2) ρ contains as many entities in A as possible.
We use the subgraph embeddings to achieve our goal. First, we create an embedding q of query q as it is defined by the embedding model. Then, we compute the likelihood scores between q and the embeddings of the subgraphs. The computation of the likelihood scores depends on the representation used for the subgraphs. We use the functions score µ and score kl , defined below, when the subgraphs are computed as average and Gaussian embeddings respectively.
Score score µ . If we use the subgraph embeddings computed as the average of the subgraph's members (first approach), then we use the score function ψ provided by the embedding model to compute the likelihood score between the subgraph embedding and the query, effectively treating the subgraph embeddings like any other entities.
Score score kl . The score function ψ cannot be applied as-is on Gaussian embeddings. In this case, we proceed as follows. First, we consider up to t (default value is 50) known answers of q from E t r ain to construct a Gaussian embedding of the query. Then, we use the Kullback-Leibler (KL) divergence [17] as scoring function. This function measures to what extent the two distributions differ from each other (a returned value of 0 means that the two distributions are equivalent). Therefore, it is a suitable measure to quantify the likelihood score between the query and the subgraph.
More formally, score kl for query q and subgraph S is defined as where ⟨q µ , q σ ⟩ and ⟨S µ , S σ ⟩ are the Gaussian embeddings of q and S respectively.
Example 3.3. If we use TransE as embedding model and average subgraph embeddings, then the embedding for q = (?, r , e) is q = e − r. Then, for every subgraph S ∈ U, the likelihood score is computed as: The ranking depends on a parameter k which determines the number of the top subgraphs that should be considered. Higher values of k will lead to higher recalls since they increase the chance that more entities are included. The downside is that the runtime will also increase. Lower values of k will have the opposite effect.
Computing k Dynamically (Optional third stage). Finding an optimal value for k might not be trivial. We propose the following procedure to dynamically compute such a value. For a given q, we select all known answers to q from E t r ain ∪ E val . Then, we compute the position of the first subgraphs that contain known answers and take the maximum value (up to a maximum threshold value of 50% of |U|). If there is no subgraph that contains the answer, or there are no answers for q in E t r ain ∪ E val , then we set k = max(10, 0.1 × |U|), i.e., we set k equal to 10% of the number of subgraphs with a minimum value of 10 if there are fewer than 100 subgraphs.
All models were trained for up to 500 epochs. To train LUBM, YAGO, and FB15K-237, we used a machine equipped with 64GB of RAM and two 8-core CPU 2.4GHz. The training terminated in a few hours in the longest case. The model of Wikidata is significantly larger than the other three. Therefore, we used another machine with 1TB of RAM and four 12-core Intel Xeon E5 CPUs. In this case, training the model with TransE and 32 HOGWILD! [23] threads took approximately 13 days.
We created subgraphs with different values of τ . The runtime for the creation with LUBM, FB15K-237, and YAGO is within a few seconds. For Wikidata, it took about five hours. Table 3d shows the number of created subgraphs.
To compare our method against approximate kNN techniques, we considered the state-of-the-art implementation of IVFADC [13] provided in the library FAISS by Facebook [14]. This technique depends on two important hyperparameters: The number of centroids (c) and the number of bytes (b) per code. Recommended settings are that the number of centroids is between 4 √ N and 16 √ N where N is the number of entities while the bytes per code should be between 5 and 25. We performed a grid search within these ranges and obtained the best results with c = 4 Subgraph-based predictions. We performed a number of experiments to evaluate the predictions with our method on the testsets (E t est ). For each triple, we performed a head (H) prediction (i.e., we try to predict answers for queries of the form (?, x, y)) and a tail (T) prediction (i.e., queries of the form (x, y, ?)). In these experiments, we included all subgraphs created with τ ≥ 10. We considered three metrics: Recall, %Reduction, and Mean Rank. With the recall, we measure how many times the test answer (head or tail) was among the selected subgraphs. This metric is important because it indicates how many times our method did not exclude true answers. We define %Reduction as 100 − |A|/|V | * 100 where A is the union of all entities in the selected subgraphs and V is the set of all entities. This metric shows how effective our method is in reducing the search space because a higher value indicates that our method selected a much smaller fraction of all entities as potential answers. The last metric measures the position of the first subgraph where the answer was found. The ideal case would occur when both recall and %reduction are maximum and the mean rank is minimum, but higher values of k will favor the recall instead of the %reduction.
We use the recall to compare the performance of our method against IVFADC. To this end, for every query we configure IVFADC to retrieve the top x similar entities where x is the number of entities contained in the top k subgraphs. For instance, suppose that our method is called to select the top three subgraphs, and for a given query it selected the subgraphs д 1 , д 2 , д 3 which contain x 1 , x 2 , x 3 sets of entities respectively. Then, IVFADC is configured to retrieve the top x = |x 1 ∪ x 2 ∪ x 3 | entities. In this way, we compare fairly because we check the recall with the same number of answers. Figure 2a shows the recall with all models on LUBM, YAGO, and FB15k-237. The recall is shown for head and tail predictions using average and Gaussian embeddings (score µ and score kl resp.), and IVFADC (score n ). Our method used top 5, 10, 5%, 10% subgraphs or the dynamically computed top k ("Dyn") subgraphs.
We make three observations. First, the recall of the average subgraphs (score µ ) is higher than the Gaussian embeddings and it outperforms IVFADC in all cases, while the Gaussian embeddings outperform it more than 80% of the cases. Second, the recall is poor if we consider only ≤ 10 subgraphs, especially with FB15k-237 which is the most challenging dataset. However, it increases significantly after selecting >5% of subgraphs. Third, with LUBM the recall of tail predictions is higher than the head predictions. The reason is that LUBM is a highly regular dataset and there are fewer objects. With the other two datasets, the recall of head predictions is higher in 70% of the cases (YAGO) and 58% of the cases (FB15K-237). Figure 2b reports the mean rank of the first subgraph that contained the right answer (with τ =10). On LUBM and YAGO, score kl returns lower ranks than score µ in 86% and 76% of the cases respectively. On FB15K-237, the two subgraph embeddings return similar ranks. Note that the mean ranks using the dynamic strategy are always higher than the mean ranks using the fixed k. This is because a few high values of k that are selected dynamically can drastically increase the mean rank. From a more general perspective, we observe that the KL-divergence used by the Gaussian subgraph embeddings is more effective in discovering the subgraphs with potential answers than the scoring function used with the average subgraph embeddings (given the lower mean rank).
Finally, Figure 2c reports the %Reduction with TransE (the results with the other methods are analogous). The figure shows that our method is very effective in reducing the search space. For YAGO and FB15K-237, the reduction is >50% for all cases. For LUBM, score µ gives higher reduction than the score kl in five out of six cases. Moreover, we observe that the dynamic strategy returns the lowest reduction rates. This result was expected since this procedure is designed to give preference to recall than reduction.
Figures 3a, 3b, and 3c report the recall, %Reduction, and normalized mean ranks using Wikidata and TransE (with a normalized mean rank, 1 corresponds to the size of the selected subgraphs). With this dataset, the number of subgraphs is significantly higher and this makes the predictions more challenging. This leads to a lower recall than with the other KGs, but it still remains above 50% if we use our dynamic procedure for the top k. The reduction of the search space is significant as it is >80%. The observed mean ranks are higher than with the other datasets, which is a result that follows from the fact that this is a much more challenging dataset.
Changing τ . Figure 3e shows how the recall and %Reduction are affected when we consider more or fewer subgraphs on LUBM, YAGO and FB15K-237 with TransE. If τ is too high, then there will be only few subgraphs and our technique will not be effective. If τ is too small, then there will be too many subgraphs and it will be equally ineffective. A threshold value of 10 returns better recalls but a higher value leads to better reductions. From our experiments, it appears that τ = 10 is a good value for better recall, otherwise τ = 50 returns better reductions (and thus faster runtimes).  Runtimes. The metrics that we used so far have the advantage that they are hardware-independent. We have also quantified the gain in terms of runtime that would be saved if we use our method to rank entities instead of considering all entities or IVFADC. The runtime comprises the (1) computation of the likelihood scores between the query and all the subgraphs, (2) ranking the subgraphs accordingly, and (3) ranking the entities in the top k subgraphs. Figure 3f reports the runtime needed for 150 random queries on FB15K-237, LUBM, YAGO using TransE and our smaller machine. We considered all three likelihood scores: score µ , score kl , and score n , k ≥10%, and the "Dyn" strategy. From the table, we observe that the KL divergence returns slower runtimes than IVFADC because the Equation 3 is much slower to compute than the other likelihood score and this cancels the gain obtained by considering less entities. We micro-benchmarked these runtimes and observed that while the computation of score µ takes 200µsec, the computation of score kl takes 6000µsec. This makes score kl worthwhile only if the aggregation into subgraphs filters out some noise that would appear if we rank all entities, or if the input contains subgraphs which exclusion from the top k list yields a search space reduction that compensates for the higher ranking cost. In general, we observe that link prediction with the average-based embeddings is faster than IVFADC and than considering all entities. From the results reported in Figures 2 and 3, we conclude that our method is able to reduce significantly the search space for relevant embeddings without excessively compromise the recall. Changing the number of top-k subgraphs leads to either a better recall or reduction. The dynamic procedure that automatically selects k appears to be a good compromise and lifts the user from the burden of finding an optimal value for this parameter. With this technique, average subgraph embeddings have a slight superior performance than the Gaussian ones. Finally, we observe that the performance of our method is not tied to a specific model. This suggests that it is a general method that can be applied to even more embedding models than the ones considered in this paper.

RELATED WORK AND CONCLUSION
Related work. The usage of star-shaped subgraph embeddings for KG completion was first proposed by Pal and Urbani [22]. In contrast to our work, the method at [22] adds special subgraph nodes to the original KG and then learns their embeddings like any other nodes. The main limitation of [22] is that adding extra links changes the topology of the graph and this affects the quality of the embeddings. Our approach does not suffer from this limitation.
More recently, the work at [27] proposes to find approximate answers to SPARQL queries using KG embeddings. This work is similar to ours since it also creates subgraph embeddings. However, the context and challenges are different since the method at [27] creates the embeddings on-the-fly for answering SPARQL queries while we create "query-independent" embeddings for link prediction.
Statistical relational learning methods have been applied to nonlabeled graphs as well. A survey is available at [4]. Some of these methods create embeddings of subgraphs (e.g. [2,29]) but the graphs are unlabeled; thus they are easier to handle.
Conclusion. In this paper, we showed how aggregations of KG embeddings in the form of subgraph embeddings can speed up significantly the search of similar embeddings. Thus, they can be used to perform link prediction on very large KGs. Moreover, our technique is particularly useful if the hardware resources (or other constraints) do not allow an extensive search that considers all embeddings.
Our experiments on realistic KGs (YAGO, Freebase, Wikidata) and benchmark dataset (LUBM) show that our technique outperforms k-nearest neighbor search and that it is able to significantly reduce the number of most similar entities while maintaining a good recall. Our results on Wikidata are particularly interesting because, as far as we know, they show for the first time how an embedding-based link prediction (TransE) can be applied to very large KGs with billions of facts. This enables the application of these techniques at a much larger scale than it is currently feasible.
It is interesting, as future work, to investigate whether there are other types of subgraphs that can reduce the search space. Moreover, our method returns, like all other similar methods, a ranked list of potential candidate entities, but we still need a procedure to make the final binary selection for link prediction. External sources (or other inference methods) can play a role in this process, and exploring such integration is another interesting topic for future work.