Adaptive Low-level Storage of Very Large Knowledge Graphs

The increasing availability and usage of Knowledge Graphs (KGs) on the Web calls for scalable and general-purpose solutions to store this type of data structures. We propose Trident, a novel storage architecture for very large KGs on centralized systems. Trident uses several interlinked data structures to provide fast access to nodes and edges, with the physical storage changing depending on the topology of the graph to reduce the memory footprint. In contrast to single architectures designed for single tasks, our approach offers an interface with few low-level and general-purpose primitives that can be used to implement tasks like SPARQL query answering, reasoning, or graph analytics. Our experiments show that Trident can handle graphs with 10^11 edges using inexpensive hardware, delivering competitive performance on multiple workloads.

come at the price of higher communication cost and increased system complexity [71]. Moreover, sometimes distributed solutions cannot be used either due to financial or privacy-related constraints.
Centralized architectures, in contrast, do not have network costs, are commonly affordable, and provide enough resources to load all-but-the-largest graphs. Some centralized storage engines have demonstrated that they can handle large graphs, but they focus primarily on supporting one particular type of workload (e.g., Ringo [71] supports graph analytics, RDF engines like Virtuoso [67] or RDFox [60] focus on SPARQL [39]). To the best of our knowledge, we still lack a single storage solution that can handle very large KGs as well as support multiple workloads. Our approach. In this paper, we fill this gap presenting Trident, a novel storage architecture that can store very large KGs on centralized architectures, support multiple workloads, such as SPARQL querying, reasoning, or graph analytics, and is resource-savvy. Therefore, it meets our goal of combining scalability and generalpurpose computation.
We started the development of Trident by studying which are the most frequent access types performed during the execution of tasks like SPARQL answering, reasoning, etc. Some of these access types are node-centric (i.e., access subsets of the nodes), while others are edge-centric (i.e., access subsets of the edges). From this study, we distilled a small set of low-level primitives that can be used to implement more complex tasks. Then, the research focused on designing an architecture that supports the execution of these primitives as efficiently as possible, resulting in Trident.
At its core, Trident uses a dedicated data structure (a B+Tree or an in-memory array) to support fast access to the nodes, and a series of binary tables to store subsets of the edges. Since there can be many binary tables -possibly billions with the largest KGshandling them with a relational DBMS can be problematic. To avoid this problem, we introduce a light-weight storage scheme where the tables are serialized on byte streams with only a little overhead per table. In this way, tables can be quickly loaded from the secondary storage without expensive pre-processing and offloaded in case the size of the database exceeds the amount of available RAM.
Another important benefit of our approach is that it allows us to exploit the topology of the graph to reduce its physical storage. To this end, we introduce a novel procedure that analyses each binary table and decides, at loading time, whether the table should be stored either in a row-by-row, column-by-column, or in a clusterbased fashion. In this way, the storage engines effectively adapts to the input. Finally, we introduce other dynamic procedures that decide, at loading time, whether some tables can be ignored due to their small sizes or whether the content of some tables can be aggregated to further reduce the space. Since Trident offers low-level primitives, we built interfaces to several engines (RDF3X [63], VLog [89], SNAP [50]) to evaluate the performance of SPARQL query answering, datalog reasoning and graph analytics on various types of graphs. Our comparison against the state-of-the-art shows that our approach can be highly competitive in multiple scenarios. Contribution. We identified the followings as the main contributions of this paper.
• We propose a new architecture to store very large KGs on a centralized system. In contrast to other engines that store the KG in few data structures (e.g., relational tables), our architecture exhaustively decomposes the storage in many binary tables such that it supports both node-and edge-centric via a small number of primitives; • Our storage solution adapts to the KG as it uses different layouts to store the binary tables depending on its topology. Moreover, some binary tables are either skipped or aggregated to save further space. The adaptation of the physical storage is, as far as we know, a unique feature which is particularly useful for highly heterogeneous graphs, such as the KGs on the Web; • We present an evaluation with multiple workloads and the results indicate highly competitive performance while maintaining good scalability. In some of our largest experiments, Trident was able to load and process KGs with up to 10 11 (100B) edges with hardware that costs less than $5K.
The source code of Trident is freely available with an open source license at https://github.com/karmaresearch/trident, along with links to the datasets and instructions to replicate our experiments.

PRELIMINARIES
A graph G = (V , E, L, ϕ V , ϕ E ) is a tuple where V , E, L represent the sets of nodes, edges and labels respectively, ϕ V is a bijection that maps each node to a label in L, while ϕ E is a function that maps each edge to a label in L. We assume that there is at most one edge with the same label between any pair of nodes. Throughout, we use the notation r (s, d) to indicate the edge with label r from node with label s (source) to the node with label d (destination). We say that the graph is undirected if r (s, d) ∈ E implies that also r (d, s) ∈ E. Otherwise, the graph is directed. A graph is unlabeled if all edges map to the same label. In this paper, we will mostly focus on labeled directed graphs since undirected or unlabeled graphs are special cases of labeled directed graphs.
In practice, it is inefficient to store the graph using the raw labels as identifiers. The most common strategy, which is the one we also follow, consists of assigning a numerical ID to each label in L, and stores each edge r (s, d) with the tuple ⟨ι s , ι r , ι d ⟩ where ι s , ι r , and ι d are the IDs associated to s, r , and d respectively.
The numerical IDs allow us to sort the edges and by permuting ι s , ι r , ι d we can define six possible ordering criteria. We use strings of three characters over the alphabet {s, r, d} to identify these orderings, e.g., srd specifies that the edges are ordered by source, relation, and destination. We denote with R = {srd, sdr, . . .} the collection of six orderings while R ′ = {s, r, d, sr, rs, sd, ds, dr, rd} specifies all partial orderings. We use the function isprefix to check whether string a is a prefix of b, i.e., isprefix(a, b) = [true | f alse] and the operator − to remove all characters of one string from another one (e.g., if a = srd and b = sd, then a − b = r).
Let V be a set of variables. A simple graph pattern (or triple pattern) is an instance of L ∪ V × L ∪ V × L ∪ V and we denote it as (X , Y , Z ) where X , Y , Z ∈ L ∪ V. A graph pattern is a finite set of simple graph patterns. Let σ : V → L be a partial function from variables to labels. With a slight abuse of notation, we also use σ as a postfix operator that replaces each occurrence of the variables in σ with the corresponding node. Given the graph G and a simple graph pattern q, the answers for q on G correspond to the set ans(G, q) = {r (s, d) | r (s, d) ∈ E ∧ qσ = (s, r, d)}. Function bound(p) returns the positions of the labels in the simple graph pattern p left-to-right, i.e., if p = (X , a, b) where X ∈ V and a, b ∈ L, then bound(p) = rd.
A Knowledge Graph (KG) is a directed labeled graph where nodes are entities and edges establish semantic relations between them, e.g., ⟨Sadiq_Khan, majorO f , London⟩. Usually, KGs are published on the Web using the RDF data model [45]. In this model, data is represented as a set of triples of the form ⟨subject, predicate, object⟩ drawn from (I ∪ B) × I × (I ∪ L) where I, B, L denote sets of IRIs, blank nodes and literals respectively. Let T = I ∪ B ∪ L be the set of all RDF terms. RDF triples can be trivially seen as a graph where the subjects and objects are the nodes, triples map to edges labeled with their predicate name, and L = T .
SPARQL [39] is a language for querying knowledge graphs which has been standardized by W3C. It offers many SQL-like operators like UNION, FILTER, DISTINCT to specify complex queries and to further process the answers. Every query contains at its core a graph pattern, which is called Basic Graph Pattern (BGP) in the SPARQL terminology. SPARQL graph patterns are defined over T ∪ V and their answers are mappings σ from V to T . Therefore, answering a SPARQL graph pattern P over a KG G corresponds to computing ans(G, p 1 ) ∩ . . . ∩ ans(G, p |P | ) and retrieving the corresponding labels.

GRAPH PRIMITIVES
We start our discussion with a description of the low-level primitives that we wish to support. We distilled these primitives considering four types of workloads: SPARQL [39] query answering, which is the most popular language for querying KGs; Rule-based reasoning [7], which is an important task in the Semantic Web to infer new knowledge from KGs; Algorithms for graph analytics, or network analysis, since these are widely applied on KGs either to study characteristics like the graph's topology or degree distribution, or within more complex pipelines; Statistical relational models [64], which are effective techniques to make predictions using the KG as prior evidence.
If we take a closer look at the computation performed in these tasks, we can make a first broad distinction between edge-centric and node-centric operations. The first ones can be defined as operations that retrieve subsets of edges that satisfy some constraints. In contrast, operations of the second type retrieve various data about Name Output f 1 lbln (G, n) Label of node n (equals to ϕ V (v)). f 2 lble (G, e) Label of edge e (equals to ϕ E (e)). f 3 nodid(G, l ) ι l , i.e., the ID of node with label l . f 4 edgid(G, l ) ι l , i.e., the ID of edge label l . f 5 edg srd (G, p) ans(G, p) sorted by srd. f 6 edg sdr (G, p) ans(G, p) sorted by sdr. f 7 edg drs (G, p) ans(G, p) sorted by drs. f 8 edg dsr (G, p) ans(G, p) sorted by dsr. f 9 edg rsd (G, p) ans(G, p) sorted by rsd. f 10 edg rds (G, p) ans(G, p) sorted by rds. f 11 grp s (G, p) All s of ans(G, p). f 12 grp r (G, p) All r of ans(G, p).  the nodes, like their degree. Some tasks, like SPARQL query answering, depend more heavily on edge-centric operations while others depend more on node-centric operations (e.g., random walks). Graph Primitives. Following a RISC-like approach, we identified a small number of low-level primitives that can act as basic building blocks for implementing both node-and edge-centric operations. These primitives are reported in Table 1 and are described below. f 1 − f 4 . These primitives retrieve the numerical IDs associated with labels and vice-versa. The primitives f 1 and f 2 retrieve the labels associated with nodes and edges respectively. The primitives f 3 and f 4 retrieve the labels associated with numerical IDs. This computation is useful in a number of cases: For instance, it can be used to optimize the computation of SPARQL queries by rearranging the join ordering depending on the cardinalities of the triple patterns or to compute the degree of nodes in the graph. f 18 − f 23 . These primitives return the i t h edge that would be returned by the corresponding primitives edg * . In practice, this operation is needed in several graph analytics algorithms or for minibatching during the training of statistical relational models.   Table 1 to answer the SPARQL query of Example 1, assuming that the KG is called I .
• First, we retrieve the IDs of the labels isA, livesIn, and Rome. To this end, we can use the primitives f 3 and f 4 .
• Then, we create two single graph patterns p 1 and p 2 which map to the first and second triple patterns respectively. Then, we execute edg rsd (I, p 1 ) and edg drs (I, p 2 ) so that the edges are returned in a order suitable for a merge join.
• We invoke the primitive f 1 to retrieve all the labels of the nodes which are returned by the join algorithm. These labels are then used to construct the answers of the query.

ARCHITECTURE
One straightforward way to implement the primitives in Table 1 is to store the KG in many independent data structures that provide optimal access for each function. However, such solution will require a large amount of space and updates will be slow. It is challenging to design a storage engine that uses fewer data structures without excessively compromising the performance. Moreover, KGs are highly heterogeneous objects where some subgraphs have a completely different topology than others. The storage engine should take advantage of this diversity and potentially store different parts of the KGs in different ways, effectively adapting to its structure. This adaptation lacks in current engines, which treat the KG as a single object to store.
Our architecture addresses these two problems with a compact storage layer that supports the execution of primitives f 1 , . . . , f 23 with a minimal compromise in terms of performance, and in such a way that the engine can adapt to the KG in input selecting the best strategy to store its parts. Figure 1 gives a graphical view of our approach. It uses a series of interlinked data structures that can be grouped in three components. The first one contains data structures for the mappings I D ⇔ label. The second component is called edge-centric storage and contains data structures for providing fast access to the edges. The third one is called node-centric storage and offers fast access to the nodes. Section 4.1 describes these components in more detail. Section 4.2 discusses how they allow an efficient execution of the primitives, while Section 4.3 focuses on loading and updating the database.

Architectural Components
Dictionary. We store the labels on a block-based byte stream on disk. We use one B+Tree called DICT ι to index the mappings ID ⇒ label and another one called DICT l for label ⇒ ID. Using B+Trees here is usual, so we will not discuss it further. It is important to note that assigning a certain ID to a term rather than another one might have a significant impact on the performance. For instance, Urbani et al. [88] have shown that a careful choice of the IDs can introduce importance speedups due to the improved data locality. Typically, current graph engines assign unique IDs to all labels, irrespectively whether a label is used as an entity or as a relation. This is desirable for SPARQL query answering because all data joins can operate on the IDs directly. There are cases, however, where unique ID assignments are not optimal. For instance, most implementations of techniques for creating KG embeddings (e.g., TranSE [16]) store the embeddings for the entities and the ones for relations in two contiguous vectors, and use offsets in the vectors as IDs. If the labels for the relations share the IDs with the entities, then the two vectors must have the same number of elements. This is highly inefficient because KGs have many fewer relations than entities which means that much space in the second vector will be unused. To avoid this problem, we can assign IDs to entities and relationships in an independent manner. In this way, no space is wasted in storing the embeddings. Note that Trident supports both global ID assignments and independent entity/relationship assignments with an additional index specifically for the relation labels. The first type of assignment is needed for tasks like SPARQL query answering while the second is useful for operations like learning graph embeddings [64]. Edge-centric storage. In order to adapt to the complex and nonuniform topology of current KGs, we do not store all edges in a single data structure, but store subsets of the edges independently. These subsets correspond the edges which share a specific entity/relation. More specifically, let us assume that we must store the graph G = (V , E, L, ϕ V , ϕ E ). For each l ∈ L, we consider three types of subsets: i.e., the subsets of edges that have l as source, edge, or destination respectively.
The choice of separating the storage of various subsets allows us to choose the best data structure for a specific subset, but it hinders the execution of inter-table scans, i.e., scans where the content of multiple tables must be taken into account. To alleviate this problem, we organize the physical storage in such a way that all edges can still be retrieved by scanning a contiguous memory location.
We proceed as follows: Let Ω be the collection of all these sets. For each E x (l) ∈ Ω, we construct two sets of tuples, F x (l) and G x (l), by extracting the free fields left-to-right and right-to-left respectively. For instance, the set E s (l) results into the sets F s (l) = {⟨r, d⟩ | r (l, d) ∈ E} and G s (l) = {⟨d, r ⟩ | r (l, d) ∈ E}. Since these sets contains pairs of elements, we view them as binary tables. These are grouped into the following six sets: The content of these six sets is serialized on disk in corresponding byte streams called TS, TS ′ , TR, TR ′ , TD, and TD ′ respectively (see middle section of Figure 1). The serialization is done by first sorting the binary tables by their defining label IDs, and then serializing each table one-by-one.
At the beginning of the byte stream, we store the list of all IDs associated to the tables, pointers to the tables' physical location and instructions to parse them.
Since the binary tables and tuples are serialized on the byte stream with a specific order, we can retrieve all edges sorted with any ordering in R with a single scan of the corresponding byte stream, using the content stored at the beginning of the stream to decode the binary tables in it. For instance, we can scan TS to retrieve all edges sorted according to srd. The IDs stored at the beginning of the stream specify the sources of the edges (s) while the content of the tables specify the remaining relations and destinations (rand d).
Node-centric storage. In order to provide fast access to the nodes, we map each ID ι l (i.e., the ID assigned to label l) to a tuple M l that contains 15 fields: • the cardinalities |E s (l)|, |E r (l)|, and |E d (l)|; • Six bytes m 1 , . . . , m 6 that contain instructions to read the data structures pointed by p 1 , . . . , p 6 . These instructions are necessary because the tables are stored in different ways (see Section 5).
We index all M * tuples by the numerical IDs using one global data structure called NM (Node Manager), shown on the left side of Figure 1. This data structure is implemented either with an on-disk B+Tree or with a in-memory sorted vector (the choice is done at loading time). The B+Tree is preferable if the engine is used for edge-based computation because the B+Tree does not need to load all nodes in main memory and the nodes are accessed infrequently anyway. In contrast, the sorted vector provides much faster access (O(1) vs. O(loд|L|)) but it requires that the entire vector is stored in main memory. Thus, it is suitable only if the application accesses the nodes very frequently and there are enough hardware resources.
Note that the coordinates to the binary tables are stored both in NM and in the meta-data in front of the byte streams. This means that the table can be accessed either by accessing NM, or by scanning the beginning of the byte stream. In our implementation, we consult NM when we need to answers graph patterns with at least one constant element (e.g., for answering the query in Example 1). In contrast, the meta-content at the beginning of the stream is used when must perform a full scan.
The way we store the binary tables in six byte streams resembles six-permutation indexing schemes such as proposed in engines like RDF3X [63] or Hexastore [93]. There are, however, two important differences: First, in our approach the edges are stored in multiple independent binary tables rather than a single series of ternary tuples (as, for instance, in RDF3X [63]). This division is important because it allows us to choose different serialization strategies for subgraphs or to avoid storing some tables (Section 5.3). The second difference is that in our case most access patterns go through a single B+Tree instead of six different data structures. This allows us to save space and to store additional information about the nodes, e.g., their degree, which is useful, for instance, for traversal algorithms like PageRank, or random walks.

Primitive Execution
We now discuss how we can implement the primitives in Table 1 with our architecture. Primitives f 1 , . . . , f 4 (lbl * nodid, edgid). These are executed consulting either DICT l or DICT ι . Thus, the time complexity follows in a straightforward manner.
Primitives f 5 , . . . , f 10 (edg * ). Let edg ω (G, p) be a generic invocation of one of f 5 , . . . , f 10 . First, we need to retrieve the numerical IDs associated to the labels in p (if any). Then, we select an ordering that allows us to 1) retrieve answers for p with a range scan, and 2) the ordering complies with ω. The orderings that satisfy 1) are An ordering ω ′ ∈ Ω which also satisfies 2) is one for which The selected ω ′ is associated to one byte stream. If p contains one or more constants, then we can query NM to retrieve the appropriate binary table from that binary stream and (range-)scan it to retrieve the answers of p. In contrast, if p only contains variables, the results can be obtained by scanning all tables in the byte stream. Note that the cost of retrieving the IDs for the labels in p is O(loд|L|) since we use B+Trees for the dictionary. This is an operation that is applied any time the input contains a graph pattern. If we ignore this cost and look at the remaining computation, then we can make the following observation.
Primitives f 11 , . . . , f 16 (grp * ). Let grp ω (G, p) be a general call to one of these primitives. Note that in this case ω ∈ R ′ , i.e., is a partial ordering. These functions can be implemented by invoking f 5 , . . . , f 10 and then return an aggregated version. Thus, they have the same cost as the previous ones.
However, there are special cases where the computation is quicker, as shown in the next example.
Example 4. Consider a call to grp s (G, p) where p = ⟨a, X , Y ⟩. In this case, we can query NM with a and return at most one tuple with the cardinality stored in M a , which has a cost of O(log(|L|)).
If ω has length two or p contains a repeated variable, then we also need to access one or more binary tables, similarly as before.
Primitive f 17 (count). This primitive returns the cardinality of the output of f 5 , . . . , f 16 . Therefore, it can be simply implemented by iterating over the results returned by these functions. However, there there are cases when we can avoid this iteration. Some of such cases are the ones below: • If the input is edg ω (G, p) and p contains no constant nor repeated variables. In this case the output is |E|. • If the input is edg ω (G, p) and p contains only one constant c and no repeated variables. In this case the cardinality is stored in M c . • If the input is grp ω (G, p), isprefix(ω, ω ′ ) = true, and p contains at most one constant and no repeated variables, then the output can be obtained either by consulting NM or the metadata of one of the byte streams.
Otherwise, we also need to access one binary table to compute the results, which, in the worst case, takes O(|E|).
Primitives f 18 , . . . , f 23 (pos * ). In order to efficiently support these primitives, we need to provide a fast random access to the edges. Given a generic pos ω (G, p, i), we distinguish four cases: However, in practice tables have more than one row so we can advance more quickly despite the higher worst-case complexity.

Bulk Loading and Updates
Bulk Loading. Loading a large KG can be a lengthy process, especially if the resources are constrained. In Trident, we developed a loading routine which exploits the multi-core architecture and maximizes the (limited) I/O bandwidth. The main operations are shown in Figure 2. Our implementation can receive the input KG in multiple formats. Currently, we considered the N-Triples format (popular in the Semantic Web) and the SNAP format [49] (used for generic graphs). The first operation is encoding the graph, i.e., assigning unique IDs to the entities and relation labels. For this task, we adapted the MapReduce technique presented at [90] to work in a multi-core environment. This technique first deconstructs the triples, then assigns unique IDs to all the terms, and finally reconstruct the triples. If the graph is already encoded, then our procedure skips the encoding and proceeds to the second operation of the loading, the creation of the database.
The creation of the binary tables requires that the triples are presorted according to a given ordering. We use a disk-based parallel merge sort algorithm for this purpose. The tables are serialized one-by-one selecting the most efficient layout for each of them. After all the tables are created, the loading procedure will create the NM and the B+Trees with the dictionaries. The encoding and sorting procedures are parallelized using threads, which might need to communicate with the secondary storage. Modern architectures can have >64 cores, but such a number of threads can easily saturate the disk bandwidth and cause serious slowdowns. To avoid this problem, we have two types of threads: Processing threads, which perform computation like sorting, and I/O threads, which only read and write from disk. In this way, we can control the maximum number of concurrent accesses to the disks.
Updates. To avoid a complete re-loading of the entire KG after each change, our implementation supports incremental updates. Our procedure is built following the well-known advice by Jim Gray [29] that discourages in-place updates, and it is inspired by the idea of differential indexing [63], which proposes to create additional indices and perform a lazy merging with the main database when the number of indices becomes too high.
Our procedure first encodes the update, which can be either an addition or removal, and then stores it in a smaller "delta" database with its own NM and byte streams. Multiple updates will be stored in multiple databases, which are timestamped to remember the order of updates. Also, updates create an extra dictionary if they introduce new terms. Whenever the primitives are executed, the content of the updates is combined with the main KG so that the execution returns an updated view of the graph.
In contrast to differential indexing, our merging does not copy the updates in the main database, but only groups them in two updates, one for the additions and one for the removals. This is to avoid the process of rebuilding binary tables with possibly different layouts. If the size of the merged updates becomes too large, then we proceed with a full reload of the entire database.

ADAPTIVE STORAGE LAYOUT
The binary tables can be serialized in different ways. For instance, we can store them row-by-row or column-by-column. Using a single serialization strategy for the entire KG is inefficient because the tables can be very different from each other, so one strategy may be efficient with one table but inefficient with another. Our approach addresses this inefficiency by choosing the best serialization strategy for each table depending on its size and content.
For example, consider two tables T 1 and T 2 . Table T 1 contains all the edges with label "isA", while T 2 contains all the edges with label "isbnValue". These two tables are not only different in terms of sizes, but also in the number of duplicated values. In fact, the second column of T 1 is likely to contain many more duplicate values than the second column of T 2 because there are (typically) many more instances than classes while "isbnValue" is a functional property, which means that every entity in the first column is associated with a unique ISBN code. In this case, it makes sense to serialize T 1 in a column-by-column fashion so that we can apply run-lengthencoding (RLE) [1], a well-known compression scheme of repeated values, to save space when storing the second column. This type of compression would be ineffective with T 2 since there each value appears only once. Therefore, T 2 can be stored row-by-row.
In our approach, we consider three different serialization strategies, which we call serialization layouts (or simply layouts) and employ an ad-hoc procedure to select, for each binary table, the best layout among these three.

Serialization Layouts
We refer to the three layouts that we consider as row, column, and cluster layouts respectively. The first layout stores the content rowby-row, the second column-by-column, while the third uses an intermediate representation.
Row layout. Let T = ⟨⟨t ′ 1 , t ′′ 1 ⟩, . . . , ⟨t ′ n , t ′′ n ⟩⟩ be a binary table that contains n sorted pairs of elements. With this layout, the pairs are stored one after the other. In terms of space consumption, this layout is optimal if the two columns do not contain any duplicated value. Moreover, if each row takes a fixed number of bytes, then it is possible to perform binary search or perform a random access to a subset of rows. The disadvantage is that with this layout all values are explicitly written on the stream while the other layouts allow us to compress duplicate values. Column layout. With this layout, the elements in T are serialized as ⟨t ′ 1 , . . . , t ′ n ⟩, ⟨t ′′ 1 , . . . , t ′′ n ⟩. The space consumption required by this layout is equal to the previous one but with the difference that here we can use RLE to reduce the space of ⟨t ′ 1 , . . . , t ′ n ⟩. In fact, if t ′ 1 = t ′ 2 = . . . = t ′ n , then we can simply write t ′ 1 × n. Also this layout allows binary search and a random access to the table. However, it is slightly less efficient than the row layout for full scans because here one row is not stored at contiguous locations, and the system needs to "jump" between columns in order to return the entire pair. On the other hand, this layout is more suitable than the row layout for aggregate reads (required, for instance, for executing дrp primitives) because in this case we only need to read the content of one column which is stored at contiguous locations. Cluster layout. Let д t = ⟨⟨t, t ′′ k ⟩, . . . , ⟨t, t ′′ l ⟩⟩ be the longest subsequence of pairs in T which share the first term t. With this layout, all groups are first ordered in the sequence ⟨д t 1 , . . . , д t i , д t i +1 , . . . , д t m ⟩ such that t i ≤ t i+1 for all 1 ≤ i < m. Then, they are serialized oneby-one. Each group д t is serialized by first writing t, then |д t |, and finally the list t ′′ k , . . . , t ′′ l . This layout needs less space than the row layout if the groups contain multiple elements. Otherwise, it uses more space because it also stores the size of the groups, and this takes an extra ⌈loд 2 n⌉ bits. Another disadvantage is that with this layout binary search is only possible within one group.

Dynamic Layout Selection
The procedure for selecting the best layout for each table is reported in Algorithm 1. Its goal is to select the layout which leads to the best compression without excessively compromising the performance. In our implementation, Algorithm 1 is applied by default, but the user can disable it and use one layout for all tables.
The procedure receives as input a binary table T with n rows and returns a tuple that specifies the layout that should be chosen. It proceeds as follows. First, it makes a distinction between tables that have less than τ rows (default value of τ is 1M) and contain less than υ unique elements in the first column from tables that do not (line 2). We make this distinction because 1) if the number of rows is too high then searching for the most optimal layout becomes expensive and 2) if the number of unique pairs is too high, then the cluster layout should not be used due to the lack of support of binary search. With small tables, this is not a problem because it is well known that in these cases linear search is faster than binary search due to a better usage of the CPU's cache memory. The value for υ is automatically determined with a small routine that performs some micro-benchmarks to identify the threshold after which binary search becomes faster. In our experiments, this value ranged between 16 and 64 elements.
If the table satisfies the condition of line 2, then the algorithm selects either the ROW or the CLUSTER layout. The COLUMN layout is not considered because its main benefit against the other two is a better compression (e.g., RLE) but this is anyway limited if the table is small. The procedure scans the table and keeps track of the largest numbers and groups used in the table (m 1 , m 2 , m 3 ). Then, the function invokes the subroutine sizeof(·) to retrieve the number of bytes needed to store these numbers. It uses this information to compute the total number of bytes that would be needed to store the table with the ROW and CLUSTER layout respectively (variables t r and t c ). Then, it selects the layout that leads to maximum compression.
If the condition in line 2 fails, then either the ROW or the COLUMN layout can be selected. An exact computation would be too expensive given the size of the table. Therefore, we always choose COLUMN since the other one cannot be compressed with RLE.
Next to choosing the best layout, Algorithm 1 also returns the maximum number of bytes needed to store the values in the two fields in the table (m 1 and m 2 ) and (optionally) also for storing the cluster size (m 3 , this last value is only needed for CLUSTER). The reason for doing so is that it would be wasteful to use fouror eight-byte integers to store small IDs. In the worst case, we assume that all IDs in both fields can be stored with five bytes, which means it can store up to 2 40 − 1 terms. We decided to use byte-wise compression rather than bit-wise compression because the latter does not appear to be worthwhile [63]. Note that more complex compression schemes could also be used (e.g., VByte [94]) but this should be seen as future work.
The tuple returned by selectlayout contains the information necessary to properly read the content of the table from the byte stream. The first field is the chosen layout while the other fields are the number of bytes that should be used to store the entries of the table. We store this tuple both in NM (in one of the m * fields) and at the beginning of the byte stream.

Table Pruning
With Algorithm 1, the system adapts to the KG while storing a single table. We discuss two other forms of compression that consider multiple tables and decide whether some tables should be skipped or stored in aggregated form. On-the-fly reconstruction (OFR). Every table in one stream T x maps to another table in T ′ x where the first column is swapped with the second column. If the tables are sufficiently small, one of them can be re-constructed on-the-fly from the other whenever needed. While this operation introduces some computational overhead, the saving in terms of space may justify it. Furthermore, the overhead  We refer to this strategy as on-the-fly reconstruction (OFR). If the user selects it at loading time, the system will not store any binary table in T ′ x which has less than η rows, η being a value passed by the user (default value is 20, determined after microbenchmarking). Aggregate Indexing. Finally, we can construct aggregate indices to further reduce the storage space. The usage of aggregate indices is not novel for KG storage [93]. Here, we limit their usage to the tables in T ′ r if they lead to a storage space reduction. To illustrate the main idea, consider a generic table t that contains the set of tuples F ′ r (isA). This table stores all the ⟨object, subject⟩ pairs of the triples with the predicate isA. Since there are typically many more instances than classes, the first column of t (the classes) will contain many duplicate values. If we range-partition t with the first field, then we can identify a copy of the values in the second field of t in the partitions of tables in T ′ d where the first term is isA. With this technique, we avoid storing the same sequence of values twice but instead store a pointer to the partition in the other table.

EVALUATION
Trident is developed in C++, is freely available, and works under Windows, Linux, MacOS. Trident is also released in the form of a Docker image. The user can interact via command line, web interface, or HTTP requests according to the SPARQL standard. Integration with other systems. Since our system offers lowlevel primitives, we integrated it with the following other engines with simple wrappers to evaluate our engine in multiple scenarios: • RDF3X [63]. RDF3X is one of the fastest and most well-known SPARQL engines. We replaced its storage layer with ours so that we can reuse its SPARQL operators and query optimizations. • SNAP [50]. Stanford Network Analysis Platform (SNAP) is a high-performance open-source library to execute over 100 different graph algorithms. As with RDF3X, we removed the SNAP storage layer and added an interface to our own engine. • VLog [89]. VLog is one of most scalable datalog reasoners. We implemented an interface allowing VLog to reason using our system as underlying database.
We also implemented a native procedure to answer basic graph patterns (BGPs) that applies greedy query optimization based on cardinalities, and uses either merge joins or index loop joins if the first cannot be used. Testbed. We used a Linux machine (kernel 3.10, GCC 6.4, page size 4k) with dual Intel E5-2630v3 eight-core CPUs of 2.4 GHz, 64 GB of memory and two 4TB SATA hard disks in RAID-0 mode. The commercial value is well below $5K. We compared against RDF3X and SNAP with their native storages, TripleBit [97], a in-memory  state-of-the-art RDF database (in contrast to RDF3X which uses disks), and SYSTEM_A, a widely used commercial SPARQL engine 1 . As inputs, we considered a selection of real-world and artifical KGs, and other non-KG graphs from SNAP [49] (see Table 2  Trident was configured to use the B+Tree for NM and table pruning was disabled, unless otherwise stated.

Lookups
During the loading procedure, Trident applies Algorithm 1 to determine the best layout for each table. Figure 3a shows the number of tables of each type for the KGs. The vast majority of tables is stored either with the ROW or CLUSTER layout. Only a few tables are stored with the COLUMN layout. These are mostly the ones in the TR and TR ′ byte streams. It is interesting to note that the number of tables varies differently among different KGs. For instance, the number of row tables is twice the number of cluster tables with LUBM. In contrast, with Wikidata there are more cluster tables than row ones. These differences show to what extent Trident adapted its physical storage to the structure of the KG.
One key operation of our system is to retrieve answers of simple triple patterns. First, we generated all possible triple patterns that return non-empty answers from YAGO2S. We considered five types of patterns. Patterns of type 0 are full scans; of type 2 contain one constants and two variables (e.g., X , type, Y ), while of type 4 contain two constants and one variable (e.g., X , type, person). These patterns are answered with edg * . Patterns of types 1 request an aggregated version of a full scan (e.g., retrieve all subjects) while patterns of type 3 request an aggregation where the pattern contains (c) Size of the database with Trident with/without optimizations Figure 3: Statistics using various layouts/configurations and runtimes of triple pattern lookups one constant (e.g., return all objects of the predicate type). These two patterns are answered with grp * .
The number, types of patterns, and average number of answers per type is reported in Table 3. The first column reports the type of pattern and the orderings we can apply when we retrieve it. The second column reports an example pattern of this type. The third column contains the number of different queries that we can construct of this type. The fourth column reports the average number of answers that we get if we execute queries of this type.
For example, the first row describes the pattern of type 0, which is a full scan. For this type of pattern, we can retrieve the answers with all the orderings in R. There is only one possible query of this type (column 3) and if we execute it then we obtain about 76M answers (column 4). Patterns of type 1 correspond to full aggregated scans. An example pattern of this type is shown in the second row. If this query is executed, the system will return the list of all subjects with the count of triples that share each subject. With this input, this query will return about 8M results (i.e., the number of subjects). We can construct a similar query if we consider the variables in the second or third position. Details for these two cases are reported in the third and fourth rows.
Patterns of type 2 have one constant and two variables. Like before the constant can appear in three positions. Note that in this case we can construct many more queries by using different constants. For instance, we can construct 8.6M queries if the constant appears as subject, and 99 if it appears as predicate. Similarly, Table 3 reports such details also for queries of type 3 and 4. By testing our system on all these types of patterns, we are effectively evaluating the performance over all possible queries of these types which would return non-empty answers.
We used the primitives to retrieve the answers for these patterns with various configurations of our system, and compared against RDF3X, which was the system with fastest runtimes. The median warm runtimes of all executions are reported in Figure 3b.
The row "Default" reports the results with the adaptive storage selected by Algorithm 1 but without table pruning. The rows "With OFR" and "With AGGR" use Algorithm 1 and the two techniques for table pruning discussed in Section 5.3 respectively. The rows "Only ROW (COLUMN)" use only the ROW and COLUMN layouts (the CLUSTER is not competitive alone due to the lack of binary search). From the table, we see that if the two pruning strategies are enabled, then the runtimes increase, especially with OFR. This was expected since these two techniques trade speed for space. Their benefit is that they reduce the size of the database, as shown in Figure 3c. In particular, OFR is very effective, and they can reduce the size by 35%. Therefore, they should only be used if space is critical. The ROW layout returns competitive performance if used alone but then the database size is about 9% larger due to the suboptimal compression. Figure 3c also reports the size of the databases with the other systems as reference. Note that the reported numbers for Trident do not include the size of the dictionary (764MB). This size should be added to the reported numbers for a fair comparison with the other systems' databases.
A comparison against RDF3X shows that the latter is faster with full scans (patterns of type 0) because our approach has to visit more tables stored with different configurations. However, our approach has comparable performance with the second pattern type and performs significantly better when the scan is limited to a single table, with, in the best case, improvements of more than two orders of magnitude (pattern 3). Note that in contexts like SPARQL query answering, patterns that contain at least one constant are much more frequent than full scans (e.g., see Table 2 of [76]). Table 4 reports the average of five cold and warm runtimes with our system and with other state-of-the-art engines. For LUBM, DBPedia, Uniprot, and BTC2012, we considered queries used to evaluate previous systems [97]. For Wikidata, we designed five example queries of various complexity looking at published examples. The queries are reported in Appendix A. Unfortunately, we could not load Wikidata and BTC2012 with SYSTEM_A due to raised exceptions during the loading phase.

SPARQL
We can make a few observations from the obtained results. First, a direct comparison against TripleBit is problematic because sometimes TripleBit crashed or returned wrong results (checked after manual inspection). Looking at the other systems, we observe that our approach returned the best cold runtimes for 20 out of 25 queries, counting in both the executions with our native SPARQL engine and the integration with RDF3X. If we compare the warm runtimes, our system is faster 20 out of 25 times. Furthermore, we observe that Trident/N is faster than Trident/R mostly with selective queries that require only a few joins. Otherwise the second is faster. The reason is that RDF3X uses a sophisticated query optimizer that builds multiple plans in a bottom-up fashion. This procedure is costly if applied to simple queries, but it pays off for more complex ones because it can detect a better execution plan.    Graph analytics. Algorithms for graph analytics are used for path analysis (e.g., find the shortest paths), community analysis (e.g., triangle counting), or to compute centrality metrics (e.g., PageRank). They use frequently the primitives pos * and count to traverse the graph or to obtain the nodes' degree. For these experiments, we used the sorted list as NODEMGR since these algorithms are node-centric. We selected ten well-known algorithms: HITS and PageRank compute centrality metrics; Breadth First Search (BFS) performs a search; MOD computes the modularity of the network, which is used for community detection; Triangle Counting counts all triangles; Random Walks extracts random paths; MaxWCC and MaxSCC compute the largest weak and strong connected components respectively; Diameter computes the diameter of the graph while ClustCoeff computes the clustering coefficient.
We executed these algorithms using the original SNAP library and in combination with our engine. Note that the implementation of the algorithms is the same; only the storage changes. Table 5 reports the runtimes. From it, we see that our engine is faster in most cases. It is only with random walks that our approach is slower. From these results, we conclude that also with this type of computation our approach leads to competitive runtimes. Reasoning and Learning. We also tested the performance of our system for rule-based reasoning. In this task, rules are used to materialize all possible derivations from the KG. First, we computed reasoning considering Trident and VLog, using LUBM and two popular rulesets (LUBM-L and LUBM-LE) [60,89]. Then, we repeated the process with the native storage of VLog. The runtime, reported in Table 6, shows that our engine leads to an improvement of the performance (48% faster in the best case).
Finally, we considered statistical relational learning as another class of problems that could benefit from our engine. These techniques associate each entity and relation label in the KG to a numerical vector (called embedding) and then learn optimal values for the embeddings so that truth values of some unseen triples can be computed via algebraic operations on the vectors.
We implemented TransE [16], one of the most popular techniques of this kind, on top of Trident and compared the runtime of training vs. the one produced by OpenKE [35], a state-of-the-art library. Table 6 reports the runtime to train a model using as input a subset of YAGO which was used in other works [69]. The results indicate competitive performance also in this case.

Scalability, updates, and bulk loading
We executed the five LUBM queries using our native SPARQL procedure on KGs of different sizes (between 1B-100B triples). We used another machine with 256GB of RAM for these experiments (which also costs <$5K) due to lack of disk space. The warm runtimes are shown in Table 7. The runtime of the first two queries remains constant. This was expected since their selectivity does not decrease as the size of the KG increases. In contrast, the runtime of the other queries increases as the KG becomes larger. Figure 4 shows the runtime of four SPARQL queries after we added five new sets of triples to the KG, merged them, removed five other sets of triples, and merged again. Each set of added triples does not contain triples contained in previous updates. Similarly, each set of removed triples contains only triples in the original KG and not in previous updates. We selected the queries so that the content of the updates is considered. We observe that the runtime increases (because more deltas are considered) and that it drops after they are merged in a single update. Figure 5a reports the runtime to process five additions of ca. 1M novel triples, one merge, five removals of ca. 1M existing triples, and another merge. As we can see, with both datasets the runtime is much smaller than re-creating the database from scratch (>1h). The runtime with LUBM8k is faster than with Wikidata because the updates with the latter KG contained 4X more new entities.
In Figure 5b, we show the trace of the resource consumption during the loading of LUBM80k (10B triples). We plot the CPU (100% means all physical cores are used) and RAM usage. From it, we see that most of the runtime is taken to dictionary encoding, sorting the edges, and to create the binary tables.
In general, Trident has competitive loading times. Figure 5c shows the loading time of ours and other systems on LUBM1k. With larger KGs, RDF3X becomes significantly slower than ours (e.g., it takes ca. 7 hours to load LUBM8k on our smaller machine while Trident needs 1 hour and 18 minutes) due to lack of parallelism. TripleBit is an in-memory database and thus it cannot scale to some of our largest inputs. In some of our largest experiments, Trident could load LUBM400k (50B triples) in about 48 hours which is a size that other systems cannot handle. If the graph is already encoded, then loading is faster. We loaded the Hyperlink Graph [58], a graph with 128B edges, in about 13 hours (with the larger machine) and the database required 1.4TB of space.

RELATED WORK
In this section, we describe the most relevant works to our problem. For a broader introduction to graph and RDF processing, we redirect to existing surveys [3,24,57,59,68,77,95]. Current approaches can be classified either as native (i.e., designed for this task) or nonnative (adapt pre-existing technology). Native engines have better performance [17], but less functionalities [17,23]. Our approach belongs to the first category.
Research on native systems has focused on advanced indexing structures. The most popular approach is to extensively materialize a dedicated index for each permutation. This was initially proposed by YARS [43], and further explored in RDF3X [10,12,13,26,63]. Also Hexastore [93] proposes a six-way permutation-based indexing, but implemented it using hierarchical in-memory Java hash maps. Instead, we use on-disk data structures and therefore can scale to larger inputs. Recently, other types of indices, based on 2D or 3D bit matrices [8,97], hash-maps [60], or data structures used for graph matching approaches [47,99] have been proposed. If compared with these works, our approach uses a novel layout of data structures and uses multiple layouts to store the subgraphs.
Non-native approaches offload the indexing to external engines (mostly DBMS). Here, the challenge is to find efficient partitioning/replication criteria to exploit the multi-table nature of relational engines. Existing partitioning criteria group the triples either by predicates [2,38,51,56,62,72,73], clusters of predicates [21], or by using other entity-based splitting criteria [17]. The various partitioning schemes are designed to create few tables to meet the constraints of relational engines [81]. Our approach differs because we group the edges at a much higher granularity generating a number of binary tables that is too large for such engines.
Some popular commercial systems for graph processing are Virtuoso [67], BlazeGraph [85], Titan [22], Neo4J [61] Sparksee [83], and InfiniteGraph [66]. We compared Trident against such a leading commercial system and observed that ours has very competitive performance; other comparisons are presented in [5,81]. In general, a direct comparison is challenging because these systems provide end-to-end solutions tailored for specific tasks while we offer general-purpose low-level APIs.
Finally, many works have focused on distributed graph processing [4,6,9,33,36,38,48,70,78,98]. We do not view these approaches as competitors since they operate on different hardware architectures. Instead, we view ours as a potential complement that can be employed by them to speed up distributed processing.
In our approach, we use numerical IDs to store the terms. This form of compression has been the subject of some studies. First, some systems use the Hash-code of the strings as IDs [37,38,40]. Most systems, however, use counters to assign new IDs [18,42,43,53,63]. It has been shown in [88] that assigning some IDs rather than others can improve the query answering due to data locality. It is straightforward to include these procedures in our system. Finally, some approaches focused on compressing RDF collections [54] and on the management of the strings [11,55,82]. We adopted a conventional approach to store such strings. Replacing our dictionary with these proposals is an interesting direction for future work.

CONCLUSION
We proposed a novel centralized architecture for the low-level storage of very large KGs which provides both node-and edge-centric access to the KG. One of the main novelties of our approach is that it exhaustively decomposes the storage of the KGs in many binary tables, serializing them in multiple byte streams to facilitate intertable scanning, akin to permutation-based approaches. Another main novelty is that the storage effectively adapts to the KG by choosing a different layout for each table depending on the graph topology. Our empirical evaluation in multiple scenarios shows that our approach offers competitive performance and that it can load very large graphs without expensive hardware.
Future work is necessary to apply or adapt our architecture for additional scenarios. In particular, we believe that our system can be used to support Triple Pattern Fragments [91], an emerging paradigm to query RDF datasets, and GraphQL [44], a more complex graph query language. Finally, it is also interesting to study whether the integration of additional compression techniques, like localitybased dictionary encoding [88] or HDT [25], can further improve the runtime and/or reduce the storage space.