Triples
A module to handle triples.
A knowledge graph can be thought of as a collection of facts, where each individual fact is represented as a triple of a head entity \(h\), a relation \(r\) and a tail entity \(t\). In order to operate efficiently on them, most of PyKEEN assumes that the set of entities has been that the set of entities has been transformed so that they are identified by integers from \([0, \ldots, E)\), where \(E\) is the number of entities. A similar assumption is made for relations with their indices of \([0, \ldots, R]\).
This module includes classes and methods for loading and transforming triples from other formats into the index-based format, as well as advanced methods for creating leakage-free training and test splits and analyzing data distribution.
Basic Handling
The most basic information about a knowledge graph is stored in KGInfo
. It contains the
minimal information needed to create a knowledge graph embedding: the number of entities and
relations, as well as information about the use of inverse relations (which artificially increases
the number of relations).
To store information about triples, there is the CoreTriplesFactory
. It extends
KGInfo
by additionally storing a set of index-based triples, in the form of a
3-column matrix. It also allows to store arbitrary metadata in the form of a
(JSON-compatible) dictionary. It also adds support for serialization, i.e. saving and loading
to a file, as well as filter operations and utility methods to create dataframes for
further (external) processing.
Finally, there is TriplesFactory
, which adds mapping of string-based entity and relation
names to IDs. This class also provides rich factory methods that allow creating mappings from
string-based triples alone, loading triples from sufficiently similar external file formats such
as TSV or CSV, and converting back and forth between label-based and index-based formats.
It also extends serialization to ensure that the string-to-index mappings are included along
with the files.
Splitting
To evaluate knowledge graph embedding models, we need training, validation, and test sets. In classical machine learning settings, we often have a large number of independent samples and can use simple random sampling. In graph learning settings, however, we are interested in learning relational patterns, i.e. patterns between different samples. In these settings, we need to be more careful.
For example, to evaluate models in a transductive setting, we need to make sure that all entities and relations of the
triples used in the evaluation are also present in the training triples.
PyKEEN includes methods to construct splits that ensure the presence of all entities and relations in the training part.
Those can be found in pykeen.triples.splitting
.
In addition, knowledge graphs may contain inverse relationships, such as a predecessor and successor relationship.
In this case, careless splitting can lead to test leakage, where models that only check whether the inverse relationship
exists in training can produce significantly strong results, inflating scores without learning meaningful relationship
patterns.
PyKEEN includes methods to check knowledge graph splits for leakage, which can be found in
pykeen.triples.leakage
.
In pykeen.triples.remix
, we offer methods to examine the effects of a particular choice of splits.
Analysis
We also provide methods for analyzing knowledge graphs.
These include simple statistics such as the number of entities or relations (in pykeen.triples.stats
),
as well as advanced analysis of relational patterns (pykeen.triples.analysis
).
Functions
|
Get ID-based triples either directly, or from a factory. |
Classes
Base class for training instances. |
|
|
Triples and mappings to their indices for LCWA. |
|
Training instances for the sLCWA. |
|
An object storing information about the number of entities and relations. |
|
Create instances from ID-based triples. |
|
Create instances given the path to triples. |
|
Create multi-modal instances given the path to triples. |
Variables
alias of |
Class Inheritance Diagram
Utilities
Instance creation utilities.
- compute_compressed_adjacency_list(mapped_triples: Tensor, num_entities: int | None = None) tuple[Tensor, Tensor, Tensor] [source]
Compute compressed undirected adjacency list representation for efficient sampling.
The compressed adjacency list format is inspired by CSR sparse matrix format.
- Parameters:
- Returns:
a tuple (degrees, offsets, compressed_adj_lists) where
degrees: shape: (num_entities,)
offsets: shape: (num_entities,)
compressed_adj_list: shape: (2 * num_triples, 2)
with
adj_list[i] = compressed_adj_list[offsets[i]:offsets[i+1]]
- Return type:
- load_triples(path: str | Path | TextIO, delimiter: str = '\t', encoding: str | None = None, column_remapping: Sequence[int] | None = None) ndarray [source]
Load triples saved as tab separated values.
- Parameters:
path (str | Path | TextIO) – The key for the data to be loaded. Typically, this will be a file path ending in
.tsv
that points to a file with three columns - the head, relation, and tail. This can also be used to invoke PyKEEN data importer entrypoints (see below).delimiter (str) – The delimiter between the columns in the file
encoding (str | None) – The encoding for the file. Defaults to utf-8.
column_remapping (Sequence[int] | None) – A remapping if the three columns do not follow the order head-relation-tail. For example, if the order is head-tail-relation, pass
(0, 2, 1)
- Returns:
A numpy array representing “labeled” triples.
- Raises:
ValueError – if a column remapping was passed, but it was not a length 3 sequence
- Return type:
Besides TSV handling, PyKEEN does not come with any importers pre-installed. A few can be found at:
bio2bel.io.pykeen
- tensor_to_df(tensor: Tensor, **kwargs: Tensor | ndarray | Sequence) DataFrame [source]
Take a tensor of triples and make a pandas dataframe with labels.
- Parameters:
tensor (Tensor) – shape: (n, 3) The triples, ID-based and in format (head_id, relation_id, tail_id).
kwargs (Tensor | ndarray | Sequence) – Any additional number of columns. Each column needs to be of shape (n,). Reserved column names: {“head_id”, “head_label”, “relation_id”, “relation_label”, “tail_id”, “tail_label”}.
- Returns:
A dataframe with n rows, and 3 + len(kwargs) columns.
- Raises:
ValueError – If a reserved column name appears in kwargs.
- Return type:
DataFrame
Triples Workflows
Splitting
Implementation of triples splitting functions.
Functions
|
Split triples into clean groups. |
|
Normalize relative sizes. |
|
Compute absolute sizes of splits from given relative sizes. |
Classes
|
A cleanup method for ensuring that all entities are contained in the triples of the first split part. |
Cleanup a triples array by randomly selecting testing triples and recalculate to minimize moves. |
|
Cleanup a triples array (testing) with respect to another (training). |
|
|
A method for splitting triples. |
|
The cleanup splitter first randomly splits the triples and then cleans up. |
This splitter greedily selects training triples such that each entity is covered and then splits the rest. |
|
|
An exception thrown when not all entities/relations are covered by triples. |
Class Inheritance Diagram
Remixing
Remixing and dataset distance utilities.
Most datasets are given in with a pre-defined split, but it’s often not discussed how this split was created. This module contains utilities for investigating the effects of remixing pre-split datasets like :class`pykeen.datasets.Nations`.
Further, it defines a metric for the “distance” between two splits of a given dataset. Later, this will be used to map the landscape and see if there is a smooth, continuous relationship between datasets’ splits’ distances and their maximum performance.
Functions
|
Remix the triples from the training, testing, and validation set. |
Deterioration
Deterioration algorithm.
Functions
|
Remove n triples from the reference set. |
Generation
Utilities for generating triples.
Functions
|
Generate random triples in a torch tensor. |
|
Generate a triples factory with random triples. |
Analysis
Analysis utilities for (mapped) triples.
- add_entity_labels(*, df: DataFrame, add_labels: bool, label_to_id: Mapping[str, int] | None = None, triples_factory: TriplesFactory | None = None) DataFrame [source]
Add entity labels to a dataframe.
- Parameters:
df (DataFrame)
add_labels (bool)
triples_factory (TriplesFactory | None)
- Return type:
DataFrame
- add_relation_labels(df: DataFrame, *, add_labels: bool, label_to_id: Mapping[str, int] | None = None, triples_factory: TriplesFactory | None = None) DataFrame [source]
Add relation labels to a dataframe.
- Parameters:
df (DataFrame)
add_labels (bool)
triples_factory (TriplesFactory | None)
- Return type:
DataFrame
- entity_relation_co_occurrence(mapped_triples: Tensor) DataFrame [source]
Calculate entity-relation co-occurrence.
- Parameters:
mapped_triples (Tensor) – The ID-based triples.
- Returns:
A dataframe with columns ( entity_id | relation_id | type | count )
- Return type:
DataFrame
- get_entity_counts(mapped_triples: Tensor) DataFrame [source]
Create a dataframe of entity frequencies.
- Parameters:
mapped_triples (Tensor) – shape: (num_triples, 3) The mapped triples.
- Returns:
A dataframe with columns ( entity_id | type | count )
- Return type:
DataFrame
- get_relation_counts(mapped_triples: Tensor) DataFrame [source]
Create a dataframe of relation frequencies.
- Parameters:
mapped_triples (Tensor) – shape: (num_triples, 3) The mapped triples.
- Returns:
A dataframe with columns ( relation_id | count )
- Return type:
DataFrame
- get_relation_functionality(mapped_triples: Collection[tuple[int, int, int]], add_labels: bool = True, label_to_id: Mapping[str, int] | None = None) DataFrame [source]
Calculate relation functionalities.
- Parameters:
- Returns:
A dataframe with columns ( functionality | inverse_functionality )
- Return type:
DataFrame
- relation_cardinality_types(mapped_triples: Collection[tuple[int, int, int]], add_labels: bool = True, label_to_id: Mapping[str, int] | None = None) DataFrame [source]
Determine the relation cardinality types.
The possible types are given in relation_cardinality_types.
Note
In the current implementation, we have by definition
\[1 = \sum_{type} conf(relation, type)\]Note
These relation types are also mentioned in [wang2014]. However, the paper does not provide any details on their definition, nor is any code provided. Thus, their exact procedure is unknown and may not coincide with this implementation.
- Parameters:
- Returns:
A dataframe with columns ( relation_id | relation_type )
- Return type:
DataFrame
- relation_injectivity(mapped_triples: Collection[tuple[int, int, int]], add_labels: bool = True, label_to_id: Mapping[str, int] | None = None) DataFrame [source]
Calculate “soft” injectivity scores for each relation.
- Parameters:
- Returns:
A dataframe with one row per relation, its number of occurrences and head / tail injectivity scores.
- Return type:
DataFrame
- relation_pattern_types(mapped_triples: Collection[tuple[int, int, int]]) DataFrame [source]
Categorize relations based on patterns from RotatE [sun2019].
The relation classifications are based upon checking whether the corresponding rules hold with sufficient support and confidence. By default, we do not require a minimum support, however, a relatively high confidence.
The following four non-exclusive classes for relations are considered:
symmetry
anti-symmetry
inversion
composition
This method generally follows the terminology of association rule mining. The patterns are expressed as
\[X_1 \land \cdot \land X_k \implies Y\]where \(X_i\) is of the form \(r_i(h_i, t_i)\), and some of the \(h_i / t_i\) might re-occur in other atoms. The support of a pattern is the number of distinct instantiations of all variables for the left hand side. The confidence is the proportion of these instantiations where the right-hand side is also true.
- Parameters:
mapped_triples (Collection[tuple[int, int, int]]) – A collection of ID-based triples.
- Returns:
A dataframe of relation categorization
- Return type:
DataFrame