Triples

A module to handle triples.

A knowledge graph can be thought of as a collection of facts, where each individual fact is represented as a triple of a head entity \(h\), a relation \(r\) and a tail entity \(t\). In order to operate efficiently on them, most of PyKEEN assumes that the set of entities has been that the set of entities has been transformed so that they are identified by integers from \([0, \ldots, E)\), where \(E\) is the number of entities. A similar assumption is made for relations with their indices of \([0, \ldots, R]\).

This module includes classes and methods for loading and transforming triples from other formats into the index-based format, as well as advanced methods for creating leakage-free training and test splits and analyzing data distribution.

Basic Handling

The most basic information about a knowledge graph is stored in KGInfo. It contains the minimal information needed to create a knowledge graph embedding: the number of entities and relations, as well as information about the use of inverse relations (which artificially increases the number of relations).

To store information about triples, there is the CoreTriplesFactory. It extends KGInfo by additionally storing a set of index-based triples, in the form of a 3-column matrix. It also allows to store arbitrary metadata in the form of a (JSON-compatible) dictionary. It also adds support for serialization, i.e. saving and loading to a file, as well as filter operations and utility methods to create dataframes for further (external) processing.

Finally, there is TriplesFactory, which adds mapping of string-based entity and relation names to IDs. This class also provides rich factory methods that allow creating mappings from string-based triples alone, loading triples from sufficiently similar external file formats such as TSV or CSV, and converting back and forth between label-based and index-based formats. It also extends serialization to ensure that the string-to-index mappings are included along with the files.

Splitting

To evaluate knowledge graph embedding models, we need training, validation, and test sets. In classical machine learning settings, we often have a large number of independent samples and can use simple random sampling. In graph learning settings, however, we are interested in learning relational patterns, i.e. patterns between different samples. In these settings, we need to be more careful.

For example, to evaluate models in a transductive setting, we need to make sure that all entities and relations of the triples used in the evaluation are also present in the training triples. PyKEEN includes methods to construct splits that ensure the presence of all entities and relations in the training part. Those can be found in pykeen.triples.splitting.

In addition, knowledge graphs may contain inverse relationships, such as a predecessor and successor relationship. In this case, careless splitting can lead to test leakage, where models that only check whether the inverse relationship exists in training can produce significantly strong results, inflating scores without learning meaningful relationship patterns. PyKEEN includes methods to check knowledge graph splits for leakage, which can be found in pykeen.triples.leakage.

In pykeen.triples.remix, we offer methods to examine the effects of a particular choice of splits.

Analysis

We also provide methods for analyzing knowledge graphs. These include simple statistics such as the number of entities or relations (in pykeen.triples.stats), as well as advanced analysis of relational patterns (pykeen.triples.analysis).

Functions

get_mapped_triples([x, mapped_triples, ...])

Get ID-based triples either directly, or from a factory.

Classes

Instances()

Base class for training instances.

LCWAInstances(*, pairs, compressed)

Triples and mappings to their indices for LCWA.

SLCWAInstances(*, mapped_triples[, ...])

Training instances for the sLCWA.

KGInfo(num_entities, num_relations, ...)

An object storing information about the number of entities and relations.

CoreTriplesFactory(mapped_triples, ...[, ...])

Create instances from ID-based triples.

TriplesFactory(mapped_triples, entity_to_id, ...)

Create instances given the path to triples.

TriplesNumericLiteralsFactory(*, ...)

Create multi-modal instances given the path to triples.

Variables

AnyTriples

alias of tuple[str, str, str] | Sequence[tuple[str, str, str]] | ndarray | Tensor | CoreTriplesFactory

Class Inheritance Diagram

Inheritance diagram of pykeen.triples.instances.Instances, pykeen.triples.instances.LCWAInstances, pykeen.triples.instances.SLCWAInstances, pykeen.triples.triples_factory.KGInfo, pykeen.triples.triples_factory.CoreTriplesFactory, pykeen.triples.triples_factory.TriplesFactory, pykeen.triples.triples_numeric_literals_factory.TriplesNumericLiteralsFactory

Utilities

Instance creation utilities.

compute_compressed_adjacency_list(mapped_triples: Tensor, num_entities: int | None = None) tuple[Tensor, Tensor, Tensor][source]

Compute compressed undirected adjacency list representation for efficient sampling.

The compressed adjacency list format is inspired by CSR sparse matrix format.

Parameters:
  • mapped_triples (Tensor) – the ID-based triples

  • num_entities (int | None) – the number of entities.

Returns:

a tuple (degrees, offsets, compressed_adj_lists) where

  • degrees: shape: (num_entities,)

  • offsets: shape: (num_entities,)

  • compressed_adj_list: shape: (2 * num_triples, 2)

with

adj_list[i] = compressed_adj_list[offsets[i]:offsets[i+1]]

Return type:

tuple[Tensor, Tensor, Tensor]

get_entities(triples: Tensor) set[int][source]

Get all entities from the triples.

Parameters:

triples (Tensor)

Return type:

set[int]

get_relations(triples: Tensor) set[int][source]

Get all relations from the triples.

Parameters:

triples (Tensor)

Return type:

set[int]

load_triples(path: str | Path | TextIO, delimiter: str = '\t', encoding: str | None = None, column_remapping: Sequence[int] | None = None) ndarray[source]

Load triples saved as tab separated values.

Parameters:
  • path (str | Path | TextIO) – The key for the data to be loaded. Typically, this will be a file path ending in .tsv that points to a file with three columns - the head, relation, and tail. This can also be used to invoke PyKEEN data importer entrypoints (see below).

  • delimiter (str) – The delimiter between the columns in the file

  • encoding (str | None) – The encoding for the file. Defaults to utf-8.

  • column_remapping (Sequence[int] | None) – A remapping if the three columns do not follow the order head-relation-tail. For example, if the order is head-tail-relation, pass (0, 2, 1)

Returns:

A numpy array representing “labeled” triples.

Raises:

ValueError – if a column remapping was passed, but it was not a length 3 sequence

Return type:

ndarray

Besides TSV handling, PyKEEN does not come with any importers pre-installed. A few can be found at:

tensor_to_df(tensor: Tensor, **kwargs: Tensor | ndarray | Sequence) DataFrame[source]

Take a tensor of triples and make a pandas dataframe with labels.

Parameters:
  • tensor (Tensor) – shape: (n, 3) The triples, ID-based and in format (head_id, relation_id, tail_id).

  • kwargs (Tensor | ndarray | Sequence) – Any additional number of columns. Each column needs to be of shape (n,). Reserved column names: {“head_id”, “head_label”, “relation_id”, “relation_label”, “tail_id”, “tail_label”}.

Returns:

A dataframe with n rows, and 3 + len(kwargs) columns.

Raises:

ValueError – If a reserved column name appears in kwargs.

Return type:

DataFrame

Triples Workflows

Splitting

Implementation of triples splitting functions.

Functions

split(mapped_triples[, ratios, ...])

Split triples into clean groups.

normalize_ratios(ratios[, epsilon])

Normalize relative sizes.

get_absolute_split_sizes(n_total, ratios)

Compute absolute sizes of splits from given relative sizes.

Classes

Cleaner()

A cleanup method for ensuring that all entities are contained in the triples of the first split part.

RandomizedCleaner()

Cleanup a triples array by randomly selecting testing triples and recalculate to minimize moves.

DeterministicCleaner()

Cleanup a triples array (testing) with respect to another (training).

Splitter()

A method for splitting triples.

CleanupSplitter([cleaner])

The cleanup splitter first randomly splits the triples and then cleans up.

CoverageSplitter()

This splitter greedily selects training triples such that each entity is covered and then splits the rest.

TripleCoverageError(arr[, name])

An exception thrown when not all entities/relations are covered by triples.

Class Inheritance Diagram

Inheritance diagram of pykeen.triples.splitting.Cleaner, pykeen.triples.splitting.RandomizedCleaner, pykeen.triples.splitting.DeterministicCleaner, pykeen.triples.splitting.Splitter, pykeen.triples.splitting.CleanupSplitter, pykeen.triples.splitting.CoverageSplitter, pykeen.triples.splitting.TripleCoverageError

Remixing

Remixing and dataset distance utilities.

Most datasets are given in with a pre-defined split, but it’s often not discussed how this split was created. This module contains utilities for investigating the effects of remixing pre-split datasets like :class`pykeen.datasets.Nations`.

Further, it defines a metric for the “distance” between two splits of a given dataset. Later, this will be used to map the landscape and see if there is a smooth, continuous relationship between datasets’ splits’ distances and their maximum performance.

Functions

remix(*triples_factories, **kwargs)

Remix the triples from the training, testing, and validation set.

Deterioration

Deterioration algorithm.

Functions

deteriorate(reference, *others, n[, ...])

Remove n triples from the reference set.

Generation

Utilities for generating triples.

Functions

generate_triples([num_entities, ...])

Generate random triples in a torch tensor.

generate_triples_factory([num_entities, ...])

Generate a triples factory with random triples.

Analysis

Analysis utilities for (mapped) triples.

add_entity_labels(*, df: DataFrame, add_labels: bool, label_to_id: Mapping[str, int] | None = None, triples_factory: TriplesFactory | None = None) DataFrame[source]

Add entity labels to a dataframe.

Parameters:
Return type:

DataFrame

add_relation_labels(df: DataFrame, *, add_labels: bool, label_to_id: Mapping[str, int] | None = None, triples_factory: TriplesFactory | None = None) DataFrame[source]

Add relation labels to a dataframe.

Parameters:
Return type:

DataFrame

entity_relation_co_occurrence(mapped_triples: Tensor) DataFrame[source]

Calculate entity-relation co-occurrence.

Parameters:

mapped_triples (Tensor) – The ID-based triples.

Returns:

A dataframe with columns ( entity_id | relation_id | type | count )

Return type:

DataFrame

get_entity_counts(mapped_triples: Tensor) DataFrame[source]

Create a dataframe of entity frequencies.

Parameters:

mapped_triples (Tensor) – shape: (num_triples, 3) The mapped triples.

Returns:

A dataframe with columns ( entity_id | type | count )

Return type:

DataFrame

get_relation_counts(mapped_triples: Tensor) DataFrame[source]

Create a dataframe of relation frequencies.

Parameters:

mapped_triples (Tensor) – shape: (num_triples, 3) The mapped triples.

Returns:

A dataframe with columns ( relation_id | count )

Return type:

DataFrame

get_relation_functionality(mapped_triples: Collection[tuple[int, int, int]], add_labels: bool = True, label_to_id: Mapping[str, int] | None = None) DataFrame[source]

Calculate relation functionalities.

Parameters:
  • mapped_triples (Collection[tuple[int, int, int]]) – The ID-based triples.

  • add_labels (bool) – Should the labels be added to the dataframe?

  • label_to_id (Mapping[str, int] | None) – The label to index mapping.

Returns:

A dataframe with columns ( functionality | inverse_functionality )

Return type:

DataFrame

relation_cardinality_types(mapped_triples: Collection[tuple[int, int, int]], add_labels: bool = True, label_to_id: Mapping[str, int] | None = None) DataFrame[source]

Determine the relation cardinality types.

The possible types are given in relation_cardinality_types.

Note

In the current implementation, we have by definition

\[1 = \sum_{type} conf(relation, type)\]

Note

These relation types are also mentioned in [wang2014]. However, the paper does not provide any details on their definition, nor is any code provided. Thus, their exact procedure is unknown and may not coincide with this implementation.

Parameters:
  • mapped_triples (Collection[tuple[int, int, int]]) – The ID-based triples.

  • add_labels (bool) – Whether to add relation labels (if available).

  • label_to_id (Mapping[str, int] | None) – The label to index mapping.

Returns:

A dataframe with columns ( relation_id | relation_type )

Return type:

DataFrame

relation_injectivity(mapped_triples: Collection[tuple[int, int, int]], add_labels: bool = True, label_to_id: Mapping[str, int] | None = None) DataFrame[source]

Calculate “soft” injectivity scores for each relation.

Parameters:
Returns:

A dataframe with one row per relation, its number of occurrences and head / tail injectivity scores.

Return type:

DataFrame

relation_pattern_types(mapped_triples: Collection[tuple[int, int, int]]) DataFrame[source]

Categorize relations based on patterns from RotatE [sun2019].

The relation classifications are based upon checking whether the corresponding rules hold with sufficient support and confidence. By default, we do not require a minimum support, however, a relatively high confidence.

The following four non-exclusive classes for relations are considered:

  • symmetry

  • anti-symmetry

  • inversion

  • composition

This method generally follows the terminology of association rule mining. The patterns are expressed as

\[X_1 \land \cdot \land X_k \implies Y\]

where \(X_i\) is of the form \(r_i(h_i, t_i)\), and some of the \(h_i / t_i\) might re-occur in other atoms. The support of a pattern is the number of distinct instantiations of all variables for the left hand side. The confidence is the proportion of these instantiations where the right-hand side is also true.

Parameters:

mapped_triples (Collection[tuple[int, int, int]]) – A collection of ID-based triples.

Returns:

A dataframe of relation categorization

Return type:

DataFrame