PyKEEN

PyKEEN is a Python package for reproducible, facile knowledge graph embeddings.

The fastest way to get up and running is to use the pykeen.pipeline.pipeline() function.

It provides a high-level entry into the extensible functionality of this package. The following example shows how to train and evaluate the TransE model (pykeen.models.TransE) on the Nations dataset (pykeen.datasets.Nations) by referring to them by name. By default, the training loop uses the stochastic closed world assumption training approach (pykeen.training.SLCWATrainingLoop) and evaluates with rank-based evaluation (pykeen.evaluation.RankBasedEvaluator).

>>> from pykeen.pipeline import pipeline
>>> result = pipeline(
...     model='TransE',
...     dataset='Nations',
... )

The results are returned in a pykeen.pipeline.PipelineResult instance, which has attributes for the trained model, the training loop, and the evaluation.

PyKEEN has a function pykeen.env() that magically prints relevant version information about PyTorch, CUDA, and your operating system that can be used for debugging. If you’re in a Jupyter notebook, it will be pretty printed as an HTML table.

>>> import pykeen
>>> pykeen.env()

Installation

Linux and Mac Users

The latest stable version of PyKEEN can be downloaded and installed from PyPI with:

$ pip install pykeen

The latest version of PyKEEN can be installed directly from the source on GitHub with:

$ pip install git+https://github.com/pykeen/pykeen.git

Google Colab and Kaggle Users

Google Colab and Kaggle both provide a hosted version of Google’s custom Jupyter notebook environment that work similarly. After opening a new notebook on one of these service, start your notebook with the following two lines:

! pip install git+https://github.com/pykeen/pykeen.git
pykeen.env()

This will install the latest code, then output relevant system and environment information with pykeen.env(). It works because Jupyter interprets any line beginning with a bang ! that the remainder of the line should be interpreted as a bash command. If you want to make your notebook compatible on both hosted and local installations, change it slightly to check if PyKEEN is already installed:

! python -c "import pykeen" || pip install git+https://github.com/pykeen/pykeen.git
pykeen.env()

Note

Old versions of PyKEEN that used class_resolve version 0.3.4 and below loaded datasets via entrypoints. This was unpredictable on Kaggle and Google Colab, so it was removed in https://github.com/pykeen/pykeen/pull/832. More information can also be found on PyKEEN issue #373.

To enable GPU usage, go to the Runtime -> Change runtime type menu to enable a GPU with your notebook.

Windows Users

We’ve added experimental support for Windows as of !95. However, be warned, it’s much less straightforward to install PyTorch and therefore PyKEEN on Windows.

First, to install PyTorch, you must install Anaconda and follow the instructions on the PyTorch website. Then, assuming your python and pip command are linked to the same place where conda is installing, you can proceed with the normal installation (or the installation from GitHub as shown above):

$ pip install pykeen

If you’re having trouble with pip or sqlite, you might also have to use conda install pip setuptools wheel sqlite. See our GitHub Actions configuration on GitHub for inspiration.

If you know better ways to install on Windows or would like to share some references, we’d really appreciate it.

Development

The latest code can be installed in development mode with:

$ git clone https://github.com/pykeen/pykeeen.git pykeen
$ cd pykeen
$ pip install -e .

If you’re interested in making contributions, please see our contributing guide.

To automatically ensure compliance to our style guide, please install pre-commit hooks using the following code block from in the same directory.

$ pip install pre-commit
$ pre-commit install

Extras

PyKEEN has several extras for installation that are defined in the [options.extras_require] section of the setup.cfg. They can be included with installation using the bracket notation like in pip install pykeen[docs] or pip install -e .[docs]. Several can be listed, comma-delimited like in pip install pykeen[docs,plotting].

Name

Description

templating

Building of templated documentation, like the README

plotting

Plotting with seaborn and generation of word clouds

mlflow

Tracking of results with mlflow

wandb

Tracking of results with wandb

neptune

Tracking of results with neptune

tensorboard

Tracking of results with tensorboard via torch.utils.tensorboard

transformers

Label-based initialization with transformers.

tests

Code needed to run tests. Typically handled with tox -e py

docs

Building of the documentation

opt_einsum

Improve performance of torch.einsum() by replacing with opt_einsum.contract()

biomedicine

Use of pyobo for lookup of biomedical entity labels

First Steps

The easiest way to train and evaluate a model is with the pykeen.pipeline.pipeline() function.

It provides a high-level entry point into the extensible functionality of this package. Full reference documentation for the pipeline and related functions can be found at pykeen.pipeline.

Training a Model

The following example shows how to train and evaluate the pykeen.models.TransE model on the pykeen.datasets.Nations dataset. Throughout the documentation, you’ll notice that each asset has a corresponding class in PyKEEN. You can follow the links to learn more about each and see the reference on how to use them specifically. Don’t worry, in this part of the tutorial, the pykeen.pipeline.pipeline() function will take care of everything for you.

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
... )
>>> pipeline_result.save_to_directory('nations_transe')

The results are returned in a pykeen.pipeline.PipelineResult instance, which has attributes for the trained model, the training loop, and the evaluation.

In this example, the model was given as a string. A list of available models can be found in pykeen.models. Alternatively, the class corresponding to the implementation of the model could be used as in:

>>> from pykeen.pipeline import pipeline
>>> from pykeen.models import TransE
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model=TransE,
... )
>>> pipeline_result.save_to_directory('nations_transe')

In this example, the dataset was given as a string. A list of available datasets can be found in pykeen.datasets. Alternatively, a subclass of pykeen.datasets.Dataset could be used as in:

>>> from pykeen.pipeline import pipeline
>>> from pykeen.models import TransE
>>> from pykeen.datasets import Nations
>>> pipeline_result = pipeline(
...     dataset=Nations,
...     model=TransE,
... )
>>> pipeline_result.save_to_directory('nations_transe')

In each of the previous three examples, the training approach, optimizer, and evaluation scheme were omitted. By default, the model is trained under the stochastic local closed world assumption (sLCWA; pykeen.training.SLCWATrainingLoop). This can be explicitly given as a string:

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     training_loop='sLCWA',
... )
>>> pipeline_result.save_to_directory('nations_transe')

Alternatively, the model can be trained under the local closed world assumption (LCWA; pykeen.training.LCWATrainingLoop) by giving 'LCWA'. No additional configuration is necessary, but it’s worth reading up on the differences between these training approaches. A list of available training assumptions can be found in pykeen.training.

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     training_loop='LCWA',
... )
>>> pipeline_result.save_to_directory('nations_transe')

One of these differences is that the sLCWA relies on negative sampling. The type of negative sampling can be given as in:

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     training_loop='sLCWA',
...     negative_sampler='basic',
... )
>>> pipeline_result.save_to_directory('nations_transe')

In this example, the negative sampler was given as a string. A list of available negative samplers can be found in pykeen.sampling. Alternatively, the class corresponding to the implementation of the negative sampler could be used as in:

>>> from pykeen.pipeline import pipeline
>>> from pykeen.sampling import BasicNegativeSampler
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     training_loop='sLCWA',
...     negative_sampler=BasicNegativeSampler,
... )
>>> pipeline_result.save_to_directory('nations_transe')

Warning

The negative_sampler keyword argument should not be used if the LCWA is being used. In general, all other options are available under either training approach.

The type of evaluation perfomed can be specified with the evaluator keyword. By default, rank-based evaluation is used. It can be given explictly as in:

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     evaluator='RankBasedEvaluator',
... )
>>> pipeline_result.save_to_directory('nations_transe')

In this example, the evaluator string. A list of available evaluators can be found in pykeen.evaluation. Alternatively, the class corresponding to the implementation of the evaluator could be used as in:

>>> from pykeen.pipeline import pipeline
>>> from pykeen.evaluation import RankBasedEvaluator
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     evaluator=RankBasedEvaluator,
... )
>>> pipeline_result.save_to_directory('nations_transe')

PyKEEN implements early stopping, which can be turned on with the stopper keyword argument as in:

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     stopper='early',
... )
>>> pipeline_result.save_to_directory('nations_transe')

In PyKEEN you can also use the learning rate schedulers provided by PyTorch, which can be turned on with the lr_scheduler keyword argument together with the lr_scheduler_kwargs keyword argument to specify arguments for the learning rate scheduler as in:

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     lr_scheduler='ExponentialLR',
...     lr_scheduler_kwargs=dict(
...         gamma=0.99,
...     ),
... )
>>> pipeline_result.save_to_directory('nations_transe')

Deeper Configuration

Arguments for the model can be given as a dictionary using model_kwargs.

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     model_kwargs=dict(
...         scoring_fct_norm=2,
...     ),
... )
>>> pipeline_result.save_to_directory('nations_transe')

The entries in model_kwargs correspond to the arguments given to pykeen.models.TransE.__init__(). For a complete listing of models, see pykeen.models, where there are links to the reference for each model that explain what kwargs are possible. Each model’s default hyper-parameters were chosen based on the best reported values from the paper originally publishing the model unless otherwise noted on the model’s reference page.

Because the pipeline takes care of looking up classes and instantiating them, there are several other parameters to pykeen.pipeline.pipeline() that can be used to specify the parameters during their respective instantiations.

Arguments can be given to the dataset with dataset_kwargs. These are passed on to the pykeen.datasets.Nations

Loading a pre-trained Model

Many of the previous examples ended with saving the results using the pykeen.pipeline.PipelineResult.save_to_directory(). One of the artifacts written to the given directory is the trained_model.pkl file. Because all PyKEEN models inherit from torch.nn.Module, we use the PyTorch mechanisms for saving and loading them. This means that you can use torch.load() to load a model like:

import torch

my_pykeen_model = torch.load('trained_model.pkl')

More information on PyTorch’s model persistence can be found at: https://pytorch.org/tutorials/beginner/saving_loading_models.html.

Mapping Entity and Relation Identifiers to their Names

While PyKEEN internally maps entities and relations to contiguous identifiers, it’s still useful to be able to interact with datasets, triples factories, and models using the labels of the entities and relations.

We can map a triples factory’s entities to identifiers using TriplesFactory.entities_to_ids() like in the following example:

from pykeen.datasets import Nations

triples_factory = Nations().training

# Get tensor of entity identifiers
entity_ids = torch.as_tensor(triples_factory.entities_to_ids(["china", "egypt"]))

Similarly, we can map a triples factory’s relations to identifiers using TriplesFactory.relations_to_ids like in the following example:

relation_ids = torch.as_tensor(triples_factory.relations_to_ids(["independence", "embassy"]))

Warning

It’s important to notice that we should use a triples factory with the same mapping that was used to train the model - otherwise we might end up with incorrect IDs.

Using Learned Embeddings

The embeddings learned for entities and relations are not only useful for link prediction (see Prediction), but also for other downstream machine learning tasks like clustering, regression, and classification.

Knowledge graph embedding models can potentially have multiple entity representations and multiple relation representations, so they are respectively stored as sequences in the entity_representations and relation_representations attributes of each model. While the exact contents of these sequences are model-dependent, the first element of each is usually the “primary” representation for either the entities or relations.

Typically, the values in these sequences are instances of the pykeen.nn.representation.Embedding. This implements a similar, but more powerful, interface to the built-in torch.nn.Embedding class. However, the values in these sequences can more generally be instances of any subclasses of pykeen.nn.representation.Representation. This allows for more powerful encoders those in GNNs such as pykeen.models.RGCN to be implemented and used.

The entity representations and relation representations can be accessed like this:

from typing import List

import pykeen.nn
from pykeen.pipeline import pipeline

result = pipeline(model='TransE', dataset='UMLS')
model = result.model

entity_representation_modules: List['pykeen.nn.Representation'] = model.entity_representations
relation_representation_modules: List['pykeen.nn.Representation'] = model.relation_representations

Most models, like pykeen.models.TransE, only have one representation for entities and one for relations. This means that the entity_representations and relation_representations lists both have a length of 1. All of the entity embeddings can be accessed like:

entity_embeddings: pykeen.nn.Embedding = entity_representation_modules[0]
relation_embeddings: pykeen.nn.Embedding = relation_representation_modules[0]

Since all representations are subclasses of torch.nn.Module, you need to call them like functions to invoke the forward() and get the values.

entity_embedding_tensor: torch.FloatTensor = entity_embeddings()
relation_embedding_tensor: torch.FloatTensor = relation_embeddings()

The forward() function of all pykeen.nn.representation.Representation takes an indices parameter. By default, it is None and returns all values. More explicitly, this looks like:

entity_embedding_tensor: torch.FloatTensor = entity_embeddings(indices=None)
relation_embedding_tensor: torch.FloatTensor = relation_embeddings(indices=None)

If you’d like to only look up certain embeddings, you can use the indices parameter and pass a torch.LongTensor with their corresponding indices.

You might want to detach them from the GPU and convert to a numpy.ndarray with

entity_embedding_tensor = model.entity_representations[0](indices=None).detach().numpy()

Warning

Some old-style models (e.g., ones inheriting from pykeen.models.EntityRelationEmbeddingModel) don’t fully implement the entity_representations and relation_representations interface. This means that they might have additional embeddings stored in attributes that aren’t exposed through these sequences. For example, pykeen.models.TransD has a secondary entity embedding in pykeen.models.TransD.entity_projections. Eventually, all models will be upgraded to new-style models and this won’t be a problem.

Beyond the Pipeline

While the pipeline provides a high-level interface, each aspect of the training process is encapsulated in classes that can be more finely tuned or subclassed. Below is an example of code that might have been executed with one of the previous examples.

>>> # Get a training dataset
>>> from pykeen.datasets import Nations
>>> dataset = Nations()
>>> training_triples_factory = dataset.training

>>> # Pick a model
>>> from pykeen.models import TransE
>>> model = TransE(triples_factory=training_triples_factory)

>>> # Pick an optimizer from Torch
>>> from torch.optim import Adam
>>> optimizer = Adam(params=model.get_grad_params())

>>> # Pick a training approach (sLCWA or LCWA)
>>> from pykeen.training import SLCWATrainingLoop
>>> training_loop = SLCWATrainingLoop(
...     model=model,
...     triples_factory=training_triples_factory,
...     optimizer=optimizer,
... )

>>> # Train like Cristiano Ronaldo
>>> _ = training_loop.train(
...     triples_factory=training_triples_factory,
...     num_epochs=5,
...     batch_size=256,
... )

>>> # Pick an evaluator
>>> from pykeen.evaluation import RankBasedEvaluator
>>> evaluator = RankBasedEvaluator()

>>> # Get triples to test
>>> mapped_triples = dataset.testing.mapped_triples

>>> # Evaluate
>>> results = evaluator.evaluate(
...     model=model,
...     mapped_triples=mapped_triples,
...     batch_size=1024,
...     additional_filter_triples=[
...         dataset.training.mapped_triples,
...         dataset.validation.mapped_triples,
...     ],
... )
>>> # print(results)

Preview: Evaluation Loops

PyKEEN is currently in the transition to use torch’s data-loaders for evaluation, too. While not being active for the high-level pipeline, you can already use it explicitly:

>>> # get a dataset
>>> from pykeen.datasets import Nations
>>> dataset = Nations()

>>> # Pick a model
>>> from pykeen.models import TransE
>>> model = TransE(triples_factory=dataset.training)

>>> # Pick a training approach (sLCWA or LCWA)
>>> from pykeen.training import SLCWATrainingLoop
>>> training_loop = SLCWATrainingLoop(
...     model=model,
...     triples_factory=dataset.training,
... )

>>> # Train like Cristiano Ronaldo
>>> _ = training_loop.train(
...     triples_factory=training_triples_factory,
...     num_epochs=5,
...     batch_size=256,
...     # NEW: validation evaluation callback
...     callbacks="evaluation-loop",
...     callbacks_kwargs=dict(
...         prefix="validation",
...         factory=dataset.validation,
...     ),
... )

>>> # Pick an evaluation loop (NEW)
>>> from pykeen.evaluation import LCWAEvaluationLoop
>>> evaluation_loop = LCWAEvaluationLoop(
...     model=model,
...     triples_factory=dataset.testing,
... )

>>> # Evaluate
>>> results = evaluation_loop.evaluate()
>>> # print(results)

Training Callbacks

PyKEEN allows interaction with the training loop through callbacks. One particular use case is regular evaluation (outside of an early stopper). The following example shows how to evaluate on the training triples on every tenth epoch

from pykeen.datasets import get_dataset
from pykeen.pipeline import pipeline

dataset = get_dataset(dataset="nations")
result = pipeline(
    dataset=dataset,
    model="mure",
    training_kwargs=dict(
        num_epochs=100,
        callbacks="evaluation",
        callbacks_kwargs=dict(
            evaluation_triples=dataset.training.mapped_triples,
            tracker="console",
            prefix="training",
        ),
    ),
)

For further information about different result trackers, take a look at the section on Result Trackers.

Next Steps

The first steps tutorial taught you how to train and use a model for some of the most common tasks. There are several other topic-specific tutorials in the section of the documentation. You might also want to jump ahead to the Troubleshooting section in case you’re having trouble, or look through questions and discussions that others have posted on GitHub.

Knowledge Graph Embedding Models

In PyKEEN, the base class for Knowledge Graph Embedding Models is pykeen.models.ERModel.

It combines entity and relation representations with an interaction function. On a very-high level, triple scores are obtained by first extracting the representations corresponding to the head and tail entity and relation (given as integer indices), and then uses the interaction interaction function to calculate a scalar score from them.

This tutorial gives a high-level overview of these components, and explains how to extend and modify them.

Representation

A pykeen.nn.representation.Representation module provides a method to obtain representations, e.g., vectors, for given integer indices. These indices may correspond to entity or relation indices. The representations are chosen by providing appropriate inputs to the parameters

  • entity_representations / entity_representations_kwargs for entity representations, or

  • relation_representations / relation_representations_kwargs for relation representations.

These inputs are then used to instantiate the representations using pykeen.nn.representation_resolver.make_many(). Notice that the model class, pykeen.models.ERModel, takes care of filling in the max_id parameter into the …_kwargs. The default is to use a single pykeen.nn.Embedding for entities and relations, as encountered in many publications.

The following examples are for entity representations, but can be equivalently used for relation representations.

  • a single pykeen.nn.Embedding with dimensionality 64, suitable, e.g., for interactions such as pykeen.nn.TransEInteraction, or pykeen.nn.DistMultInteraction.

    model = ERModel(
        # the default:
        # entity_representations=None,
        # equivalent to
        # entity_representations=[None],
        # equivalent to
        # entity_representations=[pykeen.nn.Embedding],
        entity_representations_kwargs=dict(shape=64),
        ...,
    )
    
  • two pykeen.nn.Embedding with same dimensionality 64, suitable, e.g., for interactions such as pykeen.nn.BoxEInteraction

    model = ERModel(
        entity_representations=[None, None],
        # note: ClassResolver.make_many supports "broad-casting" kwargs
        entity_representations_kwargs=dict(shape=64),
        # equivalent:
        # entity_representations_kwargs=[dict(shape=64), dict(shape=64)],
        ...,
    )
    

Note

If you are unsure about which choices you have for chosing entity representations, take a look at the subclasses of pykeen.nn.Representation.

Note

Internally, the class_resolver library is used to support various alternative parametrization, e.g., the string name of a representation class, the class object, or instances of the pykeen.nn.Representation class. You can also register your own classes to the resolver. Detailed information can be found in the documentation of the package or Using Resolvers

Interaction Function

An interaction function calculates scalar scores from head, relation and tail representations. These scores can be interpreted as the plausibility of a triple, i.e., the higher the score, the more plausible the triple is. Good models thus should output high scores for true triples, and low scores for false triples.

In PyKEEN, interactions are provided as subclasses of pykeen.nn.Interaction, which is a torch.nn.Module, i.e., it can hold additional (trainable) parameters, and can also be used outside of PyKEEN. Its core method is pykeen.nn.Interaction.forward(), which receives batches of head, relation and tail representations and calculates the corresponding triple scores.

As with the representations, interactions passed to pykeen.models.ERModel are resolved, this time using pykeen.nn.interaction_resolver.make(). Hence, we can provide, e.g., strings corresponding to the interaction function instead of an instantiated class. Further information can be found at Using Resolvers.

Note

Interaction functions can require different numbers or shapes of entity and relation representations. A symbolic description of the expected number of representations and their shape can be accessed by pykeen.nn.Interaction.entity_shape and pykeen.nn.Interaction.relation_shape.

Tracking Results during Training

Using MLflow

MLflow is a graphical tool for tracking the results of machine learning. PyKEEN integrates MLflow into the pipeline and HPO pipeline.

To use it, you’ll first have to install MLflow with pip install mlflow and run it in the background with mlflow ui. More information can be found on the MLflow Quickstart. It’ll be running at http://localhost:5000 by default.

Pipeline Example

This example shows using MLflow with the pykeen.pipeline.pipeline() function. Minimally, the tracking_uri and experiment_name are required in the result_tracker_kwargs.

from pykeen.pipeline import pipeline

pipeline_result = pipeline(
    model='RotatE',
    dataset='Kinships',
    result_tracker='mlflow',
    result_tracker_kwargs=dict(
        tracking_uri='http://localhost:5000',
        experiment_name='Tutorial Training of RotatE on Kinships',
    ),
)

If you navigate to the MLflow UI at http://localhost:5000, you’ll see the experiment appeared in the left column.

MLflow home

If you click on the experiment, you’ll see this:

MLflow experiment view

HPO Example

This example shows using MLflow with the pykeen.hpo.hpo_pipeline() function.

from pykeen.hpo import hpo_pipeline

pipeline_result = hpo_pipeline(
    model='RotatE',
    dataset='Kinships',
    result_tracker='mlflow',
    result_tracker_kwargs=dict(
        tracking_uri='http://localhost:5000',
        experiment_name='Tutorial HPO Training of RotatE on Kinships',
    ),
)

The same navigation through MLflow can be done for this example.

Reusing Experiments

In the MLflow UI, you’ll see that experiments are assigned an ID. This means you can re-use the same ID to group different sub-experiments together using the experiment_id keyword argument instead of experiment_name.

from pykeen.pipeline import pipeline

experiment_id = 4  # if doesn't already exist, will throw an error!
pipeline_result = pipeline(
    model='RotatE',
    dataset='Kinships',
    result_tracker='mlflow'
    result_tracker_kwargs=dict(
        tracking_uri='http://localhost:5000',
        experiment_id=4,
    ),
)

Adding Tags

Tags are additional key/value information that you might want to add to the experiment and store in MLflow. By default, MLflow adds the tags listed on https://www.mlflow.org/docs/latest/tracking.html#id41.

For example, if you’re using custom input, you might want to add which version of the input file produced the results as follows:

from pykeen.pipeline import pipeline

data_version = ...

pipeline_result = pipeline(
    model='RotatE',
    training=...,
    testing=...,
    validation=...,
    result_tracker='mlflow',
    result_tracker_kwargs=dict(
        tracking_uri='http://localhost:5000',
        experiment_name='Tutorial Training of RotatE on Kinships',
        tags={
            "data_version": md5_hash,
        },
    ),
)

Additional documentation of the valid keyword arguments can be found under pykeen.trackers.MLFlowResultTracker.

Using Neptune.ai

Neptune is a graphical tool for tracking the results of machine learning. PyKEEN integrates Neptune into the pipeline and HPO pipeline.

Preparation

  1. To use it, you’ll first have to install Neptune’s client with pip install neptune-client or install PyKEEN with the neptune extra with pip install pykeen[neptune].

  2. Create an account at Neptune.

    • Get an API token following this tutorial.

    • [Optional] Set the NEPTUNE_API_TOKEN environment variable to your API token.

  3. [Optional] Create a new project by following this tutorial for project and user management. Neptune automatically creates a project for all new users called sandbox which you can directly use.

Pipeline Example

This example shows using Neptune with the pykeen.pipeline.pipeline() function. Minimally, the project_qualified_name and experiment_name must be set.

from pykeen.pipeline import pipeline

pipeline_result = pipeline(
    model='RotatE',
    dataset='Kinships',
    result_tracker='neptune',
    result_tracker_kwargs=dict(
        project_qualified_name='cthoyt/sandbox',
        experiment_name='Tutorial Training of RotatE on Kinships',
    ),
)

Warning

If you haven’t set the NEPTUNE_API_TOKEN environment variable, the api_token becomes a mandatory key.

Reusing Experiments

In the Neptune web application, you’ll see that experiments are assigned an ID. This means you can re-use the same ID to group different sub-experiments together using the experiment_id keyword argument instead of experiment_name.

from pykeen.pipeline import pipeline

experiment_id = 4  # if doesn't already exist, will throw an error!
pipeline_result = pipeline(
    model='RotatE',
    dataset='Kinships',
    result_tracker='neptune'
    result_tracker_kwargs=dict(
        project_qualified_name='cthoyt/sandbox',
        experiment_id=4,
    ),
)

Don’t worry - you can keep using the experiment_name argument and the experiment’s identifier will be automatically looked up eah time.

Adding Tags

Tags are additional information that you might want to add to the experiment and store in Neptune. Note this is different from MLflow, which considers tags as key/value pairs.

For example, if you’re using custom input, you might want to add some labels about if the experiment is cool or not.

from pykeen.pipeline import pipeline

data_version = ...

pipeline_result = pipeline(
    model='RotatE',
    training=...,
    testing=...,
    validation=...,
    result_tracker='mlflow',
    result_tracker_kwargs=dict(
        project_qualified_name='cthoyt/sandbox',
        experiment_name='Tutorial Training of RotatE on Kinships',
        tags={'cool', 'doggo'},
    ),
)

Additional documentation of the valid keyword arguments can be found under pykeen.trackers.NeptuneResultTracker.

Using Weights and Biases

Weights and Biases (WANDB) is a service for tracking experimental results and various artifacts appearing whn training ML models.

After registering for WANDB, do the following:

  1. Create a project in WANDB, for example, with the name pykeen_project at https://app.wandb.ai/<your username>/new

  2. Install WANDB on your machine with pip install wandb

  3. Setup your computer for use with WANDB by using either of the following two instructions from https://github.com/wandb/client#running-your-script:

    1. Navigate to https://app.wandb.ai/settings, copy your API key, and set the WANDB_API_KEY environment variable

    2. Interactively run wandb login

Now you can simply specify this project name when initializing a pipeline, and everything else will work automatically!

Pipeline Example

This example shows using WANDB with the pykeen.pipeline.pipeline() function.

from pykeen.pipeline import pipeline

pipeline_result = pipeline(
    model='RotatE',
    dataset='Kinships',
    result_tracker='wandb',
    result_tracker_kwargs=dict(
        project='pykeen_project',
    ),
)

You can navigate to the created project in WANDB and observe a running experiment. Further tweaking of appearance, charts, and other settings is described in the official documentation

You can also specify an optional experiment which will appear on the website instead of randomly generated labels. All further keyword arguments are passed to wandb.init().

from pykeen.pipeline import pipeline

pipeline_result = pipeline(
    model='RotatE',
    dataset='Kinships',
    result_tracker='wandb',
    result_tracker_kwargs=dict(
        project='pykeen_project',
        experiment='experiment-1',
    ),
)

HPO Example

This example shows using WANDB with the pykeen.hpo.hpo_pipeline() function.

from pykeen.hpo import hpo_pipeline

pipeline_result = hpo_pipeline(
    model='RotatE',
    dataset='Kinships',
    result_tracker='wandb',
    result_tracker_kwargs=dict(
        project='pykeen_project',
        experiment='new run',
        reinit=True,
    ),
)

It’s safe to specify the experiment name during HPO. Several runs will be sent to the same experiment under different hashes. However, specifying the experiment name is advisable more for single runs and not for batches of multiple runs.

Additional documentation of the valid keyword arguments can be found under pykeen.trackers.WANDBResultTracker.

Using Tensorboard

Tensorboard is a service for tracking experimental results during or after training. It is part of the larger Tensorflow project but can be used independently of it.

Installing Tensorboard

The tensorboard package can either be installed directly with pip install tensorboard or with PyKEEN by using the tensorboard extra in pip install pykeen[tensorboard].

Note

Tensorboard logs can created without actually installing tensorboard itself. However, if you want to view and interact with the data created via the tracker, it must be installed.

Starting Tensorboard

The tensorboard web application can be started from the command line with

$ tensorboard --logdir=~/.data/pykeen/logs/tensorboard/

where the value passed to the --logdir is location of log directory. By default, PyKEEN logs to ~/.data/pykeen/logs/tensorboard/, but this is configurable. The Tensorboard can then be accessed via a browser at: http://localhost:6006/

Note

It is not required for the Tensorboard process to be running while the training is happening. Indeed, it only needs to be started once you want to interact with and view the logs. It can be stopped at any time and the logs will persist in the filesystem.

Minimal Pipeline Example

The tensorboard tracker can be used during training with the pykeen.pipeline.pipeline() as follows:

from pykeen.pipeline import pipeline

pipeline_result = pipeline(
    model='RotatE',
    dataset='Kinships',
    result_tracker='tensorboard',
)

It is placed in a subdirectory of pystow default data directory of PyKEEN called tensorboard, which will likely be at ~/.data/pykeen/logs/tensorboard on your system. The file is named based on the current time if no alternative is provided.

Specifying a Log Name

If you want to specify the name of the log file in the default directory, use the experiment_name keyword argument like:

from pykeen.pipeline import pipeline

pipeline_result = pipeline(
    model='RotatE',
    dataset='Kinships',
    result_tracker='tensorboard',
    result_tracker_kwargs=dict(
        experiment_name='rotate-kinships',
    ),
)

Specifying a Custom Log Directory

If you want to specify a custom directory to store the tensorboard logs, use the experiment_path keyword argument like:

from pykeen.pipeline import pipeline

pipeline_result = pipeline(
    model='RotatE',
    dataset='Kinships',
    result_tracker='tensorboard',
    result_tracker_kwargs=dict(
        experiment_path='tb-logs/rotate-kinships',
    ),
)

Warning

Please be aware that if you re-run an experiment using the same directory, then the logs will be combined. It is advisable to use a unique sub-directory for each experiment to allow for easy comparison.

Minimal HPO Pipeline Example

Tensorboard tracking can also be used in conjunction with a HPO pipeline as follows:

from pykeen.pipeline import pipeline

hpo_pipeline_result = hpo_pipeline(
    n_trials=30,
    dataset='Nations',
    model='TransE',
    result_tracker='tensorboard',
)

This provides a way to compare directly between different trails and parameter configurations. Please not that it is recommended to leave the experiment name as the default value here to allow for a directory to be created per trail.

Using File-Based Tracking

Rather than logging to an external backend like MLflow or W&B, file based trackers write to local files.

Minimal Pipeline Example with CSV

A CSV log file can be generated with the following:

from pykeen.pipeline import pipeline

pipeline_result = pipeline(
    model='RotatE',
    dataset='Kinships',
    result_tracker='csv',
)

It is placed in a subdirectory of pystow default data directory with PyKEEN called logs, which will likely be at ~/.data/pykeen/logs/ on your system. The file is named based on the current time.

Specifying a Name

If you want to specify the name of the log file in the default directory, use the name keyword argument like:

from pykeen.pipeline import pipeline

pipeline_result = pipeline(
    model='RotatE',
    dataset='Kinships',
    result_tracker='csv',
    result_tracker_kwargs=dict(
        name='test.csv',
    ),
)

Additional keyword arguments are passed through to the csv.writer(). This can include a delimiter, dialect, quotechar, etc.

Warning

If you specify the file name, it will overwrite the previous log file there.

The path argument can be used instead of the name to specify an absolute path to the log file rather than using the PyStow directory.

Combining with tail

If you know the name of a file, you can monitor it with tail and the -f flag like in:

$ tail -f ~/data/pykeen/logs/test.csv | grep "hits_at_10"

Pipeline Example with JSON

The JSON writer creates a JSONL file on which each line is a valid JSON object. Similarly to the CSV writer, the name argument can be omitted to create a time-based file name or given to pick the default name. The path argument can still be used to specify an absolute path.

from pykeen.pipeline import pipeline

pipeline_result = pipeline(
    model='RotatE',
    dataset='Kinships',
    result_tracker='json',
    result_tracker_kwargs=dict(
        name='test.json',
    ),
)

The same concepts can be applied in the HPO pipeline as in the previous tracker tutorials. Additional documentation of the valid keyword arguments can be found under pykeen.trackers.CSVResultTracker and pykeen.trackers.JSONResultTracker.

Saving Checkpoints during Training

Training may take days to weeks in extreme cases when using models with many parameters or big datasets. This introduces a large array of possible errors, e.g. session timeouts, server restarts etc., which would lead to a complete loss of all progress made so far. To avoid this PyKEEN supports built-in check-points that allow a straight-forward saving of the current training loop state and resumption of a saved state from saved checkpoints shown in Regular Checkpoints, as well as checkpoints on failure that are only saved when the training loop fails shown in Checkpoints on Failure. For understanding in more detail how the checkpoints work and how they can be used programmatically, please look at Checkpoints beyond the Pipeline and Technicalities. For fixing possible errors and safety fallbacks please also look at Word of Caution and Possible Errors.

Regular Checkpoints

The tutorial First Steps showed how the pykeen.pipeline.pipeline() function can be used to set up an entire KGEM for training and evaluation in just two lines of code. A slightly extended example is shown below:

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     optimizer='Adam',
...     training_kwargs=dict(
...         num_epochs=1000,
...     ),
... )

To enable checkpoints, all you have to do is add a checkpoint_name argument to the training_kwargs. This argument should have the name you would like the checkpoint files saved on your computer to be called.

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     optimizer='Adam',
...     training_kwargs=dict(
...         num_epochs=1000,
...         checkpoint_name='my_checkpoint.pt',
...     ),
... )

Furthermore, you can set the checkpoint frequency, i.e. how often checkpoints should be saved given in minutes, by setting the argument checkpoint_frequency with an integer. The default frequency is 30 minutes and setting it to 0 will cause the training loop to save a checkpoint after each epoch. Let’s look at an example.

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     optimizer='Adam',
...     training_kwargs=dict(
...         num_epochs=1000,
...         checkpoint_name='my_checkpoint.pt',
...         checkpoint_frequency=5,
...     ),
... )

Here we have defined a pipeline that will save training loop checkpoints in the checkpoint file called my_checkpoint.pt every time an epoch finishes and at least 5 minutes have passed since saving previously. Assuming that e.g. this pipeline crashes after 200 epochs, you can simply execute the same code and the pipeline will load the last state from the checkpoint file and continue training as if nothing happened. The results will be exactly same as if you ran the pipeline for 1000 epoch without interruption.

Another nice feature is that using checkpoints the training loop will save the state whenever the training loop finishes or the early stopper stops it. Assuming that you successfully trained the KGEM above for 1000 epochs, but now decide that you would like to test the model with 2000 epochs, all you have to do is to change the number of epochs and execute the code like:

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     optimizer='Adam',
...     training_kwargs=dict(
...         num_epochs=2000,  # more epochs than before
...         checkpoint_name='my_checkpoint.pt',
...         checkpoint_frequency=5,
...     ),
... )

The above code will load the saved state after finishing 1000 epochs and continue to train to 2000 epochs, giving the exact same results as if you would have run it for 2000 epochs in the first place.

By default, your checkpoints will be saved in the PYKEEN_HOME directory that is defined in pykeen.constants, which is a subdirectory in your home directory, e.g. ~/.data/pykeen/checkpoints (configured via pystow). Optionally, you can set the path to where you want the checkpoints to be saved by setting the checkpoint_directory argument with a string or a pathlib.Path object containing your desired root path, as shown in this example:

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     optimizer='Adam',
...     training_kwargs=dict(
...         num_epochs=2000,
...         checkpoint_name='my_checkpoint.pt',
...         checkpoint_directory='doctests/checkpoint_dir',
...     ),
... )

Checkpoints on Failure

In cases where you only would like to save checkpoints whenever the training loop might fail, you can use the argument checkpoint_on_failure=True, like:

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     optimizer='Adam',
...     training_kwargs=dict(
...         num_epochs=2000,
...         checkpoint_on_failure=True,
...     ),
... )

This option differs from regular checkpoints, since regular checkpoints are only saved after a successful epoch. When saving checkpoints due to failure of the training loop there is no guarantee that all random states can be recovered correctly, which might cause problems with regards to the reproducibility of that specific training loop. Therefore, these checkpoints are saved with a distinct checkpoint name, which will be PyKEEN_just_saved_my_day_{datetime}.pt in the given checkpoint_directory, even when you also opted to use regular checkpoints as defined above, e.g. with this code:

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='Nations',
...     model='TransE',
...     optimizer='Adam',
...     training_kwargs=dict(
...         num_epochs=2000,
...         checkpoint_name='my_checkpoint.pt',
...         checkpoint_on_failure=True,
...     ),
... )

Note: Use this argument with caution, since every failed training loop will create a distinct checkpoint file.

Checkpoints When Bringing Your Own Data

When continuing the training or in general using the model after resuming training, it is critical that the entity label to identifier (entity_to_id) and relation label to identifier (relation_to_id) mappings are the same as the ones that were used when saving the checkpoint. If they are not, then any downstream usage will be nonsense.

If you’re using a dataset provided by PyKEEN, you’re automatically covered. However, when using your own datasets (see Bring Your Own Data), you are responsible for making sure this is the case. Below are two typical examples of combining bringing your own data with checkpoints.

Resuming Training

The following example shows using custom triples factories for the training, validation, and testing datasets derived from files containing labeled triples. Note how the entity_to_id and relation_to_id arguments are used when creating the validation and testing triples factories in order to ensure that those datasets are created with the same mappings as the training dataset. Because the checkpoint_name is set to 'my_checkpoint.pt', PyKEEN saves the checkpoint in ~/.data/pykeen/checkpoints/my_checkpoint.pt.

>>> from pykeen.pipeline import pipeline
>>> from pykeen.triples import TriplesFactory
>>> from pykeen.datasets.nations import NATIONS_TEST_PATH, NATIONS_TRAIN_PATH, NATIONS_VALIDATE_PATH
>>> training = TriplesFactory.from_path(
...     path=NATIONS_TRAIN_PATH,
... )
>>> validation = TriplesFactory.from_path(
...     path=NATIONS_VALIDATE_PATH,
...     entity_to_id=train.entity_to_id,
...     relation_to_id=train.relation_to_id,
... )
>>> testing = TriplesFactory.from_path(
...     path=NATIONS_TEST_PATH,
...     entity_to_id=train.entity_to_id,
...     relation_to_id=train.relation_to_id,
... )
>>> pipeline_result = pipeline(
...     training=training,
...     validation=validation,
...     testing=testing,
...     model='TransE',
...     optimizer='Adam',
...     training_kwargs=dict(
...         num_epochs=2000,
...         checkpoint_name='my_checkpoint.pt',
...     ),
... )

When you are sure that your datasets shown above are the same, you can simply rerun that code and PyKEEN will automatically resume the training where it has left. However, if you only have changed the dataset or you sample it, you need to make sure that the mappings are correct when resuming training from the checkpoint. This can be done by loading the mappings from the checkpoint in the following way.

>>> import torch
>>> from pykeen.constants import PYKEEN_CHECKPOINTS
>>> checkpoint = torch.load(PYKEEN_CHECKPOINTS.joinpath('my_checkpoint.pt')

You have now loaded the checkpoint that contains the mappings, which now can be used to create mappings that match the model saved in the checkpoint in the following way

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.datasets.nations import NATIONS_TEST_PATH, NATIONS_TRAIN_PATH, NATIONS_VALIDATE_PATH
>>> training = TriplesFactory.from_path(
...     path=NATIONS_TRAIN_PATH,
...     entity_to_id=checkpoint['entity_to_id_dict'],
...     relation_to_id=checkpoint['relation_to_id_dict'],
... )
>>> validation = TriplesFactory.from_path(
...     path=NATIONS_VALIDATE_PATH,
...     entity_to_id=checkpoint['entity_to_id_dict'],
...     relation_to_id=checkpoint['relation_to_id_dict'],
... )
>>> testing = TriplesFactory.from_path(
...     path=NATIONS_TEST_PATH,
...     entity_to_id=checkpoint['entity_to_id_dict'],
...     relation_to_id=checkpoint['relation_to_id_dict'],
... )

Now you can simply resume the pipeline with the same code as above:

>>> pipeline_result = pipeline(
...     training=training,
...     validation=validation,
...     testing=testing,
...     model='TransE',
...     optimizer='Adam',
...     training_kwargs=dict(
...         num_epochs=2000,
...         checkpoint_name='my_checkpoint.pt',
...     ),
... )

In case you feel that this is too much work we still got you covered, since PyKEEN will check in the background whether the provided triples factory mappings match those provided in the checkpoints and will warn you if that is not the case.

Loading Models Manually

Instead of just resuming training with checkpoints as shown above, you can also manually load models from checkpoints for investigation or performing prediction tasks. This can be done in the following way:

>>> import torch
>>> from pykeen.constants import PYKEEN_CHECKPOINTS
>>> from pykeen.pipeline import pipeline
>>> from pykeen.triples import TriplesFactory
>>> checkpoint = torch.load(PYKEEN_CHECKPOINTS.joinpath('my_checkpoint.pt'))

You have now loaded the checkpoint that contains both the model as well as the entity_to_id and relation_to_id mapping from the example above. To load these into PyKEEN you just have to do the following:

>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH
>>> train = TriplesFactory.from_path(
...     path=NATIONS_TRAIN_PATH,
...     entity_to_id=checkpoint['entity_to_id_dict'],
...     relation_to_id=checkpoint['relation_to_id_dict'],
... )

… now load the model and pass the train triples factory to the model

>>> from pykeen.models import TransE
>>> my_model = TransE(triples_factory=train)
>>> my_model.load_state_dict(checkpoint['model_state_dict'])

Now you have loaded the model and ensured that the mapping in the triples factory is aligned with the model weights. Enjoy!

Todo

Tutorial on recovery from hpo_pipeline.

Word of Caution and Possible Errors

When using checkpoints and trying out several configurations, which in return result in multiple different checkpoints, the inherent risk of overwriting checkpoints arises. This would naturally happen when you change the configuration of the KGEM, but don’t change the checkpoint_name argument. To prevent this from happening, PyKEEN makes a hash-sum comparison of the configurations of the checkpoint and the one of the current configuration at hand. When these don’t match, PyKEEN won’t accept the checkpoint and raise an error.

In case you want to overwrite the previous checkpoint file with a new configuration, you have to delete it explicitly. The reason for this behavior is three-fold:

  1. This allows a very easy and user friendly way of resuming an interrupted training loop by simply re-running the exact same code.

  2. By explicitly requiring to name the checkpoint files the user controls the naming of the files and thus makes it easier to keep an overview.

  3. Creating new checkpoint files implicitly for each run will lead most users to inadvertently spam their file systems with unused checkpoints that with ease can add up to hundred of GBs when running many experiments.

Checkpoints beyond the Pipeline and Technicalities

Currently, PyKEEN only supports checkpoints for training loops, implemented in the class pykeen.training.TrainingLoop. When using the pykeen.pipeline.pipeline() function as defined above, the pipeline actually uses the training loop functionality. Accordingly, those checkpoints save the states of the training loop and not the pipeline itself. Therefore, the checkpoints won’t contain evaluation results that reside in the pipeline. However, PyKEEN makes sure the final results of the pipeline using training loop checkpoints are exactly the same compared to running uninterrupted without checkpoints, also for the evaluation results!

To show how to use the checkpoint functionality without the pipeline, we define a KGEM first:

>>> from pykeen.models import TransE
>>> from pykeen.training import SLCWATrainingLoop
>>> from pykeen.triples import TriplesFactory
>>> from torch.optim import Adam
>>> triples_factory = Nations().training
>>> model = TransE(
...     triples_factory=triples_factory,
...     random_seed=123,
... )
>>> optimizer = Adam(params=model.get_grad_params())
>>> training_loop = SLCWATrainingLoop(model=model, optimizer=optimizer)

At this point we have a model, dataset and optimizer all setup in a training loop and are ready to train the model with the training_loop’s method pykeen.training.TrainingLoop.train(). To enable checkpoints all you have to do is setting the function argument checkpoint_name to the name you would like it to have. Furthermore, you can set the checkpoint frequency, i.e. how often checkpoints should be saved given in minutes, by setting the argument checkpoint_frequency with an integer. The default frequency is 30 minutes and setting it to 0 will cause the training loop to save a checkpoint after each epoch. Optionally, you can set the path to where you want the checkpoints to be saved by setting the checkpoint_directory argument with a string or a pathlib.Path object containing your desired root path. If you didn’t set the checkpoint_directory argument, your checkpoints will be saved in the PYKEEN_HOME directory that is defined in pykeen.constants, which is a subdirectory in your home directory, e.g. ~/.data/pykeen/checkpoints.

Here is an example:

>>> losses = training_loop.train(
...     num_epochs=1000,
...     checkpoint_name='my_checkpoint.pt',
...     checkpoint_frequency=5,
... )

With this code we have started the training loop with the above defined KGEM. The training loop will save a checkpoint in the my_checkpoint.pt file, which will be saved in the ~/.data/pykeen/checkpoints/ directory, since we haven’t set the argument checkpoint_directory. The checkpoint file will be saved after 5 minutes since starting the training loop or the last time a checkpoint was saved and the epoch finishes, i.e. when one epoch takes 10 minutes the checkpoint will be saved after 10 minutes. In addition, checkpoints are always saved when the early stopper stops the training loop or the last epoch was finished.

Let’s assume you were anticipative, saved checkpoints and your training loop crashed after 200 epochs. Now you would like to resume from the last checkpoint. All you have to do is to rerun the exact same code as above and PyKEEN will smoothly start from the given checkpoint. Since PyKEEN stores all random states as well as the states of the model, optimizer and early stopper, the results will be exactly the same compared to running the training loop uninterruptedly. Of course, PyKEEN will also continue saving new checkpoints even when resuming from a previous checkpoint.

On top of resuming interrupted training loops you can also resume training loops that finished successfully. E.g. the above training loop finished successfully after 1000 epochs, but you would like to train the same model from that state for 2000 epochs. All you have have to do is to change the argument num_epochs in the above code to:

>>> losses = training_loop.train(
...     num_epochs=2000,
...     checkpoint_name='my_checkpoint.pt',
...     checkpoint_frequency=5,
... )

and now the training loop will resume from the state at 1000 epochs and continue to train until 2000 epochs.

As shown in Checkpoints on Failure, you can also save checkpoints only in cases where the training loop fails. To do this you just have to set the argument checkpoint_on_failure=True, like:

>>> losses = training_loop.train(
...     num_epochs=2000,
...     checkpoint_directory='/my/secret/dir',
...     checkpoint_on_failure=True,
... )

This code will save a checkpoint in case the training loop fails. Note how we also chose a new checkpoint directory by setting the checkpoint_directory argument to /my/secret/dir.

A Toy Example with Translational Distance Models

The following tutorial is based on a question originally posed by Heiko Paulheim on the PyKEEN Issue Tracker #97.

Given the following toy example comprising three entities in a triangle, a translational distance model like pykeen.models.TransE should be able to exactly learn the geometric structure.

Head

Relation

Tail

Brussels

locatedIn

Belgium

Belgium

partOf

EU

EU

hasCapital

Brussels

from pykeen.pipeline import pipeline
tf = ...
results = pipeline(
    training=tf,
    testing=...,
    model = 'TransE',
    model_kwargs=dict(embedding_dim=2),
    random_seed=1,
    device='cpu',
)
results.plot()
Troubleshooting Image 1

First, check if the model is converging using results.plot_losses. Qualitatively, this means that the loss is smoothly decreasing and eventually evening out. If the model does not decrease, you might need to tune some parameters with the optimizer_kwargs and training_kwargs to the pipeline() function.

For example, you can decrease the optimizer’s learning rate to make the loss curve less bumpy. Second, you can increase the number of epochs during training.

results = pipeline(
    training=tf,
    testing=...,
    model = 'TransE',
    model_kwargs=dict(embedding_dim=2),
    optimizer_kwargs=dict(lr=1.0e-1),
    training_kwargs=dict(num_epochs=128, use_tqdm_batch=False),
    evaluation_kwargs=dict(use_tqdm=False),
    random_seed=1,
    device='cpu',
)
results.plot()
Troubleshooting Image 2

Please notice that there is some stochasticity in the training, since we sample negative examples for positive ones. Thus, the loss may fluctuate naturally. To better see the trend, you can smooth the loss by averaging over a window of epochs.

We use a margin-based loss with TransE by default. Thus, it suffices if the model predicts scores such that the scores of positive triples and negative triples are at least one margin apart. Once the model has reached this state, if will not improve further upon these examples, as the embeddings are “good enough”. Hence, an optimal solution with margin-based loss might not look like the exact geometric solution. If you want to change that you can switch to a loss function which does not use a margin, e.g. the softplus loss. You can do this by passing loss="softplus" to the pipeline.

toy_results = pipeline(
    training=tf,
    testing=...,
    model='TransE',
    loss='softplus',
    model_kwargs=dict(embedding_dim=2),
    optimizer_kwargs=dict(lr=1.0e-1),
    training_kwargs=dict(num_epochs=128, use_tqdm_batch=False),
    evaluation_kwargs=dict(use_tqdm=False),
    random_seed=1,
    device='cpu',
)
results.plot()
Troubleshooting Image 3

There was a lot of interesting follow-up discussion at !99 during which this code was implemented for re-use. One of the interesting points is that the relation plot is only applicable for translational distance models like TransE. Further, when models whose embeddings are higher than 2, a dimensionality reduction method must be used. For this, one of many of the tools from scikit-learn can be chosen. However, to make sure that the entities and relations are projected on the same axis, the dimensionality reduction model is first trained on the entity embeddings, then applied on both the entity embeddings and relation embeddings. Further, non-linear models like KPCA should not be used when plotting relations, since these _should_ correspond to linear transformations in embedding space.

Understanding the Evaluation

This part of the tutorial is aimed to help you understand the evaluation of knowledge graph embeddings. In particular it explains rank-based evaluation metrics reported in pykeen.evaluation.RankBasedMetricResults.

Knowledge graph embedding are usually evaluated on the task of link prediction. To this end, an evaluation set of triples \(\mathcal{T}_{eval} \subset \mathcal{E} \times \mathcal{R} \times \mathcal{E}\) is provided, and for each triple \((h, r, t) \in \mathcal{T}_{eval}\) in this set, two tasks are solved:

  • Right-Side In the right-side prediction task, a pair of head entity and relation are given and aim to predict the tail, i.e. \((h, r, ?)\). To this end, the knowledge graph embedding model is used to score each of the possible choices \((h, r, e)\) for \(e \in \mathcal{E}\). Higher scores indicate higher plausibility.

  • Left-Side Analogously, in the left-side prediction task, a pair of relation and tail entity are provided and aim to predict the head, i.e. \((?, r, t)\). Again, each possible choice \((e, r, t)\) for \(e \in \mathcal{E}\) is scored according to the knowledge graph embedding model.

Note

Practically, many embedding models allow fast computation of all scores \((e, r, t)\) for all \(e \in \mathcal{E}\), than just passing the triples through the model’s score function. As an example, consider DistMult with the score function \(score(h,r,t)=\sum_{i=1}^d \mathbf{h}_i \cdot \mathbf{r}_i \cdot \mathbf{t}_i\). Here, all entities can be scored as candidate heads for a given tail and relation by first computing the element-wise product of tail and relation, and then performing a matrix multiplication with the matrix of all entity embeddings. # TODO: Link to section explaining this concept.

In the rank-based evaluation protocol, the scores are used to sort the list of possible choices by decreasing score, and determine the rank of the true choice, i.e. the index in the sorted list. Smaller ranks indicate better performance. Based on these individual ranks, which are obtained for each evaluation triple and each side of the prediction (left/right), there exist several aggregation measures to quantify the performance of a model in a single number.

Note

There are theoretical implications based on whether the indexing is 0-based or 1-based (natural). PyKEEN uses 1-based indexing to conform with related work.

As an example, consider we trained a KGEM on the countries dataset, e.g., using

from pykeen.datasets import get_dataset
from pykeen.pipeline import pipeline
dataset = get_dataset(dataset="countries")
result = pipeline(dataset=dataset, model="mure")

During evaluation time, we now evaluate head and tail prediction, i.e., whether we can correct the correct head/tail entity from the remainder of a triple. The first triple in the test split of this dataset is [‘belgium’, ‘locatedin’, ‘europe’]. Thus, for tail prediction, we aim to answer [‘belgium’, ‘locatedin’, ?]. We can see the results using the prediction workflow:

from pykeen.models.predict import get_tail_prediction_df

df = get_tail_prediction_df(
    model=result.model,
    head_label="belgium",
    relation_label="locatedin",
    triples_factory=result.training,
    add_novelties=False,
)

which returns a dataframe of all tail candidate entities sorted according to the predicted score. The index in this sorted list is essentially the rank of the correct answer.

Rank-Based Metrics

Given the set of individual rank scores for each head/tail entity from evaluation triples, there are various aggregation metrics which summarize different aspects of the set of ranks into a single-figure number. For more details, please refer to their documentation.

Ranking Types

While the aforementioned definition of the rank as “the index in the sorted list” is intuitive, it does not specify what happens when there are multiple choices with exactly the same score. Therefore, in previous work, different variants have been implemented, which yield different results in the presence of equal scores.

  • The optimistic rank assumes that the true choice is on the first position of all those with equal score.

  • The pessimistic rank assumes that the true choice is on the last position of all those with equal score.

  • The realistic rank is the mean of the optimistic and the pessimistic rank, and moreover the expected value over all permutations respecting the sort order.

  • The non-deterministic rank delegates the decision to the sort algorithm. Thus, the result depends on the internal tie breaking mechanism of the sort algorithm’s implementation.

PyKEEN supports the first three: optimistic, pessimistic and realistic. When only using a single score, the realistic score should be reported. The pessimistic and optimistic rank, or more specific the deviation between both, can be used to detect whether a model predicts exactly equal scores for many choices. There are a few causes such as:

  • finite-precision arithmetic in conjunction with explicitly using sigmoid activation

  • clamping of scores, e.g. by using a ReLU activation or similar.

Ranking Sidedness

Besides the different rank definitions, PyKEEN also report scores for the individual side predictions.

Side

Explanation

head

The rank-based metric evaluated only for the head / left-side prediction.

tail

The rank-based metric evaluated only for the tail / right-side prediction.

both

The rank-based metric evaluated on both predictions.

By default, “both” is often used in publications. The side-specific scores can however often give access to interesting insights, such as the difference in difficulty of predicting a head/tail given the rest, or the model’s incapability to solve of one the tasks.

Ranking Aggregation Scope

Real graphs often are scale-free, i.e., there are a few nodes / entities which have a high degree, often called hub, while the majority of nodes has only a few neighbors. This also impacts the evaluation triples: since the hub nodes occur in a large number of triples, they are also more likely to be part of evaluation triples. Thus, performing well on triples containing hub entities contributes strongly to the overall performance.

As an example, we can inspect the pykeen.datasets.WD50KT dataset, where a single (relation, tail)-combination, (“instance of”, “human”), is present in 699 evaluation triples.

from pykeen.datasets import get_dataset
ds = get_dataset(dataset="wd50kt")
unique_relation_tail, counts = dataset.testing.mapped_triples[:, 1:].unique(return_counts=True, dim=0)
# c = 699
c = counts.max()
r, t = unique_relation_tail[counts.argmax()]
# https://www.wikidata.org/wiki/Q5 -> "human"
t = dataset.testing.entity_id_to_label[t.item()]
# https://www.wikidata.org/wiki/Property:P31 -> "instance of"
r = dataset.testing.relation_id_to_label[r.item()]

There are arguments that we want these entities to have a strong effect on evaluation: since they occur often, they are seemingly important, and thus evaluation should reflect that. However, sometimes we also do not want to have this effect, but rather measure the performance evenly across nodes. A similar phenomenon also exists in multi-class classification with imbalanced classes, where frequent classes can dominate performance measures. In similar vein to the macro \(F_1\)-score (cf. sklearn.metrics.f1_score()) known from this area, PyKEEN implements a pykeen.evaluation.MacroRankBasedEvaluator, which ensure that triples are weighted such that each unique ranking task, e.g., a (head, relation)-pair for tail prediction, contributes evenly.

Technically, we solve the task by implemented variants of existing rank-based metrics which support weighting individual ranks differently. Moreover, the evaluator computes weights inversely proportional to the “query” part of the ranking task, i.e., e.g., (head, relation) for tail prediction.

Filtering

The rank-based evaluation allows using the “filtered setting”, proposed by [bordes2013], which is enabled by default. When evaluating the tail prediction for a triple \((h, r, t)\), i.e. scoring all triples \((h, r, e)\), there may be additional known triples \((h, r, t')\) for \(t \neq t'\). If the model predicts a higher score for \((h, r, t')\), the rank will increase, and hence the measured model performance will decrease. However, giving \((h, r, t')\) a high score (and thus a low rank) is desirable since it is a true triple as well. Thus, the filtered evaluation setting ignores for a given triple \((h, r, t)\) the scores of all other known true triples \((h, r, t')\).

Below, we present the philosophy from [bordes2013] and how it is implemented in PyKEEN:

HPO Scenario

During training/optimization with pykeen.hpo.hpo_pipeline(), the set of known positive triples comprises the training and validation sets. After optimization is finished and the final evaluation is done, the set of known positive triples comprises the training, validation, and testing set. PyKEEN explicitly does not use test triples for filtering during HPO to avoid any test leakage.

Early Stopper Scenario

When early stopping is used during training, it periodically uses the validation set for calculating the loss and evaluation metrics. During this evaluation, the set of known positive triples comprises the training and validation sets. When final evaluation is done with the testing set, the set of known positive triples comprises the training, validation, and testing set. PyKEEN explicitly does not use test triples for filtering when early stopping is being used to avoid any test leakage.

Pipeline Scenario

During vanilla training with the pykeen.pipeline.pipeline() that has no optimization, no early stopping, nor any post-hoc choices using the validation set, the set of known positive triples comprises the training and testing sets. This scenario is very atypical, and regardless, should be augmented with the validation triples to make more comparable to other published results that do not consider this scenario.

Custom Training Loops

In case the validation triples should not be filtered when evaluating the test dataset, the argument filter_validation_when_testing=False can be passed to either the pykeen.hpo.hpo_pipeline() or pykeen.pipeline.pipeline().

If you’re rolling your own pipeline, you should keep the following in mind: the pykeen.evaluation.Evaluator when in the filtered setting with filtered=True will always use the evaluation set (regardless of whether it is the testing set or validation set) for filtering. Any other triples that should be filtered should be passed to additional_filter_triples in pykeen.evaluation.Evaluator.evaluate(). Typically, this minimally includes the training triples. With the [bordes2013] technique where the testing set is used for evaluation, the additional_filter_triples should include both the training triples and validation triples as in the following example:

from pykeen.datasets import FB15k237
from pykeen.evaluation import RankBasedEvaluator
from pykeen.models import TransE

# Get FB15k-237 dataset
dataset = FB15k237()

# Define model
model = TransE(
    triples_factory=dataset.training,
)

# Train your model (code is omitted for brevity)
...

# Define evaluator
evaluator = RankBasedEvaluator(
    filtered=True,  # Note: this is True by default; we're just being explicit
)

# Evaluate your model with not only testing triples,
# but also filter on validation triples
results = evaluator.evaluate(
    model=model,
    mapped_triples=dataset.testing.mapped_triples,
    additional_filter_triples=[
        dataset.training.mapped_triples,
        dataset.validation.mapped_triples,
    ],
)

Entity and Relation Restriction

Sometimes, we are only interested in a certain set of entities and/or relations, \(\mathcal{E}_I \subset \mathcal{E}\) and \(\mathcal{R}_I \subset \mathcal{R}\) respectively, but have additional information available in form of triples with other entities/relations. As example, we would like to predict whether an actor stars in a movie. Thus, we are only interested in the relation stars_in between entities which are actors/movies. However, we may have additional information available, e.g. who directed the movie, or the movie’s language, which may help in the prediction task. Thus, we would like to train the model on the full dataset including all available relations and entities, but restrict the evaluation to the task we are aiming at.

In order to restrict the evaluation, we proceed as follows:

  1. We filter the evaluation triples \(\mathcal{T}_{eval}\) to contain only those triples which are of interest, i.e. \(\mathcal{T}_{eval}' = \{(h, r, t) \in \mathcal{T}_{eval} \mid h, t \in \mathcal{E}_I, r \in \mathcal{R}_I\}\)

  2. During tail prediction/evaluation for a triple \((h, r, t)\), we restrict the candidate tail entity \(t'\) to \(t' \in \mathcal{E}_{eval}\). Similarly for head prediction/evaluation, we restrict the candidate head entity \(h'\) to \(h' \in \mathcal{E}_{eval}\)

Example

The pykeen.datasets.Hetionet is a biomedical knowledge graph containing drugs, genes, diseases, other biological entities, and their interrelations. It was described by Himmelstein et al. in Systematic integration of biomedical knowledge prioritizes drugs for repurposing to support drug repositioning, which translates to the link prediction task between drug and disease nodes.

The edges in the graph are listed here, but we will focus on only the compound treat disease (CtD) and compound palliates disease (CpD) relations during evaluation. This can be done with the following:

from pykeen.pipeline import pipeline

evaluation_relation_whitelist = {'CtD', 'CpD'}
pipeline_result = pipeline(
    dataset='Hetionet',
    model='RotatE',
    evaluation_relation_whitelist=evaluation_relation_whitelist,
)

By restricting evaluation to the edges of interest, models more appropriate for drug repositioning can be identified during hyper-parameter optimization instead of models that are good at predicting all types of relations. The HPO pipeline accepts the same arguments:

from pykeen.hpo import hpo_pipeline

evaluation_relation_whitelist = {'CtD', 'CpD'}
hpo_pipeline_result = hpo_pipeline(
    n_trials=30,
    dataset='Hetionet',
    model='RotatE',
    evaluation_relation_whitelist=evaluation_relation_whitelist,
)

Optimizing a Model’s Hyper-parameters

The easiest way to optimize a model is with the pykeen.hpo.hpo_pipeline() function.

All of the following examples are about getting the best model when training pykeen.models.TransE on the pykeen.datasets.Nations dataset. Each gives a bit of insight into usage of the hpo_pipeline() function.

The minimal usage of the hyper-parameter optimization is to specify the dataset, the model, and how much to run. The following example shows how to optimize the TransE model on the Nations dataset a given number of times using the n_trials argument.

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     n_trials=30,
...     dataset='Nations',
...     model='TransE',
... )

Alternatively, the timeout can be set. In the following example, as many trials as possible will be run in 60 seconds.

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     timeout=60,
...     dataset='Nations',
...     model='TransE',
... )

The hyper-parameter optimization pipeline has the ability to optimize hyper-parameters for the corresponding *_kwargs arguments in the pykeen.pipeline.pipeline():

  • model

  • loss

  • regularizer

  • optimizer

  • negative_sampler

  • training

Defaults

Each component’s hyper-parameters have a reasonable default values. For example, every model in PyKEEN has default for its hyper-parameters chosen from the best-reported values in each model’s original paper unless otherwise stated on the model’s reference page. In case hyper-parameters for a model for a specific dataset were not available, we choose the hyper-parameters based on the findings in our large-scale benchmarking [ali2020a]. For most components (e.g., models, losses, regularizers, negative samples, training loops), these values are stored in the default valeues of the respective classes’ __init__() functions. They can be viewed in the corresponding reference section of the docs.

Some components contain strategies for doing hyper-parameter optimization. When you call the pykeen.hpo.hpo_pipeline(), the following steps are taken to determine what happens for each hyper-parameter in each componenent:

  1. If an explicit value was passed, use it.

  2. If no explicit value was passed and an HPO strategy was passed, use the explicit strategy.

  3. If no explicit value was passed and no HPO strategy was passed and there is a default HPO strategy, use the default strategy.

  4. If no explicit value was passed, no HPO strategy was passed, and there is no default HPO strategy, use the default hyper-parameter value

  5. If no explicit value was passed, no HPO strategy was passed, and there is no default HPO strategy, and there is no default hyper-parameter value, raise an TypeError

For example, the TransE model’s default HPO strategy for its embedding_dim argument is to search between \([16, 256]\) with a step size of 16. The \(l_p\) norm is set to search as either 1 or 2. This will be overridden with 50 in the following code:

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     dataset='Nations',
...     model='TransE',
...     model_kwargs=dict(embedding_dim=50),
... )

The strategy can be explicitly overridden with:

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     dataset='Nations',
...     model='TransE',
...     model_kwargs_ranges=dict(
...         embedding_dim=dict(type=int, low=16, high=256, step=32),
...     ),
... )

Each model, loss, regularizer, negative sampler, and training loop specify a class variable called hpo_defaults in which there’s a dictionary with all of the default strategies. They keys match up to the arguments in their respective __init__() functions.

Since optimizers aren’t re-implemented in PyKEEN, there’s a specfic dictionary at pykeen.optimizers.optimizers_hpo_defaults containing their strategies. It’s debatable whether you should optimize the optimizers (yo dawg), so you can always choose to set the learning rate lr to a constant value.

Strategies

An HPO strategy is a Python dict with a type key corresponding to a categorical variable, boolean variable, integer variable, or floating point number variable. The value itself for type should be one of the following:

  1. "categorical"

  2. bool or "bool"

  3. int or "int"

  4. float or "float"

Several strategies can be grouped together in a dictionary where the key is the name of the hyper-parameter for the component in the *_kwargs_ranges arguments to the HPO pipeline.

Categorical

The only other key to use inside a categorical variable is choices. For example, if you want to choose between Kullback-Leibler divergence or expected likelihood as similarity used in the KG2E model, you can write a strategy like:

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     dataset='Nations',
...     model='KG2E',
...     model_kwargs_ranges=dict(
...         dist_similarity=dict(type='categorical', choices=['KL', 'EL']),
...     ),
... )

Boolean

The boolean variable actually doesn’t need any extra keys besides the type, so a strategy for a boolean variable always looks like dict(type='bool'). Under the hood, this is automatically translated to a categorical variable with choices=[True, False].

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     dataset='Nations',
...     model='TransE',
...     training_loop='sLCWA',
...     negative_sampler_kwargs_ranges=dict(
...         filtered=dict(type=boolean),
...     ),
... )

Integers and Floating Point Numbers

The integer and floating point number strategies share several aspects. Both require a low and high entry like in dict(type=float, low=0.0, high=1.0) or dict(type=int, low=1, high=10).

Linear Scale

By default, you don’t need to specify a scale, but you can be explicit by setting scale='linear'. This behavior should be self explanatory - there is no rescaling and you get back uniform distribution within the bounds specified by the low and high arguments. This applies to both type=int and type=float. The following example uniformly choose from [1,100]:

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     dataset='Nations',
...     model='TransE',
...     training_loop='sLCWA',
...     negative_sampler_kwargs_ranges=dict(
...         num_negs_per_pos=dict(type=int, low=1, high=100),
...     ),
... )
Power Scale (type=int only)

The power scale was originally implemented as scale='power_two' to support pykeen.models.ConvE’s output_channels parameter. However, using two as a base is a bit limiting, so we also implemented a more general scale='power' where you can set set base. Here’s an example to optimize over the number of negatives per positive ratio using base=10:

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     dataset='Nations',
...     model='TransE',
...     training_loop='sLCWA',
...     negative_sampler_kwargs_ranges=dict(
...         num_negs_per_pos=dict(type=int, scale='power', base=10, low=0, high=2),
...     ),
... )

The power scale can only be used with type=int and not bool, categorical, or float. I like this scale because it can quickly discretize a large search space. In this example, you will get [10**0, 10**1, 10**2] as choices then uniformly choose from them.

Logarithmic Reweighting

The evil twin to the power scale is logarithmic reweighting on the linear scale. This is applicable type=int and type=float. Rather than changing the choices themselves, the log scale uses Optuna’s built in log functionality to reassign the probabilities uniformly over the log’d distribution. The same example as above could be accomplished with:

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     dataset='Nations',
...     model='TransE',
...     training_loop='sLCWA',
...     negative_sampler_kwargs_ranges=dict(
...         num_negs_per_pos=dict(type=int, low=1, high=100, log=True),
...     ),
... )

but this time, it’s not discretized. However, you’re just as likely to pick from \([1,10]\) as \([10, 100]\).

Stepping

With the linear scale, you can specify the step size. This discretizes the distribution in linear space, so if you want to pick from \(10, 20, ... 100\), you can do:

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     dataset='Nations',
...     model='TransE',
...     training_loop='sLCWA',
...     negative_sampler_kwargs_ranges=dict(
...         num_negs_per_pos=dict(type=int, low=10, high=100, step=10),
...     ),
... )

This actually also works with logarithmic reweighting, since it is still technically on a linear scale, but with probabilites reweighted logarithmically. So now you’d pick from one of \([10]\) or \([20, 30, 40, ..., 100]\) with the same probability

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     dataset='Nations',
...     model='TransE',
...     training_loop='sLCWA',
...     negative_sampler_kwargs_ranges=dict(
...         num_negs_per_pos=dict(type=int, low=10, high=100, step=10, log=True),
...     ),
... )

Custom Strategies

While the default values for hyper-parameters are encoded with the python syntax for default values of the __init__() function of each model, the ranges/scales can be found in the class variable pykeen.models.Model.hpo_default. For example, the range for TransE’s embedding dimension is set to optimize between 50 and 350 at increments of 25 in pykeen.models.TransE.hpo_default. TransE also has a scoring function norm that will be optimized by a categorical selection of {1, 2} by default.

Note

These hyper-parameter ranges were chosen as reasonable defaults for the benchmark datasets FB15k-237 / WN18RR. When using different datasets, the ranges might be suboptimal.

All hyper-parameters defined in the hpo_default of your chosen model will be optimized by default. If you already have a value that you’re happy with for one of them, you can specify it with the model_kwargs attribute. In the following example, the embedding_dim for a TransE model is fixed at 200, while the rest of the parameters will be optimized using the pre-defined HPO strategies in the model. For TransE, that means that the scoring function norm will be optimized as 1 or 2.

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     model='TransE',
...     model_kwargs=dict(
...         embedding_dim=200,
...     ),
...     dataset='Nations',
...     n_trials=30,
... )

If you would like to set your own HPO strategy for the model’s hyperparameters, you can do so with the model_kwargs_ranges argument. In the example below, the embeddings are searched over a larger range (low and high), but with a higher step size (q), such that 100, 200, 300, 400, and 500 are searched.

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_result = hpo_pipeline(
...     n_trials=30,
...     dataset='Nations',
...     model='TransE',
...     model_kwargs_ranges=dict(
...         embedding_dim=dict(type=int, low=100, high=500, q=100),
...     ),
... )

Warning

If the given range is not divisible by the step size, then the upper bound will be omitted.

If you want to optimize the entity initializer, you can use the type='categorical' type, which requres a choices=[...] key with a list of choices. This works for strings, integers, floats, etc.

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_result = hpo_pipeline(
...     n_trials=30,
...     dataset='Nations',
...     model='TransE',
...     model_kwargs_ranges=dict(
...         entity_initializer=dict(type='categorical', choices=[
...             'xavier_uniform',
...             'xavier_uniform_norm',
...             'uniform',
...         ]),
...     ),
... )

The same could be used for constrainers, normalizers, and regularizers over both entities and relations. However, different models might have different names for the initializer, normalizer, constrainer and regularizer since there could be multiple representations for either the entity, relation, or both. Check your desired model’s documentation page for the kwargs that you can optimize over.

Keys of pykeen.nn.representation.initializers can be passed as initializers as strings and keys of pykeen.nn.representation.constrainers can be passed as constrainers as strings.

The HPO pipeline does not support optimizing over the hyper-parameters for each initializer. If you are interested in this, consider rolling your own ablation study pipeline.

Optimizing the Loss

While each model has its own default loss, you can explicitly specify a loss the same way as in pykeen.pipeline.pipeline().

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     n_trials=30,
...     dataset='Nations',
...     model='TransE',
...     loss='MarginRankingLoss',
... )

As stated in the documentation for pykeen.pipeline.pipeline(), each model specifies its own default loss function in pykeen.models.Model.loss_default. For example, the TransE model defines the margin ranking loss as its default in pykeen.models.TransE.loss_default.

Each model also specifies default hyper-parameters for the loss function in pykeen.models.Model.loss_default_kwargs. For example, DistMultLiteral explicitly sets the margin to 0.0 in pykeen.models.DistMultLiteral.loss_default_kwargs.

Unlike the model’s hyper-parameters, the models don’t store the strategies for optimizing the loss functions’ hyper-parameters. The pre-configured strategies are stored in the loss function’s class variable pykeen.models.Loss.hpo_default.

However, similarily to how you would specify model_kwargs_ranges, you can specify the loss_kwargs_ranges explicitly, as in the following example.

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     n_trials=30,
...     dataset='Nations',
...     model='TransE',
...     loss='MarginRankingLoss',
...     loss_kwargs_ranges=dict(
...         margin=dict(type=float, low=1.0, high=2.0),
...     ),
... )

Optimizing the Negative Sampler

When the stochastic local closed world assumption (sLCWA) training approach is used for training, a negative sampler (subclass of pykeen.sampling.NegativeSampler) is chosen. Each has a strategy stored in pykeen.sampling.NegativeSampler.hpo_default.

Like models and regularizers, the rules are the same for specifying negative_sampler, negative_sampler_kwargs, and negative_sampler_kwargs_ranges.

Optimizing the Optimizer

Yo dawg, I heard you liked optimization, so we put an optimizer around your optimizer so you can optimize while you optimize. Since all optimizers used in PyKEEN come from the PyTorch implementations, they obviously do not have hpo_defaults class variables. Instead, every optimizer has a default optimization strategy stored in pykeen.optimizers.optimizers_hpo_defaults the same way that the default strategies for losses are stored externally.

Optimizing the Optimized Optimizer - a.k.a. Learning Rate Schedulers

If optimizing your optimizer doesn’t cut it for you, you can turn it up a notch and use learning rate schedulers (lr_scheduler) that will vary the learning rate of the optimizer. This can e.g. be useful to have a more aggressive learning rate in the beginning to quickly make progress while lowering the learning rate over time to allow the model to smoothly converge to the optimum.

PyKEEN allows you to use the learning rate schedulers provided by PyTorch, which you can simply specify as you would in the pykeen.pipeline.pipeline().

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     dataset='Nations',
...     model='TransE',
...     lr_scheduler='ExponentialLR',
... )
>>> pipeline_result.save_to_directory('nations_transe')

The same way as the optimizers don’t come with hpo_defaults class variables, lr_schedulers rely on their own optimization strategies provided in pykeen.lr_schedulers.lr_schedulers_hpo_defaults In case you are ready to explore even more you can of course also set your own ranges with the lr_scheduler_kwargs_ranges keyword argument as in:

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     dataset='Nations',
...     model='TransE',
...     lr_scheduler='ExponentialLR',
...     lr_scheduler_kwargs_ranges=dict(
...         gamma=dict(type=float, low=0.8, high=1.0),
...     ),
... )
>>> pipeline_result.save_to_directory('nations_transe')

Optimizing Everything Else

Without loss of generality, the following arguments to pykeen.pipeline.pipeline() have corresponding *_kwargs and *_kwargs_ranges:

  • training_loop (only kwargs, not kwargs_ranges)

  • evaluator

  • evaluation

Early Stopping

Early stopping can be baked directly into the optuna optimization.

The important keys are stopper='early' and stopper_kwargs. When using early stopping, the hpo_pipeline() automatically takes care of adding appropriate callbacks to interface with optuna.

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     n_trials=30,
...     dataset='Nations',
...     model='TransE',
...     stopper='early',
...     stopper_kwargs=dict(frequency=5, patience=2, relative_delta=0.002),
... )

These stopper kwargs were chosen to make the example run faster. You will likely want to use different ones.

Configuring Optuna

Choosing a Search Algorithm

Because PyKEEN’s hyper-parameter optimization pipeline is powered by Optuna, it can directly use all of Optuna’s built-in samplers listed on optuna.samplers or any custom subclass of optuna.samplers.BaseSampler.

By default, PyKEEN uses the Tree-structured Parzen Estimator (TPE; optuna.samplers.TPESampler), a probabilistic search algorithm. You can explicitly set the sampler using the sampler argument (not to be confused with the negative sampler used when training under the sLCWA):

>>> from pykeen.hpo import hpo_pipeline
>>> from optuna.samplers import TPESampler
>>> hpo_pipeline_result = hpo_pipeline(
...     n_trials=30,
...     sampler=TPESampler,
...     dataset='Nations',
...     model='TransE',
... )

You can alternatively pass a string so you don’t have to worry about importing Optuna. PyKEEN knows that sampler classes always end in “Sampler” so you can pass either “TPE” or “TPESampler” as a string. This is case-insensitive.

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     n_trials=30,
...     sampler="tpe",
...     dataset='Nations',
...     model='TransE',
... )

It’s also possible to pass a sampler instance directly:

>>> from pykeen.hpo import hpo_pipeline
>>> from optuna.samplers import TPESampler
>>> sampler = TPESampler(prior_weight=1.1)
>>> hpo_pipeline_result = hpo_pipeline(
...     n_trials=30,
...     sampler=sampler,
...     dataset='Nations',
...     model='TransE',
... )

If you’re working in a JSON-based configuration setting, you won’t be able to instantiate the sampler with your desired settings like this. As a solution, you can pass the keyword arguments via the sampler_kwargs argument in combination with specifying the sampler as a string/class to the HPO pipeline like in:

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     n_trials=30,
...     sampler="tpe",
...     sampler_kwargs=dict(prior_weight=1.1),
...     dataset='Nations',
...     model='TransE',
... )

To emulate most hyper-parameter optimizations that have used random sampling, use optuna.samplers.RandomSampler like in:

>>> from pykeen.hpo import hpo_pipeline
>>> from optuna.samplers import RandomSampler
>>> hpo_pipeline_result = hpo_pipeline(
...     n_trials=30,
...     sampler=RandomSampler,
...     dataset='Nations',
...     model='TransE',
... )

Grid search can be performed using optuna.samplers.GridSampler. Notice that this sampler expected an additional search_space argument in its sampler_kwargs, e.g.,

>>> from pykeen.hpo import hpo_pipeline
>>> from optuna.samplers import GridSampler
>>> hpo_pipeline_result = hpo_pipeline(
...     n_trials=30,
...     sampler=GridSampler,
...     sampler_kwargs=dict(
...         search_space={
...             "model.embedding_dim": [32, 64, 128],
...             "model.scoring_fct_norm": [1, 2],
...             "loss.margin": [1.0],
...             "optimizer.lr": [1.0e-03],
...             "negative_sampler.num_negs_per_pos": [32],
...             "training.num_epochs": [100],
...             "training.batch_size": [128],
...         },
...     ),
...     dataset='Nations',
...     model='TransE',
... )

Also notice that the search space of grid search grows fast with increasing number of studied hyper-parameters, and thus grid search is less efficient than other search strategies in finding good configurations, cf. https://jmlr.csail.mit.edu/papers/v13/bergstra12a.html.

Full Examples

The examples above have shown the permutation of one setting at a time. This section has some more complete examples.

The following example sets the optimizer, loss, training, negative sampling, evaluation, and early stopping settings.

>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
...     n_trials=30,
...     dataset='Nations',
...     model='TransE',
...     model_kwargs=dict(embedding_dim=20, scoring_fct_norm=1),
...     optimizer='SGD',
...     optimizer_kwargs=dict(lr=0.01),
...     loss='marginranking',
...     loss_kwargs=dict(margin=1),
...     training_loop='slcwa',
...     training_kwargs=dict(num_epochs=100, batch_size=128),
...     negative_sampler='basic',
...     negative_sampler_kwargs=dict(num_negs_per_pos=1),
...     evaluator_kwargs=dict(filtered=True),
...     evaluation_kwargs=dict(batch_size=128),
...     stopper='early',
...     stopper_kwargs=dict(frequency=5, patience=2, relative_delta=0.002),
... )

If you have the configuration as a dictionary:

>>> from pykeen.hpo import hpo_pipeline_from_config
>>> config = {
...     'optuna': dict(
...         n_trials=30,
...     ),
...     'pipeline': dict(
...         dataset='Nations',
...         model='TransE',
...         model_kwargs=dict(embedding_dim=20, scoring_fct_norm=1),
...         optimizer='SGD',
...         optimizer_kwargs=dict(lr=0.01),
...         loss='marginranking',
...         loss_kwargs=dict(margin=1),
...         training_loop='slcwa',
...         training_kwargs=dict(num_epochs=100, batch_size=128),
...         negative_sampler='basic',
...         negative_sampler_kwargs=dict(num_negs_per_pos=1),
...         evaluator_kwargs=dict(filtered=True),
...         evaluation_kwargs=dict(batch_size=128),
...         stopper='early',
...         stopper_kwargs=dict(frequency=5, patience=2, relative_delta=0.002),
...     )
... }
... hpo_pipeline_result = hpo_pipeline_from_config(config)

If you have a configuration (in the same format) in a JSON file:

>>> import json
>>> config = {
...     'optuna': dict(
...         n_trials=30,
...     ),
...     'pipeline': dict(
...         dataset='Nations',
...         model='TransE',
...         model_kwargs=dict(embedding_dim=20, scoring_fct_norm=1),
...         optimizer='SGD',
...         optimizer_kwargs=dict(lr=0.01),
...         loss='marginranking',
...         loss_kwargs=dict(margin=1),
...         training_loop='slcwa',
...         training_kwargs=dict(num_epochs=100, batch_size=128),
...         negative_sampler='basic',
...         negative_sampler_kwargs=dict(num_negs_per_pos=1),
...         evaluator_kwargs=dict(filtered=True),
...         evaluation_kwargs=dict(batch_size=128),
...         stopper='early',
...         stopper_kwargs=dict(frequency=5, patience=2, relative_delta=0.002),
...     )
... }
... with open('config.json', 'w') as file:
...     json.dump(config, file, indent=2)
... hpo_pipeline_result = hpo_pipeline_from_path('config.json')

Running an Ablation Study

You want to find out which loss function and training approach is best-suited for your interaction model (model architecture)? Then performing an ablation study is the way to go!

In general, an ablation study is a set of experiments in which components of a machine learning system are removed/replaced in order to measure the impact of these components on the performance of the system. In the context of knowledge graph embedding models, typical ablation studies involve investigating different loss functions, training approaches, negative samplers, and the explicit modeling of inverse relations. For a specific model composition based on these components, the best set of hyper-parameter values, e.g., embedding dimension, learning rate, batch size, loss function-specific hyper-parameters such as the margin value in the margin ranking loss need to be determined. This is accomplished by a process called hyper-parameter optimization. Different approaches have been proposed, of which random search and grid search are very popular.

In PyKEEN, we can define execute an ablation study within our own program or from the command line interface using a configuration file (file_name.json).

First, we show how to run an ablation study within your program. For this purpose, we provide the function pykeen.ablation.ablation_pipeline() that requires the datasets, models, losses, optimizers, training_loops, and directory arguments to define the datasets, models, loss functions, optimizers (e.g., Adam), training approaches for our ablation study, and the output directory in which the experimental artifacts should be saved. In the following, we define an ablation study for pykeen.models.ComplEx over the pykeen.datasets.Nations dataset in order to assess the effect of different loss functions (in our example, the binary cross entropy loss and the margin ranking loss) and the effect of explicitly modeling inverse relations.

Now, let’s start with defining the minimal requirements, i.e., the dataset(s), interaction model(s), the loss function(s), training approach(es), and the optimizer(s) in order to run the ablation study.

>>> from pykeen.ablation import ablation_pipeline
>>> directory = "doctests/ablation/ex01_minimal"
>>> ablation_pipeline(
...     directory=directory,
...     models=["ComplEx"],
...     datasets=["Nations"],
...     losses=["BCEAfterSigmoidLoss", "MarginRankingLoss"],
...     training_loops=["LCWA"],
...     optimizers=["Adam"],
...     # The following are not part of minimal configuration, but are necessary
...     # for demonstration/doctests. You should make these numbers bigger when
...     # you're using PyKEEN's ablation framework
...     epochs=1,
...     n_trials=1,
... )

We can provide arbitrary additional information about our study with the metadata keyword. Some keys, such as title are special and used by PyKEEN and optuna.

>>> from pykeen.ablation import ablation_pipeline
>>> directory = "doctests/ablation/ex02_metadata"
>>> ablation_pipeline(
...     directory=directory,
...     models=["ComplEx"],
...     datasets=["Nations"],
...     losses=["BCEAfterSigmoidLoss", "MarginRankingLoss"],
...     training_loops=["LCWA"],
...     optimizers=["Adam"],
...     # Add metadata with:
...     metadata=dict(
...         title="Ablation Study Over Nations for ComplEx.",
...     ),
...     # Fast testing configuration, make bigger in prod
...     epochs=1,
...     n_trials=1,
... )

As mentioned above, we also want to measure the effect of explicitly modeling inverse relations on the model’s performance. Therefore, we extend the ablation study by including the create_inverse_triples argument:

>>> from pykeen.ablation import ablation_pipeline
>>> directory = "doctests/ablation/ex03_inverse"
>>> ablation_pipeline(
...     directory=directory,
...     models=["ComplEx"],
...     datasets=["Nations"],
...     losses=["BCEAfterSigmoidLoss"],
...     training_loops=["LCWA"],
...     optimizers=["Adam"],
...     # Add inverse triples with
...     create_inverse_triples=[True, False],
...     # Fast testing configuration, make bigger in prod
...     epochs=1,
...     n_trials=1,
... )

Note

Unlike models, datasets, losses, training_loops, and optimizers, create_inverse_triples has a default value, which is False.

If there is only one value for either the models, datasets, losses, training_loops, optimizers, or create_inverse_triples argument, it can be given as a single value instead of the list.

>>> from pykeen.ablation import ablation_pipeline
>>> directory = "doctests/ablation/ex04_terse_kwargs"
>>> ablation_pipeline(
...     directory=directory,
...     models="ComplEx",
...     datasets="Nations",
...     losses=["BCEAfterSigmoidLoss", "MarginRankingLoss"],
...     training_loops="LCWA",
...     optimizers="Adam",
...     create_inverse_triples=[True, False],
...     # Fast testing configuration, make bigger in prod
...     epochs=1,
...     n_trials=1,
... )

Note

It doesn’t make sense to run an ablation study if all of these values are fixed.

For each of the components of a knowledge graph embedding model (KGEM) that requires hyper-parameters, i.e., interaction model, loss function, and the training approach, we provide default hyper-parameter optimization (HPO) ranges within PyKEEN. Therefore, the definition of our ablation study would be complete at this stage. Because hyper-parameter ranges are dataset-dependent, users can/should define their own HPO ranges. We will show later how to accomplish this. To finalize the ablation study, we recommend defining early stopping for your ablation study, which is done as follows:

>>> from pykeen.ablation import ablation_pipeline
>>> directory = "doctests/ablation/ex05_stopper"
>>> ablation_pipeline(
...     directory=directory,
...     models=["ComplEx"],
...     datasets=["Nations"],
...     losses=["BCEAfterSigmoidLoss", "MarginRankingLoss"],
...     training_loops=["LCWA"],
...     optimizers=["Adam"],
...     stopper = "early",
...     stopper_kwargs = {
...         "frequency": 5,
...         "patience": 20,
...         "relative_delta": 0.002,
...         "metric": "hits@10",
...     },
...     # Fast testing configuration, make bigger in prod
...     epochs=1,
...     n_trials=1,
... )

We define the early stopper using the argument stopper, and through stopper_kwargs, we provide instantiation arguments to the early stopper. We define that the early stopper should evaluate every 5 epochs with a patience of 20 epochs on the validation set. In order to continue training, we expect the model to obtain an improvement > 0.2% in Hits@10.

After defining the ablation study, we need to define the HPO settings for each experiment within our ablation study. Remember that for each ablation-experiment we perform an HPO in order to determine the best hyper-parameters for the currently investigated model. In PyKEEN, we use Optuna as HPO framework. Again, we provide default values for the Optuna related arguments. However, they define a very limited HPO search which is meant for testing purposes. Therefore, we define the arguments required by Optuna by ourselves:

>>> from pykeen.ablation import ablation_pipeline
>>> directory = "doctests/ablation/ex06_optuna_kwargs"
>>> ablation_pipeline(
...     directory=directory,
...     models="ComplEx",
...     datasets="Nations",
...     losses=["BCEAfterSigmoidLoss", "MarginRankingLoss"],
...     training_loops="LCWA",
...     optimizers="Adam",
...     # Fast testing configuration, make bigger in prod
...     epochs=1,
...     # Optuna-related arguments
...     n_trials=2,
...     timeout=300,
...     metric="hits@10",
...     direction="maximize",
...     sampler="random",
...     pruner= "nop",
... )

We set the number of HPO iterations for each experiment to 2 using the argument n_trials, set a timeout of 300 seconds (the HPO will be terminated after n_trials or timeout seconds depending on what occurs first), the metric to optimize, define whether the metric should be maximized or minimized using the argument direction, define random search as HPO algorithm using the argument sampler, and finally define that we do not use a pruner for pruning unpromising trials (note that we use early stopping instead).

To measure the variance in performance, we can additionally define how often we want to re-train and re-evaluate the best model of each ablation-experiment using the argument best_replicates:

>>> from pykeen.ablation import ablation_pipeline
>>> directory = "doctests/ablation/ex5"
>>> ablation_pipeline(
...     directory=directory,
...     models=["ComplEx"],
...     datasets=["Nations"],
...     losses=["BCEAfterSigmoidLoss", "MarginRankingLoss"],
...     training_loops=["LCWA"],
...     optimizers=["Adam"],
...     create_inverse_triples=[True, False],
...     stopper="early",
...     stopper_kwargs={
...         "frequency": 5,
...         "patience": 20,
...         "relative_delta": 0.002,
...         "metric": "hits@10",
...     },
...     # Fast testing configuration, make bigger in prod
...     epochs=1,
...     # Optuna-related arguments
...     n_trials=2,
...     timeout=300,
...     metric="hits@10",
...     direction="maximize",
...     sampler="random",
...     pruner= "nop",
...     best_replicates=5,
... )

Eager to check out the results? Then navigate to your output directory path/to/output/directory. Within your output directory, you will find subdirectories, e.g., 0000_nations_complex which contains all experimental artifacts of one specific ablation experiment of the defined ablation study. The most relevant subdirectory is best_pipeline which comprises the artifacts of the best performing experiment, including its definition in pipeline_config.json, the obtained results, and the trained model(s) in the sub-directory replicates. The number of replicates in replicates corresponds to the number provided through the argument -r. Additionally, you are provided with further information about the ablation study in the root directory: study.json describes the ablation experiment, hpo_config.json describes the HPO setting of the ablation experiment, trials.tsv provides an overview of each HPO experiment.

Define Your Own HPO Ranges

As mentioned above, we provide default hyper-parameters/hyper-parameter ranges for each hyper-parameter. However, these default values/ranges do not ensure good performance. Therefore, it is time that you define your own ranges, and we show you how to do it! For the definition of hyper-parameter values/ranges, two dictionaries are essential, kwargs that is used to assign the hyper-parameters fixed values, and kwargs_ranges to define ranges of values from which to sample from.

Let’s start with assigning HPO ranges to hyper-parameters belonging to the interaction model. This can be achieved by using the dictionary model_to_model_kwargs_ranges:

...

# Define HPO ranges
>>> model_to_model_kwargs_ranges = {
...    "ComplEx": {
...        "embedding_dim": {
...            "type": "int",
...            "low": 4,
...            "high": 6,
...            "scale": "power_two"
...        }
...    }
... }

...

We defined an HPO range for the embedding dimension. Because the scale is power_two, the lower bound (low) equals to 4, the upper bound high to 6, the embedding dimension is sampled from the set \(\{2^4,2^5, 2^6\}\).

Next, we fix the number of training epochs to 50 using the argument model_to_training_loop_to_training_kwargs and define a range for the batch size using model_to_training_loop_to_training_kwargs_ranges. We use these two dictionaries because the defined hyper-parameters are hyper-parameters of the training function (that is a function of the training_loop):

...

>>> model_to_model_kwargs_ranges = {
...    "ComplEx": {
...        "embedding_dim": {
...            "type": "int",
...            "low": 4,
...            "high": 6,
...            "scale": "power_two"
...        }
...    }
... }

>>> model_to_training_loop_to_training_kwargs = {
...    "ComplEx": {
...        "lcwa": {
...            "num_epochs": 50
...        }
...    }
... }

>>> model_to_training_loop_to_training_kwargs_ranges= {
...    "ComplEx": {
...        "lcwa": {
...            "label_smoothing": {
...                "type": "float",
...                "low": 0.001,
...               "high": 1.0,
...                "scale": "log"
...            },
...            "batch_size": {
...                "type": "int",
...                "low": 7,
...                "high": 9,
...                "scale": "power_two"
...            }
...        }
...    }
... }

...

Finally, we define a range for the learning rate which is a hyper-parameter of the optimizer:

...

>>> model_to_model_kwargs_ranges = {
...    "ComplEx": {
...        "embedding_dim": {
...            "type": "int",
...            "low": 4,
...            "high": 6,
...            "scale": "power_two"
...        }
...    }
... }

>>> model_to_training_loop_to_training_kwargs = {
...    "ComplEx": {
...        "lcwa": {
...            "num_epochs": 50
...        }
...    }
... }

>>> model_to_training_loop_to_training_kwargs_ranges= {
...    "ComplEx": {
...        "lcwa": {
...            "label_smoothing": {
...                "type": "float",
...                "low": 0.001,
...               "high": 1.0,
...                "scale": "log"
...            },
...            "batch_size": {
...                "type": "int",
...                "low": 7,
...                "high": 9,
...                "scale": "power_two"
...            }
...        }
...     }
... }

>>> model_to_optimizer_to_optimizer_kwargs_ranges= {
...    "ComplEx": {
...        "adam": {
...            "lr": {
...                "type": "float",
...                "low": 0.001,
...                "high": 0.1,
...                "scale": "log"
...            }
...        }
...    }
... }

...

We decided to use Adam as an optimizer, and defined a log scale for the learning rate, i.e., the learning rate is sampled from the interval \([0.001, 0.1)\).

Now that we defined our own hyper-parameter values/ranges, let’s have a look at the overall configuration:

>>> from pykeen.ablation import ablation_pipeline
>>> metadata = dict(title="Ablation Study Over Nations for ComplEx.")
>>> models = ["ComplEx"]
>>> datasets = ["Nations"]
>>> losses = ["BCEAfterSigmoidLoss"]
>>> training_loops = ["lcwa"]
>>> optimizers = ["adam"]
>>> create_inverse_triples= [True, False]
>>> stopper = "early"
>>> stopper_kwargs = {
...    "frequency": 5,
...    "patience": 20,
...    "relative_delta": 0.002,
...    "metric": "hits@10",
... }

# Define HPO ranges
>>> model_to_model_kwargs_ranges = {
...    "ComplEx": {
...        "embedding_dim": {
...            "type": "int",
...            "low": 4,
...            "high": 6,
...            "scale": "power_two"
...        }
...    }
... }

>>> model_to_training_loop_to_training_kwargs = {
...    "ComplEx": {
...        "lcwa": {
...            "num_epochs": 50
...        }
...    }
... }

>>> model_to_training_loop_to_training_kwargs_ranges= {
...    "ComplEx": {
...        "lcwa": {
...            "label_smoothing": {
...                "type": "float",
...                "low": 0.001,
...               "high": 1.0,
...                "scale": "log"
...            },
...            "batch_size": {
...                "type": "int",
...                "low": 7,
...                "high": 9,
...                "scale": "power_two"
...            }
...        }
...    }
... }


>>> model_to_optimizer_to_optimizer_kwargs_ranges= {
...    "ComplEx": {
...        "adam": {
...            "lr": {
...                "type": "float",
...                "low": 0.001,
...                "high": 0.1,
...                "scale": "log"
...            }
...        }
...    }
... }

# Run ablation experiment
>>> ablation_pipeline(
...    models=models,
...    datasets=datasets,
...    losses=losses,
...    training_loops=training_loops,
...    optimizers=optimizers,
...    model_to_model_kwargs_ranges=model_to_model_kwargs_ranges,
...    model_to_training_loop_to_training_kwargs=model_to_training_loop_to_training_kwargs,
...    model_to_optimizer_to_optimizer_kwargs_ranges=model_to_optimizer_to_optimizer_kwargs_ranges,
...    directory="doctests/ablation/ex6",
...    best_replicates=5,
...    n_trials=2,
...    timeout=300,
...    metric="hits@10",
...    direction="maximize",
...    sampler="random",
...    pruner="nop",
... )

We are expected to provide the arguments datasets, models, losses, optimizers, and training_loops to pykeen.ablation.ablation_pipeline(). For all other components and hype-parameters, PyKEEN provides default values/ranges. However, for achieving optimal performance, we should carefully define the hyper-parameter values/ranges ourselves, as explained above. Note that there are many more ranges to configure such hyper-parameters for the loss functions or the negative samplers. Check out the examples provided in tests/resources/hpo_complex_nations.json` how to define the ranges for other components.

Run an Ablation Study With Your Own Data

We showed how to run an ablation study with a PyKEEN integrated dataset. Now you are asking yourself, whether you can run ablations studies with your own data? Yes, you can! It requires a minimal change compared to the previous configuration:

>>> datasets = [
...    {
...        "training": "/path/to/your/train.txt",
...        "validation": "/path/to/your/validation.txt",
...        "testing": "/path/to/your/test.txt"
...    }
... ]

In the dataset field, you don’t provide a list of dataset names but dictionaries containing the paths to your train-validation-test splits.

Run an Ablation Study From The Command Line Interface

If you want to start an ablation study from the command line interface, we provide the function pykeen.experiments.cli.ablation(), which expects as an argument the path to a JSON configuration file. The configuration file consists of a dictionary with the sub-dictionaries ablation and optuna in which the ablation study and the Optuna related configuration are defined. Besides, similar to the programmatic interface, the metadata dictionary can be provided. The configuration file corresponding to the ablation study that we previously defined within our program would look as follows:

{
    "metadata": {
        "title": "Ablation Study Over Nations for ComplEx."
    },
    "ablation": {
        "datasets": ["nations"],
        "models":   ["ComplEx"],
        "losses": ["BCEAfterSigmoidLoss", "CrossEntropyLoss"]
        "training_loops": ["lcwa"],
        "optimizers": ["adam"],
        "create_inverse_triples": [true,false],
        "stopper": "early"
        "stopper_kwargs": {
            "frequency": 5,
            "patience": 20,
            "relative_delta": 0.002,
            "metric": "hits@10"
        },
        "model_to_model_kwargs_ranges":{
            "ComplEx": {
                "embedding_dim": {
                    "type": "int",
                    "low": 4,
                    "high": 6,
                    "scale": "power_two"
                }
            }
        },
        "model_to_training_loop_to_training_kwargs": {
            "ComplEx": {
                "lcwa": {
                    "num_epochs": 50
                }
            }
        },
        "model_to_training_loop_to_training_kwargs_ranges": {
            "ComplEx": {
                "lcwa": {
                    "label_smoothing": {
                        "type": "float",
                        "low": 0.001,
                        "high": 1.0,
                        "scale": "log"
                    },
                    "batch_size": {
                        "type": "int",
                        "low": 7,
                        "high": 9,
                        "scale": "power_two"
                    }
                }
            }
        },
        "model_to_optimizer_to_optimizer_kwargs_ranges": {
            "ComplEx": {
                "adam": {
                    "lr": {
                        "type": "float",
                        "low": 0.001,
                        "high": 0.1,
                        "scale": "log"
                    }
                }
            }
        }
    "optuna": {
        "n_trials": 2,
        "timeout": 300,
        "metric": "hits@10",
        "direction": "maximize",
        "sampler": "random",
        "pruner": "nop"
        }
    }
}

The ablation study can be started as follows:

$ pykeen experiments ablation path/to/complex_nation.json -d path/to/output/directory

To re-train and re-evaluate the best model of each ablation-experiment n times in order to measure the variance in performance the option -r/--best-replicates should be used:

$ pykeen experiments ablation path/to/complex_nation.json -d path/to/output/directory -r 5

In this tutorial, we showed how to define and start an ablation study within your program, how to execute it from the command line interface. Furthermore, we showed how you can define your ablation study using your own data.

Performance Tricks

PyKEEN uses a combination of techniques to promote efficient calculations during training/evaluation and tries to maximize the utilization of the available hardware (currently focused on single GPU usage).

Entity and Relation IDs

Entities and relations in triples are usually stored as strings. Because KGEMs aim at learning vector representations for these entities and relations such that the chosen interaction function learns a useful scoring on top of them, we need a mapping from the string representations to vectors. Moreover, for computational efficiency, we would like to store all entity/relation embeddings in matrices. Thus, the mapping process comprises two parts: Mapping strings to IDs, and using the IDs to access the embeddings (=row indices).

In PyKEEN, the mapping process takes place in pykeen.triples.TriplesFactory. The triples factory maintains the sets of unique entity and relation labels and ensures that they are mapped to unique integer IDs on \([0,\text{num_unique_entities})\) for entities and \([0, \text{num_unique_relations})\). The mappings are respectively accessible via the attributes :data:pykeen.triples.TriplesFactory.entity_label_to_id and :data:pykeen.triples.TriplesFactory.relation_label_to_id.

To improve the performance, the mapping process takes place only once, and the ID-based triples are stored in a tensor :data:pykeen.triples.TriplesFactory.mapped_triples.

Tuple Broadcasting

Interaction functions are usually only given for the standard case of scoring a single triple \((h, r, t)\). This function is in PyKEEN implemented in the pykeen.models.base.Model.score_hrt() method of each model, e.g. pykeen.models.DistMult.score_hrt() for pykeen.models.DistMult. When training under the local closed world assumption (LCWA), evaluating a model, and performing the link prediction task, the goal is to score all entities/relations for a given tuple, i.e. \((h, r)\), \((r, t)\) or \((h, t)\). In these cases a single tuple is used many times for different entities/relations.

For example, we want to rank all entities for a single tuple \((h, r)\) with pykeen.models.DistMult for the pykeen.datasets.FB15k237. This dataset contains 14,505 entities, which means that there are 14,505 \((h, r, t)\) combinations, whereas \(h\) and \(r\) are constant. Looking at the interaction function of pykeen.models.DistMult, we can observe that the \(h \odot r\) part causes half of the mathematical operations to calculate \(h \odot r \odot t\). Therefore, calculating the \(h \odot r\) part only once and reusing it spares us half of the mathematical operations for the other 14,504 remaining entities, making the calculations roughly twice as fast in total. The speed-up might be significantly higher in cases where the broadcasted part has a high relative complexity compared to the overall interaction function, e.g. pykeen.models.ConvE.

To make this technique possible, PyKEEN models have to provide an explicit broadcasting function via following methods in the model class:

  • pykeen.models.base.Model.score_h() - Scoring all possible head entities for a given \((r, t)\) tuple

  • pykeen.models.base.Model.score_r() - Scoring all possible relations for a given \((h, t)\) tuple

  • pykeen.models.base.Model.score_t() - Scoring all possible tail entities for a given \((h, r)\) tuple

The PyKEEN architecture natively supports these methods and makes use of this technique wherever possible without any additional modifications. Providing these methods is completely optional and not required when implementing new models.

Filtering with Index-based Masking

In this example, it is given a knowledge graph \(\mathcal{K} \subseteq \mathcal{E} \times \mathcal{R} \times \mathcal{E}\) and disjoint unions of \(\mathcal{K}\) in training triples \(\mathcal{K}_{train}\), testing triples \(\mathcal{K}_{test}\), and validation triples \(\mathcal{K}_{val}\). The same operations are performed on \(\mathcal{K}_{test}\) and \(\mathcal{K}_{val}\), but only \(\mathcal{K}_{test}\) will be given as example in this section.

Two calculations are performed for each test triple \((h, r, t) \in \mathcal{K}_{test}\) during standard evaluation of a knowledge graph embedding model with interaction function \(f:\mathcal{E} \times \mathcal{R} \times \mathcal{E} \rightarrow \mathbb{R}\) for the link prediction task:

  1. \((h, r)\) is combined with all possible tail entities \(t' \in \mathcal{E}\) to make triples \(T_{h,r} = \{(h,r,t') \mid t' \in \mathcal{E}\}\)

  2. \((r, t)\) is combined with all possible head entities \(h' \in \mathcal{E}\) to make triples \(H_{r,t} = \{(h',r,t) \mid h' \in \mathcal{E}\}\)

Finally, the ranking of \((h, r, t)\) is calculated against all \((h, r, t') \in T_{h,r}\) and \((h', r, t) \in H_{r,t}\) triples with respect to the interaction function \(f\).

In the filtered setting, \(T_{h,r}\) is not allowed to contain tail entities \((h, r, t') \in \mathcal{K}_{train}\) and \(H_{r,t}\) is not allowed to contain head entities leading to \((h', r, t) \in \mathcal{K}_{train}\) triples found in the train dataset. Therefore, their definitions could be amended like:

  • \(T^{\text{filtered}}_{h,r} = \{(h,r,t') \mid t' \in \mathcal{E}\} \setminus \mathcal{K}_{train}\)

  • \(H^{\text{filtered}}_{r,t} = \{(h',r,t) \mid h' \in \mathcal{E}\} \setminus \mathcal{K}_{train}\)

While this easily defined theoretically, it poses several practical challenges. For example, it leads to the computational challenge that all new possible triples \((h, r, t') \in T_{h,r}\) and \((h', r, t) \in H_{r,t}\) must be enumerated and then checked for existence in \(\mathcal{K}_{train}\). Considering a dataset like pykeen.datasets.FB15k237 that has almost 15,000 entities, each test triple \((h,r,t) \in \mathcal{K}_{test}\) leads to \(2 * | \mathcal{E} | = 30,000\) possible new triples, which have to be checked against the train dataset and then removed.

To obtain very fast filtering, PyKEEN combines the technique presented above in Entity and Relation IDs and Tuple Broadcasting together with the following mechanism, which in our case has led to a 600,000 fold increase in speed for the filtered evaluation compared to the mechanisms used in previous versions.

As a starting point, PyKEEN will always compute scores for all triples in \(H_{r,t}\) and \(T_{h,r}\), even in the filtered setting. Because the number of positive triples on average is very low, few results have to be removed. Additionally, due to the technique presented in Tuple Broadcasting, scoring extra entities has a marginally low cost. Therefore, we start with the score vectors from pykeen.models.base.Model.score_t() for all triples \((h, r, t') \in H_{r,t}\) and from pykeen.models.base.Model.score_h() for all triples \((h', r, t) \in T_{h,r}\).

Following, the sparse filters \(\mathbf{f}_t \in \mathbb{B}^{| \mathcal{E}|}\) and \(\mathbf{f}_h \in \mathbb{B}^{| \mathcal{E}|}\) are created, which state which of the entities would lead to triples found in the train dataset. To achieve this we will rely on the technique presented in Entity and Relation IDs, i.e. all entity/relation IDs correspond to their exact position in the respective embedding tensor. As an example we take the tuple \((h, r)\) from the test triple \((h, r, t) \in \mathcal{K}_{test}\) and are interested in all tail entities \(t'\) that should be removed from \(T_{h,r}\) in order to obtain \(T^{\text{filtered}}_{h,r}\). This is achieved by performing the following steps:

  1. Take \(r\) and compare it to the relations of all triples in the train dataset, leading to a boolean vector of the size of number of triples contained in the train dataset, being true where any triple had the relation \(r\)

  2. Take \(h\) and compare it to the head entities of all triples in the train dataset, leading to a boolean vector of the size of number of triples contained in the train dataset, being true where any triple had the head entity \(h\)

  3. Combine both boolean vectors, leading to a boolean vector of the size of number of triples contained in the train dataset, being true where any triple had both the head entity \(h\) and the relation \(r\)

  4. Convert the boolean vector to a non-zero index vector, stating at which indices the train dataset contains triples that contain both the head entity h and the relation \(r\), having the size of the number of non-zero elements

  5. The index vector is now applied on the tail entity column of the train dataset, returning all tail entity IDs \(t'\) that combined with \(h\) and \(r\) lead to triples contained in the train dataset

  6. Finally, the \(t'\) tail entity ID index vector is applied on the initially mentioned vector returned by pykeen.models.base.Model.score_t() for all possible triples \((h, r, t')\) and all affected scores are set to float('nan') following the IEEE-754 specification, which makes these scores non-comparable, effectively leading to the score vector for all possible novel triples \((h, r, t') \in T^{\text{filtered}}_{h,r}\).

\(H^{\text{filtered}}_{r,t}\) is obtained from \(H_{r,t}\) in a similar fashion.

Sub-batching & Slicing

With growing model and dataset sizes the KGEM at hand is likely to exceed the memory provided by GPUs. Especially during training it might be desired to train using a certain batch size. When this batch size is too big for the hardware at hand, PyKEEN allows to set a sub-batch size in the range of \([1, \text{batch_size}]\). When the sub-batch size is set, PyKEEN automatically accumulates the gradients after each sub-batch and clears the computational graph during training. This allows to train KGEMs on GPU that otherwise would be too big for the hardware at hand, while the obtained results are identical to training without sub-batching.

Note

In order to guarantee equivalent results, not all models support sub-batching, since certain components, e.g. batch normalization, require the entire batch to be calculated in one pass to avoid altering statistics.

Note

Sub-batching is sometimes also called Gradient Accumulation, e.g., by huggingface’s transformer library, since we accumulate the gradients over multiple sub-batches before updating the parameters.

For some large configurations, even after applying the sub-batching trick, out-of-memory errors may still occur. In this case, PyKEEN implements another technique, called slicing. Note that we often compute more than one score for each batch element: in sLCWA, we have \(1 + \text{num_negative_samples}\) scores, and in LCWA, we have \(\text{num_entities}\) scores for each batch element. In slicing, we do not compute all of these scores at once, but rather in smaller “batches”. For old-style models, i.e., those subclassing from pykeen.models.base._OldAbstractModel, this has to be implemented individually for each of them. New-style models, i.e., those deriving from pykeen.models.nbase.ERModel have a generic implementation enabling slicing for all interactions.

Note

Slicing computes the scores in smaller batches, but still needs to compute the gradient over all scores, since some loss functions require access to them.

Automated Memory Optimization

Allowing high computational throughput while ensuring that the available hardware memory is not exceeded during training and evaluation requires the knowledge of the maximum possible training and evaluation batch size for the current model configuration. However, determining the training and evaluation batch sizes is a tedious process, and not feasible when a large set of heterogeneous experiments are run. Therefore, PyKEEN has an automatic memory optimization step that computes the maximum possible training and evaluation batch sizes for the current model configuration and available hardware before the actual calculation starts. If the user-provided batch size is too large for the used hardware, the automatic memory optimization determines the maximum sub-batch size for training and accumulates the gradients with the above described process Sub-batching & Slicing. The batch sizes are determined using binary search taking into consideration the CUDA architecture which ensures that the chosen batch size is the most CUDA efficient one.

Evaluation Fallback

Usually the evaluation is performed on the GPU for faster speeds. In addition, users might choose a batch size upfront in their evaluation configuration to fully utilize the GPU to achieve the fastest evaluation speeds possible. However, during larger setups testing different model configurations and dataset partitions such as e.g. HPO the hardware requirements might change drastically, which might cause that the evaluation no longer can be run with the pre-set batch size or not on the GPU at all for larger datasets and memory intense models. Since PyKEEN will abide by the user configurations, the evaluation will crash in these cases even though the training finished successfully and thus loose the progress achieved and/or leave trials unfinished. Given that the batch size and the device have no impact on the evaluation results, PyKEEN offers a way to overcome this problem through the evaluation fallback option of the pipeline. This will cause the evaluation to fall back to using a smaller batch size in cases where the evaluation failed using the GPU with a set batch size and in the last instance to evaluate on the CPU, if even the smallest possible batch size is too big for the GPU. Note: This can lead to significantly longer evaluation times in cases where the evaluation falls back to using the CPU.

Representations

In PyKEEN, a pykeen.nn.representation.Representation is used to map integer indices to numeric representations. A simple example is the pykeen.nn.representation.Embedding class, where the mapping is a simple lookup. However, more advanced representation modules are available, too.

Message Passing

Message passing representation modules enrich the representations of entities by aggregating the information from their graph neighborhood. Example implementations from PyKEEN include pykeen.nn.representation.RGCNRepresentation which uses RGCN layers for enrichment, or pykeen.nn.representation.SingleCompGCNRepresentation, which enrich via CompGCN layers.

Another way to utilize message passing is via the modules provided in pykeen.nn.pyg, which allow to use the message passing layers from PyTorch Geometric to enrich base representations via message passing.

Decomposition

Since knowledge graphs may contain a large number of entities, having independent trainable embeddings for each of them may result in an excessive amount of trainable parameters. Therefore, methods have been developed, which do not learn independent representations, but rather have a set of base representations, and create individual representations by combining them.

Low-Rank Factorization

A simple method to reduce the number of parameters is to use a low-rank decomposition of the embedding matrix, as implemented in pykeen.nn.representation.LowRankEmbeddingRepresentation. Here, each representation is a linear combination of shared base representations. Typically, the number of bases is chosen smaller than the dimension of each base representation.

NodePiece

Another example is NodePiece, which takes inspiration from tokenization we encounter in, e.g.. NLP, and represents each entity as a set of tokens. The implementation in PyKEEN, pykeen.nn.representation.NodePieceRepresentation, implements a simple yet effective variant thereof, which uses a set of randomly chosen incident relations (including inverse relations) as tokens.

Text-based

Text-based representations use the entities’ (or relations’) labels to derive representations. To this end, pykeen.nn.representation.TextRepresentation uses a (pre-trained) transformer model from the transformers library to encode the labels. Since the transformer models have been trained on huge corpora of text, their text encodings often contain semantic information, i.e., labels with similar semantic meaning get similar representations. While we can also benefit from these strong features by just initializing an pykeen.nn.representation.Embedding with the vectors, e.g., using pykeen.nn.init.LabelBasedInitializer, the pykeen.nn.representation.TextRepresentation include the transformer model as part of the KGE model, and thus allow fine-tuning the language model for the KGE task. This is beneficial, e.g., since it allows a simple form of obtaining an inductive model, which can make predictions for entities not seen during training.

from pykeen.pipeline import pipeline
from pykeen.datasets import get_dataset
from pykeen.nn import TextRepresentation
from pykeen.models import ERModel

dataset = get_dataset(dataset="nations")
entity_representations = TextRepresentation.from_dataset(
    triples_factory=dataset,
    encoder="transformer",
)
result = pipeline(
    dataset=dataset,
    model=ERModel,
    model_kwargs=dict(
        interaction="ermlpe",
        interaction_kwargs=dict(
            embedding_dim=entity_representations.shape[0],
        ),
        entity_representations=entity_representations,
        relation_representations_kwargs=dict(
            shape=entity_representations.shape,
        ),
    ),
    training_kwargs=dict(
        num_epochs=1,
    ),
)
model = result.model

We can use the label-encoder part to generate representations for unknown entities with labels. For instance, “uk” is an entity in nations, but we can also put in “united kingdom”, and get a roughly equivalent vector representations

entity_representation = model.entity_representations[0]
label_encoder = entity_representation.encoder
uk, united_kingdom = label_encoder(labels=["uk", "united kingdom"])

Thus, if we would put the resulting representations into the interaction function, we would get similar scores

# true triple from train: ['brazil', 'exports3', 'uk']
relation_representation = model.relation_representations[0]
h_repr = entity_representation.get_in_more_canonical_shape(
    dim="h",
    indices=torch.as_tensor(dataset.entity_to_id["brazil"]).view(1),
)
r_repr = relation_representation.get_in_more_canonical_shape(
    dim="r",
    indices=torch.as_tensor(dataset.relation_to_id["exports3"]).view(1),
)
scores = model.interaction(
    h=h_repr,
    r=r_repr,
    t=torch.stack([uk, united_kingdom]),
)
print(scores)

As a downside, this will usually substantially increase the computational cost of computing triple scores.

Biomedical Entities

If your dataset is labeled with compact uniform resource identifiers (e.g., CURIEs) for biomedical entities like chemicals, proteins, diseases, and pathways, then the pykeen.nn.representation.BiomedicalCURIERepresentation representation can make use of pyobo to look up names (via CURIE) via the pyobo.get_name() function, then encode them using the text encoder.

All biomedical knowledge graphs in PyKEEN (at the time of adding this representation), unfortunately do not use CURIEs for referencing biomedical entities. In the future, we hope this will change.

To learn more about CURIEs, please take a look at the Bioregistry and this blog post on CURIEs.

Getting Started with NodePiece

This page gives more practical examples on using and configuring NodePiece.

Basic Usage

We’ll use pykeen.datasets.FB15k237 for illustrating purposes throughout the following examples.

from pykeen.models import NodePiece
from pykeen.datasets import FB15k237

# inverses are necessary for the current version of NodePiece
dataset = FB15k237(create_inverse_triples=True)

In the simplest usage of pykeen.models.NodePiece, we’ll only use relations for tokenization. We can do this by with the following arguments:

  1. Set the tokenizers="RelationTokenizer" to pykeen.nn.node_piece.RelationTokenizer. We can simply refer to the class name and it gets automatically resolved to the correct subclass of pykeen.nn.node_piece.Tokenizer by the class_resolver.

  2. Set the num_tokens=12 to sample 12 unique relations per node. If, for some entities, there are less than 12 unique relations, the difference will be padded with the auxiliary padding token.

Here’s how the code looks:

model = NodePiece(
    triples_factory=dataset.training,
    tokenizers="RelationTokenizer",
    num_tokens=12,
    embedding_dim=64,
)

Next, we’ll use a combination of tokenizers (pykeen.nn.node_piece.AnchorTokenizer and pykeen.nn.node_piece.RelationTokenizer) to replicate the full NodePiece tokenization with \(k\) anchors and \(m\) relational context. It’s as easy as sending a list of tokenizers to tokenizers and sending a list of arguments to num_tokens:

model = NodePiece(
    triples_factory=dataset.training,
    tokenizers=["AnchorTokenizer", "RelationTokenizer"],
    num_tokens=[20, 12],
    embedding_dim=64,
)

Class resolver will automatically instantiate pykeen.nn.node_piece.AnchorTokenizer with 20 anchors per node and pykeen.nn.node_piece.RelationTokenizer with 12 relations per node, so the order of specifying tokenizers and num_tokens matters here.

Anchor Selection and Searching

The pykeen.nn.node_piece.AnchorTokenizer has two fields:

  1. selection controls how we sample anchors from the graph (32 anchors by default)

  2. searcher controls how we tokenize nodes using selected anchors (pykeen.nn.node_piece.CSGraphAnchorSearcher by default)

By default, our models above use 32 anchors selected as top-degree nodes with pykeen.nn.node_piece.DegreeAnchorSelection (those are default values for the anchor selection resolver) and nodes are tokenized using pykeen.nn.node_piece.CSGraphAnchorSearcher - it uses scipy.sparse to explicitly compute shortest paths from all nodes in the graph to all anchors in the deterministic manner. We can afford that for relatively small graphs of FB15k237 size.

For larger graphs, we recommend using the breadth-first search (BFS) procedure in pykeen.nn.node_piece.ScipySparseAnchorSearcher - it applies BFS by iteratively expanding node neighborhood until it finds a desired number of anchors - this dramatically saves compute time on graphs of size like pykeen.datasets.OGBWikiKG2.

32 unique anchors might be a bit too small for FB15k237 with 15k nodes - so let’s create a pykeen.models.NodePiece model with 100 anchors selected with the top degree strategy by sending the tokenizers_kwargs list:

model = NodePiece(
    triples_factory=dataset.training,
    tokenizers=["AnchorTokenizer", "RelationTokenizer"],
    num_tokens=[20, 12],
    tokenizers_kwargs=[
        dict(
            selection="Degree",
            selection_kwargs=dict(
                num_anchors=100,
            ),
            searcher="CSGraph",
        ),
        dict(),  # empty dict for the RelationTokenizer - it doesn't need any kwargs
    ],
    embedding_dim=64,
)

tokenizers_kwargs expects the same number dictionaries as the number of tokenizers you used, so we have 2 dicts here - one for AnchorTokenizer and another one for RelationTokenizer (but this one doesn’t need any kwargs so we just put an empty dict there).

Let’s create a model with 500 top-pagerank anchors selected with the BFS strategy - we’ll just modify the selection and searcher args:

model = NodePiece(
    triples_factory=dataset.training,
    tokenizers=["AnchorTokenizer", "RelationTokenizer"],
    num_tokens=[20, 12],
    tokenizers_kwargs=[
        dict(
            selection="PageRank",
            selection_kwargs=dict(
                num_anchors=500,
            ),
            searcher="ScipySparse",
        ),
        dict(),  # empty dict for the RelationTokenizer - it doesn't need any kwargs
    ],
    embedding_dim=64,
)

Looks nice, but fasten your seatbelts 🚀 - we can use several anchor selection strategies sequentially to select more diverse anchors! Mindblowing 😍

Let’s create a model with 500 anchors where 50% of them will be top degree nodes and another 50% will be top PageRank nodes - for that we have a pykeen.nn.node_piece.MixtureAnchorSelection class!

model = NodePiece(
    triples_factory=dataset.training,
    tokenizers=["AnchorTokenizer", "RelationTokenizer"],
    num_tokens=[20, 12],
    tokenizers_kwargs=[
        dict(
            selection="MixtureAnchorSelection",
            selection_kwargs=dict(
                selections=["degree", "pagerank"],
                ratios=[0.5, 0.5],
                num_anchors=500,
            ),
            searcher="ScipySparse",
        ),
        dict(),  # empty dict for the RelationTokenizer - it doesn't need any kwargs
    ],
    embedding_dim=64,
)

Now the selection_kwargs controls which strategies we’ll be using and how many anchors each of them will sample - in our case selections=['degree', 'pagerank']. Using the ratios argument we control the ratio of those sampled anchors in the total pool - in our case ratios=[0.5, 0.5] which means that both degree and pagerank strategies each will sample 50% from the total number of anchors. Since the total number is 500, there will be 250 top-degree anchors and 250 top-pagerank anchors. ratios must sum up to 1.0

Important: sampled anchors are unique - that is, if a node appears to be in top-K degree and top-K pagerank, it will be used only once, the sampler will just skip it in the subsequent strategies.

At the moment, we have 3 anchor selection strategies: degree, pagerank, and random. The latter just samples random nodes as anchors.

Let’s create a tokenization setup reported in the original NodePiece paper for FB15k237 with 40% top degree anchors, 40% top pagerank, and 20% random anchors:

model = NodePiece(
    triples_factory=dataset.training,
    tokenizers=["AnchorTokenizer", "RelationTokenizer"],
    num_tokens=[20, 12],
    tokenizers_kwargs=[
        dict(
            selection="MixtureAnchorSelection",
            selection_kwargs=dict(
                selections=["degree", "pagerank", "random"],
                ratios=[0.4, 0.4, 0.2],
                num_anchors=500,
            ),
            searcher="ScipySparse",
        ),
        dict(),  # empty dict for the RelationTokenizer - it doesn't need any kwargs
    ],
    embedding_dim=64,
)

Note on Anchor Distances: As of now, the anchor distances are considered implicitly, i.e., when performing actual tokenization via shortest paths or BFS we do sort anchors by proximity and keep top-K nearest. The anchor distance embedding as a positional feature to be added to anchor embedding is not yet implemented.

How many total anchors num_anchors and anchors & relations num_tokens do I need for my graph?

This is a good question with deep theoretical implications and NP-hard problems like k-Dominating Sets and Vertex Cover Sets . We don’t have a closed-form solution for each possible dataset, but we found some empirical heuristics:

  • keeping num_anchors as 1-10% of total nodes in the graph is a good start

  • graph density is a major factor: the denser the graph, the fewer num_anchors you’d need. For dense FB15k237 100 total anchors (over 15k total nodes) seems to be good enough, while for sparser WN18RR we needed at least 500 anchors (over 40k total nodes). For dense OGB WikiKG2 of 2.5M nodes a vocab of 20K anchors (< 1%) already leads to SOTA results

  • the same applies to anchors per node: you’d need more tokens for sparser graphs and fewer for denser

  • the size of the relational context depends on the density and number of unique relations in the graph, eg, in FB15k237 we have 237 * 2 = 474 unique relations and only 11 * 2 = 22 in WN18RR. If we select a too large context, most tokens will be PADDING_TOKEN and we don’t want that.

  • reported relational context sizes (relations per node) in the NodePiece paper are 66th percentiles of the number of unique incident relations per node, eg 12 for FB15k237 and 5 for WN18RR

In some tasks, you might not need anchors at all and could use RelationTokenizer only! Check the paper for more results.

  • In inductive link prediction tasks we don’t use anchors as inference graphs are disconnected from training ones;

  • in relation prediction we found that just a relational context is better than anchors + relations;

  • in node classification (currently, this pipeline is not available in PyKEEN) on dense relation-rich graphs like Wikidata, we found that just a relational context is better than anchors + relations.

Using NodePiece with pykeen.pipeline.pipeline()

Let’s pack the last NodePiece model into the pipeline:

import torch.nn

from pykeen.models import NodePiece
from pykeen.pipeline import pipeline

result = pipeline(
    dataset="fb15k237",
    dataset_kwargs=dict(
        create_inverse_triples=True,
    ),
    model=NodePiece,
    model_kwargs=dict(
        tokenizers=["AnchorTokenizer", "RelationTokenizer"],
        num_tokens=[20, 12],
        tokenizers_kwargs=[
            dict(
                selection="MixtureAnchorSelection",
                selection_kwargs=dict(
                    selections=["degree", "pagerank", "random"],
                    ratios=[0.4, 0.4, 0.2],
                    num_anchors=500,
                ),
                searcher="ScipySparse",
            ),
            dict(),  # empty dict for the RelationTokenizer - it doesn't need any kwargs
        ],
        embedding_dim=64,
        interaction="rotate",
    ),
)

Pre-Computed Vocabularies

We have a pykeen.nn.node_piece.PrecomputedPoolTokenizer that can be instantiated with a precomputed vocabulary either from a local file or using a downloadable link.

For a local file, specify path:

precomputed_tokenizer = tokenizer_resolver.make(
    "precomputedpool", path=Path("path/to/vocab.pkl")
)

model = NodePiece(
    triples_factory=dataset.training,
    num_tokens=[20, 12],
    tokenizers=[precomputed_tokenizer, "RelationTokenizer"],
)

For a remote file, specify the url:

precomputed_tokenizer = tokenizer_resolver.make(
    "precomputedpool", url="http://link/to/vocab.pkl"
)

Generally, pykeen.nn.node_piece.PrecomputedPoolTokenizer can use any pykeen.nn.node_piece.PrecomputedTokenizerLoader as a custom processor of vocabulary formats. Right now there is one such loader, pykeen.nn.node_piece.GalkinPrecomputedTokenizerLoader that expects a dictionary of the following format:

node_id: {
    "ancs": [a list of used UNMAPPED anchor nodes sorted from nearest to farthest],
    "dists": [a list of anchor distances for each anchor in ancs, ascending]
}

As of now, we don’t use anchor distances, but we expect the anchors in ancs to be already sorted from nearest to farthest, so the example of a precomputed vocab can be:

1: {'ancs': [3, 10, 5, 9, 220, ...]}  # anchor 3 is the nearest for node 1
2: {'ancs': [22, 37, 14, 10, ...]}  # anchors 22 is the nearest for node 2

Unmapped anchors means that anchor IDs are the same node IDs from the total set of entities 0... N-1. In the pickle processing we’ll convert them to a contiguous range 0 ... num_anchors-1. Any negative indices in the lists will be treated as padding tokens (we used -99 in the precomputed vocabularies).

The original NodePiece repo has an example of building such a vocabulary format for OGB WikiKG 2.

Configuring the Interaction Function

you can use literally any interaction function available in PyKEEN as a scoring function! By default, NodePiece uses DistMult, but it’s easy to change as in any pykeen.models.ERModel, let’s use the RotatE interaction:

model = NodePiece(
    triples_factory=dataset.training,
    tokenizers=["AnchorTokenizer", "RelationTokenizer"],
    num_tokens=[20, 12],
    interaction="rotate",
    embedding_dim=64,
)

Well, for RotatE we might want to initialize relations as phases (init_phases) and use an additional relation constrainer to keep |r| = 1 (complex_normalize), and use xavier_uniform_ for anchor embedding initialization - let’s add that, too:

model = NodePiece(
    triples_factory=dataset.training,
    tokenizers=["AnchorTokenizer", "RelationTokenizer"],
    num_tokens=[20, 12],
    embedding_dim=64,
    interaction="rotate",
    relation_initializer="init_phases",
    relation_constrainer="complex_normalize",
    entity_initializer="xavier_uniform_",
)

Configuring the Aggregation Function

This section is about the aggregation keyword argument. This is an encoder function that actually builds entity representations from token embeddings. It is supposed to be a function that maps a set of tokens (anchors, relations, or both) to a single vector:

\[f([a_1, a_2, ...., a_k, r_1, r_2, ..., r_m]) \in \mathbb{R}^{(k+m) \times d} \rightarrow \mathbb{R}^{d}\]

Right now, by default we use a simple 2-layer MLP (pykeen.nn.perceptron.ConcatMLP) that concatenates all tokens to one long vector and projects it down to model’s embedding dimension:

hidden_dim = int(ratio * embedding_dim)
super().__init__(
    nn.Linear(num_tokens * embedding_dim, hidden_dim),
    nn.Dropout(dropout),
    nn.ReLU(),
    nn.Linear(hidden_dim, embedding_dim),
)

Aggregation can be parameterized with any neural network (torch.nn.Module) that would return a single vector from a set of inputs. Let’s be fancy 😎 and create a DeepSet encoder:

class DeepSet(torch.nn.Module):
    def __init__(self, hidden_dim=64):
        super().__init__()
        self.encoder = torch.nn.Sequential(
            torch.nn.Linear(hidden_dim, hidden_dim),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_dim, hidden_dim),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_dim, hidden_dim),
        )
        self.decoder = torch.nn.Sequential(
            torch.nn.Linear(hidden_dim, hidden_dim),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_dim, hidden_dim),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_dim, hidden_dim),
        )

    def forward(self, x, dim=-2):
        x = self.encoder(x).mean(dim)
        x = self.decoder(x)
        return x


model = NodePiece(
    triples_factory=dataset.training,
    tokenizers=["AnchorTokenizer", "RelationTokenizer"],
    num_tokens=[20, 12],
    embedding_dim=64,
    interaction="rotate",
    relation_initializer="init_phases",
    relation_constrainer="complex_normalize",
    entity_initializer="xavier_uniform_",
    aggregation=DeepSet(hidden_dim=64),
)

We can even put a Transformer with pooling here. The only thing to keep in mind is the complexity of the encoder - we found pykeen.nn.perceptron.ConcatMLP to be a good balance between speed and final performance, although at the cost of being not permutation invariant to the input set of tokens.

The aggregation function resembles that of GNNs. Non-parametric avg/min/max did not work that well in the current tokenization setup, so some non-linearity is definitely useful - hence the choice for MLP / DeepSets / Transformer as an aggregation function.

Let’s wrap our cool NodePiece model with 40/40/20 degree/pagerank/random tokenization with the BFS searcher and DeepSet aggregation into a pipeline:

result = pipeline(
    dataset="fb15k237",
    dataset_kwargs=dict(
        create_inverse_triples=True,
    ),
    model=NodePiece,
    model_kwargs=dict(
        tokenizers=["AnchorTokenizer", "RelationTokenizer"],
        num_tokens=[20, 12],
        tokenizers_kwargs=[
            dict(
                selection="MixtureAnchorSelection",
                selection_kwargs=dict(
                    selections=["degree", "pagerank", "random"],
                    ratios=[0.4, 0.4, 0.2],
                    num_anchors=500,
                ),
                searcher="ScipySparse",
            ),
            dict(),  # empty dict for the RelationTokenizer - it doesn't need any kwargs
        ],
        embedding_dim=64,
        interaction="rotate",
        relation_initializer="init_phases",
        relation_constrainer="complex_normalize",
        entity_initializer="xavier_uniform_",
        aggregation=DeepSet(hidden_dim=64),
    ),
)

NodePiece + GNN

It is also possible to add a message passing GNN on top of obtained NodePiece representations to further enrich node states - we found it shows even better results in inductive LP tasks. We have that implemented with pykeen.models.InductiveNodePieceGNN that uses a 2-layer CompGCN encoder - please check the Inductive Link Prediction tutorial.

Tokenizing Large Graphs with METIS

Mining anchors and running tokenization on whole graphs larger than 1M nodes might be computationally expensive. Due to the inherent locality of NodePiece, i.e., tokenization via nearest anchors and incident relations, we recommend using graph partitioning to reduce time and memory costs of tokenization. With graph partitioning, anchor search and tokenization can be performed independently within each partition with a final merging of all results into a single vocabulary.

We designed the partitioning tokenization strategy using METIS, a min-cut graph partitioning algorithm with an efficient implementation available in torch-sparse. Along with METIS, we leverage torch-sparse to offer a new, faster BFS procedure that can run on a GPU.

The main tokenizer class is pykeen.nn.node_piece.MetisAnchorTokenizer. You can place it instead of the vanilla AnchorTokenizer. With the Metis-based tokenizer, we first partition the input training graph into k separate partitions and then run anchor selection and anchor search sequentially and independently for each partition.

You can use any existing anchor selection and anchor search strategy described above although for larger graphs we recommend using a new pykeen.nn.node_piece.SparseBFSSearcher as anchor searcher – it implements faster sparse matrix multiplication kernels and can be run on a GPU. The only difference from the vanilla tokenizer is that now the num_anchors argument defines how many anchors will be mined for each partition.

The new tokenizer has two special arguments:

  • num_partitions - number of partitions the graph will be divided into. You can expect METIS to produce partitions of about the same size, e.g., num_partitions=10 for a graph of 1M nodes would produce 10 partitions with about 100K nodes in each. The total number of mined anchors will be num_partitions * num_anchors

  • device - the device to run METIS on. It can be different from the device on which an AnchorSearcher will run. We found device="cpu" works faster on larger graphs and does not require limited GPU memory, although you can keep the device to be resolved automatically or put device="cuda" to try running it on a GPU.

It is still advisable to run large graph tokenization using pykeen.nn.node_piece.SparseBFSSearcher on a GPU thanks to more efficient sparse CUDA kernels. If a GPU is available, it will be used automatically by default.

Let’s use the new tokenizer for the Wikidata5M graph of 5M nodes and 20M edges.

from pykeen.datasets import Wikidata5M

dataset = Wikidata5M(create_inverse_triples=True)

model = NodePiece(
    triples_factory=dataset.training,
    tokenizers=["MetisAnchorTokenizer", "RelationTokenizer"],
    num_tokens=[20, 12],  # 20 anchors per node in for the Metis strategy
    embedding_dim=64,
    interaction="rotate",
    tokenizers_kwargs=[
        dict(
            num_partitions=20,  # each partition will be of about 5M / 20 = 250K nodes
            device="cpu",  # METIS on cpu tends to be faster
            selection="MixtureAnchorSelection",  # we can use any anchor selection strategy here
            selection_kwargs=dict(
                selections=['degree', 'random'],
                ratios=[0.5, 0.5],
                num_anchors=1000,  # overall, we will have 20 * 1000 = 20000 anchors
            ),
            searcher="SparseBFSSearcher",  # a new efficient anchor searcher
            searcher_kwargs=dict(
                max_iter=5  # each node will be tokenized with anchors in the 5-hop neighborhood
            )
        ),
        dict()
    ],
    aggregation="mlp"
)

# we can save the vocabulary of tokenized nodes
from pathlib import Path
model.entity_representations[0].base[0].save_assignment(Path("./anchors_assignment.pt"))

On a machine with 32 GB RAM and 32 GB GPU, processing of Wikidata5M takes about 10 minutes:

  • ~ 3 min for partitioning into 20 clusters on a cpu;

  • ~ 7 min overall for anchor selection and search in each partition

How many partitions do I need for my graph?

It largely depends on the hardware and memory at hand, but as a rule of thumb we would recommend having partitions of size < 500K nodes each

PyTorch Lightning Integration

PyTorch Lightning integration.

PyTorch Lightning poses an alternative way to implement a training loop and evaluation loop for knowledge graph embedding models that has some nice features:

  • mixed precision training

  • multi-gpu training

model = LitLCWAModule(
    dataset="fb15k237",
    dataset_kwargs=dict(create_inverse_triples=True),
    model="mure",
    model_kwargs=dict(embedding_dim=128, loss="bcewithlogits"),
    batch_size=128,
)
trainer = pytorch_lightning.Trainer(
    accelerator="auto",  # automatically choose accelerator
    logger=False,  # defaults to TensorBoard; explicitly disabled here
    precision=16,  # mixed precision training
)
trainer.fit(model=model)

Classes

LitModule([dataset, dataset_kwargs, mode, ...])

A base module for training models with PyTorch Lightning.

LCWALitModule([dataset, dataset_kwargs, ...])

A PyTorch Lightning module for training a model with LCWA training loop.

SLCWALitModule(*[, negative_sampler, ...])

A PyTorch Lightning module for training a model with sLCWA training loop.

Using Resolvers

As PyKEEN is a heavily modular and extensible library, we make use of the class_resolver library to allow simple configuration of components. In this part of the tutorial, we explain how to use these configuration options, and how to figure out what values you can pass here.

We use the initialization method of the base model class pykeen.models.base.Model() and its handling of loss functions as an example. Its signature is given as

def __init__(
    self,
    *,
    ...,
    loss: HintOrType[Loss] = None,
    loss_kwargs: OptionalKwargs = None,
    ...,
) -> None:

Notice the two related parameters loss: HintOrType[Loss] = None and loss_kwargs: OptionalKwargs = None. The loss_kwargs thus takes a values of type OptionalKwargs as input, which is an abbreviation of Union[None, Mapping[str, Any]]. Hence, we can either pass a mapping of string keys to some values, or Ǹone. In the latter case of passing None, this is interpreted as an empty dictionary.

The loss parameter takes inputs of type HintOrType[Loss]. HintOrType[Loss] is a abbreviation of Union[None, str, Type[Loss], Loss]. Thus, we can either pass

  1. an instance of the pykeen.losses.Loss, e.g., pykeen.losses.MarginRankingLoss(margin=2.0). If an instance of pykeen.losses.Loss is passed, it is used without modification. In this case, loss_kwargs will be ignored.

  2. a subclass of pykeen.losses.Loss, e.g., pykeen.losses.MarginRankingLoss In this case, the class is instantiated with the given loss_kwargs as (keyword-based) parameters. For instance,

    loss = MarginRankingLoss
    loss_kwargs = None  # equivalent to {}
    

    translates to MarginRankingLoss(). We can also choose different instantiation parameters by

    loss = MarginRankingLoss
    loss_kwargs = dict(margin=2)  # or {"margin": 2}
    

    which translates to MarginRankingLoss(margin=2)

  3. a string. This string is used to search for a matching class using the class-resolver’s lookup function. The lookup function performs some string normalization and compares the resulting key to the normalized names of classes it is associated with. The found class is then used to instantiate the object as if we passed this class instead of the string. For instance, we can obtain instances MarginRankingLoss by passing “MarginRankingLoss”, “marginrankingloss”, “marginranking”, “margin-ranking”, or “MRL”.

  4. None. In this case, we use the default class set in the class-resolver, which happens to be MarginRankingLoss for the loss_resolver. If no default is set, an exception will be raised.

Determining Allowed Inputs

To keep PyKEEN easily extensible and maintainable, we often use None for the choice, e.g., loss, and the keyword-based parameters. This can sometimes make it hard to read what default values are used, what valid choices are available, and what parameters are allowed with these different choices. In the following, we describe a few ways how to find this information.

First, you should take a look at the type annotation. HintOrType[X] = None tells you that you can pass any subclass of X. Moreover, you can always pass the string of the class name instead, which often is easier to setup for you result tracking, command line arguments, or hyperparameter search. All resolvers for classes used in PyKEEN are instantiated using the ClassResolver.from_subclasses factory function, which automatically registers all subclasses for a given base class as valid choices. Moreover, it will allow you to pass class names without the base class’ name as suffix, e.g., loss_resolver accepts MarginRanking instead of MarginRankingLoss, since the base class’ name Loss is removed as suffix during the normalization. To utilize this feature, we try to follow an appropriate naming scheme for all configurable parts, e.g., pykeen.nn.representation.Representation, or pykeen.nn.modules.Interaction.

The allowed parameters for …_kwargs: OptionalKwargs are a bit harder to determine, since they vary with your choice of the component! For instance, MarginRankingLoss has a margin parameter, while pykeen.losses.BCEWithLogitsLoss does not provide such. Hence, you should investigate the documentation of the individual classes to inform yourself about available parameters and allowed values.

Troubleshooting

Loading a Model from an Old Version of PyKEEN

If your model was trained on a different version of PyKEEN, you might have difficulty loading the model using torch.load('trained_model.pkl').

This could be due to one or both of the following:

  1. The model class structure might have changed.

  2. The model weight names might have changed.

Note that PyKEEN currently cannot support model migration. Please attempt the following steps to load the model.

If the model class structure has changed

You will likely see an exception like this one: ModuleNotFoundError: No module named ...

In this case, try to instantiate the model class directly and only load the state dict from the model file.

  1. Save the model’s state_dict using the version of PyKEEN used for training:

    import torch
    from pykeen.pipeline import pipeline
    
    result = pipeline(dataset="Nations", model="RotatE")
    torch.save(result.model.state_dict(), "v1.7.0/model.state_dict.pt")
    
  2. Load the model using the version of PyKEEN you want to use. First instantiate the model, then load the state dict:

    import torch
    from pykeen.datasets import get_dataset
    from pykeen.models import RotatE
    
    dataset = get_dataset(dataset="Nations")
    model = RotatE(triples_factory=dataset.training)
    state_dict = torch.load("v1.7.0/model.state_dict.pt")
    model.load_state_dict(state_dict)
    

If the model weight names have changed

You will likely see an exception similar to this one:

RuntimeError: Error(s) in loading state_dict for RotatE:
Missing key(s) in state_dict: "entity_representations.0._embeddings.weight", "relation_representations.0._embeddings.weight".
Unexpected key(s) in state_dict: "regularizer.weight", "regularizer.regularization_term", "entity_embeddings._embeddings.weight", "relation_embeddings._embeddings.weight".

In this case, you need to inspect the state-dict dictionaries in the different version, and try to match the keys. Then modify the state dict accordingly before loading it. For example:

import torch
from pykeen.datasets import get_dataset
from pykeen.models import RotatE

dataset = get_dataset(dataset="Nations")
model = RotatE(triples_factory=dataset.training)
state_dict = torch.load("v1.7.0/model.state_dict.pt")
# these are some example changes in weight names for RotatE between two different pykeen versions
for old_name, new_name in [
    (
        "entity_embeddings._embeddings.weight",
        "entity_representations.0._embeddings.weight",
    ),
    (
        "relation_embeddings._embeddings.weight",
        "relation_representations.0._embeddings.weight",
    ),
]:
    state_dict[new_name] = state_dict.pop(old_name)
# in this example, the new model does not have a regularizer, so we need to delete corresponding data
for name in ["regularizer.weight", "regularizer.regularization_term"]:
    state_dict.pop(name)
model.load_state_dict(state_dict)

Warning

Even if the state dict can be loaded, there is still a risk that the the weights are used differently. This can lead to a difference in model behavior. To be sure that the model is still functioning the same way, you should also check some model predictions and inspect how the model definition has changed.

Bring Your Own Data

As an alternative to using a pre-packaged dataset, the training and testing can be set explicitly by file path or with instances of pykeen.triples.TriplesFactory. Throughout this tutorial, the paths to the training, testing, and validation sets for built-in pykeen.datasets.Nations will be used as examples.

Pre-stratified Dataset

You’ve got a training and testing file as 3-column TSV files, all ready to go. You’re sure that there aren’t any entities or relations appearing in the testing set that don’t appear in the training set. Load them in the pipeline like this:

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> result = pipeline(
...     training=NATIONS_TRAIN_PATH,
...     testing=NATIONS_TEST_PATH,
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')

PyKEEN will take care of making sure that the entities are mapped from their labels to appropriate integer (technically, 0-dimensional torch.LongTensor) indexes and that the different sets of triples share the same mapping.

This is equally applicable for the pykeen.hpo.hpo_pipeline(), which has a similar interface to the pykeen.pipeline.pipeline() as in:

>>> from pykeen.hpo import hpo_pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH, NATIONS_VALIDATE_PATH
>>> result = hpo_pipeline(
...     n_trials=3,  # you probably want more than this
...     training=NATIONS_TRAIN_PATH,
...     testing=NATIONS_TEST_PATH,
...     validation=NATIONS_VALIDATE_PATH,
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_hpo_pre_stratified_transe')

The remainder of the examples will be for pykeen.pipeline.pipeline(), but all work exactly the same for pykeen.hpo.hpo_pipeline().

If you want to add dataset-wide arguments, you can use the dataset_kwargs argument to the pykeen.pipeline.pipeline to enable options like create_inverse_triples=True.

>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> result = pipeline(
...     training=NATIONS_TRAIN_PATH,
...     testing=NATIONS_TEST_PATH,
...     dataset_kwargs={'create_inverse_triples': True},
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')

If you want finer control over how the triples are created, for example, if they are not all coming from TSV files, you can use the pykeen.triples.TriplesFactory interface.

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> training = TriplesFactory.from_path(NATIONS_TRAIN_PATH)
>>> testing = TriplesFactory.from_path(
...     NATIONS_TEST_PATH,
...     entity_to_id=training.entity_to_id,
...     relation_to_id=training.relation_to_id,
... )
>>> result = pipeline(
...     training=training,
...     testing=testing,
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')

Warning

The instantiation of the testing factory, we used the entity_to_id and relation_to_id keyword arguments. This is because PyKEEN automatically assigns numeric identifiers to all entities and relations for each triples factory. However, we want the identifiers to be exactly the same for the testing set as the training set, so we just reuse it. If we didn’t have the same identifiers, then the testing set would get mixed up with the wrong identifiers in the training set during evaluation, and we’d get nonsense results.

The dataset_kwargs argument is ignored when passing your own pykeen.triples.TriplesFactory, so be sure to include the create_inverse_triples=True in the instantiation of those classes if that’s your desired behavior as in:

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> training = TriplesFactory.from_path(
...     NATIONS_TRAIN_PATH,
...     create_inverse_triples=True,
... )
>>> testing = TriplesFactory.from_path(
...     NATIONS_TEST_PATH,
...     entity_to_id=training.entity_to_id,
...     relation_to_id=training.relation_to_id,
...     create_inverse_triples=True,
... )
>>> result = pipeline(
...     training=training,
...     testing=testing,
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')

Triples factories can also be instantiated using the triples keyword argument instead of the path argument if you already have triples loaded in a numpy.ndarray.

Unstratified Dataset

It’s more realistic your real-world dataset is not already stratified into training and testing sets. PyKEEN has you covered with pykeen.triples.TriplesFactory.split(), which will allow you to create a stratified dataset.

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH
>>> tf = TriplesFactory.from_path(NATIONS_TRAIN_PATH)
>>> training, testing = tf.split()
>>> result = pipeline(
...     training=training,
...     testing=testing,
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_unstratified_transe')

By default, this is an 80/20 split. If you want to use early stopping, you’ll also need a validation set, so you should specify the splits:

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH
>>> tf = TriplesFactory.from_path(NATIONS_TRAIN_PATH)
>>> training, testing, validation = tf.split([.8, .1, .1])
>>> result = pipeline(
...     training=training,
...     testing=testing,
...     validation=validation,
...     model='TransE',
...     stopper='early',
...     epochs=5,  # short epochs for testing - you should go
...                # higher, especially with early stopper enabled
... )
>>> result.save_to_directory('doctests/test_unstratified_stopped_transe')

Bring Your Own Data with Checkpoints

For a tutorial on how to use your own data together with checkpoints, see Checkpoints When Bringing Your Own Data and Loading Models Manually.

Bring Your Own Interaction

This is a tutorial about how to implement your own interaction modules (also known as scoring functions) as subclasses of pykeen.nn.modules.Interaction for use in PyKEEN.

Implementing your first Interaction Module

Imagine you’ve taken a time machine back to 2013 and you have just invented TransE, defined as:

\[f(h, r, t) = -\| \mathbf{e}_h + \mathbf{r}_r - \mathbf{e}_t \|_2\]

where \(\mathbf{e}_i\) is the \(d\)-dimensional representation for entity \(i\), \(\mathbf{r}_j\) is the \(d\)-dimensional representation for relation \(j\), and \(\|...\|_2\) is the \(L_2\) norm.

To implement TransE in PyKEEN, you need to subclass the pykeen.nn.modules.Interaction. This class it itself a subclass of torch.nn.Module, which means that you need to provide an implementation of torch.nn.Module.forward(). However, the arguments are predefined as h, r, and t, which correspond to the representations of the head, relation, and tail, respectively.

from pykeen.nn.modules import Interaction

class TransEInteraction(Interaction):
    def forward(self, h, r, t):
        return -(h + r - t).norm(p=2, dim=-1)

Note the dim=-1 because this operation is actually defined over an entire batch of head, relation, and tail representations.

See also

A reference implementation is provided in pykeen.nn.modules.TransEInteraction

As a researcher who just invented TransE, you might wonder what would happen if you replaced the addition + with multiplication *. You might then end up with a new interaction like this (which just happens to be DistMult, which was published just a year after TransE):

\[f(h, r, t) = \mathbf{e}_h^T diag(\mathbf{r}_r) \mathbf{e}_t\]

where \(\mathbf{e}_i\) is the \(d\)-dimensional representation for entity \(i\), \(\mathbf{r}_j\) is the \(d\)-dimensional representation for relation \(j\).

from pykeen.nn.modules import Interaction

class DistMultInteraction(Interaction):
    def forward(self, h, r, t):
        return (h * r * t).sum(dim=-1)

See also

A reference implementation is provided in pykeen.nn.modules.DistMultInteraction

Interactions with Hyper-Parameters

While we previously defined TransE with the \(L_2\) norm, it could be calculated with a different value for \(p\):

\[f(h, r, t) = -\| \mathbf{e}_h + \mathbf{r}_r - \mathbf{e}_t \|_p\]

This could be incorporated into the interaction definition by using the __init__(), storing the value for \(p\) in the instance, then accessing it in forward().

from pykeen.nn.modules import Interaction

class TransEInteraction(Interaction):
    def __init__(self, p: int):
        super().__init__()
        self.p = p

    def forward(self, h, r, t):
        return -(h + r - t).norm(p=self.p, dim=-1)

In general, you can put whatever you want in __init__() to support the calculation of scores.

Interactions with Trainable Parameters

In ER-MLP, the multi-layer perceptron consists of an input layer with \(3 \times d\) neurons, a hidden layer with \(d\) neurons and output layer with one neuron. The input is represented by the concatenation embeddings of the heads, relations and tail embeddings. It is defined as:

\[f(h, r, t) = W_2 ReLU(W_1 cat(h, r, t) + b_1) + b_2\]

with hidden dimension \(y\), \(W_1 \in \mathcal{R}^{3d \times y}\), \(W_2\ \in \mathcal{R}^y\), and biases \(b_1 \in \mathcal{R}^y\) and \(b_2 \in \mathcal{R}\).

\(W_1\), \(W_1\), \(b_1\), and \(b_2\) are global parameters, meaning that they are trainable, but are neither attached to the entities nor relations. Unlike the \(p\) in TransE, these global trainable parameters are not considered hyper-parameters. However, like hyper-parameters, they can also be defined in the __init__ function of your pykeen.nn.modules.Interaction class. They are trained jointly with the entity and relation embeddings during training.

import torch.nn
from pykeen.nn.modules import Interaction
from pykeen.utils import broadcast_cat

class ERMLPInteraction(Interaction):
    def __init__(self, embedding_dim: int, hidden_dim: int):
        super().__init__()
        # The weights of this MLP will be learned.
        self.mlp = torch.nn.Sequential(
            torch.nn.Linear(in_features=3 * embedding_dim, out_features=hidden_dim, bias=True),
            torch.nn.ReLU(),
            torch.nn.Linear(in_features=hidden_dim, out_features=1, bias=True),
        )

    def forward(self, h, r, t):
        x = broadcast_cat([h, r, t], dim=-1)
        return self.mlp(x)

Note that pykeen.utils.broadcast_cat() was used instead of the standard torch.cat() because of the standardization of shapes of head, relation, and tail vectors.

See also

A reference implementation is provided in pykeen.nn.modules.ERMLPInteraction

Interactions with Different Shaped Vectors

The Structured Embedding uses a 2-tensor for representing each relation, with an interaction defined as:

\[f(h, r, t) = - \|\textbf{M}_{r}^{head} \textbf{e}_h - \textbf{M}_{r}^{tail} \textbf{e}_t\|_p\]

where \(\mathbf{e}_i\) is the \(d\)-dimensional representation for entity \(i\), \(\mathbf{M}^{head}_j\) is the \(d \times d\)-dimensional representation for relation \(j\) for head entities, \(\mathbf{M}^{tail}_j\) is the \(d \times d\)-dimensional representation for relation \(j\) for tail entities, and \(\|...\|_2\) is the \(L_p\) norm.

For the purposes of this tutorial, we will propose a simplification to Strucuterd Embedding (also similar to TransR) where the same relation 2-tensor is used to project both the head and tail entities as in:

\[f(h, r, t) = - \|\textbf{M}_{r} \textbf{e}_h - \textbf{M}_{r} \textbf{e}_t\|_2\]

where \(\mathbf{e}_i\) is the \(d\)-dimensional representation for entity \(i\), \(\mathbf{M}_j\) is the \(d \times d\)-dimensional representation for relation \(j\), and \(\|...\|_2\) is the \(L_2\) norm.

from pykeen.nn.modules import Interaction

class SimplifiedStructuredEmbeddingInteraction(Interaction):
    relation_shape = ('dd',)

    def forward(self, h, r, t):
        h_proj = r @ h.unsqueeze(dim=-1)
        t_proj = r @ t.unsqueeze(dim=-1)
        return -(h_proj - t_proj).squeeze(dim=-1).norm(p=2, dim=-1)

Note the definition of the relation_shape. By default, the entity_shape and relation_shape are both equal to ('d', ), which uses eigen-notation to show that they both are 1-tensors with the same shape. In this simplified version of Structured Embedding, we need to denote that the shape of the relation is \(d \times d\), so it’s written as dd.

See also

Reference implementations are provided in pykeen.nn.modules.StructuredEmbeddingInteraction and in pykeen.nn.modules.TransRInteraction.

Interactions with Multiple Representations

Sometimes, like in the canonical version of Structured Embedding, you need more than one representation for entities and/or relations. To specify this, you just need to extend the tuple for relation_shape with more entries, each corresponding to the sequence of representations.

from pykeen.nn.modules import Interaction

class StructuredEmbeddingInteraction(Interaction):
    relation_shape = (
        'dd',  # Corresponds to $\mathbf{M}^{head}_j$
        'dd',  # Corresponds to $\mathbf{M}^{tail}_j$
    )

    def forward(self, h, r, t):
        # Since the relation_shape is more than length 1, the r value is given as a sequence
        # of the representations defined there. You can use tuple unpacking to get them out
        r_h, r_t = r
        h_proj = r_h @ h.unsqueeze(dim=-1)
        t_proj = r_t @ t.unsqueeze(dim=-1)
        return -(h_proj - t_proj).squeeze(dim=-1).norm(p=2, dim=-1)

Interactions with Different Dimension Vectors

TransD is an example of an interaction module that not only uses two different representations for each entity and two representations for each relation, but they are of different dimensions.

It can be implemented by choosing a different letter for use in the entity_shape and/or relation_shape dictionary. Ultimately, the letters used are arbitrary, but you need to remember what they are when using the pykeen.models.make_model(), pykeen.models.make_model_cls(), or pykeen.pipeline.interaction_pipeline() functions to instantiate a model, make a model class, or run the pipeline using your custom interaction module (respectively).

from pykeen.nn.modules import Interaction
from pykeen.utils import project_entity

class TransDInteraction(Interaction):
    entity_shape = ("d", "d")
    relation_shape = ("e", "e")

    def forward(self, h, r, t):
        h, h_proj = h
        r, r_proj = r
        t, t_proj = t
        h_bot = project_entity(
            e=h,
            e_p=h_p,
            r_p=r_p,
        )
        t_bot = project_entity(
            e=t,
            e_p=t_p,
            r_p=r_p,
        )
        return -(h_bot + r - t_bot).norm(p=2, dim=-1)

Note

The pykeen.utils.project_entity() function was used in this implementation to reduce the complexity. So far, it’s the case that all of the models using multiple different representation dimensions are quite complicated and don’t fall into the paradigm of presenting simple examples.

See also

A reference implementation is provided in pykeen.nn.modules.TransDInteraction

Differences between pykeen.nn.modules.Interaction and pykeen.models.Model

The high-level pipeline() function allows you to pass pre-defined subclasses of pykeen.models.Model such as pykeen.models.TransE or pykeen.models.DistMult. These classes are high-level wrappers around the interaction functions pykeen.nn.modules.TransEInteraction and nn.modules.DistMultInteraction that are more suited for running benchmarking experiments or practical applications of knowledge graph embeddings that include lots of information about default hyper-parameters, recommended hyper-parameter optimization strategies, and more complex applications of regularization schemas.

As a researcher, the pykeen.nn.modules.Interaction is a way to quickly translate ideas into new models that can be used without all of the overhead of defining a pykeen.models.Model. These components are also completely reusable throughout PyKEEN (e.g., in self-rolled training loops) and can be used as standalone components outside of PyKEEN.

If you are happy with your interaction module and would like to go the next step to making it generally reusable, check the “Extending the Models” tutorial.

Ad hoc Models from Interactions

A pykeen.models.ERModel can be constructed from pykeen.nn.modules.Interaction.

The new style-class, pykeen.models.ERModel abstracts the interaction away from the representations such that different interactions can be used interchangably. A new model can be constructed directly from the interaction module, given a dimensions mapping. In each pykeen.nn.modules.Interaction, there is a field called entity_shape and relation_shape that allows for using eigen-notation for defining the different dimensions of the model. Most models share the d dimensionality for both the entity and relation vectors. Some (but not all) exceptions are:

With this in mind, you’ll have to investigate the dimensions of the vectors through the PyKEEN documentation. If you’re implementing your own, you have control over this and will know which dimensions to specify (though the d for both entities and relations is standard). As a shorthand for {'d': value}, you can directly pass value for the dimension and it will be automatically interpreted as the {'d': value}.

Make a model class from lookup of an interaction module class:

>>> from pykeen.nn.modules import TransEInteraction
>>> from pykeen.models import make_model_cls
>>> embedding_dim = 3
>>> model_cls = make_model_cls(
...     dimensions={"d": embedding_dim},
...     interaction='TransE',
...     interaction_kwargs={'p': 2},
... )

If there’s only one dimension in the entity_shapes and relation_shapes, it can be directly given as an integer as a shortcut.

>>> # Implicitly can also be written as:
>>> model_cls_alt = make_model_cls(
...     dimensions=embedding_dim,
...     interaciton='TransE',
...     interaction_kwargs={'p': 2},
... )

Make a model class from an interaction module class:

>>> from pykeen.nn.modules import TransEInteraction
>>> from pykeen.models import make_model_cls
>>> embedding_dim = 3
>>> model_cls = make_model_cls({"d": embedding_dim}, TransEInteraction, {'p': 2})

Make a model class from an instantiated interaction module:

>>> from pykeen.nn.modules import TransEInteraction
>>> from pykeen.models import make_model_cls
>>> embedding_dim = 3
>>> model_cls = make_model_cls({"d": embedding_dim}, TransEInteraction(p=2))

All of these model classes can be passed directly into the model argument of pykeen.pipeline.pipeline().

Interaction Pipeline

The pykeen.pipeline.pipeline() also allows passing of an interaction such that the following code block can be compressed:

from pykeen.pipeline import pipeline
from pykeen.nn.modules import TransEInteraction

model = make_model_cls(
    interaction=TransEInteraction,
    interaction_kwargs={'p': 2},
    dimensions={'d': 100},
)
results = pipeline(
    dataset='Nations',
    model=model,
    ...
)

into:

from pykeen.pipeline import pipeline
from pykeen.nn.modules import TransEInteraction

results = pipeline(
    dataset='Nations',
    interaction=TransEInteraction,
    interaction_kwargs={'p': 2},
    dimensions={'d': 100},
    ...
)

This can be used with any subclass of the pykeen.nn.modules.Interaction, not only ones that are implemented in the PyKEEN package.

Extending the Datasets

While the core of PyKEEN uses the pykeen.triples.TriplesFactory for handling sets of triples, the definition of a training, validation, and testing trichotomy for a given dataset is very useful for reproducible benchmarking. The internal pykeen.datasets.base.Dataset class can be considered as a three-tuple of datasets (though it’s implemented as a class such that it can be extended). There are several datasets included in PyKEEN already, each coming from sources that look different. This tutorial gives some insight into implementing your own Dataset class.

Pre-split Datasets

Unpacked Remote Dataset

Use this tutorial if you have three separate URLs for the respective training, testing, and validation sets that are each 3 column TSV files. A good example can be found at https://github.com/ZhenfengLei/KGDatasets/tree/master/DBpedia50. There’s a base class called pykeen.datasets.base.UnpackedRemoteDataset that can be used to wrap it like the following:

from pykeen.datasets.base import UnpackedRemoteDataset

TEST_URL =  'https://raw.githubusercontent.com/ZhenfengLei/KGDatasets/master/DBpedia50/test.txt'
TRAIN_URL = 'https://raw.githubusercontent.com/ZhenfengLei/KGDatasets/master/DBpedia50/train.txt'
VALID_URL = 'https://raw.githubusercontent.com/ZhenfengLei/KGDatasets/master/DBpedia50/valid.txt'

class DBpedia50(UnpackedRemoteDataset):
    def __init__(self, **kwargs):
        super().__init__(
            training_url=TRAIN_URL,
            testing_url=TEST_URL,
            validation_url=VALID_URL,
            **kwargs,
        )

Unsplit Datasets

Use this tutorial if you have a single URL for a TSV dataset that needs to be automatically split into training, testing, and validation. A good example can be found at https://github.com/hetio/hetionet/raw/master/hetnet/tsv. There’s a base class called pykeen.datasets.base.SingleTabbedDataset that can be used to wrap it like the following:

from pykeen.datasets.base import SingleTabbedDataset

URL = 'https://github.com/hetio/hetionet/raw/master/hetnet/tsv/hetionet-v1.0-edges.sif.gz'

class Hetionet(SingleTabbedDataset):
    def __init__(self, **kwargs):
        super().__init__(url=URL, **kwargs)

The value for URL can be anything that can be read by pandas.read_csv(). Additional options can be passed through to the reading function, such as sep=',', with the keyword argument read_csv_kwargs=dict(sep=','). Note that the default separator for Pandas is a comma, but PyKEEN overrides it to be a tab, so you’ll have to explicitly set it if you want a comma. Since there’s a random aspect to this process, you can also set the seed used for splitting with the random_state keyword argument.

Updating the setup.cfg

Whether you’re making a pull request against PyKEEN or implementing a dataset in your own package, you can use Python entrypoints to register your dataset with PyKEEN. Below is an example of the entrypoints that register pykeen.datasets.Hetionet, pykeen.datasets.DRKG, and others that appear in the PyKEEN setup.cfg. Under the pykeen.datasets header, you can pick whatever name you want for the dataset as the key (appearing on the left side of the equals, e.g. hetionet) and the path to the class (appearing on the right side of the equals, e.g., pykeen.datasets.hetionet:Hetionet). The right side is constructed by the path to the module, the colon :, then the name of the class.

# setup.cfg
...
[options.entry_points]
console_scripts =
    pykeen = pykeen.cli:main
pykeen.datasets =
    hetionet         = pykeen.datasets.hetionet:Hetionet
    conceptnet       = pykeen.datasets.conceptnet:ConceptNet
    drkg             = pykeen.datasets.drkg:DRKG
    ...

If you’re working on a development version of PyKEEN, you also need to run pykeen readme in the shell to update the README.md file.

Extending the Models

You should first read the tutorial on bringing your own interaction module. This tutorial is about how to wrap a custom interaction module with a model module for general reuse and application.

Implementing a model by subclassing pykeen.models.ERModel

The following code block demonstrates how an interaction model can be used to define a full KGEM using the pykeen.models.ERModel base class.

from pykeen.models import ERModel
from pykeen.nn import Embedding, Interaction


class DistMultInteraction(Interaction):
    def forward(self, h, r, t):
        return (h * r * t).sum(dim=-1)


class DistMult(ERModel):
    def __init__(
        self,
        # When defining your class, any hyper-parameters that can be configured should be
        # made as arguments to the __init__() function. When running the pipeline(), these
        # are passed via the ``model_kwargs``.
        embedding_dim: int = 50,
        # All remaining arguments are simply passed through to the parent constructor. If you
        # want access to them, you can name them explicitly. See the pykeen.models.ERModel
        # documentation for a full list
        **kwargs,
    ) -> None:
        # since this is a python class, you can feel free to get creative here. One example of
        # pre-processing is to derive the shape for the relation representation based on the
        # embedding dimension.
        super().__init__(
            # Pass an instance of your interaction function. This is also a place where you can
            # pass hyper-parameters, such as the L_p norm, from the KGEM to the interaction function
            interaction=DistMultInteraction,
            # interaction_kwargs=dict(...),
            # Define the entity representations using a dict. By default, each
            # embedding is a vector. You can use the ``shape`` kwarg to specify higher dimensional
            # tensor shapes.
            entity_representations=Embedding,
            entity_representations_kwargs=dict(
                embedding_dim=embedding_dim,
            ),
            # Define the relation representations the same as the entities
            relation_representations=Embedding,
            relation_representations_kwargs=dict(
                embedding_dim=embedding_dim,
            ),
            # All other arguments are passed through, such as the ``triples_factory``, ``loss``,
            # ``preferred_device``, and others. These are all handled by the pipeline() function
            **kwargs,
        )

The actual implementation of DistMult can be found in pykeen.models.DistMult. Note that it additionally contains configuration for the initializers, constrainers, and regularizers for each of the embeddings as well as class-level defaults for hyper-parameters and hyper-parameter optimization. Modifying these is covered in other tutorials.

Specifying Defaults

If you have a preferred loss function for your model, you can add the loss_default class variable where the value is the loss class.

from typing import ClassVar

from pykeen.models import ERModel
from pykeen.losses import Loss, NSSALoss

class DistMult(ERModel):
    loss_default: ClassVar[Type[Loss]] = NSSALoss
    ...

Now, when using the pipeline, the pykeen.losses.NSSALoss. loss is used by default if none is given. The same kind of modifications can be made to set a default regularizer with regularizer_default.

Specifying Hyper-parameter Optimization Default Ranges

All subclasses of pykeen.models.Model can specify the default ranges or values used during hyper-parameter optimization (HPO). PyKEEN implements a simple dictionary-based configuration that is interpreted by pykeen.hpo.hpo.suggest_kwargs() in the HPO pipeline.

HPO default ranges can be applied to all keyword arguments appearing in the __init__() function of your model by setting a class-level variable called hpo_default.

For example, the embedding_dim can be specified as being on a range between 100 and 150 with the following:

class DistMult(ERModel):
    hpo_default = {
        'embedding_dim': dict(type=int, low=100, high=150)
    }
    ...

A step size can be imposed with q:

class DistMult(ERModel):
    hpo_default = {
        'embedding_dim': dict(type=int, low=100, high=150 q=5)
    }
    ...

An alternative scale can be imposed with scale. Right now, the default is linear, and scale can optionally be set to power_two for integers as in:

class DistMult(ERModel):
    hpo_default = {
        # will uniformly give 16, 32, 64, 128 (left inclusive, right exclusive)
        'hidden_dim': dict(type=int, low=4, high=8, scale='power_two')
    }
    ...

Warning

Alternative scales can not currently be used in combination with step size (q).

There are other possibilities for specifying the type as float, categorical, or as bool.

With float, you can’t use the q option nor set the scale to power_two, but the scale can be set to log (see optuna.distributions.LogUniformDistribution).

hpo_default = {
    # will uniformly give floats on the range of [1.0, 2.0) (exclusive)
    'alpha': dict(type='float', low=1.0, high=2.0),

    # will uniformly give 1.0, 2.0, or 4.0 (exclusive)
    'beta': dict(type='float', low=1.0, high=8.0, scale='log'),
}

With categorical, you can form a dictionary like the following using type='categorical' and giving a choices entry that contains a sequence of either integers, floats, or strings.

hpo_default = {
    'similarity': dict(type='categorical', choices=[...])
}

With bool, you can simply use dict(type=bool) or dict(type='bool').

Note

The HPO rules are subject to change as they are tightly coupled to optuna, which since version 2.0.0 has introduced several new possibilities.

Implementing a model by instantiating pykeen.models.ERModel

Instead of creating a new class, you can also directly use the pykeen.models.ERModel, e.g.

from pykeen.models import ERModel
from pykeen.losses import BCEWithLogitsLoss

model = ERModel(
    triples_factory=...,
    loss="BCEWithLogits",
    interaction="transformer",
    entity_representations_kwargs=dict(embedding_dim=64),
    relation_representations_kwargs=dict(embedding_dim=64),
)

Using a Custom Model with the Pipeline

We can use this new model with all available losses, evaluators, training pipelines, inverse triple modeling, via the pykeen.pipeline.pipeline(), since in addition to the names of models (given as strings), it can also take model classes in the model argument.

from pykeen.pipeline import pipeline

pipeline(
    model=DistMult,
    dataset='Nations',
    loss='NSSA',
)

Pipeline

The PyKEEN pipeline and related wrapper functions.

Functions

pipeline_from_path(path, **kwargs)

Run the pipeline with configuration in a JSON/YAML file at the given path.

pipeline_from_config(config[, discard_seed])

Run the pipeline with a configuration dictionary.

replicate_pipeline_from_config(config, ...)

Run the same pipeline several times from a configuration dictionary.

replicate_pipeline_from_path(path, **kwargs)

Run the same pipeline several times from a configuration file by path.

pipeline(*[, dataset, dataset_kwargs, ...])

Train and evaluate a model.

plot_losses(pipeline_result, *[, ax])

Plot the losses per epoch.

plot_early_stopping(pipeline_result, *[, ...])

Plot the evaluations during early stopping.

plot_er(pipeline_result, *[, model, ...])

Plot the reduced entities and relation vectors in 2D.

plot(pipeline_result[, er_kwargs, figsize])

Plot all plots.

Classes

PipelineResult(random_seed, model, training, ...)

A dataclass containing the results of running pykeen.pipeline.pipeline().

Models

A knowledge graph embedding model is capable of computing real-valued scores representing the plausibility of a triple \((h,r,t) \in \mathbb{K}\), where a larger score indicates a higher plausibility. The interpretation of the score value is model-dependent, and usually it cannot be directly interpreted as a probability.

In PyKEEN, the API of a model is defined in Model, where the scoring function is exposed as Model.score_hrt(), which can be used to compute plausability scores for (a batch of) triples. In addition, the Model class also offers additional scoring methods, which can be used to (efficiently) compute scores for a large number of triples sharing some parts, e.g., to compute scores for triples (h, r, e) for a given (h, r) pair and all available entities \(e \in \mathcal{E}\).

Note

The implementations of the knowledge graph embedding models provided here all operate on entity / relation indices rather than string representations, cf. here.

On top of these scoring methods, there are also corresponding prediction methods, e.g., Model.predict_hrt(). These methods extend the scoring ones, by ensuring the model is in evaluation mode, cf. torch.nn.Module.eval(), and optionally applying a sigmoid activation on the scores to ensure a value range of \([0, 1]\).

Warning

Depending on the model at hand, directly applying sigmoid might not always be sensible. For instance, distance-based interaction functions, such as pykeen.nn.modules.TransEInteraction, result in non-positive scores (since they use the negative distance as scoring function), and thus the output of the sigmoid only covers the interval \([0.5, 1]\).

Most models derive from ERModel, which is a generic implementation of a knowledge graph embedding model. It combines a variable number of representations for entities and relations, cf. pykeen.nn.representation.Representation, and an interaction function, cf. pykeen.nn.modules.Interaction. The representation modules convert integer entity or relation indices to numeric representations, e.g., vectors. The interaction function takes the representations of the head entities, relations and tail entities as input and computes a scalar plausability score for triples.

Note

An in-depth discussion of representation modules can be found in the corresponding tutorial.

Note

The specific models from this module, e.g., RESCAL, package given specific entity and relation representations with an interaction function. For more flexible combinations, consider using ERModel directly.

Functions

make_model(dimensions, interaction[, ...])

Build a model from an interaction class hint (name or class).

make_model_cls(dimensions, interaction[, ...])

Build a model class from an interaction class hint (name or class).

Classes

Model(*, triples_factory[, loss, ...])

A base module for KGE models.

ERModel(*, triples_factory, interaction[, ...])

A commonly useful base for KGEMs using embeddings and interaction modules.

InductiveERModel(*, triples_factory[, ...])

A base class for inductive models.

LiteralModel(triples_factory, interaction[, ...])

Base class for models with entity literals that uses combinations from pykeen.nn.combinations.

EvaluationOnlyModel(triples_factory)

A model which only implements the methods used for evaluation.

AutoSF([embedding_dim, num_components, ...])

An implementation of AutoSF from [zhang2020].

BoxE(*[, embedding_dim, tanh_map, p, ...])

An implementation of BoxE from [abboud2020].

CompGCN(*, triples_factory[, embedding_dim, ...])

An implementation of CompGCN from [vashishth2020].

ComplEx(*[, embedding_dim, ...])

An implementation of ComplEx [trouillon2016].

ComplExLiteral(triples_factory[, ...])

An implementation of the LiteralE model with the ComplEx interaction from [kristiadi2018].

ConvE(triples_factory[, input_channels, ...])

An implementation of ConvE from [dettmers2018].

ConvKB(*[, embedding_dim, ...])

An implementation of ConvKB from [nguyen2018].

CP([embedding_dim, rank, ...])

An implementation of CP as described in [lacroix2018] based on [hitchcock1927].

CrossE(*[, embedding_dim, ...])

An implementation of CrossE from [zhang2019b].

DistMA([embedding_dim, entity_initializer, ...])

An implementation of DistMA from [shi2019].

DistMult(*[, embedding_dim, ...])

An implementation of DistMult from [yang2014].

DistMultLiteral(triples_factory[, ...])

An implementation of the LiteralE model with the DistMult interaction from [kristiadi2018].

DistMultLiteralGated(triples_factory[, ...])

An implementation of the LiteralE model with thhe Gated DistMult interaction from [kristiadi2018].

ERMLP(*[, embedding_dim, hidden_dim, ...])

An implementation of ERMLP from [dong2014].

ERMLPE(*[, embedding_dim, hidden_dim, ...])

An extension of pykeen.models.ERMLP proposed by [sharifzadeh2019].

HolE(*[, embedding_dim, entity_initializer, ...])

An implementation of HolE [nickel2016].

KG2E(*[, embedding_dim, dist_similarity, ...])

An implementation of KG2E from [he2015].

FixedModel(*, triples_factory, **_kwargs)

A mock model returning fixed scores.

MuRE(*[, embedding_dim, p, power_norm, ...])

An implementation of MuRE from [balazevic2019b].

NodePiece(*, triples_factory[, num_tokens, ...])

A wrapper which combines an interaction function with NodePiece entity representations from [galkin2021].

NTN(*[, embedding_dim, num_slices, ...])

An implementation of NTN from [socher2013].

PairRE([embedding_dim, p, power_norm, ...])

An implementation of PairRE from [chao2020].

ProjE(*[, embedding_dim, ...])

An implementation of ProjE from [shi2017].

QuatE(*[, embedding_dim, ...])

An implementation of QuatE from [zhang2019].

RESCAL(*[, embedding_dim, ...])

An implementation of RESCAL from [nickel2011].

RGCN(*, triples_factory[, embedding_dim, ...])

An implementation of R-GCN from [schlichtkrull2018].

RotatE(*[, embedding_dim, ...])

An implementation of RotatE from [sun2019].

SimplE(*[, embedding_dim, clamp_score, ...])

An implementation of SimplE [kazemi2018].

SE(*[, embedding_dim, scoring_fct_norm, ...])

An implementation of the Structured Embedding (SE) published by [bordes2011].

TorusE([embedding_dim, p, power_norm, ...])

An implementation of TorusE from [ebisu2018].

TransD(*[, embedding_dim, relation_dim, ...])

An implementation of TransD from [ji2015].

TransE(*[, embedding_dim, scoring_fct_norm, ...])

An implementation of TransE [bordes2013].

TransF([embedding_dim, entity_initializer, ...])

An implementation of TransF from [feng2016].

TransH(*[, embedding_dim, scoring_fct_norm, ...])

An implementation of TransH [wang2014].

TransR(*[, embedding_dim, relation_dim, ...])

An implementation of TransR from [lin2015].

TuckER(*[, embedding_dim, relation_dim, ...])

An implementation of TuckEr from [balazevic2019].

UM(*[, embedding_dim, scoring_fct_norm, ...])

An implementation of the Unstructured Model (UM) published by [bordes2014].

InductiveNodePiece(*, triples_factory, ...)

A wrapper which combines an interaction function with NodePiece entity representations from [galkin2021].

InductiveNodePieceGNN(*[, gnn_encoder])

Inductive NodePiece with a GNN encoder on top.

SoftInverseTripleBaseline(triples_factory[, ...])

Score based on relation similarity.

MarginalDistributionBaseline(triples_factory)

Score based on marginal distributions.

CooccurrenceFilteredModel(*, triples_factory)

A model which filters predictions by co-occurence.

Class Inheritance Diagram

digraph inheritance7e2d9373a1 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "AutoSF" [URL="index.html#pykeen.models.AutoSF",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of AutoSF from [zhang2020]_."]; "ERModel" -> "AutoSF" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BoxE" [URL="index.html#pykeen.models.BoxE",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of BoxE from [abboud2020]_."]; "ERModel" -> "BoxE" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CP" [URL="index.html#pykeen.models.CP",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of CP as described in [lacroix2018]_ based on [hitchcock1927]_."]; "ERModel" -> "CP" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CompGCN" [URL="index.html#pykeen.models.CompGCN",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of CompGCN from [vashishth2020]_."]; "ERModel" -> "CompGCN" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ComplEx" [URL="index.html#pykeen.models.ComplEx",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of ComplEx [trouillon2016]_."]; "ERModel" -> "ComplEx" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ComplExLiteral" [URL="index.html#pykeen.models.ComplExLiteral",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of the LiteralE model with the ComplEx interaction from [kristiadi2018]_."]; "LiteralModel" -> "ComplExLiteral" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ConvE" [URL="index.html#pykeen.models.ConvE",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of ConvE from [dettmers2018]_."]; "ERModel" -> "ConvE" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ConvKB" [URL="index.html#pykeen.models.ConvKB",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of ConvKB from [nguyen2018]_."]; "ERModel" -> "ConvKB" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CooccurrenceFilteredModel" [URL="index.html#pykeen.models.CooccurrenceFilteredModel",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A model which filters predictions by co-occurence."]; "Model" -> "CooccurrenceFilteredModel" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CrossE" [URL="index.html#pykeen.models.CrossE",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of CrossE from [zhang2019b]_."]; "ERModel" -> "CrossE" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DistMA" [URL="index.html#pykeen.models.DistMA",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of DistMA from [shi2019]_."]; "ERModel" -> "DistMA" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DistMult" [URL="index.html#pykeen.models.DistMult",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of DistMult from [yang2014]_."]; "ERModel" -> "DistMult" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DistMultLiteral" [URL="index.html#pykeen.models.DistMultLiteral",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of the LiteralE model with the DistMult interaction from [kristiadi2018]_."]; "LiteralModel" -> "DistMultLiteral" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DistMultLiteralGated" [URL="index.html#pykeen.models.DistMultLiteralGated",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of the LiteralE model with thhe Gated DistMult interaction from [kristiadi2018]_."]; "LiteralModel" -> "DistMultLiteralGated" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ERMLP" [URL="index.html#pykeen.models.ERMLP",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of ERMLP from [dong2014]_."]; "ERModel" -> "ERMLP" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ERMLPE" [URL="index.html#pykeen.models.ERMLPE",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An extension of :class:`pykeen.models.ERMLP` proposed by [sharifzadeh2019]_."]; "ERModel" -> "ERMLPE" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ERModel" [URL="index.html#pykeen.models.ERModel",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A commonly useful base for KGEMs using embeddings and interaction modules."]; "Generic" -> "ERModel" [arrowsize=0.5,style="setlinewidth(0.5)"]; "_NewAbstractModel" -> "ERModel" [arrowsize=0.5,style="setlinewidth(0.5)"]; "EvaluationOnlyModel" [URL="index.html#pykeen.models.EvaluationOnlyModel",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A model which only implements the methods used for evaluation."]; "Model" -> "EvaluationOnlyModel" [arrowsize=0.5,style="setlinewidth(0.5)"]; "FixedModel" [URL="index.html#pykeen.models.FixedModel",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A mock model returning fixed scores."]; "Model" -> "FixedModel" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Generic" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Abstract base class for generic types."]; "HolE" [URL="index.html#pykeen.models.HolE",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of HolE [nickel2016]_."]; "ERModel" -> "HolE" [arrowsize=0.5,style="setlinewidth(0.5)"]; "InductiveERModel" [URL="index.html#pykeen.models.InductiveERModel",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A base class for inductive models."]; "ERModel" -> "InductiveERModel" [arrowsize=0.5,style="setlinewidth(0.5)"]; "InductiveNodePiece" [URL="index.html#pykeen.models.InductiveNodePiece",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A wrapper which combines an interaction function with NodePiece entity representations from [galkin2021]_."]; "InductiveERModel" -> "InductiveNodePiece" [arrowsize=0.5,style="setlinewidth(0.5)"]; "InductiveNodePieceGNN" [URL="index.html#pykeen.models.InductiveNodePieceGNN",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Inductive NodePiece with a GNN encoder on top."]; "InductiveNodePiece" -> "InductiveNodePieceGNN" [arrowsize=0.5,style="setlinewidth(0.5)"]; "KG2E" [URL="index.html#pykeen.models.KG2E",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of KG2E from [he2015]_."]; "ERModel" -> "KG2E" [arrowsize=0.5,style="setlinewidth(0.5)"]; "LiteralModel" [URL="index.html#pykeen.models.LiteralModel",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Base class for models with entity literals that uses combinations from :class:`pykeen.nn.combinations`."]; "ERModel" -> "LiteralModel" [arrowsize=0.5,style="setlinewidth(0.5)"]; "MarginalDistributionBaseline" [URL="index.html#pykeen.models.MarginalDistributionBaseline",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Score based on marginal distributions."]; "EvaluationOnlyModel" -> "MarginalDistributionBaseline" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Model" [URL="index.html#pykeen.models.Model",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A base module for KGE models."]; "Module" -> "Model" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "Model" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Module" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Base class for all neural network modules."]; "MuRE" [URL="index.html#pykeen.models.MuRE",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of MuRE from [balazevic2019b]_."]; "ERModel" -> "MuRE" [arrowsize=0.5,style="setlinewidth(0.5)"]; "NTN" [URL="index.html#pykeen.models.NTN",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of NTN from [socher2013]_."]; "ERModel" -> "NTN" [arrowsize=0.5,style="setlinewidth(0.5)"]; "NodePiece" [URL="index.html#pykeen.models.NodePiece",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A wrapper which combines an interaction function with NodePiece entity representations from [galkin2021]_."]; "ERModel" -> "NodePiece" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PairRE" [URL="index.html#pykeen.models.PairRE",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of PairRE from [chao2020]_."]; "ERModel" -> "PairRE" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ProjE" [URL="index.html#pykeen.models.ProjE",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of ProjE from [shi2017]_."]; "ERModel" -> "ProjE" [arrowsize=0.5,style="setlinewidth(0.5)"]; "QuatE" [URL="index.html#pykeen.models.QuatE",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of QuatE from [zhang2019]_."]; "ERModel" -> "QuatE" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RESCAL" [URL="index.html#pykeen.models.RESCAL",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of RESCAL from [nickel2011]_."]; "ERModel" -> "RESCAL" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RGCN" [URL="index.html#pykeen.models.RGCN",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of R-GCN from [schlichtkrull2018]_."]; "ERModel" -> "RGCN" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RotatE" [URL="index.html#pykeen.models.RotatE",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of RotatE from [sun2019]_."]; "ERModel" -> "RotatE" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SE" [URL="index.html#pykeen.models.SE",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of the Structured Embedding (SE) published by [bordes2011]_."]; "ERModel" -> "SE" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SimplE" [URL="index.html#pykeen.models.SimplE",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of SimplE [kazemi2018]_."]; "ERModel" -> "SimplE" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SoftInverseTripleBaseline" [URL="index.html#pykeen.models.SoftInverseTripleBaseline",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Score based on relation similarity."]; "EvaluationOnlyModel" -> "SoftInverseTripleBaseline" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TorusE" [URL="index.html#pykeen.models.TorusE",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of TorusE from [ebisu2018]_."]; "ERModel" -> "TorusE" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TransD" [URL="index.html#pykeen.models.TransD",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of TransD from [ji2015]_."]; "ERModel" -> "TransD" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TransE" [URL="index.html#pykeen.models.TransE",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of TransE [bordes2013]_."]; "ERModel" -> "TransE" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TransF" [URL="index.html#pykeen.models.TransF",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of TransF from [feng2016]_."]; "ERModel" -> "TransF" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TransH" [URL="index.html#pykeen.models.TransH",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of TransH [wang2014]_."]; "ERModel" -> "TransH" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TransR" [URL="index.html#pykeen.models.TransR",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of TransR from [lin2015]_."]; "ERModel" -> "TransR" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TuckER" [URL="index.html#pykeen.models.TuckER",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of TuckEr from [balazevic2019]_."]; "ERModel" -> "TuckER" [arrowsize=0.5,style="setlinewidth(0.5)"]; "UM" [URL="index.html#pykeen.models.UM",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of the Unstructured Model (UM) published by [bordes2014]_."]; "ERModel" -> "UM" [arrowsize=0.5,style="setlinewidth(0.5)"]; "_NewAbstractModel" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="An abstract class for knowledge graph embedding models (KGEMs)."]; "Model" -> "_NewAbstractModel" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "_NewAbstractModel" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Datasets

pykeen.datasets Package

Built-in datasets for PyKEEN.

New datasets (inheriting from pykeen.datasets.Dataset) can be registered with PyKEEN using the pykeen.datasets group in Python entrypoints in your own setup.py or setup.cfg package configuration. They are loaded automatically with pkg_resources.iter_entry_points().

Functions

get_dataset(*[, dataset, dataset_kwargs, ...])

Get a dataset, cached based on the given kwargs.

has_dataset(key)

Return if the dataset is registered in PyKEEN.

Classes

Dataset()

The base dataset class.

AristoV4(**kwargs)

The Aristo-v4 dataset from [chen2021].

Hetionet([random_state])

The Hetionet dataset from [himmelstein2017].

Kinships(**kwargs)

The Kinships dataset.

Nations(**kwargs)

The Nations dataset.

OpenBioLink(**kwargs)

The OpenBioLink dataset.

OpenBioLinkLQ(**kwargs)

The low-quality variant of the OpenBioLink dataset.

CoDExSmall(**kwargs)

The CoDEx small dataset.

CoDExMedium(**kwargs)

The CoDEx medium dataset.

CoDExLarge(**kwargs)

The CoDEx large dataset.

CN3l([graph_pair])

The CN3l dataset family.

OGBBioKG([cache_root, create_inverse_triples])

The OGB BioKG dataset.

OGBWikiKG2([cache_root, create_inverse_triples])

The OGB WikiKG2 dataset.

UMLS(**kwargs)

The UMLS dataset.

FB15k(**kwargs)

The FB15k dataset.

FB15k237(**kwargs)

The FB15k-237 dataset.

WK3l15k([graph_pair])

The WK3l-15k dataset family.

WK3l120k([graph_pair])

The WK3l-120k dataset family.

WN18(**kwargs)

The WN18 dataset.

WN18RR(**kwargs)

The WN18-RR dataset.

YAGO310(**kwargs)

The YAGO3-10 dataset is a subset of YAGO3 that only contains entities with at least 10 relations.

DRKG([random_state])

The DRKG dataset.

BioKG([random_state])

The BioKG dataset from [walsh2020].

ConceptNet([random_state])

The ConceptNet dataset from [speer2017].

CKG([random_state])

The Clinical Knowledge Graph (CKG) dataset from [santos2020].

CSKG([random_state])

The CSKG dataset.

DBpedia50(**kwargs)

The DBpedia50 dataset.

DB100K(**kwargs)

The DB100K dataset from [ding2018].

OpenEA(*[, graph_pair, size, version])

The OpenEA dataset family.

Countries(**kwargs)

The Countries dataset.

WD50KT(**kwargs)

The triples-only version of WD50K.

Wikidata5M(**kwargs)

The Wikidata5M dataset from [wang2019].

PharmKG8k(**kwargs)

The PharmKG8k dataset from [zheng2020].

PharmKG([random_state])

The PharmKGFull dataset from [zheng2020].

PrimeKG([random_state])

The Precision Medicine Knowledge Graph (PrimeKG) dataset from [chandak2022].

Globi([random_state])

The Global Biotic Interactions (GloBI) dataset.

PharMeBINet([random_state])

The PharMeBINet dataset from [koenigs2022].

Class Inheritance Diagram

digraph inheritance875d756eb0 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "AristoV4" [URL="index.html#pykeen.datasets.AristoV4",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The Aristo-v4 dataset from [chen2021]."]; "PackedZipRemoteDataset" -> "AristoV4" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BioKG" [URL="index.html#pykeen.datasets.BioKG",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The BioKG dataset from [walsh2020]_."]; "ZipSingleDataset" -> "BioKG" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CKG" [URL="index.html#pykeen.datasets.CKG",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The Clinical Knowledge Graph (CKG) dataset from [santos2020]_."]; "TabbedDataset" -> "CKG" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CN3l" [URL="index.html#pykeen.datasets.CN3l",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The CN3l dataset family."]; "MTransEDataset" -> "CN3l" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CSKG" [URL="index.html#pykeen.datasets.CSKG",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The CSKG dataset."]; "SingleTabbedDataset" -> "CSKG" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CoDExLarge" [URL="index.html#pykeen.datasets.CoDExLarge",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The CoDEx large dataset."]; "UnpackedRemoteDataset" -> "CoDExLarge" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CoDExMedium" [URL="index.html#pykeen.datasets.CoDExMedium",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The CoDEx medium dataset."]; "UnpackedRemoteDataset" -> "CoDExMedium" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CoDExSmall" [URL="index.html#pykeen.datasets.CoDExSmall",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The CoDEx small dataset."]; "UnpackedRemoteDataset" -> "CoDExSmall" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CompressedSingleDataset" [URL="index.html#pykeen.datasets.base.CompressedSingleDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Loads a dataset that's a single file inside an archive."]; "LazyDataset" -> "CompressedSingleDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ConceptNet" [URL="index.html#pykeen.datasets.ConceptNet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The ConceptNet dataset from [speer2017]_."]; "SingleTabbedDataset" -> "ConceptNet" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Countries" [URL="index.html#pykeen.datasets.Countries",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The Countries dataset."]; "UnpackedRemoteDataset" -> "Countries" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DB100K" [URL="index.html#pykeen.datasets.DB100K",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The DB100K dataset from [ding2018]_."]; "UnpackedRemoteDataset" -> "DB100K" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DBpedia50" [URL="index.html#pykeen.datasets.DBpedia50",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The DBpedia50 dataset."]; "UnpackedRemoteDataset" -> "DBpedia50" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DRKG" [URL="index.html#pykeen.datasets.DRKG",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The DRKG dataset."]; "TarFileSingleDataset" -> "DRKG" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Dataset" [URL="index.html#pykeen.datasets.base.Dataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The base dataset class."]; "ExtraReprMixin" -> "Dataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "EADataset" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Base class for entity alignment datasets."]; "EagerDataset" -> "EADataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "EagerDataset" [URL="index.html#pykeen.datasets.base.EagerDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A dataset whose training, testing, and optional validation factories are pre-loaded."]; "Dataset" -> "EagerDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExtraReprMixin" [URL="index.html#pykeen.utils.ExtraReprMixin",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A mixin for modules with hierarchical `extra_repr`."]; "FB15k" [URL="index.html#pykeen.datasets.FB15k",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The FB15k dataset."]; "TarFileRemoteDataset" -> "FB15k" [arrowsize=0.5,style="setlinewidth(0.5)"]; "FB15k237" [URL="index.html#pykeen.datasets.FB15k237",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The FB15k-237 dataset."]; "PackedZipRemoteDataset" -> "FB15k237" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Generic" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Abstract base class for generic types."]; "Globi" [URL="index.html#pykeen.datasets.Globi",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The Global Biotic Interactions (GloBI) dataset."]; "SingleTabbedDataset" -> "Globi" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Hetionet" [URL="index.html#pykeen.datasets.Hetionet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The Hetionet dataset from [himmelstein2017]_."]; "SingleTabbedDataset" -> "Hetionet" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Kinships" [URL="index.html#pykeen.datasets.Kinships",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The Kinships dataset."]; "PathDataset" -> "Kinships" [arrowsize=0.5,style="setlinewidth(0.5)"]; "LazyDataset" [URL="index.html#pykeen.datasets.base.LazyDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A dataset whose training, testing, and optional validation factories are lazily loaded."]; "Dataset" -> "LazyDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "MTransEDataset" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Base class for WK3l datasets (WK3l-15k, WK3l-120k, CN3l)."]; "EADataset" -> "MTransEDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "MTransEDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Nations" [URL="index.html#pykeen.datasets.Nations",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The Nations dataset."]; "PathDataset" -> "Nations" [arrowsize=0.5,style="setlinewidth(0.5)"]; "OGBBioKG" [URL="index.html#pykeen.datasets.OGBBioKG",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The OGB BioKG dataset."]; "OGBLoader" -> "OGBBioKG" [arrowsize=0.5,style="setlinewidth(0.5)"]; "OGBLoader" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Load from the Open Graph Benchmark (OGB)."]; "LazyDataset" -> "OGBLoader" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Generic" -> "OGBLoader" [arrowsize=0.5,style="setlinewidth(0.5)"]; "OGBWikiKG2" [URL="index.html#pykeen.datasets.OGBWikiKG2",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The OGB WikiKG2 dataset."]; "OGBLoader" -> "OGBWikiKG2" [arrowsize=0.5,style="setlinewidth(0.5)"]; "OpenBioLink" [URL="index.html#pykeen.datasets.OpenBioLink",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The OpenBioLink dataset."]; "PackedZipRemoteDataset" -> "OpenBioLink" [arrowsize=0.5,style="setlinewidth(0.5)"]; "OpenBioLinkLQ" [URL="index.html#pykeen.datasets.OpenBioLinkLQ",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The low-quality variant of the OpenBioLink dataset."]; "PackedZipRemoteDataset" -> "OpenBioLinkLQ" [arrowsize=0.5,style="setlinewidth(0.5)"]; "OpenEA" [URL="index.html#pykeen.datasets.OpenEA",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The OpenEA dataset family."]; "EADataset" -> "OpenEA" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PackedZipRemoteDataset" [URL="index.html#pykeen.datasets.base.PackedZipRemoteDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Contains a lazy reference to a remote dataset that is loaded if needed."]; "LazyDataset" -> "PackedZipRemoteDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PathDataset" [URL="index.html#pykeen.datasets.base.PathDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Contains a lazy reference to a training, testing, and validation dataset."]; "LazyDataset" -> "PathDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PharMeBINet" [URL="index.html#pykeen.datasets.PharMeBINet",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The PharMeBINet dataset from [koenigs2022]_."]; "TarFileSingleDataset" -> "PharMeBINet" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PharmKG" [URL="index.html#pykeen.datasets.PharmKG",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The PharmKGFull dataset from [zheng2020]_."]; "SingleTabbedDataset" -> "PharmKG" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PharmKG8k" [URL="index.html#pykeen.datasets.PharmKG8k",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The PharmKG8k dataset from [zheng2020]_."]; "UnpackedRemoteDataset" -> "PharmKG8k" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PrimeKG" [URL="index.html#pykeen.datasets.PrimeKG",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The Precision Medicine Knowledge Graph (PrimeKG) dataset from [chandak2022]_."]; "SingleTabbedDataset" -> "PrimeKG" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RemoteDataset" [URL="index.html#pykeen.datasets.base.RemoteDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Contains a lazy reference to a remote dataset that is loaded if needed."]; "PathDataset" -> "RemoteDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SingleTabbedDataset" [URL="index.html#pykeen.datasets.base.SingleTabbedDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="This class is for when you've got a single TSV of edges and want them to get auto-split."]; "TabbedDataset" -> "SingleTabbedDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TabbedDataset" [URL="index.html#pykeen.datasets.base.TabbedDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="This class is for when you've got a single TSV of edges and want them to get auto-split."]; "LazyDataset" -> "TabbedDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TarFileRemoteDataset" [URL="index.html#pykeen.datasets.base.TarFileRemoteDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A remote dataset stored as a tar file."]; "RemoteDataset" -> "TarFileRemoteDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TarFileSingleDataset" [URL="index.html#pykeen.datasets.base.TarFileSingleDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Loads a dataset that's a single file inside a tar.gz archive."]; "CompressedSingleDataset" -> "TarFileSingleDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "UMLS" [URL="index.html#pykeen.datasets.UMLS",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The UMLS dataset."]; "PathDataset" -> "UMLS" [arrowsize=0.5,style="setlinewidth(0.5)"]; "UnpackedRemoteDataset" [URL="index.html#pykeen.datasets.base.UnpackedRemoteDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A dataset with all three of train, test, and validation sets as URLs."]; "PathDataset" -> "UnpackedRemoteDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "WD50KT" [URL="index.html#pykeen.datasets.WD50KT",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The triples-only version of WD50K."]; "UnpackedRemoteDataset" -> "WD50KT" [arrowsize=0.5,style="setlinewidth(0.5)"]; "WK3l120k" [URL="index.html#pykeen.datasets.WK3l120k",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The WK3l-120k dataset family."]; "MTransEDataset" -> "WK3l120k" [arrowsize=0.5,style="setlinewidth(0.5)"]; "WK3l15k" [URL="index.html#pykeen.datasets.WK3l15k",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The WK3l-15k dataset family."]; "MTransEDataset" -> "WK3l15k" [arrowsize=0.5,style="setlinewidth(0.5)"]; "WN18" [URL="index.html#pykeen.datasets.WN18",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The WN18 dataset."]; "TarFileRemoteDataset" -> "WN18" [arrowsize=0.5,style="setlinewidth(0.5)"]; "WN18RR" [URL="index.html#pykeen.datasets.WN18RR",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The WN18-RR dataset."]; "TarFileRemoteDataset" -> "WN18RR" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Wikidata5M" [URL="index.html#pykeen.datasets.Wikidata5M",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The Wikidata5M dataset from [wang2019]_."]; "TarFileRemoteDataset" -> "Wikidata5M" [arrowsize=0.5,style="setlinewidth(0.5)"]; "YAGO310" [URL="index.html#pykeen.datasets.YAGO310",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The YAGO3-10 dataset is a subset of YAGO3 that only contains entities with at least 10 relations."]; "TarFileRemoteDataset" -> "YAGO310" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ZipSingleDataset" [URL="index.html#pykeen.datasets.base.ZipSingleDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Loads a dataset that's a single file inside a zip archive."]; "CompressedSingleDataset" -> "ZipSingleDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

pykeen.datasets.base Module

Utility classes for constructing datasets.

Functions

dataset_similarity(a, b[, metric])

Calculate the similarity between two datasets.

Classes

Dataset()

The base dataset class.

EagerDataset(training, testing[, ...])

A dataset whose training, testing, and optional validation factories are pre-loaded.

LazyDataset()

A dataset whose training, testing, and optional validation factories are lazily loaded.

PathDataset(training_path, testing_path, ...)

Contains a lazy reference to a training, testing, and validation dataset.

RemoteDataset(url, relative_training_path, ...)

Contains a lazy reference to a remote dataset that is loaded if needed.

UnpackedRemoteDataset(training_url, ...[, ...])

A dataset with all three of train, test, and validation sets as URLs.

TarFileRemoteDataset(url, ...[, cache_root, ...])

A remote dataset stored as a tar file.

PackedZipRemoteDataset(...[, url, name, ...])

Contains a lazy reference to a remote dataset that is loaded if needed.

CompressedSingleDataset(url, relative_path)

Loads a dataset that's a single file inside an archive.

TarFileSingleDataset(url, relative_path[, ...])

Loads a dataset that's a single file inside a tar.gz archive.

ZipSingleDataset(url, relative_path[, name, ...])

Loads a dataset that's a single file inside a zip archive.

TabbedDataset([cache_root, eager, ...])

This class is for when you've got a single TSV of edges and want them to get auto-split.

SingleTabbedDataset(url[, name, cache_root, ...])

This class is for when you've got a single TSV of edges and want them to get auto-split.

Class Inheritance Diagram

digraph inheritanceacf16f683a { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "CompressedSingleDataset" [URL="index.html#pykeen.datasets.base.CompressedSingleDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Loads a dataset that's a single file inside an archive."]; "LazyDataset" -> "CompressedSingleDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Dataset" [URL="index.html#pykeen.datasets.base.Dataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The base dataset class."]; "ExtraReprMixin" -> "Dataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "EagerDataset" [URL="index.html#pykeen.datasets.base.EagerDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A dataset whose training, testing, and optional validation factories are pre-loaded."]; "Dataset" -> "EagerDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExtraReprMixin" [URL="index.html#pykeen.utils.ExtraReprMixin",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A mixin for modules with hierarchical `extra_repr`."]; "LazyDataset" [URL="index.html#pykeen.datasets.base.LazyDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A dataset whose training, testing, and optional validation factories are lazily loaded."]; "Dataset" -> "LazyDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PackedZipRemoteDataset" [URL="index.html#pykeen.datasets.base.PackedZipRemoteDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Contains a lazy reference to a remote dataset that is loaded if needed."]; "LazyDataset" -> "PackedZipRemoteDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PathDataset" [URL="index.html#pykeen.datasets.base.PathDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Contains a lazy reference to a training, testing, and validation dataset."]; "LazyDataset" -> "PathDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RemoteDataset" [URL="index.html#pykeen.datasets.base.RemoteDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Contains a lazy reference to a remote dataset that is loaded if needed."]; "PathDataset" -> "RemoteDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SingleTabbedDataset" [URL="index.html#pykeen.datasets.base.SingleTabbedDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="This class is for when you've got a single TSV of edges and want them to get auto-split."]; "TabbedDataset" -> "SingleTabbedDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TabbedDataset" [URL="index.html#pykeen.datasets.base.TabbedDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="This class is for when you've got a single TSV of edges and want them to get auto-split."]; "LazyDataset" -> "TabbedDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TarFileRemoteDataset" [URL="index.html#pykeen.datasets.base.TarFileRemoteDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A remote dataset stored as a tar file."]; "RemoteDataset" -> "TarFileRemoteDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TarFileSingleDataset" [URL="index.html#pykeen.datasets.base.TarFileSingleDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Loads a dataset that's a single file inside a tar.gz archive."]; "CompressedSingleDataset" -> "TarFileSingleDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "UnpackedRemoteDataset" [URL="index.html#pykeen.datasets.base.UnpackedRemoteDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A dataset with all three of train, test, and validation sets as URLs."]; "PathDataset" -> "UnpackedRemoteDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ZipSingleDataset" [URL="index.html#pykeen.datasets.base.ZipSingleDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Loads a dataset that's a single file inside a zip archive."]; "CompressedSingleDataset" -> "ZipSingleDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

pykeen.datasets.analysis Module

Dataset analysis utilities.

Functions

get_relation_count_df(dataset[, ...])

Create a dataframe with relation counts.

get_entity_count_df(dataset[, merge_sides, ...])

Create a dataframe with entity counts.

get_entity_relation_co_occurrence_df(dataset)

Create a dataframe of entity/relation co-occurrence.

get_relation_functionality_df(*, dataset[, ...])

Calculate the functionality and inverse functionality score per relation.

get_relation_pattern_types_df(dataset, *[, ...])

Categorize relations based on patterns from RotatE [sun2019].

get_relation_cardinality_types_df(*, dataset)

Determine the relation cardinality types.

Inductive Datasets

pykeen.datasets.inductive Package

Inductive models in PyKEEN.

Classes

InductiveDataset()

Contains transductive train and inductive inference/validation/test datasets.

EagerInductiveDataset(transductive_training, ...)

An eager inductive datasets.

LazyInductiveDataset()

An inductive dataset that has lazy loading.

DisjointInductivePathDataset(...[, eager, ...])

A disjoint inductive dataset specified by paths.

UnpackedRemoteDisjointInductiveDataset(...)

A dataset with all four of train, inductive_inference, inductive test, and inductive validation sets as URLs.

InductiveFB15k237([version])

The inductive FB15k-237 dataset in 4 versions.

InductiveWN18RR([version])

The inductive WN18RR dataset in 4 versions.

InductiveNELL([version])

The inductive NELL dataset in 4 versions.

ILPC2022Large(**kwargs)

An inductive link prediction dataset for the ILPC 2022 Challenge.

ILPC2022Small(**kwargs)

An inductive link prediction dataset for the ILPC 2022 Challenge.

Class Inheritance Diagram

digraph inheritance4cf349c8f2 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "DisjointInductivePathDataset" [URL="index.html#pykeen.datasets.inductive.DisjointInductivePathDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A disjoint inductive dataset specified by paths."]; "LazyInductiveDataset" -> "DisjointInductivePathDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "EagerInductiveDataset" [URL="index.html#pykeen.datasets.inductive.EagerInductiveDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An eager inductive datasets."]; "InductiveDataset" -> "EagerInductiveDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ILPC2022Large" [URL="index.html#pykeen.datasets.inductive.ILPC2022Large",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An inductive link prediction dataset for the ILPC 2022 Challenge."]; "UnpackedRemoteDisjointInductiveDataset" -> "ILPC2022Large" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ILPC2022Small" [URL="index.html#pykeen.datasets.inductive.ILPC2022Small",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An inductive link prediction dataset for the ILPC 2022 Challenge."]; "UnpackedRemoteDisjointInductiveDataset" -> "ILPC2022Small" [arrowsize=0.5,style="setlinewidth(0.5)"]; "InductiveDataset" [URL="index.html#pykeen.datasets.inductive.InductiveDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Contains transductive train and inductive inference/validation/test datasets."]; "InductiveFB15k237" [URL="index.html#pykeen.datasets.inductive.InductiveFB15k237",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The inductive FB15k-237 dataset in 4 versions."]; "UnpackedRemoteDisjointInductiveDataset" -> "InductiveFB15k237" [arrowsize=0.5,style="setlinewidth(0.5)"]; "InductiveNELL" [URL="index.html#pykeen.datasets.inductive.InductiveNELL",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The inductive NELL dataset in 4 versions."]; "UnpackedRemoteDisjointInductiveDataset" -> "InductiveNELL" [arrowsize=0.5,style="setlinewidth(0.5)"]; "InductiveWN18RR" [URL="index.html#pykeen.datasets.inductive.InductiveWN18RR",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The inductive WN18RR dataset in 4 versions."]; "UnpackedRemoteDisjointInductiveDataset" -> "InductiveWN18RR" [arrowsize=0.5,style="setlinewidth(0.5)"]; "LazyInductiveDataset" [URL="index.html#pykeen.datasets.inductive.LazyInductiveDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An inductive dataset that has lazy loading."]; "InductiveDataset" -> "LazyInductiveDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "UnpackedRemoteDisjointInductiveDataset" [URL="index.html#pykeen.datasets.inductive.UnpackedRemoteDisjointInductiveDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A dataset with all four of train, inductive_inference, inductive test, and inductive validation sets as URLs."]; "DisjointInductivePathDataset" -> "UnpackedRemoteDisjointInductiveDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Entity Alignment

pykeen.datasets.ea.combination Module

Combination strategies for entity alignment datasets.

Classes

GraphPairCombinator()

A base class for combination of a graph pair into a single graph.

DisjointGraphPairCombinator()

This combinator keeps both graphs as disconnected components.

SwapGraphPairCombinator()

Add extra triples by swapping aligned entities.

ExtraRelationGraphPairCombinator()

This combinator keeps all entities, but introduces a novel alignment relation.

CollapseGraphPairCombinator()

This combinator merges all matching entity pairs into a single ID.

ProcessedTuple(mapped_triples, alignment, ...)

The result of processing a pair of triples factories.

Class Inheritance Diagram

digraph inheritancefa6f1ecad5 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "CollapseGraphPairCombinator" [URL="index.html#pykeen.datasets.ea.combination.CollapseGraphPairCombinator",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="This combinator merges all matching entity pairs into a single ID."]; "GraphPairCombinator" -> "CollapseGraphPairCombinator" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DisjointGraphPairCombinator" [URL="index.html#pykeen.datasets.ea.combination.DisjointGraphPairCombinator",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="This combinator keeps both graphs as disconnected components."]; "GraphPairCombinator" -> "DisjointGraphPairCombinator" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExtraRelationGraphPairCombinator" [URL="index.html#pykeen.datasets.ea.combination.ExtraRelationGraphPairCombinator",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="This combinator keeps all entities, but introduces a novel alignment relation."]; "GraphPairCombinator" -> "ExtraRelationGraphPairCombinator" [arrowsize=0.5,style="setlinewidth(0.5)"]; "GraphPairCombinator" [URL="index.html#pykeen.datasets.ea.combination.GraphPairCombinator",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A base class for combination of a graph pair into a single graph."]; "ABC" -> "GraphPairCombinator" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ProcessedTuple" [URL="index.html#pykeen.datasets.ea.combination.ProcessedTuple",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The result of processing a pair of triples factories."]; "SwapGraphPairCombinator" [URL="index.html#pykeen.datasets.ea.combination.SwapGraphPairCombinator",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Add extra triples by swapping aligned entities."]; "GraphPairCombinator" -> "SwapGraphPairCombinator" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Triples

Classes for creating and storing training data from triples.

class CoreTriplesFactory(mapped_triples, num_entities, num_relations, create_inverse_triples=False, metadata=None)[source]

Create instances from ID-based triples.

Create the triples factory.

Parameters:
  • mapped_triples (Union[LongTensor, ndarray]) – shape: (n, 3) A three-column matrix where each row are the head identifier, relation identifier, then tail identifier.

  • num_entities (int) – The number of entities.

  • num_relations (int) – The number of relations.

  • create_inverse_triples (bool) – Whether to create inverse triples.

  • metadata (Optional[Mapping[str, Any]]) – Arbitrary metadata to go with the graph

Raises:
  • TypeError – if the mapped_triples are of non-integer dtype

  • ValueError – if the mapped_triples are of invalid shape

clone_and_exchange_triples(mapped_triples, extra_metadata=None, keep_metadata=True, create_inverse_triples=None)[source]

Create a new triples factory sharing everything except the triples.

Note

We use shallow copies.

Parameters:
  • mapped_triples (LongTensor) – The new mapped triples.

  • extra_metadata (Optional[Dict[str, Any]]) – Extra metadata to include in the new triples factory. If keep_metadata is true, the dictionaries will be unioned with precedence taken on keys from extra_metadata.

  • keep_metadata (bool) – Pass the current factory’s metadata to the new triples factory

  • create_inverse_triples (Optional[bool]) – Change inverse triple creation flag. If None, use flag from this factory.

Return type:

CoreTriplesFactory

Returns:

The new factory.

classmethod create(mapped_triples, num_entities=None, num_relations=None, create_inverse_triples=False, metadata=None)[source]

Create a triples factory without any label information.

Parameters:
  • mapped_triples (LongTensor) – shape: (n, 3) The ID-based triples.

  • num_entities (Optional[int]) – The number of entities. If not given, inferred from mapped_triples.

  • num_relations (Optional[int]) – The number of relations. If not given, inferred from mapped_triples.

  • create_inverse_triples (bool) – Whether to create inverse triples.

  • metadata (Optional[Mapping[str, Any]]) – Additional metadata to store in the factory.

Return type:

CoreTriplesFactory

Returns:

A new triples factory.

create_lcwa_instances(use_tqdm=None, target=None)[source]

Create LCWA instances for this factory’s triples.

Return type:

Dataset

Parameters:
  • use_tqdm (bool | None) –

  • target (int | None) –

create_slcwa_instances(*, sampler=None, **kwargs)[source]

Create sLCWA instances for this factory’s triples.

Return type:

Dataset

Parameters:

sampler (str | None) –

entities_to_ids(entities)[source]

Normalize entities to IDs.

Parameters:

entities (Union[Collection[int], Collection[str]]) – A collection of either integer identifiers for entities or string labels for entities (that will get auto-converted)

Return type:

Collection[int]

Returns:

Integer identifiers for entities

Raises:

ValueError – If the entities passed are string labels and this triples factory does not have an entity label to identifier mapping (e.g., it’s just a base CoreTriplesFactory instance)

classmethod from_path_binary(path)[source]

Load triples factory from a binary file.

Parameters:

path (Union[str, Path, TextIO]) – The path, pointing to an existing PyTorch .pt file.

Return type:

CoreTriplesFactory

Returns:

The loaded triples factory.

get_inverse_relation_id(relation)[source]

Get the inverse relation identifier for the given relation.

Return type:

int

Parameters:

relation (int) –

get_mask_for_relations(relations, invert=False)[source]

Get a boolean mask for triples with the given relations.

Return type:

BoolTensor

Parameters:
get_most_frequent_relations(n)[source]

Get the IDs of the n most frequent relations.

Parameters:

n (Union[int, float]) – Either the (integer) number of top relations to keep or the (float) percentage of top relationships to keep.

Return type:

Set[int]

Returns:

A set of IDs for the n most frequent relations

Raises:

TypeError – If the n is the wrong type

iter_extra_repr()[source]

Iterate over extra_repr components.

Return type:

Iterable[str]

new_with_restriction(entities=None, relations=None, invert_entity_selection=False, invert_relation_selection=False)[source]

Make a new triples factory only keeping the given entities and relations, but keeping the ID mapping.

Parameters:
  • entities (Union[None, Collection[int], Collection[str]]) – The entities of interest. If None, defaults to all entities.

  • relations (Union[None, Collection[int], Collection[str]]) – The relations of interest. If None, defaults to all relations.

  • invert_entity_selection (bool) – Whether to invert the entity selection, i.e. select those triples without the provided entities.

  • invert_relation_selection (bool) – Whether to invert the relation selection, i.e. select those triples without the provided relations.

Return type:

CoreTriplesFactory

Returns:

A new triples factory, which has only a subset of the triples containing the entities and relations of interest. The label-to-ID mapping is not modified.

property num_triples: int

The number of triples.

Return type:

int

relations_to_ids(relations)[source]

Normalize relations to IDs.

Parameters:

relations (Union[Collection[int], Collection[str]]) – A collection of either integer identifiers for relations or string labels for relations (that will get auto-converted)

Return type:

Collection[int]

Returns:

Integer identifiers for relations

Raises:

ValueError – If the relations passed are string labels and this triples factory does not have a relation label to identifier mapping (e.g., it’s just a base CoreTriplesFactory instance)

split(ratios=0.8, *, random_state=None, randomize_cleanup=False, method=None)[source]

Split a triples factory into a train/test.

Parameters:
  • ratios (Union[float, Sequence[float]]) –

    There are three options for this argument:

    1. A float can be given between 0 and 1.0, non-inclusive. The first set of triples will get this ratio and the second will get the rest.

    2. A list of ratios can be given for which set in which order should get what ratios as in [0.8, 0.1]. The final ratio can be omitted because that can be calculated.

    3. All ratios can be explicitly set in order such as in [0.8, 0.1, 0.1] where the sum of all ratios is 1.0.

  • random_state (Union[None, int, Generator]) – The random state used to shuffle and split the triples.

  • randomize_cleanup (bool) – If true, uses the non-deterministic method for moving triples to the training set. This has the advantage that it does not necessarily have to move all of them, but it might be significantly slower since it moves one triple at a time.

  • method (Optional[str]) – The name of the method to use, from SPLIT_METHODS. Defaults to “coverage”.

Return type:

List[CoreTriplesFactory]

Returns:

A partition of triples, which are split (approximately) according to the ratios, stored TriplesFactory’s which share everything else with this root triples factory.

ratio = 0.8  # makes a [0.8, 0.2] split
training_factory, testing_factory = factory.split(ratio)

ratios = [0.8, 0.1]  # makes a [0.8, 0.1, 0.1] split
training_factory, testing_factory, validation_factory = factory.split(ratios)

ratios = [0.8, 0.1, 0.1]  # also makes a [0.8, 0.1, 0.1] split
training_factory, testing_factory, validation_factory = factory.split(ratios)
tensor_to_df(tensor, **kwargs)[source]

Take a tensor of triples and make a pandas dataframe with labels.

Parameters:
  • tensor (LongTensor) – shape: (n, 3) The triples, ID-based and in format (head_id, relation_id, tail_id).

  • kwargs (Union[Tensor, ndarray, Sequence]) – Any additional number of columns. Each column needs to be of shape (n,). Reserved column names: {“head_id”, “head_label”, “relation_id”, “relation_label”, “tail_id”, “tail_label”}.

Return type:

DataFrame

Returns:

A dataframe with n rows, and 6 + len(kwargs) columns.

to_path_binary(path)[source]

Save triples factory to path in (PyTorch’s .pt) binary format.

Parameters:

path (Union[str, Path, TextIO]) – The path to store the triples factory to.

Return type:

Path

Returns:

The path to the file that got dumped

with_labels(entity_to_id, relation_to_id)[source]

Add labeling to the TriplesFactory.

Return type:

TriplesFactory

Parameters:
class Instances[source]

Base class for training instances.

classmethod from_triples(mapped_triples, *, num_entities, num_relations, **kwargs)[source]

Create instances from mapped triples.

Parameters:
  • mapped_triples (LongTensor) – shape: (num_triples, 3) The ID-based triples.

  • num_entities (int) – >0 The number of entities.

  • num_relations (int) – >0 The number of relations.

  • kwargs – additional keyword-based parameters.

Return type:

Instances

Returns:

The instances.

# noqa:DAR202 # noqa:DAR401

get_collator()[source]

Get a collator.

Return type:

Optional[Callable[[List[~SampleType]], ~BatchType]]

class KGInfo(num_entities, num_relations, create_inverse_triples)[source]

An object storing information about the number of entities and relations.

Initialize the information object.

Parameters:
  • num_entities (int) – the number of entities.

  • num_relations (int) – the number of relations, excluding artifical inverse relations.

  • create_inverse_triples (bool) – whether to create inverse triples

create_inverse_triples: bool

whether to create inverse triples

iter_extra_repr()[source]

Iterate over extra_repr components.

Return type:

Iterable[str]

num_entities: int

the number of unique entities

num_relations: int

the number of relations (maybe including “artificial” inverse relations)

real_num_relations: int

the number of real relations, i.e., without artificial inverses

class LCWAInstances(*, pairs, compressed)[source]

Triples and mappings to their indices for LCWA.

Initialize the LCWA instances.

Parameters:
  • pairs (ndarray) – The unique pairs

  • compressed (csr_matrix) – The compressed triples in CSR format

classmethod from_triples(mapped_triples, *, num_entities, num_relations, target=None, **kwargs)[source]

Create LCWA instances from triples.

Parameters:
  • mapped_triples (LongTensor) – shape: (num_triples, 3) The ID-based triples.

  • num_entities (int) – The number of entities.

  • num_relations (int) – The number of relations.

  • target (Optional[int]) – The column to predict

  • kwargs – Keyword arguments (thrown out)

Return type:

Instances

Returns:

The instances.

class SLCWAInstances(*, mapped_triples, num_entities=None, num_relations=None, negative_sampler=None, negative_sampler_kwargs=None)[source]

Training instances for the sLCWA.

Initialize the sLCWA instances.

Parameters:
  • mapped_triples (LongTensor) – shape: (num_triples, 3) the ID-based triples, passed to the negative sampler

  • num_entities (Optional[int]) – >0 the number of entities, passed to the negative sampler

  • num_relations (Optional[int]) – >0 the number of relations, passed to the negative sampler

  • negative_sampler (Union[str, NegativeSampler, Type[NegativeSampler], None]) – the negative sampler, or a hint thereof

  • negative_sampler_kwargs (Optional[Mapping[str, Any]]) – additional keyword-based arguments passed to the negative sampler

static collate(samples)[source]

Collate samples.

Return type:

SLCWABatch

Parameters:

samples (Iterable[Tuple[LongTensor, LongTensor, BoolTensor | None]]) –

classmethod from_triples(mapped_triples, *, num_entities, num_relations, **kwargs)[source]

Create instances from mapped triples.

Parameters:
  • mapped_triples (LongTensor) – shape: (num_triples, 3) The ID-based triples.

  • num_entities (int) – >0 The number of entities.

  • num_relations (int) – >0 The number of relations.

  • kwargs – additional keyword-based parameters.

Return type:

Instances

Returns:

The instances.

# noqa:DAR202 # noqa:DAR401

get_collator()[source]

Get a collator.

Return type:

Optional[Callable[[List[Tuple[LongTensor, LongTensor, Optional[BoolTensor]]]], SLCWABatch]]

class TriplesFactory(mapped_triples, entity_to_id, relation_to_id, create_inverse_triples=False, metadata=None, num_entities=None, num_relations=None)[source]

Create instances given the path to triples.

Create the triples factory.

Parameters:
  • mapped_triples (Union[LongTensor, ndarray]) – shape: (n, 3) A three-column matrix where each row are the head identifier, relation identifier, then tail identifier.

  • entity_to_id (Mapping[str, int]) – The mapping from entities’ labels to their indices.

  • relation_to_id (Mapping[str, int]) – The mapping from relations’ labels to their indices.

  • create_inverse_triples (bool) – Whether to create inverse triples.

  • metadata (Optional[Mapping[str, Any]]) – Arbitrary metadata to go with the graph

  • num_entities (Optional[int]) – the number of entities. May be None, in which case this number is inferred by the label mapping

  • num_relations (Optional[int]) – the number of relations. May be None, in which case this number is inferred by the label mapping

Raises:

ValueError – if the explicitly provided number of entities or relations does not match with the one given by the label mapping

clone_and_exchange_triples(mapped_triples, extra_metadata=None, keep_metadata=True, create_inverse_triples=None)[source]

Create a new triples factory sharing everything except the triples.

Note

We use shallow copies.

Parameters:
  • mapped_triples (LongTensor) – The new mapped triples.

  • extra_metadata (Optional[Dict[str, Any]]) – Extra metadata to include in the new triples factory. If keep_metadata is true, the dictionaries will be unioned with precedence taken on keys from extra_metadata.

  • keep_metadata (bool) – Pass the current factory’s metadata to the new triples factory

  • create_inverse_triples (Optional[bool]) – Change inverse triple creation flag. If None, use flag from this factory.

Return type:

TriplesFactory

Returns:

The new factory.

entities_to_ids(entities)[source]

Normalize entities to IDs.

Parameters:

entities (Union[Collection[int], Collection[str]]) – A collection of either integer identifiers for entities or string labels for entities (that will get auto-converted)

Return type:

Collection[int]

Returns:

Integer identifiers for entities

Raises:

ValueError – If the entities passed are string labels and this triples factory does not have an entity label to identifier mapping (e.g., it’s just a base CoreTriplesFactory instance)

property entity_id_to_label: Mapping[int, str]

Return the mapping from entity IDs to labels.

Return type:

Mapping[int, str]

property entity_to_id: Mapping[str, int]

Return the mapping from entity labels to IDs.

Return type:

Mapping[str, int]

entity_word_cloud(top=None)[source]

Make a word cloud based on the frequency of occurrence of each entity in a Jupyter notebook.

Parameters:

top (Optional[int]) – The number of top entities to show. Defaults to 100.

Returns:

A word cloud object for a Jupyter notebook

Warning

This function requires the wordcloud package. Use pip install pykeen[wordcloud] to install it.

classmethod from_labeled_triples(triples, *, create_inverse_triples=False, entity_to_id=None, relation_to_id=None, compact_id=True, filter_out_candidate_inverse_relations=True, metadata=None)[source]

Create a new triples factory from label-based triples.

Parameters:
  • triples (ndarray) – shape: (n, 3), dtype: str The label-based triples.

  • create_inverse_triples (bool) – Whether to create inverse triples.

  • entity_to_id (Optional[Mapping[str, int]]) – The mapping from entity labels to ID. If None, create a new one from the triples.

  • relation_to_id (Optional[Mapping[str, int]]) – The mapping from relations labels to ID. If None, create a new one from the triples.

  • compact_id (bool) – Whether to compact IDs such that the IDs are consecutive.

  • filter_out_candidate_inverse_relations (bool) – Whether to remove triples with relations with the inverse suffix.

  • metadata (Optional[Dict[str, Any]]) – Arbitrary key/value pairs to store as metadata

Return type:

TriplesFactory

Returns:

A new triples factory.

classmethod from_path(path, *, create_inverse_triples=False, entity_to_id=None, relation_to_id=None, compact_id=True, metadata=None, load_triples_kwargs=None, **kwargs)[source]

Create a new triples factory from triples stored in a file.

Parameters:
  • path (Union[str, Path, TextIO]) – The path where the label-based triples are stored.

  • create_inverse_triples (bool) – Whether to create inverse triples.

  • entity_to_id (Optional[Mapping[str, int]]) – The mapping from entity labels to ID. If None, create a new one from the triples.

  • relation_to_id (Optional[Mapping[str, int]]) – The mapping from relations labels to ID. If None, create a new one from the triples.

  • compact_id (bool) – Whether to compact IDs such that the IDs are consecutive.

  • metadata (Optional[Dict[str, Any]]) – Arbitrary key/value pairs to store as metadata with the triples factory. Do not include path as a key because it is automatically taken from the path kwarg to this function.

  • load_triples_kwargs (Optional[Mapping[str, Any]]) – Optional keyword arguments to pass to load_triples(). Could include the delimiter or a column_remapping.

  • kwargs – additional keyword-based parameters, which are ignored.

Return type:

TriplesFactory

Returns:

A new triples factory.

get_inverse_relation_id(relation)[source]

Get the inverse relation identifier for the given relation.

Return type:

int

Parameters:

relation (str | int) –

get_mask_for_relations(relations, invert=False)[source]

Get a boolean mask for triples with the given relations.

Return type:

BoolTensor

Parameters:
label_triples(triples, unknown_entity_label='[UNKNOWN]', unknown_relation_label=None)[source]

Convert ID-based triples to label-based ones.

Parameters:
  • triples (LongTensor) – The ID-based triples.

  • unknown_entity_label (str) – The label to use for unknown entity IDs.

  • unknown_relation_label (Optional[str]) – The label to use for unknown relation IDs.

Return type:

ndarray

Returns:

The same triples, but labeled.

map_triples(triples)[source]

Convert label-based triples to ID-based triples.

Return type:

LongTensor

Parameters:

triples (ndarray) –

new_with_restriction(entities=None, relations=None, invert_entity_selection=False, invert_relation_selection=False)[source]

Make a new triples factory only keeping the given entities and relations, but keeping the ID mapping.

Parameters:
  • entities (Union[None, Collection[int], Collection[str]]) – The entities of interest. If None, defaults to all entities.

  • relations (Union[None, Collection[int], Collection[str]]) – The relations of interest. If None, defaults to all relations.

  • invert_entity_selection (bool) – Whether to invert the entity selection, i.e. select those triples without the provided entities.

  • invert_relation_selection (bool) – Whether to invert the relation selection, i.e. select those triples without the provided relations.

Return type:

TriplesFactory

Returns:

A new triples factory, which has only a subset of the triples containing the entities and relations of interest. The label-to-ID mapping is not modified.

property relation_id_to_label: Mapping[int, str]

Return the mapping from relations IDs to labels.

Return type:

Mapping[int, str]

property relation_to_id: Mapping[str, int]

Return the mapping from relations labels to IDs.

Return type:

Mapping[str, int]

relation_word_cloud(top=None)[source]

Make a word cloud based on the frequency of occurrence of each relation in a Jupyter notebook.

Parameters:

top (Optional[int]) – The number of top relations to show. Defaults to 100.

Returns:

A world cloud object for a Jupyter notebook

Warning

This function requires the wordcloud package. Use pip install pykeen[wordcloud] to install it.

relations_to_ids(relations)[source]

Normalize relations to IDs.

Parameters:

relations (Union[Collection[int], Collection[str]]) – A collection of either integer identifiers for relations or string labels for relations (that will get auto-converted)

Return type:

Collection[int]

Returns:

Integer identifiers for relations

Raises:

ValueError – If the relations passed are string labels and this triples factory does not have a relation label to identifier mapping (e.g., it’s just a base CoreTriplesFactory instance)

tensor_to_df(tensor, **kwargs)[source]

Take a tensor of triples and make a pandas dataframe with labels.

Parameters:
  • tensor (LongTensor) – shape: (n, 3) The triples, ID-based and in format (head_id, relation_id, tail_id).

  • kwargs (Union[Tensor, ndarray, Sequence]) – Any additional number of columns. Each column needs to be of shape (n,). Reserved column names: {“head_id”, “head_label”, “relation_id”, “relation_label”, “tail_id”, “tail_label”}.

Return type:

DataFrame

Returns:

A dataframe with n rows, and 6 + len(kwargs) columns.

to_core_triples_factory()[source]

Return this factory as a core factory.

Return type:

CoreTriplesFactory

to_path_binary(path)[source]

Save triples factory to path in (PyTorch’s .pt) binary format.

Parameters:

path (Union[str, Path, TextIO]) – The path to store the triples factory to.

Return type:

Path

Returns:

The path to the file that got dumped

property triples: ndarray

The labeled triples, a 3-column matrix where each row are the head label, relation label, then tail label.

Return type:

ndarray

class TriplesNumericLiteralsFactory(*, numeric_literals, literals_to_id, **kwargs)[source]

Create multi-modal instances given the path to triples.

Initialize the multi-modal triples factory.

Parameters:
  • numeric_literals (ndarray) – shape: (num_entities, num_literals) the numeric literals as a dense matrix.

  • literals_to_id (Mapping[str, int]) – a mapping from literal names to their IDs, i.e., the columns in the numeric_literals matrix.

  • kwargs – additional keyword-based parameters passed to TriplesFactory.__init__().

clone_and_exchange_triples(mapped_triples, extra_metadata=None, keep_metadata=True, create_inverse_triples=None)[source]

Create a new triples factory sharing everything except the triples.

Note

We use shallow copies.

Parameters:
  • mapped_triples (LongTensor) – The new mapped triples.

  • extra_metadata (Optional[Dict[str, Any]]) – Extra metadata to include in the new triples factory. If keep_metadata is true, the dictionaries will be unioned with precedence taken on keys from extra_metadata.

  • keep_metadata (bool) – Pass the current factory’s metadata to the new triples factory

  • create_inverse_triples (Optional[bool]) – Change inverse triple creation flag. If None, use flag from this factory.

Return type:

TriplesNumericLiteralsFactory

Returns:

The new factory.

classmethod from_labeled_triples(triples, *, numeric_triples=None, **kwargs)[source]

Create a new triples factory from label-based triples.

Parameters:
  • triples (ndarray) – shape: (n, 3), dtype: str The label-based triples.

  • create_inverse_triples – Whether to create inverse triples.

  • entity_to_id – The mapping from entity labels to ID. If None, create a new one from the triples.

  • relation_to_id – The mapping from relations labels to ID. If None, create a new one from the triples.

  • compact_id – Whether to compact IDs such that the IDs are consecutive.

  • filter_out_candidate_inverse_relations – Whether to remove triples with relations with the inverse suffix.

  • metadata – Arbitrary key/value pairs to store as metadata

  • numeric_triples (ndarray) –

Return type:

TriplesNumericLiteralsFactory

Returns:

A new triples factory.

classmethod from_path(path, *, path_to_numeric_triples=None, **kwargs)[source]

Create a new triples factory from triples stored in a file.

Parameters:
  • path (Union[str, Path, TextIO]) – The path where the label-based triples are stored.

  • create_inverse_triples – Whether to create inverse triples.

  • entity_to_id – The mapping from entity labels to ID. If None, create a new one from the triples.

  • relation_to_id – The mapping from relations labels to ID. If None, create a new one from the triples.

  • compact_id – Whether to compact IDs such that the IDs are consecutive.

  • metadata – Arbitrary key/value pairs to store as metadata with the triples factory. Do not include path as a key because it is automatically taken from the path kwarg to this function.

  • load_triples_kwargs – Optional keyword arguments to pass to load_triples(). Could include the delimiter or a column_remapping.

  • kwargs – additional keyword-based parameters, which are ignored.

  • path_to_numeric_triples (None | str | Path | TextIO) –

Return type:

TriplesNumericLiteralsFactory

Returns:

A new triples factory.

get_numeric_literals_tensor()[source]

Return the numeric literals as a tensor.

Return type:

FloatTensor

iter_extra_repr()[source]

Iterate over extra_repr components.

Return type:

Iterable[str]

property literal_shape: Tuple[int, ...]

Return the shape of the literals.

Return type:

Tuple[int, …]

to_path_binary(path)[source]

Save triples factory to path in (PyTorch’s .pt) binary format.

Parameters:

path (Union[str, Path, TextIO]) – The path to store the triples factory to.

Return type:

Path

Returns:

The path to the file that got dumped

get_mapped_triples(x=None, *, mapped_triples=None, triples=None, factory=None)[source]

Get ID-based triples either directly, or from a factory.

Preference order: 1. mapped_triples 2. triples (converted using factory) 3. x 4. factory.mapped_triples

Parameters:
Raises:

ValueError – if all inputs are None, or provided inputs are invalid.

Return type:

LongTensor

Returns:

the ID-based triples

Instance creation utilities.

compute_compressed_adjacency_list(mapped_triples, num_entities=None)[source]

Compute compressed undirected adjacency list representation for efficient sampling.

The compressed adjacency list format is inspired by CSR sparse matrix format.

Parameters:
  • mapped_triples (LongTensor) – the ID-based triples

  • num_entities (Optional[int]) – the number of entities.

Return type:

Tuple[LongTensor, LongTensor, LongTensor]

Returns:

a tuple (degrees, offsets, compressed_adj_lists) where

  • degrees: shape: (num_entities,)

  • offsets: shape: (num_entities,)

  • compressed_adj_list: shape: (2 * num_triples, 2)

with

adj_list[i] = compressed_adj_list[offsets[i]:offsets[i+1]]

get_entities(triples)[source]

Get all entities from the triples.

Return type:

Set[int]

Parameters:

triples (LongTensor) –

get_relations(triples)[source]

Get all relations from the triples.

Return type:

Set[int]

Parameters:

triples (LongTensor) –

load_triples(path, delimiter='\\t', encoding=None, column_remapping=None)[source]

Load triples saved as tab separated values.

Parameters:
  • path (Union[str, Path, TextIO]) – The key for the data to be loaded. Typically, this will be a file path ending in .tsv that points to a file with three columns - the head, relation, and tail. This can also be used to invoke PyKEEN data importer entrypoints (see below).

  • delimiter (str) – The delimiter between the columns in the file

  • encoding (Optional[str]) – The encoding for the file. Defaults to utf-8.

  • column_remapping (Optional[Sequence[int]]) – A remapping if the three columns do not follow the order head-relation-tail. For example, if the order is head-tail-relation, pass (0, 2, 1)

Return type:

ndarray

Returns:

A numpy array representing “labeled” triples.

Raises:

ValueError – if a column remapping was passed but it was not a length 3 sequence

Besides TSV handling, PyKEEN does not come with any importers pre-installed. A few can be found at:

tensor_to_df(tensor, **kwargs)[source]

Take a tensor of triples and make a pandas dataframe with labels.

Parameters:
  • tensor (LongTensor) – shape: (n, 3) The triples, ID-based and in format (head_id, relation_id, tail_id).

  • kwargs (Union[Tensor, ndarray, Sequence]) – Any additional number of columns. Each column needs to be of shape (n,). Reserved column names: {“head_id”, “head_label”, “relation_id”, “relation_label”, “tail_id”, “tail_label”}.

Return type:

DataFrame

Returns:

A dataframe with n rows, and 3 + len(kwargs) columns.

Raises:

ValueError – If a reserved column name appears in kwargs.

Remixing and dataset distance utilities.

Most datasets are given in with a pre-defined split, but it’s often not discussed how this split was created. This module contains utilities for investigating the effects of remixing pre-split datasets like :class`pykeen.datasets.Nations`.

Further, it defines a metric for the “distance” between two splits of a given dataset. Later, this will be used to map the landscape and see if there is a smooth, continuous relationship between datasets’ splits’ distances and their maximum performance.

remix(*triples_factories, **kwargs)[source]

Remix the triples from the training, testing, and validation set.

Parameters:
  • triples_factories (CoreTriplesFactory) – A sequence of triples factories

  • kwargs – Keyword arguments to be passed to split()

Return type:

List[CoreTriplesFactory]

Returns:

A sequence of triples factories of the same sizes but randomly re-assigned triples

Raises:

NotImplementedError – if any of the triples factories have create_inverse_triples

Deterioration algorithm.

deteriorate(reference, *others, n, random_state=None)[source]

Remove n triples from the reference set.

Parameters:
  • reference (TriplesFactory) – The reference triples factory

  • others (TriplesFactory) – Other triples factories to deteriorate

  • n (Union[int, float]) – The ratio to deteriorate. If given as a float, should be between 0 and 1. If an integer, deteriorates that many triples

  • random_state (Union[None, int, Generator]) – The random state

Return type:

List[TriplesFactory]

Returns:

A concatenated list of the processed reference and other triples factories

Raises:
  • NotImplementedError – if the reference triples factory has inverse triples

  • ValueError – If a float is given for n that isn’t between 0 and 1

Training

Training loops for KGE models using multi-modal information.

Throughout the following explanations of training loops, we will assume the set of entities \(\mathcal{E}\), set of relations \(\mathcal{R}\), set of possible triples \(\mathcal{T} = \mathcal{E} \times \mathcal{R} \times \mathcal{E}\). We stratify \(\mathcal{T}\) into the disjoint union of positive triples \(\mathcal{T^{+}} \subseteq \mathcal{T}\) and negative triples \(\mathcal{T^{-}} \subseteq \mathcal{T}\) such that \(\mathcal{T^{+}} \cap \mathcal{T^{-}} = \emptyset\) and \(\mathcal{T^{+}} \cup \mathcal{T^{-}} = \mathcal{T}\).

A knowledge graph \(\mathcal{K}\) constructed under the open world assumption contains a subset of all possible positive triples such that \(\mathcal{K} \subseteq \mathcal{T^{+}}\).

Assumptions

Open World Assumption

When training under the open world assumption (OWA), all triples that are not part of the knowledge graph are considered unknown (e.g., neither positive nor negative). This leads to under-fitting (i.e., over-generalization) and is therefore usually a poor choice for training knowledge graph embedding models [nickel2016review]. PyKEEN does not implement a training loop with the OWA.

Warning

Many publications and software packages use OWA to incorrectly refer to the stochastic local closed world assumption (sLCWA). See below for an explanation.

Closed World Assumption

When training under the close world assumption (CWA), all triples that are not part of the knowledge graph are considered as negative. As most knowledge graphs are inherently incomplete, this leads to over-fitting and is therefore usually a poor choice for training knowledge graph embedding models. PyKEEN does not implement a training loop with the CWA.

Local Closed World Assumption

When training under the local closed world assumption (LCWA; introduced in [dong2014]), a particular subset of triples that are not part of the knowledge graph are considered as negative.

Strategy

Local Generator

Global Generator

Head

\(\mathcal{T}_h^-(r,t)=\{(h,r,t) \mid h \in \mathcal{E} \land (h,r,t) \notin \mathcal{K} \}\)

\(\bigcup\limits_{(\_,r,t) \in \mathcal{K}} \mathcal{T}_h^-(r,t)\)

Relation

\(\mathcal{T}_r^-(h,t)=\{(h,r,t) \mid r \in \mathcal{R} \land (h,r,t) \notin \mathcal{K} \}\)

\(\bigcup\limits_{(h,\_,t) \in \mathcal{K}} \mathcal{T}_r^-(h,t)\)

Tail

\(\mathcal{T}_t^-(h,r)=\{(h,r,t) \mid t \in \mathcal{E} \land (h,r,t) \notin \mathcal{K} \}\)

\(\bigcup\limits_{(h,r,\_) \in \mathcal{K}} \mathcal{T}_t^-(h,r)\)

Most articles refer exclusively to the tail generation strategy when discussing LCWA. However, the relation generation strategy is a popular choice in visual relation detection domain (see [zhang2017] and [sharifzadeh2019vrd]). However, PyKEEN additionally implements head generation since PR #602.

Stochastic Local Closed World Assumption

When training under the stochastic local closed world assumption (SLCWA), a random subset of the union of the head and tail generation strategies from LCWA are considered as negative triples. There are a few benefits from doing this:

  1. Reduce computational workload

  2. Spare updates (i.e., only a few rows of the embedding are affected)

  3. Ability to integrate new negative sampling strategies

There are two other major considerations when randomly sampling negative triples: the random sampling strategy and the filtering of positive triples. A full guide on negative sampling with the SLCWA can be found in pykeen.sampling. The following chart from [ali2020a] demonstrates the different potential triples considered in LCWA vs. sLCWA based on the given true triples (in red):

Troubleshooting Image 2

Classes

TrainingLoop(model, triples_factory[, ...])

A training loop.

SLCWATrainingLoop([negative_sampler, ...])

A training loop that uses the stochastic local closed world assumption training approach.

LCWATrainingLoop(*[, target])

A training loop that is based upon the local closed world assumption (LCWA).

SymmetricLCWATrainingLoop(model, triples_factory)

A "symmetric" LCWA scoring heads and tails at once.

NonFiniteLossError

An exception raised for non-finite loss values.

Callbacks

Training callbacks.

Training callbacks allow for arbitrary extension of the functionality of the pykeen.training.TrainingLoop without subclassing it. Each callback instance has a loop attribute that allows access to the parent training loop and all of its attributes, including the model. The interaction points are similar to those of Keras.

Examples

The following are vignettes showing how PyKEEN’s training loop can be arbitrarily extended using callbacks. If you find that none of the hooks in the TrainingCallback help do what you want, feel free to open an issue.

Reporting Batch Loss

It was suggested in Issue #333 that it might be useful to log all batch losses. This could be accomplished with the following:

from pykeen.training import TrainingCallback

class BatchLossReportCallback(TrainingCallback):
    def on_batch(self, epoch: int, batch, batch_loss: float):
        print(epoch, batch_loss)
Implementing Gradient Clipping

Gradient clipping is one technique used to avoid the exploding gradient problem. Despite it being a very simple, it has several theoretical implications.

In order to reproduce the reference experiments on R-GCN performed by [schlichtkrull2018], gradient clipping must be used before each step of the optimizer. The following example shows how to implement a gradient clipping callback:

from pykeen.training import TrainingCallback
from pykeen.nn.utils import clip_grad_value_

class GradientClippingCallback(TrainingCallback):
    def __init__(self, clip_value: float = 1.0):
        super().__init__()
        self.clip_value = clip_value

    def pre_step(self, **kwargs: Any):
        clip_grad_value_(self.model.parameters(), clip_value=self.clip_value)

Classes

TrainingCallback()

An interface for training callbacks.

StopperTrainingCallback(stopper, *, ...[, ...])

An adapter for the pykeen.stopper.Stopper.

TrackerTrainingCallback()

An adapter for the pykeen.trackers.ResultTracker.

EvaluationLoopTrainingCallback(factory[, ...])

A callback for regular evaluation using new-style evaluation loops.

EvaluationTrainingCallback(*, evaluation_triples)

A callback for regular evaluation.

MultiTrainingCallback([callbacks, ...])

A wrapper for calling multiple training callbacks together.

GradientNormClippingTrainingCallback(max_norm)

A callback for gradient clipping before stepping the optimizer with torch.nn.utils.clip_grad_norm_().

GradientAbsClippingTrainingCallback(clip_value)

A callback for gradient clipping before stepping the optimizer with torch.nn.utils.clip_grad_value_().

Class Inheritance Diagram

digraph inheritancee2864025fa { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "EvaluationLoopTrainingCallback" [URL="index.html#pykeen.training.callbacks.EvaluationLoopTrainingCallback",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A callback for regular evaluation using new-style evaluation loops."]; "TrainingCallback" -> "EvaluationLoopTrainingCallback" [arrowsize=0.5,style="setlinewidth(0.5)"]; "EvaluationTrainingCallback" [URL="index.html#pykeen.training.callbacks.EvaluationTrainingCallback",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A callback for regular evaluation."]; "TrainingCallback" -> "EvaluationTrainingCallback" [arrowsize=0.5,style="setlinewidth(0.5)"]; "GradientAbsClippingTrainingCallback" [URL="index.html#pykeen.training.callbacks.GradientAbsClippingTrainingCallback",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A callback for gradient clipping before stepping the optimizer with :func:`torch.nn.utils.clip_grad_value_`."]; "TrainingCallback" -> "GradientAbsClippingTrainingCallback" [arrowsize=0.5,style="setlinewidth(0.5)"]; "GradientNormClippingTrainingCallback" [URL="index.html#pykeen.training.callbacks.GradientNormClippingTrainingCallback",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A callback for gradient clipping before stepping the optimizer with :func:`torch.nn.utils.clip_grad_norm_`."]; "TrainingCallback" -> "GradientNormClippingTrainingCallback" [arrowsize=0.5,style="setlinewidth(0.5)"]; "MultiTrainingCallback" [URL="index.html#pykeen.training.callbacks.MultiTrainingCallback",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A wrapper for calling multiple training callbacks together."]; "TrainingCallback" -> "MultiTrainingCallback" [arrowsize=0.5,style="setlinewidth(0.5)"]; "StopperTrainingCallback" [URL="index.html#pykeen.training.callbacks.StopperTrainingCallback",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An adapter for the :class:`pykeen.stopper.Stopper`."]; "TrainingCallback" -> "StopperTrainingCallback" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TrackerTrainingCallback" [URL="index.html#pykeen.training.callbacks.TrackerTrainingCallback",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An adapter for the :class:`pykeen.trackers.ResultTracker`."]; "TrainingCallback" -> "TrackerTrainingCallback" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TrainingCallback" [URL="index.html#pykeen.training.callbacks.TrainingCallback",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An interface for training callbacks."]; }

Learning Rate Schedulers

Learning Rate Schedulers available in PyKEEN.

Stoppers

Early stoppers.

The following code will create a scenario in which training will stop (quite) early when training pykeen.models.TransE on the pykeen.datasets.Nations dataset.

>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
...     dataset='nations',
...     model='transe',
...     model_kwargs=dict(embedding_dim=20, scoring_fct_norm=1),
...     optimizer='SGD',
...     optimizer_kwargs=dict(lr=0.01),
...     loss='marginranking',
...     loss_kwargs=dict(margin=1),
...     training_loop='slcwa',
...     training_kwargs=dict(num_epochs=100, batch_size=128),
...     negative_sampler='basic',
...     negative_sampler_kwargs=dict(num_negs_per_pos=1),
...     evaluator_kwargs=dict(filtered=True),
...     evaluation_kwargs=dict(batch_size=128),
...     stopper='early',
...     stopper_kwargs=dict(frequency=5, patience=2, relative_delta=0.002),
... )
class NopStopper(*args, **kwargs)[source]

A stopper that does nothing.

Initialize the stopper.

Parameters:
  • args – ignored positional parameters

  • kwargs – ignored keyword-based parameters

get_summary_dict()[source]

Return empty mapping, doesn’t have any attributes.

Return type:

Mapping[str, Any]

should_evaluate(epoch)[source]

Return false; should never evaluate.

Return type:

bool

Parameters:

epoch (int) –

should_stop(epoch)[source]

Return false; should never stop.

Return type:

bool

Parameters:

epoch (int) –

class EarlyStopper(model, evaluator, training_triples_factory, evaluation_triples_factory, evaluation_batch_size=None, evaluation_slice_size=None, frequency=10, patience=2, metric='hits_at_k', relative_delta=0.01, results=<factory>, larger_is_better=True, result_tracker=None, result_callbacks=<factory>, continue_callbacks=<factory>, stopped_callbacks=<factory>, stopped=False, best_model_path=None, clean_up_checkpoint=True)[source]

A harness for early stopping.

Initialize the stopper.

Parameters:
property best_epoch: int | None

Return the epoch at which the best result occurred.

Return type:

Optional[int]

property best_metric: float

Return the best result so far.

Return type:

float

best_model_path: Path | None = None

the path to the weights of the best model

clean_up_checkpoint: bool = True

whether to delete the file with the best model weights after termination note: the weights will be re-loaded into the model before

continue_callbacks: List[Callable[[Stopper, int | float, int], None]]

Callbacks when training gets continued

evaluation_batch_size: int | None = None

Size of the evaluation batches

evaluation_slice_size: int | None = None

Slice size of the evaluation batches

evaluation_triples_factory: CoreTriplesFactory

The triples to use for evaluation

evaluator: Evaluator

The evaluator

frequency: int = 10

The number of epochs after which the model is evaluated on validation set

get_summary_dict()[source]

Get a summary dict.

Return type:

Mapping[str, Any]

larger_is_better: bool = True

Whether a larger value is better, or a smaller

metric: str = 'hits_at_k'

The name of the metric to use

model: Model

The model

property number_results: int

Count the number of results stored in the early stopper.

Return type:

int

patience: int = 2

The number of iterations (one iteration can correspond to various epochs) with no improvement after which training will be stopped.

relative_delta: float = 0.01

The minimum relative improvement necessary to consider it an improved result

property remaining_patience: int

Return the remaining patience.

Return type:

int

result_callbacks: List[Callable[[Stopper, int | float, int], None]]

Callbacks when after results are calculated

result_tracker: ResultTracker | None = None

The result tracker

results: List[float]

The metric results from all evaluations

should_evaluate(epoch)[source]

Decide if evaluation should be done based on the current epoch and the internal frequency.

Return type:

bool

Parameters:

epoch (int) –

should_stop(epoch)[source]

Evaluate on a metric and compare to past evaluations to decide if training should stop.

Return type:

bool

Parameters:

epoch (int) –

stopped: bool = False

Did the stopper ever decide to stop?

stopped_callbacks: List[Callable[[Stopper, int | float, int], None]]

Callbacks when training is stopped early

training_triples_factory: CoreTriplesFactory

The triples to use for training (to be used during filtered evaluation)

Base Classes

class Stopper(*args, **kwargs)[source]

A harness for stopping training.

Initialize the stopper.

Parameters:
  • args – ignored positional parameters

  • kwargs – ignored keyword-based parameters

abstract get_summary_dict()[source]

Get a summary dict.

Return type:

Mapping[str, Any]

static load_summary_dict_from_training_loop_checkpoint(path)[source]

Load the summary dict from a training loop checkpoint.

Parameters:

path (Union[str, Path]) – Path of the file where to store the state in.

Return type:

Mapping[str, Any]

Returns:

The summary dict of the stopper at the time of saving the checkpoint.

should_evaluate(epoch)[source]

Check if the stopper should be evaluated on the given epoch.

Return type:

bool

Parameters:

epoch (int) –

abstract should_stop(epoch)[source]

Validate on validation set and check for termination condition.

Return type:

bool

Parameters:

epoch (int) –

Loss Functions

Loss functions integrated in PyKEEN.

Rather than re-using the built-in loss functions in PyTorch, we have elected to re-implement some of the code from pytorch.nn.modules.loss in order to encode the three different links of loss functions accepted by PyKEEN in a class hierarchy. This allows for PyKEEN to more dynamically handle different kinds of loss functions as well as share code. Further, it gives more insight to potential users.

Throughout the following explanations of pointwise loss functions, pairwise loss functions, and setwise loss functions, we will assume the set of entities \(\mathcal{E}\), set of relations \(\mathcal{R}\), set of possible triples \(\mathcal{T} = \mathcal{E} \times \mathcal{R} \times \mathcal{E}\), set of possible subsets of possible triples \(2^{\mathcal{T}}\) (i.e., the power set of \(\mathcal{T}\)), set of positive triples \(\mathcal{K}\), set of negative triples \(\mathcal{\bar{K}}\), scoring function (e.g., TransE) \(f: \mathcal{T} \rightarrow \mathbb{R}\) and labeling function \(l:\mathcal{T} \rightarrow \{0,1\}\) where a value of 1 denotes the triple is positive (i.e., \((h,r,t) \in \mathcal{K}\)) and a value of 0 denotes the triple is negative (i.e., \((h,r,t) \notin \mathcal{K}\)).

Note

In most realistic use cases of knowledge graph embedding models, you will have observed a subset of positive triples \(\mathcal{T_{obs}} \subset \mathcal{K}\) and no observations over negative triples. Depending on the training assumption (sLCWA or LCWA), this will mean negative triples are generated in a variety of patterns.

Note

Following the open world assumption (OWA), triples \(\mathcal{\bar{K}}\) are better named “not positive” rather than negative. This is most relevant for pointwise loss functions. For pairwise and setwise loss functions, triples are compared as being more/less positive and the binary classification is not relevant.

Pointwise Loss Functions

A pointwise loss is applied to a single triple. It takes the form of \(L: \mathcal{T} \rightarrow \mathbb{R}\) and computes a real-value for the triple given its labeling. Typically, a pointwise loss function takes the form of \(g: \mathbb{R} \times \{0,1\} \rightarrow \mathbb{R}\) based on the scoring function and labeling function.

\[L(k) = g(f(k), l(k))\]

Examples

Pointwise Loss

Formulation

Square Error

\(g(s, l) = \frac{1}{2}(s - l)^2\)

Binary Cross Entropy

\(g(s, l) = -(l*\log (\sigma(s))+(1-l)*(\log (1-\sigma(s))))\)

Pointwise Hinge

\(g(s, l) = \max(0, \lambda -\hat{l}*s)\)

Soft Pointwise Hinge

\(g(s, l) = \log(1+\exp(\lambda-\hat{l}*s))\)

Pointwise Logistic (softplus)

\(g(s, l) = \log(1+\exp(-\hat{l}*s))\)

For the pointwise logistic and pointwise hinge losses, \(\hat{l}\) has been rescaled from \(\{0,1\}\) to \(\{-1,1\}\). The sigmoid logistic loss function is defined as \(\sigma(z) = \frac{1}{1 + e^{-z}}\).

Note

The pointwise logistic loss can be considered as a special case of the pointwise soft hinge loss where \(\lambda = 0\).

Batching

The pointwise loss of a set of triples (i.e., a batch) \(\mathcal{L}_L: 2^{\mathcal{T}} \rightarrow \mathbb{R}\) is defined as the arithmetic mean of the pointwise losses over each triple in the subset \(\mathcal{B} \in 2^{\mathcal{T}}\):

\[\mathcal{L}_L(\mathcal{B}) = \frac{1}{|\mathcal{B}|} \sum \limits_{k \in \mathcal{B}} L(k)\]

Pairwise Loss Functions

A pairwise loss is applied to a pair of triples - a positive and a negative one. It is defined as \(L: \mathcal{K} \times \mathcal{\bar{K}} \rightarrow \mathbb{R}\) and computes a real value for the pair.

All loss functions implemented in PyKEEN induce an auxillary loss function based on the chosen interaction function \(L{*}: \mathbb{R} \times \mathbb{R} \rightarrow \mathbb{R}\) that simply passes the scores through. Note that \(L\) is often used interchangbly with \(L^{*}\).

\[L(k, \bar{k}) = L^{*}(f(k), f(\bar{k}))\]

Delta Pairwise Loss Functions

Delta pairwise losses are computed on the differences between the scores of the negative and positive triples (e.g., \(\Delta := f(\bar{k}) - f(k)\)) with transfer function \(g: \mathbb{R} \rightarrow \mathbb{R}\) that take the form of:

\[L^{*}(f(k), f(\bar{k})) = g(f(\bar{k}) - f(k)) := g(\Delta)\]

The following table shows delta pairwise loss functions:

Pairwise Loss

Activation

Margin

Formulation

Pairwise Hinge (margin ranking)

ReLU

\(\lambda \neq 0\)

\(g(\Delta) = \max(0, \Delta + \lambda)\)

Soft Pairwise Hinge (soft margin ranking)

softplus

\(\lambda \neq 0\)

\(g(\Delta) = \log(1 + \exp(\Delta + \lambda))\)

Pairwise Logistic

softplus

\(\lambda=0\)

\(g(\Delta) = \log(1 + \exp(\Delta))\)

Note

The pairwise logistic loss can be considered as a special case of the pairwise soft hinge loss where \(\lambda = 0\).

Inseparable Pairwise Loss Functions

The following pairwise loss function use the full generalized form of \(L(k, \bar{k}) = \dots\) for their definitions:

Pairwise Loss

Formulation

Double Loss

\(h(\bar{\lambda} + f(\bar{k})) + h(\lambda - f(k))\)

Batching

The pairwise loss for a set of pairs of positive/negative triples \(\mathcal{L}_L: 2^{\mathcal{K} \times \mathcal{\bar{K}}} \rightarrow \mathbb{R}\) is defined as the arithmetic mean of the pairwise losses for each pair of positive and negative triples in the subset \(\mathcal{B} \in 2^{\mathcal{K} \times \mathcal{\bar{K}}}\).

\[\mathcal{L}_L(\mathcal{B}) = \frac{1}{|\mathcal{B}|} \sum \limits_{(k, \bar{k}) \in \mathcal{B}} L(k, \bar{k})\]

Setwise Loss Functions

A setwise loss is applied to a set of triples which can be either positive or negative. It is defined as \(L: 2^{\mathcal{T}} \rightarrow \mathbb{R}\). The two setwise loss functions implemented in PyKEEN, pykeen.losses.NSSALoss and pykeen.losses.CrossEntropyLoss are both widely different in their paradigms, but both share the notion that triples are not strictly positive or negative.

\[L(k_1, ... k_n) = g(f(k_1), ..., f(k_n))\]

Batching

The pairwise loss for a set of sets of triples triples \(\mathcal{L}_L: 2^{2^{\mathcal{T}}} \rightarrow \mathbb{R}\) is defined as the arithmetic mean of the setwise losses for each set of triples \(\mathcal{b}\) in the subset \(\mathcal{B} \in 2^{2^{\mathcal{T}}}\).

\[\mathcal{L}_L(\mathcal{B}) = \frac{1}{|\mathcal{B}|} \sum \limits_{\mathcal{b} \in \mathcal{B}} L(\mathcal{b})\]

Classes

PointwiseLoss([reduction])

Pointwise loss functions compute an independent loss term for each triple-label pair.

DeltaPointwiseLoss([margin, ...])

A generic class for delta-pointwise losses.

MarginPairwiseLoss([margin, ...])

The generalized margin ranking loss.

PairwiseLoss([reduction])

Pairwise loss functions compare the scores of a positive triple and a negative triple.

SetwiseLoss([reduction])

Setwise loss functions compare the scores of several triples.

AdversarialLoss([...])

A loss with adversarial weighting of negative samples.

AdversarialBCEWithLogitsLoss([...])

An adversarially weighted BCE loss.

BCEAfterSigmoidLoss([reduction])

The numerically unstable version of explicit Sigmoid + BCE loss.

BCEWithLogitsLoss([reduction])

The binary cross entropy loss.

CrossEntropyLoss([reduction])

The cross entropy loss that evaluates the cross entropy after softmax output.

FocalLoss(*[, gamma, alpha])

The focal loss proposed by [lin2018].

InfoNCELoss([margin, ...])

The InfoNCE loss with additive margin proposed by [wang2022].

MarginRankingLoss([margin, reduction])

The pairwise hinge loss (i.e., margin ranking loss).

MSELoss([reduction])

The mean squared error loss.

NSSALoss([margin, adversarial_temperature, ...])

The self-adversarial negative sampling loss function proposed by [sun2019].

SoftplusLoss([reduction])

The pointwise logistic loss (i.e., softplus loss).

SoftPointwiseHingeLoss([margin, reduction])

The soft pointwise hinge loss.

PointwiseHingeLoss([margin, reduction])

The pointwise hinge loss.

DoubleMarginLoss(*[, positive_margin, ...])

A limit-based scoring loss, with separate margins for positive and negative elements from [sun2018].

SoftMarginRankingLoss([margin, reduction])

The soft pairwise hinge loss (i.e., soft margin ranking loss).

PairwiseLogisticLoss([reduction])

The pairwise logistic loss.

Class Inheritance Diagram

digraph inheritance6b847db012 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "AdversarialBCEWithLogitsLoss" [URL="index.html#pykeen.losses.AdversarialBCEWithLogitsLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An adversarially weighted BCE loss."]; "AdversarialLoss" -> "AdversarialBCEWithLogitsLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "AdversarialLoss" [URL="index.html#pykeen.losses.AdversarialLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A loss with adversarial weighting of negative samples."]; "SetwiseLoss" -> "AdversarialLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BCEAfterSigmoidLoss" [URL="index.html#pykeen.losses.BCEAfterSigmoidLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The numerically unstable version of explicit Sigmoid + BCE loss."]; "PointwiseLoss" -> "BCEAfterSigmoidLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BCEWithLogitsLoss" [URL="index.html#pykeen.losses.BCEWithLogitsLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The binary cross entropy loss."]; "PointwiseLoss" -> "BCEWithLogitsLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CrossEntropyLoss" [URL="index.html#pykeen.losses.CrossEntropyLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The cross entropy loss that evaluates the cross entropy after softmax output."]; "SetwiseLoss" -> "CrossEntropyLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DeltaPointwiseLoss" [URL="index.html#pykeen.losses.DeltaPointwiseLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A generic class for delta-pointwise losses."]; "PointwiseLoss" -> "DeltaPointwiseLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DoubleMarginLoss" [URL="index.html#pykeen.losses.DoubleMarginLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A limit-based scoring loss, with separate margins for positive and negative elements from [sun2018]_."]; "PointwiseLoss" -> "DoubleMarginLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "FocalLoss" [URL="index.html#pykeen.losses.FocalLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The focal loss proposed by [lin2018]_."]; "PointwiseLoss" -> "FocalLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "InfoNCELoss" [URL="index.html#pykeen.losses.InfoNCELoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The InfoNCE loss with additive margin proposed by [wang2022]_."]; "CrossEntropyLoss" -> "InfoNCELoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Loss" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="A loss function."]; "_Loss" -> "Loss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "MSELoss" [URL="index.html#pykeen.losses.MSELoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The mean squared error loss."]; "PointwiseLoss" -> "MSELoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "MarginPairwiseLoss" [URL="index.html#pykeen.losses.MarginPairwiseLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The generalized margin ranking loss."]; "PairwiseLoss" -> "MarginPairwiseLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "MarginRankingLoss" [URL="index.html#pykeen.losses.MarginRankingLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The pairwise hinge loss (i.e., margin ranking loss)."]; "MarginPairwiseLoss" -> "MarginRankingLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Module" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Base class for all neural network modules."]; "NSSALoss" [URL="index.html#pykeen.losses.NSSALoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The self-adversarial negative sampling loss function proposed by [sun2019]_."]; "AdversarialLoss" -> "NSSALoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PairwiseLogisticLoss" [URL="index.html#pykeen.losses.PairwiseLogisticLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The pairwise logistic loss."]; "SoftMarginRankingLoss" -> "PairwiseLogisticLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PairwiseLoss" [URL="index.html#pykeen.losses.PairwiseLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Pairwise loss functions compare the scores of a positive triple and a negative triple."]; "Loss" -> "PairwiseLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PointwiseHingeLoss" [URL="index.html#pykeen.losses.PointwiseHingeLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The pointwise hinge loss."]; "DeltaPointwiseLoss" -> "PointwiseHingeLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PointwiseLoss" [URL="index.html#pykeen.losses.PointwiseLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Pointwise loss functions compute an independent loss term for each triple-label pair."]; "Loss" -> "PointwiseLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SetwiseLoss" [URL="index.html#pykeen.losses.SetwiseLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Setwise loss functions compare the scores of several triples."]; "Loss" -> "SetwiseLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SoftMarginRankingLoss" [URL="index.html#pykeen.losses.SoftMarginRankingLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The soft pairwise hinge loss (i.e., soft margin ranking loss)."]; "MarginPairwiseLoss" -> "SoftMarginRankingLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SoftPointwiseHingeLoss" [URL="index.html#pykeen.losses.SoftPointwiseHingeLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The soft pointwise hinge loss."]; "DeltaPointwiseLoss" -> "SoftPointwiseHingeLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SoftplusLoss" [URL="index.html#pykeen.losses.SoftplusLoss",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The pointwise logistic loss (i.e., softplus loss)."]; "SoftPointwiseHingeLoss" -> "SoftplusLoss" [arrowsize=0.5,style="setlinewidth(0.5)"]; "_Loss" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled"]; "Module" -> "_Loss" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Regularizers

Regularization in PyKEEN.

Classes

LpRegularizer(*[, weight, apply_only_once, ...])

A simple L_p norm based regularizer.

NoRegularizer([weight, apply_only_once, ...])

A regularizer which does not perform any regularization.

CombinedRegularizer(regularizers[, total_weight])

A convex combination of regularizers.

PowerSumRegularizer(*[, weight, ...])

A simple x^p based regularizer.

OrthogonalityRegularizer(*[, weight, ...])

A regularizer for the soft orthogonality constraints from [wang2014].

NormLimitRegularizer(*[, weight, ...])

A regularizer which formulates a soft constraint on a maximum norm.

Class Inheritance Diagram

digraph inheritanceb309268641 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "CombinedRegularizer" [URL="index.html#pykeen.regularizers.CombinedRegularizer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A convex combination of regularizers."]; "Regularizer" -> "CombinedRegularizer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "LpRegularizer" [URL="index.html#pykeen.regularizers.LpRegularizer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A simple L_p norm based regularizer."]; "Regularizer" -> "LpRegularizer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Module" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Base class for all neural network modules."]; "NoRegularizer" [URL="index.html#pykeen.regularizers.NoRegularizer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A regularizer which does not perform any regularization."]; "Regularizer" -> "NoRegularizer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "NormLimitRegularizer" [URL="index.html#pykeen.regularizers.NormLimitRegularizer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A regularizer which formulates a soft constraint on a maximum norm."]; "Regularizer" -> "NormLimitRegularizer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "OrthogonalityRegularizer" [URL="index.html#pykeen.regularizers.OrthogonalityRegularizer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A regularizer for the soft orthogonality constraints from [wang2014]_."]; "Regularizer" -> "OrthogonalityRegularizer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PowerSumRegularizer" [URL="index.html#pykeen.regularizers.PowerSumRegularizer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A simple x^p based regularizer."]; "Regularizer" -> "PowerSumRegularizer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Regularizer" [URL="#pykeen.regularizers.Regularizer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A base class for all regularizers."]; "Module" -> "Regularizer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "Regularizer" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Base Classes

class Regularizer(weight=1.0, apply_only_once=False, parameters=None)[source]

A base class for all regularizers.

Instantiate the regularizer.

Parameters:
  • weight (float) – The relative weight of the regularization

  • apply_only_once (bool) – Should the regularization be applied more than once after reset?

  • parameters (Optional[Iterable[Parameter]]) – Specific parameters to track. if none given, it’s expected that your model automatically delegates to the update() function.

add_parameter(parameter)[source]

Add a parameter for regularization.

Return type:

None

Parameters:

parameter (Parameter) –

apply_only_once: bool

Should the regularization only be applied once? This was used for ConvKB and defaults to False.

abstract forward(x)[source]

Compute the regularization term for one tensor.

Return type:

FloatTensor

Parameters:

x (FloatTensor) –

classmethod get_normalized_name()[source]

Get the normalized name of the regularizer class.

Return type:

str

hpo_default: ClassVar[Mapping[str, Any]] = {'weight': {'high': 1.0, 'low': 0.01, 'scale': 'log', 'type': <class 'float'>}}

The default strategy for optimizing the regularizer’s hyper-parameters

pop_regularization_term()[source]

Return the weighted regularization term, and reset the regularize afterwards.

Return type:

FloatTensor

post_parameter_update()[source]

Reset the regularizer’s term.

Warning

Typically, you want to use the regularization term exactly once to calculate gradients via pop_regularization_term(). In this case, there should be no need to manually call this method.

regularization_term: torch.FloatTensor

The current regularization term (a scalar)

reset()[source]

Reset the regularization term to zero.

Return type:

None

property term: FloatTensor

Return the weighted regularization term.

Return type:

FloatTensor

update(*tensors)[source]

Update the regularization term based on passed tensors.

Return type:

None

Parameters:

tensors (FloatTensor) –

updated: bool

Has this regularizer been updated since last being reset?

weight: torch.FloatTensor

The overall regularization weight

Result Trackers

Result trackers in PyKEEN.

Functions

resolve_result_trackers([result_tracker, ...])

Resolve and compose result trackers.

Classes

ResultTracker()

A class that tracks the results from a pipeline run.

FileResultTracker([path, name])

Tracking results to a file.

MultiResultTracker([trackers])

A result tracker which delegates to multiple different result trackers.

MLFlowResultTracker([tracking_uri, ...])

A tracker for MLflow.

NeptuneResultTracker([...])

A tracker for Neptune.ai.

WANDBResultTracker(project[, offline])

A tracker for Weights and Biases.

JSONResultTracker([path, name])

Tracking results to a JSON lines file.

CSVResultTracker([path, name])

Tracking results to a CSV file.

PythonResultTracker([store_metrics])

A tracker which stores everything in Python dictionaries.

TensorBoardResultTracker([experiment_path, ...])

A tracker for TensorBoard.

ConsoleResultTracker(*[, track_parameters, ...])

A class that directly prints to console.

Class Inheritance Diagram

digraph inheritancef43e89634e { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "CSVResultTracker" [URL="index.html#pykeen.trackers.CSVResultTracker",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Tracking results to a CSV file."]; "FileResultTracker" -> "CSVResultTracker" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ConsoleResultTracker" [URL="index.html#pykeen.trackers.ConsoleResultTracker",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A class that directly prints to console."]; "ResultTracker" -> "ConsoleResultTracker" [arrowsize=0.5,style="setlinewidth(0.5)"]; "FileResultTracker" [URL="index.html#pykeen.trackers.FileResultTracker",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Tracking results to a file."]; "ResultTracker" -> "FileResultTracker" [arrowsize=0.5,style="setlinewidth(0.5)"]; "JSONResultTracker" [URL="index.html#pykeen.trackers.JSONResultTracker",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Tracking results to a JSON lines file."]; "FileResultTracker" -> "JSONResultTracker" [arrowsize=0.5,style="setlinewidth(0.5)"]; "MLFlowResultTracker" [URL="index.html#pykeen.trackers.MLFlowResultTracker",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A tracker for MLflow."]; "ResultTracker" -> "MLFlowResultTracker" [arrowsize=0.5,style="setlinewidth(0.5)"]; "MultiResultTracker" [URL="index.html#pykeen.trackers.MultiResultTracker",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A result tracker which delegates to multiple different result trackers."]; "ResultTracker" -> "MultiResultTracker" [arrowsize=0.5,style="setlinewidth(0.5)"]; "NeptuneResultTracker" [URL="index.html#pykeen.trackers.NeptuneResultTracker",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A tracker for Neptune.ai."]; "ResultTracker" -> "NeptuneResultTracker" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PythonResultTracker" [URL="index.html#pykeen.trackers.PythonResultTracker",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A tracker which stores everything in Python dictionaries."]; "ResultTracker" -> "PythonResultTracker" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ResultTracker" [URL="index.html#pykeen.trackers.ResultTracker",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A class that tracks the results from a pipeline run."]; "TensorBoardResultTracker" [URL="index.html#pykeen.trackers.TensorBoardResultTracker",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A tracker for TensorBoard."]; "ResultTracker" -> "TensorBoardResultTracker" [arrowsize=0.5,style="setlinewidth(0.5)"]; "WANDBResultTracker" [URL="index.html#pykeen.trackers.WANDBResultTracker",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A tracker for Weights and Biases."]; "ResultTracker" -> "WANDBResultTracker" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Negative Sampling

For entities \(\mathcal{E}\) and relations \(\mathcal{R}\), the set of all possible triples \(\mathcal{T}\) is constructed through their cartesian product \(\mathcal{T} = \mathcal{E} \times \mathcal{R} \times \mathcal{E}\). A given knowledge graph \(\mathcal{K}\) is a subset of all possible triples \(\mathcal{K} \subseteq \mathcal{T}\).

Construction of Knowledge Graphs

When constructing a knowledge graph \(\mathcal{K}_{\text{closed}}\) under the closed world assumption, the labels of the remaining triples \((h,r,t) \in \mathcal{T} \setminus \mathcal{K}_{\text{closed}}\) are defined as negative. When constructing a knowledge graph \(\mathcal{K}_{\text{open}}\) under the open world assumption, the labels of the remaining triples \((h,r,t) \in \mathcal{T} \setminus \mathcal{K}_{\text{open}}\) are unknown.

Becuase most knowledge graphs are generated under the open world assumption, negative sampling techniques must be employed during the training of knowledge graph embedding models to avoid over-generalization.

Corruption

Negative sampling techniques often generate negative triples by corrupting a known positive triple \((h,r,t) \in \mathcal{K}\) by replacing either \(h\), \(r\), or \(t\) with one of the following operations:

Corrupt heads

\(\mathcal{H}(h, r, t) = \{(h', r, t) \mid h' \in \mathcal{E} \land h' \neq h\}\)

Corrupt relations

\(\mathcal{R}(h, r, t) = \{(h, r', t) \mid r' \in \mathcal{E} \land r' \neq r\}\)

Corrupt tails

\(\mathcal{T}(h, r, t) = \{(h, r, t') \mid t' \in \mathcal{E} \land t' \neq t\}\)

Typically, the corrupt relations operation \(\mathcal{R}(h, r, t)\) is omitted becuase the evaluation of knowledge graph embedding models on the link prediction task only consideres the goodness of head prediction and tail prediction, but not relation prediction. Therefore, the set of candidate negative triples \(\mathcal{N}(h, r, t)\) for a given known positive triple \((h,r,t) \in \mathcal{K}\) is given by:

\[\mathcal{N}(h, r, t) = \mathcal{T}(h, r, t) \cup \mathcal{H}(h, r, t)\]

Generally, the set of potential negative triples \(\mathcal{N}\) over all positive triples \((h,r,t) \in \mathcal{K}\) is defined as:

\[\mathcal{N} = \bigcup_{(h,r,t) \in \mathcal{K}} \mathcal{N}(h, r, t)\]

Uniform Negative Sampling

The default negative sampler pykeen.sampling.BasicNegativeSampler generates corrupted triples from a known positive triple \((h,r,t) \in \mathcal{K}\) by uniformly randomly either using the corrupt heads operation or the corrupt tails operation. The default negative sampler is automatically used in the following code:

from pykeen.pipeline import pipeline

results = pipeline(
    dataset='YAGO3-10',
    model='PairRE',
    training_loop='sLCWA',
)

It can be set explicitly with:

from pykeen.pipeline import pipeline

results = pipeline(
    dataset='YAGO3-10',
    model='PairRE',
    training_loop='sLCWA',
    negative_sampler='basic',
)

In general, the behavior of the negative sampler can be modified when using the pykeen.pipeline.pipeline() by passing the negative_sampler_kwargs argument. In order to explicitly specifiy which of the head, relation, and tail corruption methods are used, the corruption_schema argument can be used. For example, to use all three, the collection ('h', 'r', 't') can be passed as in the following:

from pykeen.pipeline import pipeline

results = pipeline(
    dataset='YAGO3-10',
    model='PairRE',
    training_loop='sLCWA',
    negative_sampler='basic',
    negative_sampler_kwargs=dict(
        corruption_scheme=('h', 'r', 't'),
    ),
)

Bernoulli Negative Sampling

The Bernoulli negative sampler pykeen.sampling.BernoulliNegativeSampler generates corrupted triples from a known positive triple \((h,r,t) \in \mathcal{K}\) similarly to the uniform negative sampler, but it pre-computes a probability \(p_r\) for each relation \(r\) to weight whether the head corruption is used with probability \(p_r\) or if tail corruption is used with probability \(1 - p_r\).

from pykeen.pipeline import pipeline

results = pipeline(
    dataset='YAGO3-10',
    model='PairRE',
    training_loop='sLCWA',
    negative_sampler='bernoulli',
)

Classes

NegativeSampler(*, mapped_triples[, ...])

A negative sampler.

BasicNegativeSampler(*[, corruption_scheme])

A basic negative sampler.

BernoulliNegativeSampler(*, mapped_triples, ...)

An implementation of the Bernoulli negative sampling approach proposed by [wang2014].

PseudoTypedNegativeSampler(*, ...)

A sampler that accounts for which entities co-occur with a relation.

Class Inheritance Diagram

digraph inheritance22f84fae48 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "BasicNegativeSampler" [URL="index.html#pykeen.sampling.BasicNegativeSampler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A basic negative sampler."]; "NegativeSampler" -> "BasicNegativeSampler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BernoulliNegativeSampler" [URL="index.html#pykeen.sampling.BernoulliNegativeSampler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of the Bernoulli negative sampling approach proposed by [wang2014]_."]; "NegativeSampler" -> "BernoulliNegativeSampler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Module" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Base class for all neural network modules."]; "NegativeSampler" [URL="index.html#pykeen.sampling.NegativeSampler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A negative sampler."]; "Module" -> "NegativeSampler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PseudoTypedNegativeSampler" [URL="index.html#pykeen.sampling.PseudoTypedNegativeSampler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A sampler that accounts for which entities co-occur with a relation."]; "NegativeSampler" -> "PseudoTypedNegativeSampler" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Filtering

Consider the following properties of relation \(r\). Because the corruption operations (see Corruption) are applied independently of triples, the resulting candidate corrupt triples could overlap with known positive triples in \(\mathcal{K}\).

Property of \(r\)

Example pair of triples

Implications

one-to-many

\((h,r,t_1), (h,r,t_2) \in \mathcal{K}\)

\((h,r,t_2) \in T(h,r,t_1) \cup (h,r,t_1) \in T(h,r,t_2)\)

multiple

\((h,r_1,t), (h,r_2,t) \in \mathcal{K}\)

\((h,r_2,t) \in R(h,r_1,t) \cup (h,r_1,t) \in R(h,r_2,t)\)

many-to-one

\((h_1,r,t), (h_2,r,t) \in \mathcal{K}\)

\((h_2,r,t) \in H(h_1,r,t) \cup (h_1,r,t) \in H(h_2,r,t)\)

If no relations in \(\mathcal{K}\) satisfy any of the relevant properties for the corruption schema chosen in negative sampling, then there is guaranteed to be no overlap between \(\mathcal{N}\) and \(\mathcal{K}\) such that \(\mathcal{N} \cap \mathcal{K} \neq \emptyset\). However, this scenario is very unlikely for real-world knowledge graphs.

The known positive triples that appear in \(\mathcal{N}\) are known false negatives. Hence, we know that these are incorrect (negative) training examples, and might want to exclude them to reduce the training noise.

Warning

It should be taken into account that also a corrupted triple that is not part of the knowledge graph can represent a true fact. These “unknown” false negatives can not be removed a priori in the filtered setting. The philosophy of the methodology again relies on the low number of unknown false negatives such that learning can take place.

However, in practice, \(|\mathcal{N}| \gg |\mathcal{K}|\), so the likelihood of generating a false negative is rather low. Therefore, the additional filter step is often omitted to lower computational cost. This general observation might not hold for all entities; e.g., for a hub entity which is connected to many other entities, there may be a considerable number of false negatives without filtering.

Identifying False Negatives During Training

By default, PyKEEN does not filter false negatives from \(\mathcal{N}\) during training. To enable filtering of negative examples during training, the filtered keyword can be given to negative_sampler_kwargs like in:

results = pipeline(
    dataset='YAGO3-10',
    model='PairRE',
    training_loop='sLCWA',
    negative_sampler='basic',
    negative_sampler_kwargs=dict(
        filtered=True,    
    ),
)

PyKEEN implements several algorithms for filtering with different properties that can be chosen using the filterer keyword argument in negative_sampler_kwargs. By default, an fast and approximate algorithm is used in pykeen.sampling.filtering.BloomFilterer, which is based on bloom filters. The bloom filterer also has a configurable desired error rate, which can be further lowered at the cost of increase in memory and computation costs.

from pykeen.pipeline import pipeline

results = pipeline(
    dataset='YAGO3-10',
    model='PairRE',
    training_loop='sLCWA',
    negative_sampler='basic',
    negative_sampler_kwargs=dict(
        filtered=True,
        filterer='bloom',
        filterer_kwargs=dict(
            error_rate=0.0001,
        ),
    ),
)

If you want to have a guarantee that all known false negatives are filtered, you can use a slower implementation based on Python’s built-in sets, the pykeen.sampling.filtering.PythonSetFilterer. It can be activated with:

from pykeen.pipeline import pipeline

results = pipeline(
    dataset='YAGO3-10',
    model='PairRE',
    training_loop='sLCWA',
    negative_sampler='basic',
    negative_sampler_kwargs=dict(
        filtered=True,
        filterer='python-set',    
    ),
)

Identifying False Negatives During Evaluation

In contrast to training, PyKEEN does filter false negatives from \(\mathcal{N}\) during evaluation by default. To disable the “filtered setting” during evaluation, the filtered keyword can be given to evaluator_kwargs like in:

from pykeen.pipeline import pipeline

results = pipeline(
    dataset='YAGO3-10',
    model='PairRE',
    evaluator_kwargs=dict(
        filtered=False,
    ),
)

Filtering during evaluation is implemented differently than in negative sampling:

First, there are no choices between an exact or approximate algorithm via a pykeen.sampling.filtering.Filterer. Instead, the evaluation filtering can modify the scores in-place and does so instead of selecting only the non-filtered entries. The reason is mainly that evaluation always is done in 1:n scoring, and thus, we gain some efficiently here by keeping the tensor in “dense” shape (batch_size, num_entities).

Second, filtering during evaluation has to be correct, and is crucial for reproducing results from the filtered setting. For evaluation it makes sense to use all information we have to get as solid evaluation results as possible.

Classes

Filterer(*args, **kwargs)

An interface for filtering methods for negative triples.

BloomFilterer(mapped_triples[, error_rate])

A filterer for negative triples based on the Bloom filter.

PythonSetFilterer(mapped_triples)

A filterer using Python sets for filtering.

Class Inheritance Diagram

digraph inheritance55a750c54a { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "BloomFilterer" [URL="index.html#pykeen.sampling.filtering.BloomFilterer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A filterer for negative triples based on the Bloom filter."]; "Filterer" -> "BloomFilterer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Filterer" [URL="index.html#pykeen.sampling.filtering.Filterer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An interface for filtering methods for negative triples."]; "Module" -> "Filterer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Module" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Base class for all neural network modules."]; "PythonSetFilterer" [URL="index.html#pykeen.sampling.filtering.PythonSetFilterer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A filterer using Python sets for filtering."]; "Filterer" -> "PythonSetFilterer" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Evaluation

Evaluation.

Classes

Evaluator([filtered, ...])

An abstract evaluator for KGE models.

MetricResults(data)

Results from computing metrics.

RankBasedEvaluator([filtered, metrics, ...])

A rank-based evaluator for KGE models.

RankBasedMetricResults(data)

Results from computing metrics.

MacroRankBasedEvaluator(**kwargs)

Macro-average rank-based evaluation.

LCWAEvaluationLoop(triples_factory[, ...])

Evaluation loop using 1:n scoring.

SampledRankBasedEvaluator(evaluation_factory, *)

A rank-based evaluator using sampled negatives instead of all negatives.

OGBEvaluator([filtered])

A sampled, rank-based evaluator that applies a custom OGB evaluation.

ClassificationEvaluator(**kwargs)

An evaluator that uses a classification metrics.

ClassificationMetricResults(data)

Results from computing metrics.

Class Inheritance Diagram

digraph inheritance26cbd1aa46 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "ClassificationEvaluator" [URL="index.html#pykeen.evaluation.ClassificationEvaluator",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An evaluator that uses a classification metrics."]; "Evaluator" -> "ClassificationEvaluator" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ClassificationMetricResults" [URL="index.html#pykeen.evaluation.ClassificationMetricResults",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Results from computing metrics."]; "MetricResults" -> "ClassificationMetricResults" [arrowsize=0.5,style="setlinewidth(0.5)"]; "EvaluationLoop" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="A base class for evaluation loops."]; "Generic" -> "EvaluationLoop" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Evaluator" [URL="index.html#pykeen.evaluation.Evaluator",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An abstract evaluator for KGE models."]; "ABC" -> "Evaluator" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Generic" -> "Evaluator" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Generic" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Abstract base class for generic types."]; "LCWAEvaluationLoop" [URL="index.html#pykeen.evaluation.LCWAEvaluationLoop",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Evaluation loop using 1:n scoring."]; "EvaluationLoop" -> "LCWAEvaluationLoop" [arrowsize=0.5,style="setlinewidth(0.5)"]; "MacroRankBasedEvaluator" [URL="index.html#pykeen.evaluation.MacroRankBasedEvaluator",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Macro-average rank-based evaluation."]; "RankBasedEvaluator" -> "MacroRankBasedEvaluator" [arrowsize=0.5,style="setlinewidth(0.5)"]; "MetricResults" [URL="index.html#pykeen.evaluation.MetricResults",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Results from computing metrics."]; "Generic" -> "MetricResults" [arrowsize=0.5,style="setlinewidth(0.5)"]; "OGBEvaluator" [URL="index.html#pykeen.evaluation.OGBEvaluator",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A sampled, rank-based evaluator that applies a custom OGB evaluation."]; "SampledRankBasedEvaluator" -> "OGBEvaluator" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RankBasedEvaluator" [URL="index.html#pykeen.evaluation.RankBasedEvaluator",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A rank-based evaluator for KGE models."]; "Evaluator" -> "RankBasedEvaluator" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RankBasedMetricResults" [URL="index.html#pykeen.evaluation.RankBasedMetricResults",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Results from computing metrics."]; "MetricResults" -> "RankBasedMetricResults" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SampledRankBasedEvaluator" [URL="index.html#pykeen.evaluation.SampledRankBasedEvaluator",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A rank-based evaluator using sampled negatives instead of all negatives."]; "RankBasedEvaluator" -> "SampledRankBasedEvaluator" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Metrics

A module for PyKEEN ranking and classification metrics.

Classes

Metric()

A base class for metrics.

ValueRange([lower, lower_inclusive, upper, ...])

A value range description.

RankBasedMetric()

A base class for rank-based metrics.

ClassificationMetric()

A base class for classification metrics.

Class Inheritance Diagram

digraph inheritance3ebc3345fc { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "ClassificationMetric" [URL="index.html#pykeen.metrics.ClassificationMetric",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A base class for classification metrics."]; "Metric" -> "ClassificationMetric" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "ClassificationMetric" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExtraReprMixin" [URL="index.html#pykeen.utils.ExtraReprMixin",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A mixin for modules with hierarchical `extra_repr`."]; "Metric" [URL="index.html#pykeen.metrics.Metric",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A base class for metrics."]; "ExtraReprMixin" -> "Metric" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RankBasedMetric" [URL="index.html#pykeen.metrics.ranking.RankBasedMetric",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A base class for rank-based metrics."]; "Metric" -> "RankBasedMetric" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ValueRange" [URL="index.html#pykeen.metrics.ValueRange",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A value range description."]; }

Ranking metrics.

This module comprises various rank-based metrics, which get an array of individual ranks as input, as summarize them into a single-figure metric measuring different aspects of ranking performance.

We can generally distinguish:

Base Metrics

These metrics directly operate on the ranks:

The following metrics measures summarize the central tendency of ranks

The Hits at K metric is closely related to information retrieval and measures the fraction of times when the correct result is in the top-\(k\) ranked entries, i.e., the rank is at most \(k\)

The next metrics summarize the dispersion of ranks

and finally there is a simple metric to store the number of ranks which where aggregated

Inverse Metrics

The inverse metrics are reciprocals of the central tendency measures. They offer the advantage of having a fixed value range of \((0, 1]\), with a known optimal value of \(1\):

Adjusted Metrics

Adjusted metrics build upon base metrics, but adjust them for chance, cf. [berrendorf2020] and [hoyt2022]. All adjusted metrics derive from pykeen.metrics.ranking.DerivedRankBasedMetric and, for a given evaluation set, are affine transformations of the base metric with dataset-dependent, but fixed transformation constants. Thus, they can also be computed when the model predictions are not available anymore, but the evaluation set is known.

Expectation-Normalized Metrics

These metrics divide the metric by its expected value under random ordering. Thus, their expected value is always 1 irrespective of the evaluation set. They derive from pykeen.metrics.ranking.ExpectationNormalizedMetric, and there is currently only a single implementation:

Re-indexed Metrics

Re-indexed metrics subtract the expected value, and then normalize the optimal value to be 1. Thus, their expected value under random ordering is 0, their optimal value is 1, and larger values indicate better results. The classes derive from pykeen.metrics.ranking.ReindexedMetric, and the following implementations are available:

z-Adjusted Metrics

The final type of adjusted metrics uses the expected value as well as the variance of the metric under random ordering to normalize the metrics similar to z-score normalization. The z-score normalized metrics have an expected value of 0, and a variance of 1, and positive values indicate better results. While their value range is unbound, it can be interpreted through the lens of the inverse cumulative density function of the standard Gaussian distribution to retrieve a p-value. The classes derive from pykeen.metrics.ranking.ZMetric, and the following implementations are available:

Functions

generate_ranks(num_candidates[, ...])

Generate random ranks from a given array of the number of candidates for each ranking task.

generate_num_candidates_and_ranks(num_ranks, ...)

Generate random number of candidates, and coherent ranks.

generalized_harmonic_numbers(n[, p])

Calculate the generalized harmonic numbers from 1 to n (both inclusive).

harmonic_variances(n)

Pre-calculate variances of inverse rank distributions.

Classes

RankBasedMetric()

A base class for rank-based metrics.

DerivedRankBasedMetric([base_cls])

A derived rank-based metric.

ExpectationNormalizedMetric([base_cls])

An adjustment to create an expectation-normalized metric.

ReindexedMetric([base_cls])

A mixin to create an expectation normalized metric with max of 1 and expectation of 0.

ZMetric([base_cls])

A z-score adjusted metrics.

ArithmeticMeanRank()

The (arithmetic) mean rank.

AdjustedArithmeticMeanRank([base_cls])

The adjusted arithmetic mean rank (AMR).

AdjustedArithmeticMeanRankIndex([base_cls])

The adjusted arithmetic mean rank index (AMRI).

ZArithmeticMeanRank([base_cls])

The z-scored arithmetic mean rank.

InverseArithmeticMeanRank()

The inverse arithmetic mean rank.

GeometricMeanRank()

The (weighted) geometric mean rank.

AdjustedGeometricMeanRankIndex([base_cls])

The adjusted geometric mean rank index (AGMRI).

ZGeometricMeanRank([base_cls])

The z geometric mean rank (zGMR).

InverseGeometricMeanRank()

The inverse geometric mean rank.

HarmonicMeanRank()

The harmonic mean rank.

InverseHarmonicMeanRank()

The inverse harmonic mean rank.

AdjustedInverseHarmonicMeanRank([base_cls])

The adjusted MRR index.

ZInverseHarmonicMeanRank([base_cls])

The z-inverse harmonic mean rank (ZIHMR).

MedianRank()

The median rank.

InverseMedianRank()

The inverse median rank.

HitsAtK([k])

The Hits @ k.

AdjustedHitsAtK([base_cls])

The adjusted Hits at K (\(AH_k\)).

ZHitsAtK([base_cls])

The z-scored hits at k (\(ZAH_k\)).

StandardDeviation()

The ranks' standard deviation.

Variance()

The ranks' variance.

Count()

The ranks' count.

NoClosedFormError

The metric does not provide a closed-form implementation for the requested operation.

AffineTransformationParameters([scale, offset])

The parameters of an affine transformation.

Class Inheritance Diagram

digraph inheritance03f5cf17f0 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "AdjustedArithmeticMeanRank" [URL="index.html#pykeen.metrics.ranking.AdjustedArithmeticMeanRank",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The adjusted arithmetic mean rank (AMR)."]; "ExpectationNormalizedMetric" -> "AdjustedArithmeticMeanRank" [arrowsize=0.5,style="setlinewidth(0.5)"]; "AdjustedArithmeticMeanRankIndex" [URL="index.html#pykeen.metrics.ranking.AdjustedArithmeticMeanRankIndex",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The adjusted arithmetic mean rank index (AMRI)."]; "ReindexedMetric" -> "AdjustedArithmeticMeanRankIndex" [arrowsize=0.5,style="setlinewidth(0.5)"]; "AdjustedGeometricMeanRankIndex" [URL="index.html#pykeen.metrics.ranking.AdjustedGeometricMeanRankIndex",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The adjusted geometric mean rank index (AGMRI)."]; "ReindexedMetric" -> "AdjustedGeometricMeanRankIndex" [arrowsize=0.5,style="setlinewidth(0.5)"]; "AdjustedHitsAtK" [URL="index.html#pykeen.metrics.ranking.AdjustedHitsAtK",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The adjusted Hits at K ($AH_k$)."]; "ReindexedMetric" -> "AdjustedHitsAtK" [arrowsize=0.5,style="setlinewidth(0.5)"]; "AdjustedInverseHarmonicMeanRank" [URL="index.html#pykeen.metrics.ranking.AdjustedInverseHarmonicMeanRank",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The adjusted MRR index."]; "ReindexedMetric" -> "AdjustedInverseHarmonicMeanRank" [arrowsize=0.5,style="setlinewidth(0.5)"]; "AffineTransformationParameters" [URL="index.html#pykeen.metrics.ranking.AffineTransformationParameters",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The parameters of an affine transformation."]; "ArithmeticMeanRank" [URL="index.html#pykeen.metrics.ranking.ArithmeticMeanRank",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The (arithmetic) mean rank."]; "RankBasedMetric" -> "ArithmeticMeanRank" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Count" [URL="index.html#pykeen.metrics.ranking.Count",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The ranks' count."]; "RankBasedMetric" -> "Count" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DerivedRankBasedMetric" [URL="index.html#pykeen.metrics.ranking.DerivedRankBasedMetric",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A derived rank-based metric."]; "RankBasedMetric" -> "DerivedRankBasedMetric" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "DerivedRankBasedMetric" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExpectationNormalizedMetric" [URL="index.html#pykeen.metrics.ranking.ExpectationNormalizedMetric",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An adjustment to create an expectation-normalized metric."]; "DerivedRankBasedMetric" -> "ExpectationNormalizedMetric" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExtraReprMixin" [URL="index.html#pykeen.utils.ExtraReprMixin",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A mixin for modules with hierarchical `extra_repr`."]; "GeometricMeanRank" [URL="index.html#pykeen.metrics.ranking.GeometricMeanRank",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The (weighted) geometric mean rank."]; "RankBasedMetric" -> "GeometricMeanRank" [arrowsize=0.5,style="setlinewidth(0.5)"]; "HarmonicMeanRank" [URL="index.html#pykeen.metrics.ranking.HarmonicMeanRank",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The harmonic mean rank."]; "RankBasedMetric" -> "HarmonicMeanRank" [arrowsize=0.5,style="setlinewidth(0.5)"]; "HitsAtK" [URL="index.html#pykeen.metrics.ranking.HitsAtK",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The Hits @ k."]; "RankBasedMetric" -> "HitsAtK" [arrowsize=0.5,style="setlinewidth(0.5)"]; "InverseArithmeticMeanRank" [URL="index.html#pykeen.metrics.ranking.InverseArithmeticMeanRank",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The inverse arithmetic mean rank."]; "RankBasedMetric" -> "InverseArithmeticMeanRank" [arrowsize=0.5,style="setlinewidth(0.5)"]; "InverseGeometricMeanRank" [URL="index.html#pykeen.metrics.ranking.InverseGeometricMeanRank",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The inverse geometric mean rank."]; "RankBasedMetric" -> "InverseGeometricMeanRank" [arrowsize=0.5,style="setlinewidth(0.5)"]; "InverseHarmonicMeanRank" [URL="index.html#pykeen.metrics.ranking.InverseHarmonicMeanRank",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The inverse harmonic mean rank."]; "RankBasedMetric" -> "InverseHarmonicMeanRank" [arrowsize=0.5,style="setlinewidth(0.5)"]; "InverseMedianRank" [URL="index.html#pykeen.metrics.ranking.InverseMedianRank",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The inverse median rank."]; "RankBasedMetric" -> "InverseMedianRank" [arrowsize=0.5,style="setlinewidth(0.5)"]; "MedianRank" [URL="index.html#pykeen.metrics.ranking.MedianRank",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The median rank."]; "RankBasedMetric" -> "MedianRank" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Metric" [URL="index.html#pykeen.metrics.Metric",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A base class for metrics."]; "ExtraReprMixin" -> "Metric" [arrowsize=0.5,style="setlinewidth(0.5)"]; "NoClosedFormError" [URL="index.html#pykeen.metrics.ranking.NoClosedFormError",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The metric does not provide a closed-form implementation for the requested operation."]; "RankBasedMetric" [URL="index.html#pykeen.metrics.ranking.RankBasedMetric",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A base class for rank-based metrics."]; "Metric" -> "RankBasedMetric" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ReindexedMetric" [URL="index.html#pykeen.metrics.ranking.ReindexedMetric",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A mixin to create an expectation normalized metric with max of 1 and expectation of 0."]; "DerivedRankBasedMetric" -> "ReindexedMetric" [arrowsize=0.5,style="setlinewidth(0.5)"]; "StandardDeviation" [URL="index.html#pykeen.metrics.ranking.StandardDeviation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The ranks' standard deviation."]; "RankBasedMetric" -> "StandardDeviation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Variance" [URL="index.html#pykeen.metrics.ranking.Variance",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The ranks' variance."]; "RankBasedMetric" -> "Variance" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ZArithmeticMeanRank" [URL="index.html#pykeen.metrics.ranking.ZArithmeticMeanRank",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The z-scored arithmetic mean rank."]; "ZMetric" -> "ZArithmeticMeanRank" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ZGeometricMeanRank" [URL="index.html#pykeen.metrics.ranking.ZGeometricMeanRank",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The z geometric mean rank (zGMR)."]; "ZMetric" -> "ZGeometricMeanRank" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ZHitsAtK" [URL="index.html#pykeen.metrics.ranking.ZHitsAtK",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The z-scored hits at k ($ZAH_k$)."]; "ZMetric" -> "ZHitsAtK" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ZInverseHarmonicMeanRank" [URL="index.html#pykeen.metrics.ranking.ZInverseHarmonicMeanRank",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The z-inverse harmonic mean rank (ZIHMR)."]; "ZMetric" -> "ZInverseHarmonicMeanRank" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ZMetric" [URL="index.html#pykeen.metrics.ranking.ZMetric",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A z-score adjusted metrics."]; "DerivedRankBasedMetric" -> "ZMetric" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Hyper-parameter Optimization

class HpoPipelineResult(study, objective)[source]

A container for the results of the HPO pipeline.

Parameters:
  • study (Study) –

  • objective (Objective) –

objective: Objective

The objective class, containing information on preset hyper-parameters and those to optimize

replicate_best_pipeline(*, directory, replicates, move_to_cpu=False, save_replicates=True, save_training=False)[source]

Run the pipeline on the best configuration, but this time on the “test” set instead of “evaluation” set.

Parameters:
  • directory (Union[str, Path]) – Output directory

  • replicates (int) – The number of times to retrain the model

  • move_to_cpu (bool) – Should the model be moved back to the CPU? Only relevant if training on GPU.

  • save_replicates (bool) – Should the artifacts of the replicates be saved?

  • save_training (bool) – Should the training triples be saved?

Raises:

ValueError – if "use_testing_data" is provided in the best pipeline’s config.

Return type:

None

save_to_directory(directory, **kwargs)[source]

Dump the results of a study to the given directory.

Return type:

None

Parameters:

directory (str | Path) –

save_to_ftp(directory, ftp)[source]

Save the results to the directory in an FTP server.

Parameters:
  • directory (str) – The directory in the FTP server to save to

  • ftp (FTP) – A connection to the FTP server

save_to_s3(directory, bucket, s3=None)[source]

Save all artifacts to the given directory in an S3 Bucket.

Parameters:
  • directory (str) – The directory in the S3 bucket

  • bucket (str) – The name of the S3 bucket

  • s3 – A client from boto3.client(), if already instantiated

Return type:

None

study: Study

The optuna study object

hpo_pipeline_from_path(path, **kwargs)[source]

Run a HPO study from the configuration at the given path.

Return type:

HpoPipelineResult

Parameters:

path (str | Path) –

hpo_pipeline_from_config(config, **kwargs)[source]

Run the HPO pipeline using a properly formatted configuration dictionary.

Return type:

HpoPipelineResult

Parameters:

config (Mapping[str, Any]) –

hpo_pipeline(*, dataset=None, dataset_kwargs=None, training=None, testing=None, validation=None, evaluation_entity_whitelist=None, evaluation_relation_whitelist=None, model, model_kwargs=None, model_kwargs_ranges=None, loss=None, loss_kwargs=None, loss_kwargs_ranges=None, regularizer=None, regularizer_kwargs=None, regularizer_kwargs_ranges=None, optimizer=None, optimizer_kwargs=None, optimizer_kwargs_ranges=None, lr_scheduler=None, lr_scheduler_kwargs=None, lr_scheduler_kwargs_ranges=None, training_loop=None, training_loop_kwargs=None, negative_sampler=None, negative_sampler_kwargs=None, negative_sampler_kwargs_ranges=None, epochs=None, training_kwargs=None, training_kwargs_ranges=None, stopper=None, stopper_kwargs=None, evaluator=None, evaluator_kwargs=None, evaluation_kwargs=None, metric=None, filter_validation_when_testing=True, result_tracker=None, result_tracker_kwargs=None, device=None, storage=None, sampler=None, sampler_kwargs=None, pruner=None, pruner_kwargs=None, study_name=None, direction=None, load_if_exists=False, n_trials=None, timeout=None, gc_after_trial=None, n_jobs=None, save_model_directory=None)[source]

Train a model on the given dataset.

Parameters:
Return type:

HpoPipelineResult

Returns:

the optimization result

Raises:

ValueError – if early stopping is enabled, but the number of epochs is to be optimized, too.

Ablation

ablation_pipeline(datasets, directory, models, losses, optimizers, training_loops, *, epochs=None, create_inverse_triples=False, regularizers=None, negative_sampler=None, evaluator=None, stopper='NopStopper', model_to_model_kwargs=None, model_to_model_kwargs_ranges=None, model_to_loss_to_loss_kwargs=None, model_to_loss_to_loss_kwargs_ranges=None, model_to_optimizer_to_optimizer_kwargs=None, model_to_optimizer_to_optimizer_kwargs_ranges=None, model_to_negative_sampler_to_negative_sampler_kwargs=None, model_to_negative_sampler_to_negative_sampler_kwargs_ranges=None, model_to_training_loop_to_training_loop_kwargs=None, model_to_training_loop_to_training_kwargs=None, model_to_training_loop_to_training_kwargs_ranges=None, model_to_regularizer_to_regularizer_kwargs=None, model_to_regularizer_to_regularizer_kwargs_ranges=None, evaluator_kwargs=None, evaluation_kwargs=None, stopper_kwargs=None, n_trials=5, timeout=3600, metric='hits@10', direction='maximize', sampler='random', pruner='nop', metadata=None, save_artifacts=True, move_to_cpu=True, dry_run=False, best_replicates=None, discard_replicates=False, create_unique_subdir=False)[source]

Run ablation study.

Parameters:
  • datasets (Union[str, List[str]]) – A dataset name or list of dataset names.

  • directory (Union[str, Path]) – The directory in which the experimental artifacts will be saved.

  • models (Union[str, List[str]]) – A model name or list of model names.

  • losses (Union[str, List[str]]) – A loss function name or list of loss function names.

  • optimizers (Union[str, List[str]]) – An optimizer name or list of optimizer names.

  • training_loops (Union[str, List[str]]) – A training loop name or list of training loop names.

  • epochs (Optional[int]) – A quick way to set the num_epochs in the training kwargs.

  • create_inverse_triples (Union[bool, List[bool]]) – Either a boolean for a single entry or a list of booleans.

  • regularizers (Union[None, str, List[str]]) – A regularizer name, list of regularizer names, or None if no regularizer is desired.

  • negative_sampler (Optional[str]) – A negative sampler name, list of regularizer names, or None if no negative sampler is desired. Negative sampling is used only in combination with pykeen.training.SLCWATrainingLoop.

  • evaluator (Optional[str]) – The name of the evaluator to be used. Defaults to rank-based evaluator.

  • stopper (Optional[str]) – The name of the stopper to be used. Defaults to NopStopper which doesn’t define a stopping criterion.

  • model_to_model_kwargs (Optional[Mapping[str, Mapping[str, Any]]]) – A mapping from model name to dictionaries of default keyword arguments for the instantiation of that model.

  • model_to_model_kwargs_ranges (Optional[Mapping[str, Mapping[str, Any]]]) – A mapping from model name to dictionaries of keyword argument ranges for that model to be used in HPO.

  • model_to_loss_to_loss_kwargs (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of loss name to a mapping of default keyword arguments for the instantiation of that loss function. This is useful because for some losses, have hyper-parameters such as pykeen.losses.MarginRankingLoss.

  • model_to_loss_to_loss_kwargs_ranges (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of loss name to a mapping of keyword argument ranges for that loss to be used in HPO.

  • model_to_optimizer_to_optimizer_kwargs (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of optimizer name to a mapping of default keyword arguments for the instantiation of that optimizer. This is useful because the optimizers, have hyper-parameters such as the learning rate.

  • model_to_optimizer_to_optimizer_kwargs_ranges (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of optimizer name to a mapping of keyword argument ranges for that optimizer to be used in HPO.

  • model_to_regularizer_to_regularizer_kwargs (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of regularizer name to a mapping of default keyword arguments for the instantiation of that regularizer. This is useful because the optimizers, have hyper-parameters such as the regularization weight.

  • model_to_regularizer_to_regularizer_kwargs_ranges (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of regularizer name to a mapping of keyword argument ranges for that regularizer to be used in HPO.

  • model_to_negative_sampler_to_negative_sampler_kwargs (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of negative sampler name to a mapping of default keyword arguments for the instantiation of that negative sampler. This is useful because the negative samplers, have hyper-parameters such as the number of negatives that should get generated for each positive training example.

  • model_to_negative_sampler_to_negative_sampler_kwargs_ranges (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of negative sampler name to a mapping of keyword argument ranges for that negative sampler to be used in HPO.

  • model_to_training_loop_to_training_loop_kwargs (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of training loop name to a mapping of default keyword arguments for the training loop.

  • model_to_training_loop_to_training_kwargs (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of trainer name to a mapping of default keyword arguments for the training procedure. This is useful because you can set the hyper-parameters such as the number of training epochs and the batch size.

  • model_to_training_loop_to_training_kwargs_ranges (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of trainer name to a mapping of keyword argument ranges for that trainer to be used in HPO.

  • evaluator_kwargs (Optional[Mapping[str, Any]]) – The keyword arguments passed to the evaluator.

  • evaluation_kwargs (Optional[Mapping[str, Any]]) – The keyword arguments passed during evaluation.

  • stopper_kwargs (Optional[Mapping[str, Any]]) – The keyword arguments passed to the stopper.

  • n_trials (Optional[int]) – Number of HPO trials.

  • timeout (Optional[int]) – The time (seconds) after which the ablation study will be terminated.

  • metric (Optional[str]) – The metric to optimize during HPO.

  • direction (Optional[str]) – Defines, whether to ‘maximize’ or ‘minimize’ the metric during HPO.

  • sampler (Optional[str]) – The HPO sampler, it defaults to random search.

  • pruner (Optional[str]) – Defines approach for pruning trials. Per default no pruning is used, i.e., pruner is set to ‘Nopruner’.

  • metadata (Optional[Mapping]) – A mapping of meta data arguments such as name of the ablation study.

  • save_artifacts (bool) – Defines, whether each trained model sampled during HPO should be saved.

  • move_to_cpu (bool) – Defines, whether a replicate of the best model should be moved to CPU.

  • dry_run (bool) – Defines whether only the configurations for the single experiments should be created without running them.

  • best_replicates (Optional[int]) – Defines how often the final model should be re-trained and evaluated based on the best hyper-parameters enabling to measure the variance in performance.

  • discard_replicates (bool) – Defines, whether the best model should be discarded after training and evaluation.

  • create_unique_subdir (bool) – Defines, whether a unique sub-directory for the experimental artifacts should be created. The sub-directory name is defined by the current data + a unique id.

prepare_ablation_from_config(config, directory, save_artifacts)[source]

Prepare a set of ablation study directories.

Parameters:
  • config (Mapping[str, Any]) – Dictionary defining the ablation studies.

  • directory (Union[str, Path]) – The directory in which the experimental artifacts (including the ablation configurations) will be saved.

  • save_artifacts (bool) – Defines, whether the output directories for the trained models sampled during HPO should be created.

Return type:

List[Tuple[Path, Path]]

Returns:

pairs of output directories and HPO config paths inside those directories

prepare_ablation(datasets, models, losses, optimizers, training_loops, directory, *, epochs=None, create_inverse_triples=False, regularizers=None, negative_sampler=None, evaluator=None, model_to_model_kwargs=None, model_to_model_kwargs_ranges=None, model_to_loss_to_loss_kwargs=None, model_to_loss_to_loss_kwargs_ranges=None, model_to_optimizer_to_optimizer_kwargs=None, model_to_optimizer_to_optimizer_kwargs_ranges=None, model_to_training_loop_to_training_loop_kwargs=None, model_to_neg_sampler_to_neg_sampler_kwargs=None, model_to_neg_sampler_to_neg_sampler_kwargs_ranges=None, model_to_training_loop_to_training_kwargs=None, model_to_training_loop_to_training_kwargs_ranges=None, model_to_regularizer_to_regularizer_kwargs=None, model_to_regularizer_to_regularizer_kwargs_ranges=None, n_trials=5, timeout=3600, metric='hits@10', direction='maximize', sampler='random', pruner='nop', evaluator_kwargs=None, evaluation_kwargs=None, stopper='NopStopper', stopper_kwargs=None, metadata=None, save_artifacts=True)[source]

Prepare an ablation directory.

Parameters:
  • datasets (Union[str, List[str]]) – A dataset name or list of dataset names.

  • models (Union[str, List[str]]) – A model name or list of model names.

  • losses (Union[str, List[str]]) – A loss function name or list of loss function names.

  • optimizers (Union[str, List[str]]) – An optimizer name or list of optimizer names.

  • training_loops (Union[str, List[str]]) – A training loop name or list of training loop names.

  • epochs (Optional[int]) – A quick way to set the num_epochs in the training kwargs.

  • create_inverse_triples (Union[bool, List[bool]]) – Either a boolean for a single entry or a list of booleans.

  • regularizers (Union[None, str, List[str], List[None]]) – A regularizer name, list of regularizer names, or None if no regularizer is desired.

  • negative_sampler (Optional[str]) – A negative sampler name, list of regularizer names, or None if no negative sampler is desired. Negative sampling is used only in combination with the pykeen.training.sclwa training loop.

  • evaluator (Optional[str]) – The name of the evaluator to be used. Defaults to rank-based evaluator.

  • stopper (Optional[str]) – The name of the stopper to be used. Defaults to NopStopper which doesn’t define a stopping criterion.

  • model_to_model_kwargs (Optional[Mapping[str, Mapping[str, Any]]]) – A mapping from model name to dictionaries of default keyword arguments for the instantiation of that model.

  • model_to_model_kwargs_ranges (Optional[Mapping[str, Mapping[str, Any]]]) – A mapping from model name to dictionaries of keyword argument ranges for that model to be used in HPO.

  • model_to_loss_to_loss_kwargs (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of loss name to a mapping of default keyword arguments for the instantiation of that loss function. This is useful because for some losses, have hyper-parameters such as pykeen.losses.MarginRankingLoss

  • model_to_loss_to_loss_kwargs_ranges (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of loss name to a mapping of keyword argument ranges for that loss to be used in HPO.

  • model_to_optimizer_to_optimizer_kwargs (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of optimizer name to a mapping of default keyword arguments for the instantiation of that optimizer. This is useful because the optimizers, have hyper-parameters such as the learning rate.

  • model_to_optimizer_to_optimizer_kwargs_ranges (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of optimizer name to a mapping of keyword argument ranges for that optimizer to be used in HPO.

  • model_to_regularizer_to_regularizer_kwargs (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of regularizer name to a mapping of default keyword arguments for the instantiation of that regularizer. This is useful because the optimizers, have hyper-parameters such as the regularization weight.

  • model_to_regularizer_to_regularizer_kwargs_ranges (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of regularizer name to a mapping of keyword argument ranges for that regularizer to be used in HPO.

  • model_to_neg_sampler_to_neg_sampler_kwargs (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of negative sampler name to a mapping of default keyword arguments for the instantiation of that negative sampler. This is useful because the negative samplers, have hyper-parameters such as the number of negatives that should get generated for each positive training example.

  • model_to_neg_sampler_to_neg_sampler_kwargs_ranges (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of negative sampler name to a mapping of keyword argument ranges for that negative sampler to be used in HPO.

  • model_to_training_loop_to_training_loop_kwargs (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of training loop name to a mapping of default keyword arguments for the training loop.

  • model_to_training_loop_to_training_kwargs (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of trainer name to a mapping of default keyword arguments for the training procedure. This is useful because you can set the hyper-parameters such as the number of training epochs and the batch size.

  • model_to_training_loop_to_training_kwargs_ranges (Optional[Mapping[str, Mapping[str, Mapping[str, Any]]]]) – A mapping from model name to a mapping of trainer name to a mapping of keyword argument ranges for that trainer to be used in HPO.

  • evaluator_kwargs (Optional[Mapping[str, Any]]) – The keyword arguments passed to the evaluator.

  • evaluation_kwargs (Optional[Mapping[str, Any]]) – The keyword arguments passed during evaluation.

  • stopper_kwargs (Optional[Mapping[str, Any]]) – The keyword arguments passed to the stopper.

  • n_trials (Optional[int]) – Number of HPO trials.

  • timeout (Optional[int]) – The time (seconds) after which the ablation study will be terminated.

  • metric (Optional[str]) – The metric to optimize during HPO.

  • direction (Optional[str]) – Defines, whether to ‘maximize’ or ‘minimize’ the metric during HPO.

  • sampler (Optional[str]) – The HPO sampler, it defaults to random search.

  • pruner (Optional[str]) – Defines approach for pruning trials. Per default no pruning is used, i.e., pruner is set to ‘Nopruner’.

  • metadata (Optional[Mapping]) – A mapping of meta data arguments such as name of the ablation study.

  • directory (Union[str, Path]) – The directory in which the experimental artifacts will be saved.

  • save_artifacts (bool) – Defines, whether each trained model sampled during HPO should be saved.

Return type:

List[Tuple[Path, Path]]

Returns:

pairs of output directories and HPO config paths inside those directories.

Raises:

ValueError – If the dataset is not specified correctly, i.e., dataset is not of type str, or a dictionary containing the paths to the training, testing, and validation data.

Lookup

model_resolver: ClassResolver[Model] = <class_resolver.api.ClassResolver object>

Resolve from a list of classes.

loss_resolver: ClassResolver[Loss] = <class_resolver.api.ClassResolver object>

Resolve from a list of classes.

optimizer_resolver = <class_resolver.api.ClassResolver object>

Resolve from a list of classes.

regularizer_resolver: ClassResolver[Regularizer] = <class_resolver.api.ClassResolver object>

Resolve from a list of classes.

stopper_resolver: ClassResolver[Stopper] = <class_resolver.api.ClassResolver object>

Resolve from a list of classes.

negative_sampler_resolver: ClassResolver[NegativeSampler] = <class_resolver.api.ClassResolver object>

Resolve from a list of classes.

dataset_resolver: ClassResolver[Dataset] = <class_resolver.api.ClassResolver object>

Resolve from a list of classes.

training_loop_resolver: ClassResolver[TrainingLoop] = <class_resolver.api.ClassResolver object>

Resolve from a list of classes.

evaluator_resolver: ClassResolver[Evaluator] = <class_resolver.api.ClassResolver object>

Resolve from a list of classes.

metric_resolver: ClassResolver[MetricResults] = <class_resolver.api.ClassResolver object>

Resolve from a list of classes.

Prediction

Prediction workflows.

After training, the interaction model (e.g., TransE, ConvE, RotatE) can assign a score to an arbitrary triple, whether it appeared during training, testing, or not. In PyKEEN, each is implemented such that the higher the score (or less negative the score), the more likely a triple is to be true.

However, for most models, these scores do not have obvious statistical interpretations. This has two main consequences:

  1. The score for a triple from one model can not be compared to the score for that triple from another model

  2. There is no a priori minimum score for a triple to be labeled as true, so predictions must be given as a prioritization by sorting a set of triples by their respective scores.

For the remainder of this part of the documentation, we assume that we have trained a model, e.g. via

>>> from pykeen.pipeline import pipeline
>>> result = pipeline(dataset="nations", model="pairre", training_kwargs=dict(num_epochs=0))

High-Level

The prediction workflow offers three high-level methods to perform predictions

Warning

Please note that not all models automatically have interpretable scores, and their calibration may be poor. Thus, exercise caution when interpreting the results.

Triple Scoring

When scoring triples with pykeen.predict.predict_triples(), we obtain a score for each of the given triples. As an example, we will calculate scores for all validation triples from the dataset we trained the model upon.

>>> from pykeen.datasets import get_dataset
>>> from pykeen.predict import predict_triples
>>> dataset = get_dataset(dataset="nations")
>>> pack = predict_triples(model=result.model, triples=dataset.validation)

The variable pack now contains a pykeen.predict.ScorePack, which essentially is a pair of ID-based triples with their predicted scores. For interpretation, it can be helpful to add their corresponding labels, which the “nations” dataset offers, and convert them to a pandas dataframe:

>>> df = pack.process(factory=result.training).df

Since we now have a dataframe, we can utilize the full power of pandas for our subsequent analysis, e.g., showing the triples which received the highest score

>>> df.nlargest(n=5, columns="score")

or investigate whether certain entities generally receive larger scores

>>> df.groupby(by=["head_id", "head_label"]).agg({"score": ["mean", "std", "count"]})

Target Scoring

pykeen.predict.predict_target()’s primary usecase is link prediction or relation prediction. For instance, we could use our models to score all possible tail entities for the query (“uk”, “conferences”, ?) via

>>> from pykeen.datasets import get_dataset
>>> from pykeen.predict import predict_target
>>> dataset = get_dataset(dataset="nations")
>>> pred = predict_target(
...     model=result.model,
...     head="uk",
...     relation="conferences",
...     triples_factory=result.training,
... )

Notice that the result stored into pred is a pykeen.predict.Predictions object, which offers some post-processing options. For instance, we can remove all targets which are already know from the training set

>>> pred_filtered = pred.filter_triples(dataset.training)

or add additional columns to the dataframe proving the information whether the target is contained in another set, e.g., the validation or testing set.

>>> pred_annotated = pred_filtered.add_membership_columns(validation=dataset.validation, testing=dataset.testing)

The predictions object also exposes filtered / annotated dataframe through its df attribute

>>> pred_annotated.df

Full Scoring

Finally, we can use pykeen.predict.predict() to calculate scores for all possible triples. Notice that this operation can be prohibitively expensive for reasonably sized knowledge graphs, and the model may produce additional ill-calibrated scores for entity/relation combinations it has never seen paired before during training. The next line calculates and stores all triples and scores

>>> from pykeen.predict import predict_all
>>> pack = predict_all(model=result.model)

In addition to the expensive calculations, this additionally requires us to have sufficient memory available to store all scores. A computationally equally expensive option with reduced, fixed memory requirement is to store only the triples with the top \(k\) scores. This can be done through the optional parameter k

>>> pack = predict_all(model=result.model, k=10)

We can again convert the score pack to a predictions object for further filtering, e.g., adding a column indicating whether the triple has been seen during training

>>> pred = pack.process(factory=result.training)
>>> pred_annotated = pred.add_membership_columns(training=result.training)
>>> pred_annotated.df

Low-Level

The following section outlines some details about the implementation of operations which require calculating scores for all triples. The algorithm works are follows:

for batch in DataLoader(dataset, batch_size=batch_size):
  scores = model.predict(batch)
  for consumer in consumers:
    consumer(batch, scores)

Here, dataset is a pykeen.predict.PredictionDataset, which breaks the score calculation down into individual target predictions (e.g., tail predictions). Implementations include pykeen.predict.AllPredictionDataset and pykeen.predict.PartiallyRestrictedPredictionDataset. Notice that the prediction tasks are built lazily, i.e., only instantiating the prediction tasks when accessed. Moreover, the torch_max_mem package is used to automatically tune the batch size to maximize the memory utilization of the hardware at hand.

For each batch, the scores of the prediction task are calculated once. Afterwards, multiple consumers can process these scores. A consumer extends pykeen.predict.ScoreConsumer and receives the batch, i.e., input to the predict method, as well as the tensor of predicted scores. Examples include

  • pykeen.predict.CountScoreConsumer: a simple consumer which only counts how many scores it has seen. Mostly used for debugging or testing purposes

  • pykeen.predict.AllScoreConsumer: accumulates all scores into a single huge tensor. This incurs massive memory requirements for reasonably sized datasets, and often can be avoided by interleaving the processing of the scores with calculation of individual batches.

  • pykeen.predict.TopKScoreConsumer: keeps only the top \(k\) scores as well as the inputs leading to them. This is a memory-efficient variant of first accumulating all scores, then sorting by score and keeping only the top entries.

Potential Caveats

The model is trained on a particular link prediction task, e.g. to predict the appropriate tail for a given head/relation pair. This means that while the model can technically also predict other links, e.g., relations between a given head/tail pair, it must be done with the caveat that it was not trained for this task, and thus its scores may behave unexpectedly.

Migration Guide

Until version 1.9, the model itself provided wrappers which would delegate to the corresponding method in pykeen.models.predict

  • model.get_all_prediction_df

  • model.get_prediction_df

  • model.get_head_prediction_df

  • model.get_relation_prediction_df

  • model.get_tail_prediction_df

These methods were already deprecated and could be replaced by providing the model as explicit parameter to the stand-alone functions from the prediction module. Thus, we will focus on the migrating the stand-alone functions.

In the pykeen.models.predict module, the prediction methods were organized differently. There were

  • get_prediction_df

  • get_head_prediction_df

  • get_relation_prediction_df

  • get_tail_prediction_df

  • get_all_prediction_df

  • predict_triples_df

where get_head_prediction_df, get_relation_prediction_df and get_tail_prediction_df were deprecated in favour of directly using get_prediction_df with all but the prediction target being provided, i.e., e.g.,

>>> from pykeen.models import predict
>>> prediction.get_tail_prediction_df(
...     model=model,
...     head_label="belgium",
...     relation_label="locatedin",
...     triples_factory=result.training,
... )

was deprecated in favour of

>>> from pykeen.models import predict
>>> predict.get_prediction_df(
...     model=model,
...     head_label="brazil",
...     relation_label="intergovorgs",
...     triples_factory=result.training,
... )

get_prediction_df

The old use of

>>> from pykeen.models import predict
>>> predict.get_prediction_df(
...     model=model,
...     head_label="brazil",
...     relation_label="intergovorgs",
...     triples_factory=result.training,
... )

can be replaced by

>>> from pykeen import predict
>>> predict.predict_target(
...     model=model,
...     head="brazil",
...     relation="intergovorgs",
...     triples_factory=result.training,
... ).df

Notice the trailing .df.

get_all_prediction_df

The old use of

>>> from pykeen.models import predict
>>> predictions_df = predict.get_all_prediction_df(model, triples_factory=result.training)

can be replaced by

>>> from pykeen import predict
>>> predict.predict_all(model=model).process(factory=result.training).df

predict_triples_df

The old use of

>>> from pykeen.models import predict
>>> score_df = predict.predict_triples_df(
...     model=model,
...     triples=[("brazil", "conferences", "uk"), ("brazil", "intergovorgs", "uk")],
...     triples_factory=result.training,
... )

can be replaced by

>>> from pykeen import predict
>>> score_df = predict.predict_triples(
...     model=model,
...     triples=[("brazil", "conferences", "uk"), ("brazil", "intergovorgs", "uk")],
...     triples_factory=result.training,
... )

Functions

predict_all(model, *[, k, batch_size, mode, ...])

Calculate scores for all triples, and either keep all of them or only the top k triples.

predict_triples(model, *, triples[, ...])

Predict on labeled or mapped triples.

predict_target(model, *[, head, relation, ...])

Get predictions for the head, relation, and/or tail combination.

consume_scores(model, dataset, *consumers[, ...])

Batch-wise calculation of all triple scores and consumption.

Classes

ScoreConsumer()

A consumer of scores for visitor pattern.

CountScoreConsumer()

A simple consumer which counts the number of batches and scores.

TopKScoreConsumer([k, device])

Collect top-k triples & scores.

AllScoreConsumer(num_entities, num_relations)

Collect scores for all triples.

CountScoreConsumer()

A simple consumer which counts the number of batches and scores.

ScorePack(result, scores)

A pair of result triples and scores.

Predictions(df, factory)

Base class for predictions.

TriplePredictions(df, factory)

Triples with their predicted scores.

TargetPredictions(df, factory, target, ...)

Targets with their predicted scores.

PredictionDataset([target])

A base class for prediction datasets.

AllPredictionDataset(num_entities, ...)

A dataset for predicting all possible triples.

PartiallyRestrictedPredictionDataset(*[, ...])

A dataset for scoring some links.

Class Inheritance Diagram

digraph inheritancec581f67e95 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "AllPredictionDataset" [URL="index.html#pykeen.predict.AllPredictionDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A dataset for predicting all possible triples."]; "PredictionDataset" -> "AllPredictionDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "AllScoreConsumer" [URL="index.html#pykeen.predict.AllScoreConsumer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Collect scores for all triples."]; "ScoreConsumer" -> "AllScoreConsumer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CountScoreConsumer" [URL="index.html#pykeen.predict.CountScoreConsumer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A simple consumer which counts the number of batches and scores."]; "ScoreConsumer" -> "CountScoreConsumer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Dataset" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="An abstract class representing a :class:`Dataset`."]; "Generic" -> "Dataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Generic" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Abstract base class for generic types."]; "PartiallyRestrictedPredictionDataset" [URL="index.html#pykeen.predict.PartiallyRestrictedPredictionDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A dataset for scoring some links."]; "PredictionDataset" -> "PartiallyRestrictedPredictionDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PredictionDataset" [URL="index.html#pykeen.predict.PredictionDataset",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A base class for prediction datasets."]; "Dataset" -> "PredictionDataset" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Predictions" [URL="index.html#pykeen.predict.Predictions",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Base class for predictions."]; "ABC" -> "Predictions" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ScoreConsumer" [URL="index.html#pykeen.predict.ScoreConsumer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A consumer of scores for visitor pattern."]; "ScorePack" [URL="index.html#pykeen.predict.ScorePack",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A pair of result triples and scores."]; "TargetPredictions" [URL="index.html#pykeen.predict.TargetPredictions",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Targets with their predicted scores."]; "Predictions" -> "TargetPredictions" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TopKScoreConsumer" [URL="index.html#pykeen.predict.TopKScoreConsumer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Collect top-k triples & scores."]; "ScoreConsumer" -> "TopKScoreConsumer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TriplePredictions" [URL="index.html#pykeen.predict.TriplePredictions",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Triples with their predicted scores."]; "Predictions" -> "TriplePredictions" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Uncertainty

Analyze uncertainty.

Currently, all implemented approaches are based on Monte-Carlo dropout [gal2016]. Monte-Carlo dropout relies on the model having dropout layers. While dropout usually is turned off for inference / evaluation mode, MC dropout leaves dropout enabled. Thereby, if we run the same prediction method \(k\) times, we get \(k\) different predictions. The variance of these predictions can be used as an approximation of uncertainty, where larger variance indicates higher uncertainty in the predicted score.

The absolute variance is usually hard to interpret, but comparing the variances with each other can help to identify which scores are more uncertain than others.

The following code-block sketches an example use case, where we train a model with a classification loss, i.e., on the triple classification task.

from pykeen.pipeline import pipeline
from pykeen.models.uncertainty import predict_hrt_uncertain

# train model
# note: as this is an example, the model is only trained for a few epochs,
#       but not until convergence. In practice, you would usually first verify that
#       the model is sufficiently good in prediction, before looking at uncertainty scores
result = pipeline(dataset="nations", model="ERMLPE", loss="bcewithlogits")

# predict triple scores with uncertainty
prediction_with_uncertainty = predict_hrt_uncertain(
    model=result.model,
    hrt_batch=result.training.mapped_triples[0:8],
)

# use a larger number of samples, to increase quality of uncertainty estimate
prediction_with_uncertainty = predict_hrt_uncertain(
    model=result.model,
    hrt_batch=result.training.mapped_triples[0:8],
    num_samples=100,
)

# get most and least uncertain prediction on training set
prediction_with_uncertainty = predict_hrt_uncertain(
    model=result.model,
    hrt_batch=result.training.mapped_triples,
    num_samples=100,
)
df = result.training.tensor_to_df(
    result.training.mapped_triples,
    logits=prediction_with_uncertainty.score[:, 0],
    probability=prediction_with_uncertainty.score[:, 0].sigmoid(),
    uncertainty=prediction_with_uncertainty.uncertainty[:, 0],
)
print(df.nlargest(5, columns="uncertainty"))
print(df.nsmallest(5, columns="uncertainty"))

A collection of related work on uncertainty quantification can be found here: https://github.com/uncertainty-toolbox/uncertainty-toolbox/blob/master/docs/paper_list.md

Functions

predict_hrt_uncertain(model, hrt_batch[, ...])

Calculate the scores with uncertainty quantification via Monte-Carlo dropout.

predict_h_uncertain(model, rt_batch[, ...])

Forward pass using left side (head) prediction for obtaining scores of all possible heads.

predict_t_uncertain(model, hr_batch[, ...])

Forward pass using right side (tail) prediction for obtaining scores of all possible tails.

predict_r_uncertain(model, ht_batch[, ...])

Forward pass using middle (relation) prediction for obtaining scores of all possible relations.

predict_uncertain_helper(model, batch, ...)

Predict with uncertainty estimates via Monte-Carlo dropout.

Classes

MissingDropoutError

Raised during uncertainty analysis if no dropout modules are present.

UncertainPrediction(score, uncertainty)

A pair of predicted scores and corresponding uncertainty.

Sealant

Tools for removing the leakage from datasets.

Leakage is when the inverse of a given training triple appears in either the testing or validation set. This scenario generally leads to inflated and misleading evaluation because predicting an inverse triple is usually very easy and not a sign of the generalizability of a model to predict novel triples.

class Sealant(triples_factory, minimum_frequency=None, symmetric=True)[source]

Stores inverse frequencies and inverse mappings in a given triples factory.

Index the inverse frequencies and the inverse relations in the triples factory.

Parameters:
  • triples_factory (CoreTriplesFactory) – The triples factory to index.

  • minimum_frequency (Optional[float]) – The minimum overlap between two relations’ triples to consider them as inverses. The default value, 0.97, is taken from Toutanova and Chen (2015), who originally described the generation of FB15k-237.

  • symmetric (bool) – If the similarities are computed as symmetric

Raises:

NotImplementedError – If symmetric is False

apply(triples_factory)[source]

Make a new triples factory containing neither duplicate nor inverse relationships.

Return type:

CoreTriplesFactory

Parameters:

triples_factory (CoreTriplesFactory) –

reindex(*triples_factories)[source]

Reindex a set of triples factories.

Return type:

List[CoreTriplesFactory]

Parameters:

triples_factories (CoreTriplesFactory) –

unleak(train, *triples_factories, n=None, minimum_frequency=None)[source]

Unleak a train, test, and validate triples factory.

Parameters:
  • train (CoreTriplesFactory) – The target triples factory

  • triples_factories (CoreTriplesFactory) – All other triples factories (test, validate, etc.)

  • n (Union[None, int, float]) – Either the (integer) number of top relations to keep or the (float) percentage of top relationships to keep. If left none, frequent relations are not removed.

  • minimum_frequency (Optional[float]) –

    The minimum overlap between two relations’ triples to consider them as inverses or duplicates. The default value, 0.97, is taken from Toutanova and Chen (2015), who originally described the generation of FB15k-237.

Return type:

Iterable[CoreTriplesFactory]

Returns:

A sequence of reindexed triples factories

Constants

Constants for PyKEEN.

PYKEEN_BENCHMARKS: Path = PosixPath('/home/docs/.data/pykeen/benchmarks')

A subdirectory of the PyKEEN data folder for benchmarks, defaults to ~/.data/pykeen/benchmarks

PYKEEN_CHECKPOINTS: Path = PosixPath('/home/docs/.data/pykeen/checkpoints')

A subdirectory of the PyKEEN data folder for checkpoints, defaults to ~/.data/pykeen/checkpoints

PYKEEN_DATASETS: Path = PosixPath('/home/docs/.data/pykeen/datasets')

A subdirectory of the PyKEEN data folder for datasets, defaults to ~/.data/pykeen/datasets

PYKEEN_EXPERIMENTS: Path = PosixPath('/home/docs/.data/pykeen/experiments')

A subdirectory of the PyKEEN data folder for experiments, defaults to ~/.data/pykeen/experiments

PYKEEN_HOME: Path = PosixPath('/home/docs/.data/pykeen')

A path representing the PyKEEN data folder

PYKEEN_LOGS: Path = PosixPath('/home/docs/.data/pykeen/logs')

A subdirectory for PyKEEN logs

Type hints for PyKEEN.

Constrainer

A function that can be applied to a tensor to constrain it

alias of Callable[[FloatTensor], FloatTensor]

DeviceHint

A hint for a torch.device

alias of Optional[Union[str, device]]

class GaussianDistribution(mean: torch.FloatTensor, diagonal_covariance: torch.FloatTensor)[source]

A gaussian distribution with diagonal covariance matrix.

Create new instance of GaussianDistribution(mean, diagonal_covariance)

Parameters:
  • mean (FloatTensor) –

  • diagonal_covariance (FloatTensor) –

diagonal_covariance: FloatTensor

Alias for field number 1

mean: FloatTensor

Alias for field number 0

class HeadRepresentation

A type variable for head representations used in pykeen.models.Model, pykeen.nn.modules.Interaction, etc.

alias of TypeVar(‘HeadRepresentation’, bound=Union[FloatTensor, Sequence[FloatTensor]])

InductiveMode

the inductive prediction and training mode

alias of Literal[‘training’, ‘validation’, ‘testing’]

Initializer

A function that can be applied to a tensor to initialize it

alias of Callable[[FloatTensor], FloatTensor]

LabeledTriples

alias of ndarray

Mutation

A function that mutates the input and returns a new object of the same type as output

alias of Callable[[X], X]

Normalizer

A function that can be applied to a tensor to normalize it

alias of Callable[[FloatTensor], FloatTensor]

class RelationRepresentation

A type variable for relation representations used in pykeen.models.Model, pykeen.nn.modules.Interaction, etc.

alias of TypeVar(‘RelationRepresentation’, bound=Union[FloatTensor, Sequence[FloatTensor]])

class TailRepresentation

A type variable for tail representations used in pykeen.models.Model, pykeen.nn.modules.Interaction, etc.

alias of TypeVar(‘TailRepresentation’, bound=Union[FloatTensor, Sequence[FloatTensor]])

Target

the prediction target

alias of Literal[‘head’, ‘relation’, ‘tail’]

TargetColumn

the prediction target index

alias of Literal[0, 1, 2]

TorchRandomHint

A hint for a torch.Generator

alias of Union[None, int, Generator]

cast_constrainer(f)[source]

Cast a constrainer function with typing.cast().

Return type:

Callable[[FloatTensor], FloatTensor]

normalize_rank_type(rank)[source]

Normalize a rank type.

Return type:

Literal[‘optimistic’, ‘realistic’, ‘pessimistic’]

Parameters:

rank (str | None) –

normalize_target(target)[source]

Normalize a prediction target side.

Return type:

Union[Literal[‘head’, ‘relation’, ‘tail’], Literal[‘both’]]

Parameters:

target (str | None) –

pykeen.nn

PyKEEN internal “nn” module.

Functional

Functional forms of interaction methods.

These implementations allow for an arbitrary number of batch dimensions, as well as broadcasting and thus naturally support slicing and 1:n scoring.

Functions

conve_interaction(h, r, t, t_bias, ...)

Evaluate the ConvE interaction function.

convkb_interaction(h, r, t, conv, ...)

Evaluate the ConvKB interaction function.

cp_interaction(h, r, t)

Evaluate the Canonical Tensor Decomposition interaction function.

cross_e_interaction(h, r, c_r, t, bias, ...)

Evaluate the interaction function of CrossE for the given representations from [zhang2019b].

dist_ma_interaction(h, r, t)

Evaluate the DistMA interaction function from [shi2019].

distmult_interaction(h, r, t)

Evaluate the DistMult interaction function.

ermlp_interaction(h, r, t, hidden, ...)

Evaluate the ER-MLP interaction function.

ermlpe_interaction(h, r, t, mlp)

Evaluate the ER-MLPE interaction function.

hole_interaction(h, r, t)

Evaluate the HolE interaction function.

kg2e_interaction(h_mean, h_var, r_mean, ...)

Evaluate the KG2E interaction function.

multilinear_tucker_interaction(h, r, t, ...)

Evaluate the (original) multi-linear TuckEr interaction function.

mure_interaction(h, b_h, r_vec, r_mat, t, b_t)

Evaluate the MuRE interaction function from [balazevic2019b].

ntn_interaction(h, t, w, vh, vt, b, u, ...)

Evaluate the NTN interaction function.

pair_re_interaction(h, t, r_h, r_t[, p, ...])

Evaluate the PairRE interaction function.

proje_interaction(h, r, t, d_e, d_r, b_c, ...)

Evaluate the ProjE interaction function.

rescal_interaction(h, r, t)

Evaluate the RESCAL interaction function.

simple_interaction(h, r, t, h_inv, r_inv, t_inv)

Evaluate the SimplE interaction function.

se_interaction(h, r_h, r_t, t, p[, power_norm])

Evaluate the Structured Embedding interaction function.

transd_interaction(h, r, t, h_p, r_p, t_p, p)

Evaluate the TransD interaction function.

transe_interaction(h, r, t[, p, power_norm])

Evaluate the TransE interaction function.

transf_interaction(h, r, t)

Evaluate the TransF interaction function.

transh_interaction(h, w_r, d_r, t, p[, ...])

Evaluate the DistMult interaction function.

transr_interaction(h, r, t, m_r, p[, power_norm])

Evaluate the TransR interaction function.

transformer_interaction(h, r, t, ...)

Evaluate the Transformer interaction function, as described in [galkin2020]..

triple_re_interaction(h, r_head, r_mid, ...)

Evaluate the TripleRE interaction function.

tucker_interaction(h, r, t, core_tensor, ...)

Evaluate the TuckEr interaction function.

um_interaction(h, t, p[, power_norm])

Evaluate the SimplE interaction function.

linea_re_interaction(h, r_head, r_mid, r_tail, t)

Evaluate the LineaRE interaction function.

Stateful Interaction Modules

Stateful interaction functions.

Classes

Interaction(*args, **kwargs)

Base class for interaction functions.

FunctionalInteraction(*args, **kwargs)

Base class for interaction functions.

NormBasedInteraction(p[, power_norm])

Norm-based interactions use a (powered) \(p\)-norm in their scoring function.

MonotonicAffineTransformationInteraction(base)

An adapter of interaction functions which adds a scalar (trainable) monotonic affine transformation of the score.

AutoSFInteraction(coefficients, *[, ...])

The AutoSF interaction as described by [zhang2020].

BoxEInteraction([tanh_map, p, power_norm])

The BoxE interaction from [abboud2020].

ComplExInteraction(*args, **kwargs)

The ComplEx interaction proposed by [trouillon2016].

ConvEInteraction([input_channels, ...])

A stateful module for the ConvE interaction function.

ConvKBInteraction([hidden_dropout_rate, ...])

A stateful module for the ConvKB interaction function.

CPInteraction(*args, **kwargs)

An implementation of the CP interaction as described [lacroix2018] (originally from [hitchcock1927]).

CrossEInteraction([embedding_dim, ...])

A module wrapper for the CrossE interaction function.

DistMAInteraction(*args, **kwargs)

A module wrapper for the stateless DistMA interaction function.

DistMultInteraction(*args, **kwargs)

A module wrapper for the stateless DistMult interaction function.

ERMLPEInteraction([embedding_dim, ...])

A stateful module for the ER-MLP (E) interaction function.

ERMLPInteraction(embedding_dim[, hidden_dim])

A stateful module for the ER-MLP interaction.

HolEInteraction(*args, **kwargs)

A module wrapper for the stateless HolE interaction function.

KG2EInteraction([similarity, exact])

A stateful module for the KG2E interaction function.

LineaREInteraction(p[, power_norm])

The LineaRE interaction described by [peng2020].

MultiLinearTuckerInteraction([head_dim, ...])

An implementation of the original (multi-linear) TuckER interaction as described [tucker1966].

MuREInteraction(p[, power_norm])

A stateful module for the MuRE interaction function from [balazevic2019b].

NTNInteraction([activation, activation_kwargs])

A stateful module for the NTN interaction function.

PairREInteraction(p[, power_norm])

A stateful module for the PairRE interaction function.

ProjEInteraction([embedding_dim, ...])

A stateful module for the ProjE interaction function.

QuatEInteraction()

A module wrapper for the QuatE interaction function.

RESCALInteraction(*args, **kwargs)

A module wrapper for the stateless RESCAL interaction function.

RotatEInteraction(*args, **kwargs)

The RotatE interaction function proposed by [sun2019].

SEInteraction(p[, power_norm])

A stateful module for the Structured Embedding (SE) interaction function.

SimplEInteraction([clamp_score])

A module wrapper for the SimplE interaction function.

TorusEInteraction([p, power_norm])

A stateful module for the TorusE interaction function.

TransDInteraction([p, power_norm])

A stateful module for the TransD interaction function.

TransEInteraction(p[, power_norm])

A stateful module for the TransE interaction function.

TransFInteraction(*args, **kwargs)

A stateless module for the TransF interaction function.

TransformerInteraction([input_dim, ...])

Transformer-based interaction, as described in [galkin2020].

TransHInteraction(p[, power_norm])

A stateful module for the TransH interaction function.

TransRInteraction(p[, power_norm])

A stateful module for the TransR interaction function.

TripleREInteraction([u, p, power_norm])

A stateful module for the TripleRE interaction function from [yu2021].

TuckerInteraction([embedding_dim, ...])

A stateful module for the stateless Tucker interaction function.

UMInteraction(p[, power_norm])

A stateful module for the UnstructuredModel interaction function.

Class Inheritance Diagram

digraph inheritancee3eb59d197 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "AutoSFInteraction" [URL="index.html#pykeen.nn.modules.AutoSFInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The AutoSF interaction as described by [zhang2020]_."]; "FunctionalInteraction" -> "AutoSFInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BoxEInteraction" [URL="index.html#pykeen.nn.modules.BoxEInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The BoxE interaction from [abboud2020]_."]; "NormBasedInteraction" -> "BoxEInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CPInteraction" [URL="index.html#pykeen.nn.modules.CPInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of the CP interaction as described [lacroix2018]_ (originally from [hitchcock1927]_)."]; "FunctionalInteraction" -> "CPInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ComplExInteraction" [URL="index.html#pykeen.nn.modules.ComplExInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The ComplEx interaction proposed by [trouillon2016]_."]; "FunctionalInteraction" -> "ComplExInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ConvEInteraction" [URL="index.html#pykeen.nn.modules.ConvEInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the ConvE interaction function."]; "FunctionalInteraction" -> "ConvEInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ConvKBInteraction" [URL="index.html#pykeen.nn.modules.ConvKBInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the ConvKB interaction function."]; "FunctionalInteraction" -> "ConvKBInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CrossEInteraction" [URL="index.html#pykeen.nn.modules.CrossEInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A module wrapper for the CrossE interaction function."]; "FunctionalInteraction" -> "CrossEInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DistMAInteraction" [URL="index.html#pykeen.nn.modules.DistMAInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A module wrapper for the stateless DistMA interaction function."]; "FunctionalInteraction" -> "DistMAInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DistMultInteraction" [URL="index.html#pykeen.nn.modules.DistMultInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A module wrapper for the stateless DistMult interaction function."]; "FunctionalInteraction" -> "DistMultInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ERMLPEInteraction" [URL="index.html#pykeen.nn.modules.ERMLPEInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the ER-MLP (E) interaction function."]; "FunctionalInteraction" -> "ERMLPEInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ERMLPInteraction" [URL="index.html#pykeen.nn.modules.ERMLPInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the ER-MLP interaction."]; "FunctionalInteraction" -> "ERMLPInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "FunctionalInteraction" [URL="index.html#pykeen.nn.modules.FunctionalInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Base class for interaction functions."]; "Interaction" -> "FunctionalInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Generic" -> "FunctionalInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Generic" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Abstract base class for generic types."]; "HolEInteraction" [URL="index.html#pykeen.nn.modules.HolEInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A module wrapper for the stateless HolE interaction function."]; "FunctionalInteraction" -> "HolEInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Interaction" [URL="index.html#pykeen.nn.modules.Interaction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Base class for interaction functions."]; "Module" -> "Interaction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Generic" -> "Interaction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "Interaction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "KG2EInteraction" [URL="index.html#pykeen.nn.modules.KG2EInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the KG2E interaction function."]; "FunctionalInteraction" -> "KG2EInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "LineaREInteraction" [URL="index.html#pykeen.nn.modules.LineaREInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The LineaRE interaction described by [peng2020]_."]; "NormBasedInteraction" -> "LineaREInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Module" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Base class for all neural network modules."]; "MonotonicAffineTransformationInteraction" [URL="index.html#pykeen.nn.modules.MonotonicAffineTransformationInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An adapter of interaction functions which adds a scalar (trainable) monotonic affine transformation of the score."]; "Interaction" -> "MonotonicAffineTransformationInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "MuREInteraction" [URL="index.html#pykeen.nn.modules.MuREInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the MuRE interaction function from [balazevic2019b]_."]; "NormBasedInteraction" -> "MuREInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "MultiLinearTuckerInteraction" [URL="index.html#pykeen.nn.modules.MultiLinearTuckerInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An implementation of the original (multi-linear) TuckER interaction as described [tucker1966]_."]; "FunctionalInteraction" -> "MultiLinearTuckerInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "NTNInteraction" [URL="index.html#pykeen.nn.modules.NTNInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the NTN interaction function."]; "FunctionalInteraction" -> "NTNInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "NormBasedInteraction" [URL="index.html#pykeen.nn.modules.NormBasedInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Norm-based interactions use a (powered) $p$-norm in their scoring function."]; "FunctionalInteraction" -> "NormBasedInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Generic" -> "NormBasedInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "NormBasedInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PairREInteraction" [URL="index.html#pykeen.nn.modules.PairREInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the PairRE interaction function."]; "NormBasedInteraction" -> "PairREInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ProjEInteraction" [URL="index.html#pykeen.nn.modules.ProjEInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the ProjE interaction function."]; "FunctionalInteraction" -> "ProjEInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "QuatEInteraction" [URL="index.html#pykeen.nn.modules.QuatEInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A module wrapper for the QuatE interaction function."]; "FunctionalInteraction" -> "QuatEInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RESCALInteraction" [URL="index.html#pykeen.nn.modules.RESCALInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A module wrapper for the stateless RESCAL interaction function."]; "FunctionalInteraction" -> "RESCALInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RotatEInteraction" [URL="index.html#pykeen.nn.modules.RotatEInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="The RotatE interaction function proposed by [sun2019]_."]; "FunctionalInteraction" -> "RotatEInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SEInteraction" [URL="index.html#pykeen.nn.modules.SEInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the Structured Embedding (SE) interaction function."]; "NormBasedInteraction" -> "SEInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SimplEInteraction" [URL="index.html#pykeen.nn.modules.SimplEInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A module wrapper for the SimplE interaction function."]; "FunctionalInteraction" -> "SimplEInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TorusEInteraction" [URL="index.html#pykeen.nn.modules.TorusEInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the TorusE interaction function."]; "NormBasedInteraction" -> "TorusEInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TransDInteraction" [URL="index.html#pykeen.nn.modules.TransDInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the TransD interaction function."]; "NormBasedInteraction" -> "TransDInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TransEInteraction" [URL="index.html#pykeen.nn.modules.TransEInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the TransE interaction function."]; "NormBasedInteraction" -> "TransEInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TransFInteraction" [URL="index.html#pykeen.nn.modules.TransFInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateless module for the TransF interaction function."]; "FunctionalInteraction" -> "TransFInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TransHInteraction" [URL="index.html#pykeen.nn.modules.TransHInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the TransH interaction function."]; "NormBasedInteraction" -> "TransHInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TransRInteraction" [URL="index.html#pykeen.nn.modules.TransRInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the TransR interaction function."]; "NormBasedInteraction" -> "TransRInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TransformerInteraction" [URL="index.html#pykeen.nn.modules.TransformerInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Transformer-based interaction, as described in [galkin2020]_."]; "FunctionalInteraction" -> "TransformerInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TripleREInteraction" [URL="index.html#pykeen.nn.modules.TripleREInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the TripleRE interaction function from [yu2021]_."]; "NormBasedInteraction" -> "TripleREInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TuckerInteraction" [URL="index.html#pykeen.nn.modules.TuckerInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the stateless Tucker interaction function."]; "FunctionalInteraction" -> "TuckerInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; "UMInteraction" [URL="index.html#pykeen.nn.modules.UMInteraction",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A stateful module for the UnstructuredModel interaction function."]; "NormBasedInteraction" -> "UMInteraction" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Similarity

pykeen.nn.sim Module

Similarity functions.

Functions

expected_likelihood(h, r, t[, exact])

Compute the similarity based on expected likelihood.

kullback_leibler_similarity(h, r, t[, exact])

Compute the negative KL divergence.

Representation

Representation modules.

Classes

Representation(max_id[, shape, normalizer, ...])

A base class for obtaining representations for entities/relations.

Embedding([max_id, num_embeddings, ...])

Trainable embeddings.

LowRankRepresentation(*, max_id, shape[, ...])

Low-rank embedding factorization.

CompGCNLayer(input_dim[, output_dim, ...])

A single layer of the CompGCN model.

CombinedCompGCNRepresentations(*, ...[, ...])

A sequence of CompGCN layers.

PartitionRepresentation(assignment[, shape, ...])

A partition of the indices into different representation modules.

BackfillRepresentation(max_id, base_ids[, ...])

A variant of a partition representation that is easily applicable to a single base representation.

SingleCompGCNRepresentation(combined[, ...])

A wrapper around the combined representation module.

SubsetRepresentation(max_id[, base, ...])

A representation module, which only exposes a subset of representations of its base.

CombinedRepresentation(max_id[, shape, ...])

A combined representation.

TensorTrainRepresentation([assignment, ...])

A tensor factorization of representations.

TransformedRepresentation(transformation[, ...])

A (learnable) transformation upon base representations.

TextRepresentation(labels[, max_id, shape, ...])

Textual representations using a text encoder on labels.

CachedTextRepresentation(identifiers[, cache])

Textual representations for datasets with identifiers that can be looked up with a TextCache.

WikidataTextRepresentation(identifiers[, cache])

Textual representations for datasets grounded in Wikidata.

BiomedicalCURIERepresentation(identifiers[, ...])

Textual representations for datasets grounded with biomedical CURIEs.

Class Inheritance Diagram

digraph inheritance007a15c843 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "BackfillRepresentation" [URL="index.html#pykeen.nn.representation.BackfillRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A variant of a partition representation that is easily applicable to a single base representation."]; "PartitionRepresentation" -> "BackfillRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BiomedicalCURIERepresentation" [URL="index.html#pykeen.nn.representation.BiomedicalCURIERepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Textual representations for datasets grounded with biomedical CURIEs."]; "CachedTextRepresentation" -> "BiomedicalCURIERepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CachedTextRepresentation" [URL="index.html#pykeen.nn.representation.CachedTextRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Textual representations for datasets with identifiers that can be looked up with a :class:`TextCache`."]; "TextRepresentation" -> "CachedTextRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CombinedCompGCNRepresentations" [URL="index.html#pykeen.nn.representation.CombinedCompGCNRepresentations",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A sequence of CompGCN layers."]; "Module" -> "CombinedCompGCNRepresentations" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CombinedRepresentation" [URL="index.html#pykeen.nn.representation.CombinedRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A combined representation."]; "Representation" -> "CombinedRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CompGCNLayer" [URL="index.html#pykeen.nn.representation.CompGCNLayer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A single layer of the CompGCN model."]; "Module" -> "CompGCNLayer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Embedding" [URL="index.html#pykeen.nn.representation.Embedding",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Trainable embeddings."]; "Representation" -> "Embedding" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExtraReprMixin" [URL="index.html#pykeen.utils.ExtraReprMixin",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A mixin for modules with hierarchical `extra_repr`."]; "LowRankRepresentation" [URL="index.html#pykeen.nn.representation.LowRankRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Low-rank embedding factorization."]; "Representation" -> "LowRankRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Module" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Base class for all neural network modules."]; "PartitionRepresentation" [URL="index.html#pykeen.nn.representation.PartitionRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A partition of the indices into different representation modules."]; "Representation" -> "PartitionRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Representation" [URL="index.html#pykeen.nn.representation.Representation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A base class for obtaining representations for entities/relations."]; "Module" -> "Representation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExtraReprMixin" -> "Representation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "Representation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SingleCompGCNRepresentation" [URL="index.html#pykeen.nn.representation.SingleCompGCNRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A wrapper around the combined representation module."]; "Representation" -> "SingleCompGCNRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SubsetRepresentation" [URL="index.html#pykeen.nn.representation.SubsetRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A representation module, which only exposes a subset of representations of its base."]; "Representation" -> "SubsetRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TensorTrainRepresentation" [URL="index.html#pykeen.nn.representation.TensorTrainRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A tensor factorization of representations."]; "Representation" -> "TensorTrainRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TextRepresentation" [URL="index.html#pykeen.nn.representation.TextRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Textual representations using a text encoder on labels."]; "Representation" -> "TextRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TransformedRepresentation" [URL="index.html#pykeen.nn.representation.TransformedRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A (learnable) transformation upon base representations."]; "Representation" -> "TransformedRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "WikidataTextRepresentation" [URL="index.html#pykeen.nn.representation.WikidataTextRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Textual representations for datasets grounded in Wikidata."]; "CachedTextRepresentation" -> "WikidataTextRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Initialization

Embedding weight initialization routines.

Functions

xavier_uniform_(tensor[, gain])

Initialize weights of the tensor similarly to Glorot/Xavier initialization.

xavier_normal_(tensor[, gain])

Initialize weights of the tensor similarly to Glorot/Xavier initialization.

init_phases(x)

Generate random phases between 0 and \(2\pi\).

Classes

PretrainedInitializer(tensor)

Initialize tensor with pretrained weights.

LabelBasedInitializer(labels[, encoder, ...])

An initializer using pretrained models from the transformers library to encode labels.

WeisfeilerLehmanInitializer(*[, ...])

An initializer based on an encoding of categorical colors from the Weisfeiler-Lehman algorithm.

RandomWalkPositionalEncodingInitializer(*[, ...])

Initialize nodes via random-walk positional encoding.

Class Inheritance Diagram

digraph inheritancee775b8f2ef { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "LabelBasedInitializer" [URL="index.html#pykeen.nn.init.LabelBasedInitializer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An initializer using pretrained models from the `transformers` library to encode labels."]; "PretrainedInitializer" -> "LabelBasedInitializer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PretrainedInitializer" [URL="index.html#pykeen.nn.init.PretrainedInitializer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Initialize tensor with pretrained weights."]; "RandomWalkPositionalEncodingInitializer" [URL="index.html#pykeen.nn.init.RandomWalkPositionalEncodingInitializer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Initialize nodes via random-walk positional encoding."]; "PretrainedInitializer" -> "RandomWalkPositionalEncodingInitializer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "WeisfeilerLehmanInitializer" [URL="index.html#pykeen.nn.init.WeisfeilerLehmanInitializer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An initializer based on an encoding of categorical colors from the Weisfeiler-Lehman algorithm."]; "PretrainedInitializer" -> "WeisfeilerLehmanInitializer" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Message Passing

Various decompositions for R-GCN.

Classes

RGCNRepresentation(triples_factory[, ...])

Entity representations enriched by R-GCN.

RGCNLayer(num_relations[, input_dim, ...])

An RGCN layer from [schlichtkrull2018] updated to match the official implementation.

Decomposition(num_relations[, input_dim, ...])

Base module for relation-specific message passing.

BasesDecomposition([num_bases])

Represent relation-weights as a linear combination of base transformation matrices.

BlockDecomposition([num_blocks])

Represent relation-specific weight matrices via block-diagonal matrices.

Class Inheritance Diagram

digraph inheritancec005d7e5ca { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "BasesDecomposition" [URL="index.html#pykeen.nn.message_passing.BasesDecomposition",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Represent relation-weights as a linear combination of base transformation matrices."]; "Decomposition" -> "BasesDecomposition" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BlockDecomposition" [URL="index.html#pykeen.nn.message_passing.BlockDecomposition",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Represent relation-specific weight matrices via block-diagonal matrices."]; "Decomposition" -> "BlockDecomposition" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Decomposition" [URL="index.html#pykeen.nn.message_passing.Decomposition",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Base module for relation-specific message passing."]; "Module" -> "Decomposition" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExtraReprMixin" -> "Decomposition" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "Decomposition" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExtraReprMixin" [URL="index.html#pykeen.utils.ExtraReprMixin",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A mixin for modules with hierarchical `extra_repr`."]; "Module" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Base class for all neural network modules."]; "RGCNLayer" [URL="index.html#pykeen.nn.message_passing.RGCNLayer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An RGCN layer from [schlichtkrull2018]_ updated to match the official implementation."]; "Module" -> "RGCNLayer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RGCNRepresentation" [URL="index.html#pykeen.nn.message_passing.RGCNRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Entity representations enriched by R-GCN."]; "Representation" -> "RGCNRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Representation" [URL="index.html#pykeen.nn.representation.Representation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A base class for obtaining representations for entities/relations."]; "Module" -> "Representation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExtraReprMixin" -> "Representation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "Representation" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

PyG Message Passing

PyTorch Geometric based representation modules.

The modules enable entity representations which are linked to their graph neighbors’ representations. Similar representations are those by CompGCN or R-GCN. However, this module offers generic modules to combine many of the numerous message passing layers from PyTorch Geometric with base representations. A summary of available message passing layers can be found at torch_geometric.nn.conv.

The three classes differ in how the make use of the relation type information:

We can also easily utilize these representations with pykeen.models.ERModel. Here, we showcase how to combine static label-based entity features with a trainable GCN encoder for entity representations, with learned embeddings for relation representations and a DistMult interaction function.

from pykeen.datasets import get_dataset
from pykeen.models import ERModel
from pykeen.nn.init import LabelBasedInitializer
from pykeen.pipeline import pipeline

dataset = get_dataset(dataset="nations", dataset_kwargs=dict(create_inverse_triples=True))
entity_initializer = LabelBasedInitializer.from_triples_factory(
    triples_factory=dataset.training,
    for_entities=True,
)
(embedding_dim,) = entity_initializer.tensor.shape[1:]
r = pipeline(
    dataset=dataset,
    model=ERModel,
    model_kwargs=dict(
        interaction="distmult",
        entity_representations="SimpleMessagePassing",
        entity_representations_kwargs=dict(
            triples_factory=dataset.training,
            base_kwargs=dict(
                shape=embedding_dim,
                initializer=entity_initializer,
                trainable=False,
            ),
            layers=["GCN"] * 2,
            layers_kwargs=dict(in_channels=embedding_dim, out_channels=embedding_dim),
        ),
        relation_representations_kwargs=dict(
            shape=embedding_dim,
        ),
    ),
)

Classes

MessagePassingRepresentation(...[, ...])

An abstract representation class utilizing PyTorch Geometric message passing layers.

SimpleMessagePassingRepresentation(...[, ...])

A representation with message passing not making use of the relation type.

FeaturizedMessagePassingRepresentation(...)

A representation with message passing with uses edge features obtained from relation representations.

TypedMessagePassingRepresentation(...)

A representation with message passing with uses categorical relation type information.

Class Inheritance Diagram

digraph inheritance4110c02433 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "ExtraReprMixin" [URL="index.html#pykeen.utils.ExtraReprMixin",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A mixin for modules with hierarchical `extra_repr`."]; "FeaturizedMessagePassingRepresentation" [URL="index.html#pykeen.nn.pyg.FeaturizedMessagePassingRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A representation with message passing with uses edge features obtained from relation representations."]; "TypedMessagePassingRepresentation" -> "FeaturizedMessagePassingRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "MessagePassingRepresentation" [URL="index.html#pykeen.nn.pyg.MessagePassingRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An abstract representation class utilizing PyTorch Geometric message passing layers."]; "Representation" -> "MessagePassingRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "MessagePassingRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Module" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Base class for all neural network modules."]; "Representation" [URL="index.html#pykeen.nn.representation.Representation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A base class for obtaining representations for entities/relations."]; "Module" -> "Representation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExtraReprMixin" -> "Representation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "Representation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SimpleMessagePassingRepresentation" [URL="index.html#pykeen.nn.pyg.SimpleMessagePassingRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A representation with message passing not making use of the relation type."]; "MessagePassingRepresentation" -> "SimpleMessagePassingRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TypedMessagePassingRepresentation" [URL="index.html#pykeen.nn.pyg.TypedMessagePassingRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A representation with message passing with uses categorical relation type information."]; "MessagePassingRepresentation" -> "TypedMessagePassingRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Weighting

Various edge weighting implementations for R-GCN.

Classes

EdgeWeighting(**kwargs)

Base class for edge weightings.

InverseInDegreeEdgeWeighting(**kwargs)

Normalize messages by inverse in-degree.

InverseOutDegreeEdgeWeighting(**kwargs)

Normalize messages by inverse out-degree.

SymmetricEdgeWeighting(**kwargs)

Normalize messages by product of inverse sqrt of in-degree and out-degree.

AttentionEdgeWeighting(message_dim[, ...])

Message weighting by attention.

Class Inheritance Diagram

digraph inheritance3f3c363e2e { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "AttentionEdgeWeighting" [URL="index.html#pykeen.nn.weighting.AttentionEdgeWeighting",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Message weighting by attention."]; "EdgeWeighting" -> "AttentionEdgeWeighting" [arrowsize=0.5,style="setlinewidth(0.5)"]; "EdgeWeighting" [URL="index.html#pykeen.nn.weighting.EdgeWeighting",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Base class for edge weightings."]; "Module" -> "EdgeWeighting" [arrowsize=0.5,style="setlinewidth(0.5)"]; "InverseInDegreeEdgeWeighting" [URL="index.html#pykeen.nn.weighting.InverseInDegreeEdgeWeighting",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Normalize messages by inverse in-degree."]; "EdgeWeighting" -> "InverseInDegreeEdgeWeighting" [arrowsize=0.5,style="setlinewidth(0.5)"]; "InverseOutDegreeEdgeWeighting" [URL="index.html#pykeen.nn.weighting.InverseOutDegreeEdgeWeighting",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Normalize messages by inverse out-degree."]; "EdgeWeighting" -> "InverseOutDegreeEdgeWeighting" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Module" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Base class for all neural network modules."]; "SymmetricEdgeWeighting" [URL="index.html#pykeen.nn.weighting.SymmetricEdgeWeighting",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Normalize messages by product of inverse sqrt of in-degree and out-degree."]; "EdgeWeighting" -> "SymmetricEdgeWeighting" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Combinations

Implementation of combinations for the pykeen.models.LiteralModel.

Classes

Combination(*args, **kwargs)

Base class for combinations.

ComplexSeparatedCombination([combination, ...])

A combination for mixed complex & real representations.

ConcatCombination([dim])

Combine representation by concatenation.

ConcatAggregationCombination([aggregation, dim])

Combine representation by concatenation followed by an aggregation along the same axis.

ConcatProjectionCombination(input_dims[, ...])

Combine representations by concatenation follow by a linear projection and activation.

GatedCombination([entity_dim, literal_dim, ...])

A module that implements a gated linear transformation for the combination of entities and literals.

Class Inheritance Diagram

digraph inheritancea4b772a93b { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "Combination" [URL="index.html#pykeen.nn.combination.Combination",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Base class for combinations."]; "Module" -> "Combination" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExtraReprMixin" -> "Combination" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "Combination" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ComplexSeparatedCombination" [URL="index.html#pykeen.nn.combination.ComplexSeparatedCombination",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A combination for mixed complex & real representations."]; "Combination" -> "ComplexSeparatedCombination" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ConcatAggregationCombination" [URL="index.html#pykeen.nn.combination.ConcatAggregationCombination",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Combine representation by concatenation followed by an aggregation along the same axis."]; "ConcatCombination" -> "ConcatAggregationCombination" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ConcatCombination" [URL="index.html#pykeen.nn.combination.ConcatCombination",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Combine representation by concatenation."]; "Combination" -> "ConcatCombination" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ConcatProjectionCombination" [URL="index.html#pykeen.nn.combination.ConcatProjectionCombination",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Combine representations by concatenation follow by a linear projection and activation."]; "ConcatCombination" -> "ConcatProjectionCombination" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExtraReprMixin" [URL="index.html#pykeen.utils.ExtraReprMixin",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A mixin for modules with hierarchical `extra_repr`."]; "GatedCombination" [URL="index.html#pykeen.nn.combination.GatedCombination",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A module that implements a gated linear transformation for the combination of entities and literals."]; "Combination" -> "GatedCombination" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Module" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Base class for all neural network modules."]; }

Perceptron

Perceptron-like modules.

class ConcatMLP(input_dim, output_dim=None, dropout=0.1, ratio=2, flatten_dims=2)[source]

A 2-layer MLP with ReLU activation and dropout applied to the flattened token representations.

This is for conveniently choosing a configuration similar to the paper. For more complex aggregation mechanisms, pass an arbitrary callable instead.

Initialize the module.

Parameters:
  • input_dim (int) – the input dimension

  • output_dim (Optional[int]) – the output dimension. defaults to input dim

  • dropout (float) – the dropout value on the hidden layer

  • ratio (Union[int, float]) – the ratio of the output dimension to the hidden layer size.

  • flatten_dims (int) – the number of trailing dimensions to flatten

forward(xs, dim)[source]

Forward the MLP on the given dimension.

Parameters:
  • xs (FloatTensor) – The tensor to forward

  • dim (int) – Only a parameter to match the signature of torch.mean / torch.sum this class is not thought to be usable from outside

Return type:

FloatTensor

Returns:

The tensor after applying this MLP

Utilities

Utilities for neural network components.

class PyOBOCache(*args, **kwargs)[source]

A cache that looks up labels of biomedical entities based on their CURIEs.

Instantiate the PyOBO cache, ensuring PyOBO is installed.

get_texts(identifiers)[source]

Get text for the given CURIEs.

Parameters:

identifiers (Sequence[str]) – The compact URIs for each entity (e.g., ['doid:1234', ...])

Return type:

Sequence[Optional[str]]

Returns:

the label for each entity, looked up via pyobo.get_name(). Might be none if no label is available.

exception ShapeError(shape, reference)[source]

An error for a mismatch in shapes.

Initialize the error.

Parameters:
Return type:

None

classmethod verify(shape, reference)[source]

Raise an exception if the shape does not match the reference.

This method normalizes the shapes first.

Parameters:
Raises:

ShapeError – if the two shapes do not match.

Return type:

Sequence[int]

Returns:

the normalized shape

class TextCache[source]

An interface for looking up text for various flavors of entity identifiers.

abstract get_texts(identifiers)[source]

Get text for the given identifiers for the cache.

Return type:

Sequence[Optional[str]]

Parameters:

identifiers (Sequence[str]) –

class WikidataCache[source]

A cache for requests against Wikidata’s SPARQL endpoint.

Initialize the cache.

WIKIDATA_ENDPOINT = 'https://query.wikidata.org/bigdata/namespace/wdq/sparql'

Wikidata SPARQL endpoint. See https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service#Interfacing

get_descriptions(wikidata_identifiers)[source]

Get entity descriptions for the given IDs.

Parameters:

wikidata_identifiers (Sequence[str]) – the Wikidata identifiers, each starting with Q (e.g., ['Q42'])

Return type:

Sequence[str]

Returns:

the description for each Wikidata entity

get_image_paths(ids, extensions=('jpeg', 'jpg', 'gif', 'png', 'svg', 'tif'), progress=False)[source]

Get paths to images for the given IDs.

Parameters:
  • ids (Sequence[str]) – the Wikidata IDs.

  • extensions (Collection[str]) – the allowed file extensions

  • progress (bool) – whether to display a progress bar

Return type:

Sequence[Optional[Path]]

Returns:

the paths to images for the given IDs.

get_labels(wikidata_identifiers)[source]

Get entity labels for the given IDs.

Parameters:

wikidata_identifiers (Sequence[str]) – the Wikidata identifiers, each starting with Q (e.g., ['Q42'])

Return type:

Sequence[str]

Returns:

the label for each Wikidata entity

get_texts(identifiers)[source]

Get a concatenation of the title and description for each Wikidata identifier.

Parameters:

identifiers (Sequence[str]) – the Wikidata identifiers, each starting with Q (e.g., ['Q42'])

Return type:

Sequence[str]

Returns:

the label and description for each Wikidata entity concatenated

classmethod query(sparql, wikidata_ids, batch_size=256)[source]

Batched SPARQL query execution for the given IDS.

Parameters:
  • sparql (Union[str, Callable[…, str]]) – the SPARQL query with a placeholder ids

  • wikidata_ids (Sequence[str]) – the Wikidata IDs

  • batch_size (int) – the batch size, i.e., maximum number of IDs per query

Return type:

Iterable[Mapping[str, Any]]

Returns:

an iterable over JSON results, where the keys correspond to query variables, and the values to the corresponding binding

classmethod query_text(wikidata_ids, language='en', batch_size=256)[source]

Query the SPARQL endpoints about information for the given IDs.

Parameters:
  • wikidata_ids (Sequence[str]) – the Wikidata IDs

  • language (str) – the label language

  • batch_size (int) – the batch size; if more ids are provided, break the big request into multiple smaller ones

Return type:

Mapping[str, Mapping[str, str]]

Returns:

a mapping from Wikidata Ids to dictionaries with the label and description of the entities

static verify_ids(ids)[source]

Raise error if invalid IDs are encountered.

Parameters:

ids (Sequence[str]) – the ids to verify

Raises:

ValueError – if any invalid ID is encountered

adjacency_tensor_to_stacked_matrix(num_relations, num_entities, source, target, edge_type, edge_weights=None, horizontal=True)[source]

Stack adjacency matrices as described in [thanapalasingam2021].

This method re-arranges the (sparse) adjacency tensor of shape (num_entities, num_relations, num_entities) to a sparse adjacency matrix of shape (num_entities, num_relations * num_entities) (horizontal stacking) or (num_entities * num_relations, num_entities) (vertical stacking). Thereby, we can perform the relation-specific message passing of R-GCN by a single sparse matrix multiplication (and some additional pre- and/or post-processing) of the inputs.

Parameters:
  • num_relations (int) – the number of relations

  • num_entities (int) – the number of entities

  • source (LongTensor) – shape: (num_triples,) the source entity indices

  • target (LongTensor) – shape: (num_triples,) the target entity indices

  • edge_type (LongTensor) – shape: (num_triples,) the edge type, i.e., relation ID

  • edge_weights (Optional[FloatTensor]) – shape: (num_triples,) scalar edge weights

  • horizontal (bool) – whether to use horizontal or vertical stacking

Return type:

Tensor

Returns:

shape: (num_entities * num_relations, num_entities) or (num_entities, num_entities * num_relations) the stacked adjacency matrix

safe_diagonal(matrix)[source]

Extract diagonal from a potentially sparse matrix.

Note

this is a work-around as long as torch.diagonal() does not work for sparse tensors

Parameters:

matrix (Tensor) – shape: (n, n) the matrix

Return type:

Tensor

Returns:

shape: (n,) the diagonal values.

use_horizontal_stacking(input_dim, output_dim)[source]

Determine a stacking direction based on the input and output dimension.

The vertical stacking approach is suitable for low dimensional input and high dimensional output, because the projection to low dimensions is done first. While the horizontal stacking approach is good for high dimensional input and low dimensional output as the projection to high dimension is done last.

Parameters:
  • input_dim (int) – the layer’s input dimension

  • output_dim (int) – the layer’s output dimension

Return type:

bool

Returns:

whether to use horizontal (True) or vertical stacking

NodePiece

pykeen.nn.node_piece Package

NodePiece modules.

A NodePieceRepresentation contains a collection of TokenizationRepresentation. A TokenizationRepresentation is defined as Representation module mapping token indices to representations, also called the vocabulary in resemblance of token representations known from NLP applications, and an assignment from entities to (multiple) tokens.

In order to obtain the vocabulary and assignment, multiple options are available, which often follow a two-step approach of first selecting a vocabulary, and afterwards assigning the entities to the set of tokens, usually using the graph structure of the KG.

One way of tokenization, is tokenization by AnchorTokenizer, which selects some anchor entities from the graph as vocabulary. The anchor selection process is controlled by an AnchorSelection instance. In order to obtain the assignment, some measure of graph distance is used. To this end, a AnchorSearcher instance calculates the closest anchor entities from the vocabulary for each of the entities in the graph.

Since some tokenizations are expensive to compute, we offer a mechanism to use precomputed tokenizations via PrecomputedPoolTokenizer. To enable loading from different formats, a loader subclassing from PrecomputedTokenizerLoader can be selected accordingly. To precompute anchor-based tokenizations, you can use the command

pykeen tokenize

Its usage is explained by passing the --help flag.

Classes

AnchorSearcher()

A method for finding the closest anchors.

ScipySparseAnchorSearcher([max_iter])

Find closest anchors using scipy.sparse.

SparseBFSSearcher([max_iter, device])

Find closest anchors using torch_sparse on a GPU.

CSGraphAnchorSearcher()

Find closest anchors using scipy.sparse.csgraph.

PersonalizedPageRankAnchorSearcher([...])

Select closest anchors as the nodes with the largest personalized page rank.

AnchorSelection([num_anchors])

Anchor entity selection strategy.

SingleSelection([num_anchors])

Single-step selection.

DegreeAnchorSelection([num_anchors])

Select entities according to their (undirected) degree.

MixtureAnchorSelection(selections[, ratios, ...])

A weighted mixture of different anchor selection strategies.

PageRankAnchorSelection([num_anchors])

Select entities according to their page rank.

RandomAnchorSelection([num_anchors, random_seed])

Random node selection.

Tokenizer()

A base class for tokenizers for NodePiece representations.

RelationTokenizer()

Tokenize entities by representing them as a bag of relations.

AnchorTokenizer([selection, ...])

Tokenize entities by representing them as a bag of anchor entities.

MetisAnchorTokenizer([num_partitions, device])

An anchor tokenizer, which first partitions the graph using METIS.

PrecomputedPoolTokenizer(*[, path, url, ...])

A tokenizer using externally precomputed tokenization.

PrecomputedTokenizerLoader()

A loader for precomputed tokenization.

GalkinPrecomputedTokenizerLoader()

A loader for pickle files provided by Galkin et al.

TorchPrecomputedTokenizerLoader()

A loader via torch.load.

TokenizationRepresentation(assignment[, ...])

A module holding the result of tokenization.

NodePieceRepresentation(*, triples_factory)

Basic implementation of node piece decomposition [galkin2021].

HashDiversityInfo(...)

A ratio information object.

Class Inheritance Diagram
digraph inheritance792931ba31 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "AnchorSearcher" [URL="index.html#pykeen.nn.node_piece.AnchorSearcher",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A method for finding the closest anchors."]; "ExtraReprMixin" -> "AnchorSearcher" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "AnchorSearcher" [arrowsize=0.5,style="setlinewidth(0.5)"]; "AnchorSelection" [URL="index.html#pykeen.nn.node_piece.AnchorSelection",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Anchor entity selection strategy."]; "ExtraReprMixin" -> "AnchorSelection" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "AnchorSelection" [arrowsize=0.5,style="setlinewidth(0.5)"]; "AnchorTokenizer" [URL="index.html#pykeen.nn.node_piece.AnchorTokenizer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Tokenize entities by representing them as a bag of anchor entities."]; "Tokenizer" -> "AnchorTokenizer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CSGraphAnchorSearcher" [URL="index.html#pykeen.nn.node_piece.CSGraphAnchorSearcher",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Find closest anchors using :class:`scipy.sparse.csgraph`."]; "AnchorSearcher" -> "CSGraphAnchorSearcher" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CombinedRepresentation" [URL="index.html#pykeen.nn.representation.CombinedRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A combined representation."]; "Representation" -> "CombinedRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DegreeAnchorSelection" [URL="index.html#pykeen.nn.node_piece.DegreeAnchorSelection",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Select entities according to their (undirected) degree."]; "SingleSelection" -> "DegreeAnchorSelection" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExtraReprMixin" [URL="index.html#pykeen.utils.ExtraReprMixin",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A mixin for modules with hierarchical `extra_repr`."]; "GalkinPrecomputedTokenizerLoader" [URL="index.html#pykeen.nn.node_piece.GalkinPrecomputedTokenizerLoader",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A loader for pickle files provided by Galkin *et al*."]; "PrecomputedTokenizerLoader" -> "GalkinPrecomputedTokenizerLoader" [arrowsize=0.5,style="setlinewidth(0.5)"]; "HashDiversityInfo" [URL="index.html#pykeen.nn.node_piece.HashDiversityInfo",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A ratio information object."]; "MetisAnchorTokenizer" [URL="index.html#pykeen.nn.node_piece.MetisAnchorTokenizer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="An anchor tokenizer, which first partitions the graph using METIS."]; "AnchorTokenizer" -> "MetisAnchorTokenizer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "MixtureAnchorSelection" [URL="index.html#pykeen.nn.node_piece.MixtureAnchorSelection",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A weighted mixture of different anchor selection strategies."]; "AnchorSelection" -> "MixtureAnchorSelection" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Module" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Base class for all neural network modules."]; "NodePieceRepresentation" [URL="index.html#pykeen.nn.node_piece.NodePieceRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Basic implementation of node piece decomposition [galkin2021]_."]; "CombinedRepresentation" -> "NodePieceRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PageRankAnchorSelection" [URL="index.html#pykeen.nn.node_piece.PageRankAnchorSelection",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Select entities according to their page rank."]; "SingleSelection" -> "PageRankAnchorSelection" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PersonalizedPageRankAnchorSearcher" [URL="index.html#pykeen.nn.node_piece.PersonalizedPageRankAnchorSearcher",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Select closest anchors as the nodes with the largest personalized page rank."]; "AnchorSearcher" -> "PersonalizedPageRankAnchorSearcher" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PrecomputedPoolTokenizer" [URL="index.html#pykeen.nn.node_piece.PrecomputedPoolTokenizer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A tokenizer using externally precomputed tokenization."]; "Tokenizer" -> "PrecomputedPoolTokenizer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "PrecomputedTokenizerLoader" [URL="index.html#pykeen.nn.node_piece.PrecomputedTokenizerLoader",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A loader for precomputed tokenization."]; "ABC" -> "PrecomputedTokenizerLoader" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RandomAnchorSelection" [URL="index.html#pykeen.nn.node_piece.RandomAnchorSelection",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Random node selection."]; "SingleSelection" -> "RandomAnchorSelection" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RelationTokenizer" [URL="index.html#pykeen.nn.node_piece.RelationTokenizer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Tokenize entities by representing them as a bag of relations."]; "Tokenizer" -> "RelationTokenizer" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Representation" [URL="index.html#pykeen.nn.representation.Representation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A base class for obtaining representations for entities/relations."]; "Module" -> "Representation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ExtraReprMixin" -> "Representation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "Representation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ScipySparseAnchorSearcher" [URL="index.html#pykeen.nn.node_piece.ScipySparseAnchorSearcher",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Find closest anchors using :mod:`scipy.sparse`."]; "AnchorSearcher" -> "ScipySparseAnchorSearcher" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SingleSelection" [URL="index.html#pykeen.nn.node_piece.SingleSelection",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Single-step selection."]; "AnchorSelection" -> "SingleSelection" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ABC" -> "SingleSelection" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SparseBFSSearcher" [URL="index.html#pykeen.nn.node_piece.SparseBFSSearcher",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Find closest anchors using :mod:`torch_sparse` on a GPU."]; "AnchorSearcher" -> "SparseBFSSearcher" [arrowsize=0.5,style="setlinewidth(0.5)"]; "TokenizationRepresentation" [URL="index.html#pykeen.nn.node_piece.TokenizationRepresentation",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A module holding the result of tokenization."]; "Representation" -> "TokenizationRepresentation" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Tokenizer" [URL="index.html#pykeen.nn.node_piece.Tokenizer",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A base class for tokenizers for NodePiece representations."]; "TorchPrecomputedTokenizerLoader" [URL="index.html#pykeen.nn.node_piece.TorchPrecomputedTokenizerLoader",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="A loader via torch.load."]; "PrecomputedTokenizerLoader" -> "TorchPrecomputedTokenizerLoader" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

Utilities

Utilities for PyKEEN.

class Bias(dim)[source]

A module wrapper for adding a bias.

Initialize the module.

Parameters:

dim (int) – >0 The dimension of the input.

forward(x)[source]

Add the learned bias to the input.

Parameters:

x (FloatTensor) – shape: (n, d) The input.

Return type:

FloatTensor

Returns:

x + b[None, :]

reset_parameters()[source]

Reset the layer’s parameters.

class ExtraReprMixin[source]

A mixin for modules with hierarchical extra_repr.

It takes up the torch.nn.Module.extra_repr() idea, and additionally provides a simple composable way to generate the components of extra_repr() via iter_extra_repr().

If combined with torch.nn.Module, make sure to put ExtraReprMixin behind torch.nn.Module to prefer the latter’s __repr__() implementation.

extra_repr()[source]

Generate the extra repr, cf. :meth`torch.nn.Module.extra_repr`.

Return type:

str

Returns:

the extra part of the repr()

iter_extra_repr()[source]

Iterate over the components of the extra_repr().

This method is typically overridden. A common pattern would be

def iter_extra_repr(self) -> Iterable[str]:
    yield from super().iter_extra_repr()
    yield "<key1>=<value1>"
    yield "<key2>=<value2>"
Return type:

Iterable[str]

Returns:

an iterable over individual components of the extra_repr()

class NoRandomSeedNecessary[source]

Used in pipeline when random seed is set automatically.

class Result[source]

A superclass of results that can be saved to a directory.

abstract save_to_directory(directory, **kwargs)[source]

Save the results to the directory.

Return type:

None

Parameters:

directory (str) –

abstract save_to_ftp(directory, ftp)[source]

Save the results to the directory in an FTP server.

Return type:

None

Parameters:
  • directory (str) –

  • ftp (FTP) –

abstract save_to_s3(directory, bucket, s3=None)[source]

Save all artifacts to the given directory in an S3 Bucket.

Parameters:
  • directory (str) – The directory in the S3 bucket

  • bucket (str) – The name of the S3 bucket

  • s3 – A client from boto3.client(), if already instantiated

Return type:

None

all_in_bounds(x, low=None, high=None, a_tol=0.0)[source]

Check if tensor values respect lower and upper bound.

Parameters:
Return type:

bool

Returns:

If all values are within the given bounds

at_least_eps(x)[source]

Make sure a tensor is greater than zero.

Return type:

FloatTensor

Parameters:

x (FloatTensor) –

broadcast_upgrade_to_sequences(*xs)[source]

Apply upgrade_to_sequence to each input, and afterwards repeat singletons to match the maximum length.

Parameters:

xs (Union[~X, Sequence[~X]]) – length: m the inputs.

Return type:

Sequence[Sequence[~X]]

Returns:

a sequence of length m, where each element is a sequence and all elements have the same length.

Raises:

ValueError – if there is a non-singleton sequence input with length different from the maximum sequence length.

>>> broadcast_upgrade_to_sequences(1)
((1,),)
>>> broadcast_upgrade_to_sequences(1, 2)
((1,), (2,))
>>> broadcast_upgrade_to_sequences(1, (2, 3))
((1, 1), (2, 3))
calculate_broadcasted_elementwise_result_shape(first, second)[source]

Determine the return shape of a broadcasted elementwise operation.

Return type:

Tuple[int, …]

Parameters:
check_shapes(*x, raise_on_errors=True)[source]

Verify that a sequence of tensors are of matching shapes.

Parameters:
  • x (Tuple[Union[Tensor, Tuple[int, …]], str]) – A tuple (t, s), where t is a tensor, or an actual shape of a tensor (a tuple of integers), and s is a string, where each character corresponds to a (named) dimension. If the shapes of different tensors share a character, the corresponding dimensions are expected to be of equal size.

  • raise_on_errors (bool) – Whether to raise an exception in case of a mismatch.

Return type:

bool

Returns:

Whether the shapes matched.

Raises:

ValueError – If the shapes mismatch and raise_on_error is True.

Examples: >>> check_shapes(((10, 20), “bd”), ((10, 20, 20), “bdd”)) True >>> check_shapes(((10, 20), “bd”), ((10, 30, 20), “bdd”), raise_on_errors=False) False

clamp_norm(x, maxnorm, p='fro', dim=None)[source]

Ensure that a tensor’s norm does not exceeds some threshold.

Parameters:
Return type:

Tensor

Returns:

A vector with \(|x| <= maxnorm\).

combine_complex(x_re, x_im)[source]

Combine a complex tensor from real and imaginary part.

Return type:

FloatTensor

Parameters:
  • x_re (FloatTensor) –

  • x_im (FloatTensor) –

compact_mapping(mapping)[source]

Update a mapping (key -> id) such that the IDs range from 0 to len(mappings) - 1.

Parameters:

mapping (Mapping[~X, int]) – The mapping to compact.

Return type:

Tuple[Mapping[~X, int], Mapping[int, int]]

Returns:

A pair (translated, translation) where translated is the updated mapping, and translation a dictionary from old to new ids.

complex_normalize(x)[source]

Normalize a vector of complex numbers such that each element is of unit-length.

Let \(x \in \mathbb{C}^d\) denote a complex vector. Then, the operation computes

\[x_i' = \frac{x_i}{|x_i|}\]

where \(|x_i| = \sqrt{Re(x_i)^2 + Im(x_i)^2}\) is the modulus of complex number

Parameters:

x (Tensor) – A tensor formulating complex numbers

Return type:

Tensor

Returns:

An elementwise normalized vector.

class compose(*operations, name)[source]

A class representing the composition of several functions.

Initialize the composition with a sequence of operations.

Parameters:
  • operations (Callable[[~X], ~X]) – unary operations that will be applied in succession

  • name (str) – The name of the composed function.

convert_to_canonical_shape(x, dim, num=None, batch_size=1, suffix_shape=-1)[source]

Convert a tensor to canonical shape.

Parameters:
  • x (FloatTensor) – The tensor in compatible shape.

  • dim (Union[int, str]) – The “num” dimension.

  • batch_size (int) – The batch size.

  • num (Optional[int]) – The number.

  • suffix_shape (Union[int, Sequence[int]]) – The suffix shape.

Return type:

FloatTensor

Returns:

shape: (batch_size, num_heads, num_relations, num_tails, *) A tensor in canonical shape.

create_relation_to_entity_set_mapping(triples)[source]

Create mappings from relation IDs to the set of their head / tail entities.

Parameters:

triples (Iterable[Tuple[int, int, int]]) – The triples.

Return type:

Tuple[Mapping[int, Set[int]], Mapping[int, Set[int]]]

Returns:

A pair of dictionaries, each mapping relation IDs to entity ID sets.

einsum(*args)[source]

Sums the product of the elements of the input operands along dimensions specified using a notation based on the Einstein summation convention.

Einsum allows computing many common multi-dimensional linear algebraic array operations by representing them in a short-hand format based on the Einstein summation convention, given by equation. The details of this format are described below, but the general idea is to label every dimension of the input operands with some subscript and define which subscripts are part of the output. The output is then computed by summing the product of the elements of the operands along the dimensions whose subscripts are not part of the output. For example, matrix multiplication can be computed using einsum as torch.einsum(“ij,jk->ik”, A, B). Here, j is the summation subscript and i and k the output subscripts (see section below for more details on why).

Equation:

The equation string specifies the subscripts (letters in [a-zA-Z]) for each dimension of the input operands in the same order as the dimensions, separating subscripts for each operand by a comma (‘,’), e.g. ‘ij,jk’ specify subscripts for two 2D operands. The dimensions labeled with the same subscript must be broadcastable, that is, their size must either match or be 1. The exception is if a subscript is repeated for the same input operand, in which case the dimensions labeled with this subscript for this operand must match in size and the operand will be replaced by its diagonal along these dimensions. The subscripts that appear exactly once in the equation will be part of the output, sorted in increasing alphabetical order. The output is computed by multiplying the input operands element-wise, with their dimensions aligned based on the subscripts, and then summing out the dimensions whose subscripts are not part of the output.

Optionally, the output subscripts can be explicitly defined by adding an arrow (‘->’) at the end of the equation followed by the subscripts for the output. For instance, the following equation computes the transpose of a matrix multiplication: ‘ij,jk->ki’. The output subscripts must appear at least once for some input operand and at most once for the output.

Ellipsis (’…’) can be used in place of subscripts to broadcast the dimensions covered by the ellipsis. Each input operand may contain at most one ellipsis which will cover the dimensions not covered by subscripts, e.g. for an input operand with 5 dimensions, the ellipsis in the equation ‘ab…c’ cover the third and fourth dimensions. The ellipsis does not need to cover the same number of dimensions across the operands but the ‘shape’ of the ellipsis (the size of the dimensions covered by them) must broadcast together. If the output is not explicitly defined with the arrow (‘->’) notation, the ellipsis will come first in the output (left-most dimensions), before the subscript labels that appear exactly once for the input operands. e.g. the following equation implements batch matrix multiplication ‘…ij,…jk’.

A few final notes: the equation may contain whitespaces between the different elements (subscripts, ellipsis, arrow and comma) but something like ‘…’ is not valid. An empty string ‘’ is valid for scalar operands.

Note

torch.einsum handles ellipsis (’…’) differently from NumPy in that it allows dimensions covered by the ellipsis to be summed over, that is, ellipsis are not required to be part of the output.

Note

This function uses opt_einsum (https://optimized-einsum.readthedocs.io/en/stable/) to speed up computation or to consume less memory by optimizing contraction order. This optimization occurs when there are at least three inputs, since the order does not matter otherwise. Note that finding _the_ optimal path is an NP-hard problem, thus, opt_einsum relies on different heuristics to achieve near-optimal results. If opt_einsum is not available, the default order is to contract from left to right.

To bypass this default behavior, add the following line to disable the usage of opt_einsum and skip path calculation: torch.backends.opt_einsum.enabled = False

To specify which strategy you’d like for opt_einsum to compute the contraction path, add the following line: torch.backends.opt_einsum.strategy = ‘auto’. The default strategy is ‘auto’, and we also support ‘greedy’ and ‘optimal’. Disclaimer that the runtime of ‘optimal’ is factorial in the number of inputs! See more details in the opt_einsum documentation (https://optimized-einsum.readthedocs.io/en/stable/path_finding.html).

Note

As of PyTorch 1.10 torch.einsum() also supports the sublist format (see examples below). In this format, subscripts for each operand are specified by sublists, list of integers in the range [0, 52). These sublists follow their operands, and an extra sublist can appear at the end of the input to specify the output’s subscripts., e.g. torch.einsum(op1, sublist1, op2, sublist2, …, [subslist_out]). Python’s Ellipsis object may be provided in a sublist to enable broadcasting as described in the Equation section above.

Args:

equation (str): The subscripts for the Einstein summation. operands (List[Tensor]): The tensors to compute the Einstein summation of.

Examples:

>>> # xdoctest: +IGNORE_WANT("non-deterministic")
>>> # trace
>>> torch.einsum('ii', torch.randn(4, 4))
tensor(-1.2104)

>>> # xdoctest: +IGNORE_WANT("non-deterministic")
>>> # diagonal
>>> torch.einsum('ii->i', torch.randn(4, 4))
tensor([-0.1034,  0.7952, -0.2433,  0.4545])

>>> # xdoctest: +IGNORE_WANT("non-deterministic")
>>> # outer product
>>> x = torch.randn(5)
>>> y = torch.randn(4)
>>> torch.einsum('i,j->ij', x, y)
tensor([[ 0.1156, -0.2897, -0.3918,  0.4963],
        [-0.3744,  0.9381,  1.2685, -1.6070],
        [ 0.7208, -1.8058, -2.4419,  3.0936],
        [ 0.1713, -0.4291, -0.5802,  0.7350],
        [ 0.5704, -1.4290, -1.9323,  2.4480]])

>>> # xdoctest: +IGNORE_WANT("non-deterministic")
>>> # batch matrix multiplication
>>> As = torch.randn(3, 2, 5)
>>> Bs = torch.randn(3, 5, 4)
>>> torch.einsum('bij,bjk->bik', As, Bs)
tensor([[[-1.0564, -1.5904,  3.2023,  3.1271],
        [-1.6706, -0.8097, -0.8025, -2.1183]],

        [[ 4.2239,  0.3107, -0.5756, -0.2354],
        [-1.4558, -0.3460,  1.5087, -0.8530]],

        [[ 2.8153,  1.8787, -4.3839, -1.2112],
        [ 0.3728, -2.1131,  0.0921,  0.8305]]])

>>> # xdoctest: +IGNORE_WANT("non-deterministic")
>>> # with sublist format and ellipsis
>>> torch.einsum(As, [..., 0, 1], Bs, [..., 1, 2], [..., 0, 2])
tensor([[[-1.0564, -1.5904,  3.2023,  3.1271],
        [-1.6706, -0.8097, -0.8025, -2.1183]],

        [[ 4.2239,  0.3107, -0.5756, -0.2354],
        [-1.4558, -0.3460,  1.5087, -0.8530]],

        [[ 2.8153,  1.8787, -4.3839, -1.2112],
        [ 0.3728, -2.1131,  0.0921,  0.8305]]])

>>> # batch permute
>>> A = torch.randn(2, 3, 4, 5)
>>> torch.einsum('...ij->...ji', A).shape
torch.Size([2, 3, 5, 4])

>>> # equivalent to torch.nn.functional.bilinear
>>> A = torch.randn(3, 5, 4)
>>> l = torch.randn(2, 5)
>>> r = torch.randn(2, 4)
>>> torch.einsum('bn,anm,bm->ba', l, A, r)
tensor([[-0.3430, -5.2405,  0.4494],
        [ 0.3311,  5.5201, -3.0356]])
Return type:

Tensor

Parameters:

args (Any) –

ensure_complex(*xs)[source]

Ensure that all tensors are of complex dtype.

Reshape and convert if necessary.

Parameters:

xs (Tensor) – the tensors

Yields:

complex tensors.

Return type:

Iterable[Tensor]

ensure_ftp_directory(*, ftp, directory)[source]

Ensure the directory exists on the FTP server.

Return type:

None

Parameters:
  • ftp (FTP) –

  • directory (str) –

ensure_torch_random_state(random_state)[source]

Prepare a random state for PyTorch.

Return type:

Generator

Parameters:

random_state (None | int | Generator) –

ensure_tuple(*x)[source]

Ensure that all elements in the sequence are upgraded to sequences.

Parameters:

x (Union[~X, Sequence[~X]]) – A sequence of sequences or literals

Return type:

Sequence[Sequence[~X]]

Returns:

An upgraded sequence of sequences

>>> ensure_tuple(1, (1,), (1, 2))
((1,), (1,), (1, 2))
estimate_cost_of_sequence(shape, *other_shapes)[source]

Cost of a sequence of broadcasted element-wise operations of tensors, given their shapes.

Return type:

int

Parameters:
extend_batch(batch, max_id, dim, ids=None)[source]

Extend batch for 1-to-all scoring by explicit enumeration.

Parameters:
  • batch (LongTensor) – shape: (batch_size, 2) The batch.

  • max_id (int) – The maximum IDs to enumerate.

  • ids (Optional[LongTensor]) – shape: (num_ids,) | (batch_size, num_ids) explicit IDs

  • dim (int) – in {0,1,2} The column along which to insert the enumerated IDs.

Return type:

LongTensor

Returns:

shape: (batch_size * num_choices, 3) A large batch, where every pair from the original batch is combined with every ID.

fix_dataclass_init_docs(cls)[source]

Fix the __init__ documentation for a dataclasses.dataclass.

Parameters:

cls (Type) – The class whose docstring needs fixing

Return type:

Type

Returns:

The class that was passed so this function can be used as a decorator

flatten_dictionary(dictionary, prefix=None, sep='.')[source]

Flatten a nested dictionary.

Return type:

Dict[str, Any]

Parameters:
format_relative_comparison(part, total)[source]

Format a relative comparison.

Return type:

str

Parameters:
  • part (int) –

  • total (int) –

get_batchnorm_modules(module)[source]

Return all submodules which are batch normalization layers.

Return type:

List[Module]

Parameters:

module (Module) –

get_benchmark(name)[source]

Get the benchmark directory for this version.

Return type:

Path

Parameters:

name (str) –

get_connected_components(pairs)[source]

Calculate the connected components for a graph given as edge list.

The implementation uses a union-find data structure with path compression.

Parameters:

pairs (Iterable[Tuple[~X, ~X]]) – the edge list, i.e., pairs of node ids.

Return type:

Collection[Collection[~X]]

Returns:

a collection of connected components, i.e., a collection of disjoint collections of node ids.

get_devices(module)[source]

Return the device(s) from each components of the model.

Return type:

Collection[device]

Parameters:

module (Module) –

get_df_io(df)[source]

Get the dataframe as bytes.

Return type:

BytesIO

Parameters:

df (DataFrame) –

get_dropout_modules(module)[source]

Return all submodules which are dropout layers.

Return type:

List[Module]

Parameters:

module (Module) –

get_edge_index(*, triples_factory=None, mapped_triples=None, edge_index=None)[source]

Get the edge index from a number of different sources.

Parameters:
  • triples_factory (Optional[Any]) – the triples factory

  • mapped_triples (Optional[LongTensor]) – shape: (m, 3) ID-based triples

  • edge_index (Optional[LongTensor]) – shape: (2, m) the edge index

Raises:

ValueError – if none of the source was different from None

Return type:

LongTensor

Returns:

shape: (2, m) the edge index

get_expected_norm(p, d)[source]

Compute the expected value of the L_p norm.

\[E[\|x\|_p] = d^{1/p} E[|x_1|^p]^{1/p}\]

under the assumption that \(x_i \sim N(0, 1)\), i.e.

\[E[|x_1|^p] = 2^{p/2} \cdot \Gamma(\frac{p+1}{2} \cdot \pi^{-1/2}\]
Parameters:
  • p (Union[int, float, str]) – The parameter p of the norm.

  • d (int) – The dimension of the vector.

Return type:

float

Returns:

The expected value.

Raises:
get_json_bytes_io(obj)[source]

Get the JSON as bytes.

Return type:

BytesIO

get_model_io(model)[source]

Get the model as bytes.

Return type:

BytesIO

get_optimal_sequence(*shapes)[source]

Find the optimal sequence in which to combine tensors elementwise based on the shapes.

Parameters:

shapes (Tuple[int, …]) – The shapes of the tensors to combine.

Return type:

Tuple[int, Tuple[int, …]]

Returns:

The optimal execution order (as indices), and the cost.

get_preferred_device(module, allow_ambiguity=True)[source]

Return the preferred device.

Return type:

device

Parameters:
get_until_first_blank(s)[source]

Recapitulate all lines in the string until the first blank line.

Return type:

str

Parameters:

s (str) –

invert_mapping(mapping)[source]

Invert a mapping.

Parameters:

mapping (Mapping[~K, ~V]) – The mapping, key -> value.

Return type:

Mapping[~V, ~K]

Returns:

The inverse mapping, value -> key.

Raises:

ValueError – if the mapping is not bijective

is_cuda_oom_error(runtime_error)[source]

Check whether the caught RuntimeError was due to CUDA being out of memory.

Return type:

bool

Parameters:

runtime_error (RuntimeError) –

is_cudnn_error(runtime_error)[source]

Check whether the caught RuntimeError was due to a CUDNN error.

Return type:

bool

Parameters:

runtime_error (RuntimeError) –

is_triple_tensor_subset(a, b)[source]

Check whether one tensor of triples is a subset of another one.

Return type:

bool

Parameters:
  • a (LongTensor) –

  • b (LongTensor) –

isin_many_dim(elements, test_elements, dim=0)[source]

Return whether elements are contained in test elements.

Return type:

BoolTensor

Parameters:
logcumsumexp(a)[source]

Compute log(cumsum(exp(a))).

Parameters:

a (ndarray) – shape: s the array

Return type:

ndarray

Returns:

shape s the log-cumsum-exp of the array

See also

scipy.special.logsumexp() and torch.logcumsumexp()

lp_norm(x, p, dim, normalize)[source]

Return the \(L_p\) norm.

Return type:

FloatTensor

Parameters:
  • x (FloatTensor) –

  • p (float) –

  • dim (int | None) –

  • normalize (bool) –

negative_norm(x, p=2, power_norm=False)[source]

Evaluate negative norm of a vector.

Parameters:
Return type:

FloatTensor

Returns:

shape: (batch_size, num_heads, num_relations, num_tails) The scores.

negative_norm_of_sum(*x, p=2, power_norm=False)[source]

Evaluate negative norm of a sum of vectors on already broadcasted representations.

Parameters:
Return type:

FloatTensor

Returns:

shape: (batch_size, num_heads, num_relations, num_tails) The scores.

nested_get(d, *key, default=None)[source]

Get from a nested dictionary.

Parameters:
  • d (Mapping[str, Any]) – the (nested) dictionary

  • key (str) – a sequence of keys

  • default – the default value

Return type:

Any

Returns:

the value or default

normalize_path(path, *other, mkdir=False, is_file=False, default=None)[source]

Normalize a path.

Parameters:
  • path (Union[str, Path, TextIO, None]) – the path in either of the valid forms.

  • other (Union[str, Path]) – additional parts to join to the path

  • mkdir (bool) – whether to ensure that the path refers to an existing directory by creating it if necessary

  • is_file (bool) – whether the path is intended to be a file - only relevant for creating directories

  • default (Union[str, Path, TextIO, None]) – the default to use if path is None

Raises:
Return type:

Path

Returns:

the absolute and resolved path

normalize_string(s, *, suffix=None)[source]

Normalize a string for lookup.

Return type:

str

Parameters:
  • s (str) –

  • suffix (str | None) –

powersum_norm(x, p, dim, normalize)[source]

Return the power sum norm.

Return type:

FloatTensor

Parameters:
  • x (FloatTensor) –

  • p (float) –

  • dim (int | None) –

  • normalize (bool) –

prepare_filter_triples(mapped_triples, additional_filter_triples=None, warn=True)[source]

Prepare the filter triples from the evaluation triples, and additional filter triples.

Return type:

LongTensor

Parameters:
  • mapped_triples (LongTensor) –

  • additional_filter_triples (None | LongTensor | List[LongTensor]) –

  • warn (bool) –

project_entity(e, e_p, r_p)[source]

Project entity relation-specific.

\[e_{\bot} = M_{re} e = (r_p e_p^T + I^{d_r \times d_e}) e = r_p e_p^T e + I^{d_r \times d_e} e = r_p (e_p^T e) + e'\]

and additionally enforces

\[\|e_{\bot}\|_2 \leq 1\]
Parameters:
  • e (FloatTensor) – shape: (…, d_e) The entity embedding.

  • e_p (FloatTensor) – shape: (…, d_e) The entity projection.

  • r_p (FloatTensor) – shape: (…, d_r) The relation projection.

Return type:

FloatTensor

Returns:

shape: (…, d_r)

random_non_negative_int()[source]

Generate a random positive integer.

Return type:

int

rate_limited(xs, min_avg_time=1.0)[source]

Iterate over iterable with rate limit.

Parameters:
  • xs (Iterable[~X]) – the iterable

  • min_avg_time (float) – the minimum average time per element

Yields:

elements of the iterable

Return type:

Iterable[~X]

resolve_device(device=None)[source]

Resolve a torch.device given a desired device (string).

Return type:

device

Parameters:

device (str | device | None) –

set_random_seed(seed)[source]

Set the random seed on numpy, torch, and python.

Parameters:

seed (int) – The seed that will be used in np.random.seed(), torch.manual_seed(), and random.seed().

Return type:

Tuple[None, Generator, None]

Returns:

A three tuple with None, the torch generator, and None.

split_complex(x)[source]

Split a complex tensor into real and imaginary part.

Return type:

Tuple[FloatTensor, FloatTensor]

Parameters:

x (FloatTensor) –

tensor_product(*tensors)[source]

Compute element-wise product of tensors in broadcastable shape.

Return type:

FloatTensor

Parameters:

tensors (FloatTensor) –

tensor_sum(*tensors)[source]

Compute element-wise sum of tensors in broadcastable shape.

Return type:

FloatTensor

Parameters:

tensors (FloatTensor) –

triple_tensor_to_set(tensor)[source]

Convert a tensor of triples to a set of int-tuples.

Return type:

Set[Tuple[int, …]]

Parameters:

tensor (LongTensor) –

unpack_singletons(*xs)[source]

Unpack sequences of length one.

Parameters:

xs (Tuple[~X]) – A sequence of tuples of length 1 or more

Return type:

Sequence[Union[~X, Tuple[~X]]]

Returns:

An unpacked sequence of sequences

>>> unpack_singletons((1,), (1, 2), (1, 2, 3))
(1, (1, 2), (1, 2, 3))
upgrade_to_sequence(x)[source]

Ensure that the input is a sequence.

Note

While strings are technically also a sequence, i.e.,

isinstance("test", typing.Sequence) is True

this may lead to unexpected behaviour when calling upgrade_to_sequence(“test”). We thus handle strings as non-sequences. To recover the other behavior, the following may be used:

upgrade_to_sequence(tuple("test"))
Parameters:

x (Union[~X, Sequence[~X]]) – A literal or sequence of literals

Return type:

Sequence[~X]

Returns:

If a literal was given, a one element tuple with it in it. Otherwise, return the given value.

>>> upgrade_to_sequence(1)
(1,)
>>> upgrade_to_sequence((1, 2, 3))
(1, 2, 3)
>>> upgrade_to_sequence("test")
('test',)
>>> upgrade_to_sequence(tuple("test"))
('t', 'e', 's', 't')
view_complex(x)[source]

Convert a PyKEEN complex tensor representation into a torch one.

Return type:

Tensor

Parameters:

x (FloatTensor) –

env(file=None)[source]

Print the env or output as HTML if in Jupyter.

Parameters:

file – The file to print to if not in a Jupyter setting. Defaults to sys.stdout

Returns:

A IPython.display.HTML if in a Jupyter notebook setting, otherwise none.

Version information for PyKEEN.

get_git_branch()[source]

Get the PyKEEN branch, if installed from git in editable mode.

Return type:

Optional[str]

Returns:

Returns the name of the current branch, or None if not installed in development mode.

get_git_hash(terse=True)[source]

Get the PyKEEN git hash.

Parameters:

terse (bool) – Should the hash be clipped to 8 characters?

Return type:

str

Returns:

The git hash, equals ‘UNHASHED’ if encountered CalledProcessError, signifying that the code is not installed in development mode.

get_version(with_git_hash=False)[source]

Get the PyKEEN version string, including a git hash.

Parameters:

with_git_hash (bool) – If set to True, the git hash will be appended to the version.

Return type:

str

Returns:

The PyKEEN version as well as the git hash, if the parameter with_git_hash was set to true.

Analysis

Dataset Degree Distributions

Dataset Degree Distribution

This plot shows various statistics (mean, variance, skewness and kurtosis) of the degree distributions of all contained datasets. Since the triples contained in knowledge graphs are directed, the degree is further distinguished into out-degree (head) and in-degree (tail). Notice further that the plots are in log-log space, i.e., both axes are shown logarithmically scaled.

In particular, we observe that the higher order moments skewness and kurtosis increase with increasing graph size, and are generally larger than one. This hints at a fat-tail degree distribution, i.e., the existence of hub entities, with a large degree while the majority of entities has only a few neighbors.

References

Unimodal KGE Models.

[balazevic2019]

Balažević, et al. (2019) TuckER: Tensor Factorization for Knowledge Graph Completion. EMNLP’19

[bordes2011]

Bordes, A., et al. (2011). Learning Structured Embeddings of Knowledge Bases. AAAI. Vol. 6. No. 1.

[bordes2013]

Bordes, A., et al. (2013). Translating embeddings for modeling multi-relational data. NIPS.

[bordes2014]

Bordes, A., et al. (2014). A semantic matching energy function for learning with multi-relational data. Machine

[dettmers2018]

Dettmers, T., et al. (2018) Convolutional 2d knowledge graph embeddings. Thirty-Second AAAI Conference on Artificial Intelligence.

[ebisu2018]

Ebisu, T., et al. (2018) https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16227. AAAI’18.

[feng2016]

Feng, J. et al. (2016) Knowledge Graph Embedding by Flexible Translation. KR’16.

[ji2015]

Ji, G., et al. (2015). Knowledge graph embedding via dynamic mapping matrix. ACL.

[kazemi2018]

Kazemi, S.M. and Poole, D. (2018). SimplE Embedding for Link Prediction in Knowledge Graphs. NIPS’18

[he2015]

Shizhu, H., et al. (2017). Learning to Represent Knowledge Graphs with Gaussian Embedding. CIKM’17.

[lin2015]

Lin, Y., et al. (2015). Learning entity and relation embeddings for knowledge graph completion. AAAI. Vol. 15.

[nickel2011]

Nickel, M., et al. (2011) A Three-Way Model for Collective Learning on Multi-Relational Data. ICML. Vol. 11.

[nickel2016]

Nickel, M. et al. (2016) Holographic Embeddings of Knowledge Graphs. AAAI 2016.

[schlichtkrull2018]

Schlichtkrull, M., et al. (2018) Modeling relational data with graph convolutional networks. ESWC’18.

[sharifzadeh2019]

Sharifzadeh et al. (2019) Extension of ERMLP in PyKEEN.

[shi2017]

Shi, B., and Weninger, T. ProjE: Embedding Projection for Knowledge Graph Completion, AAAI 2017

[trouillon2016]

Trouillon, T., et al. (2016) Complex embeddings for simple link prediction. International Conference on Machine Learning. 2016.

[wang2014]

Wang, Z., et al. (2014). Knowledge Graph Embedding by Translating on Hyperplanes. AAAI. Vol. 14.

[yang2014]

Yang, B., et al. (2014). Embedding Entities and Relations for Learning and Inference in Knowledge Bases. CoRR, abs/1412.6575.

[socher2013]

Socher, R., et al. (2013) Reasoning with neural tensor networks for knowledge base completion.. NIPS. 2013.

[shi2019]

Shi, X. et al. (2019). Modeling Multi-mapping Relations for Precise Cross-lingual Entity Alignment. EMNLP-IJCNLP 2019.

[vashishth2020]

Vashishth, S., et al. (2020). Composition-based multi-relational graph convolutional networks. arXiv, 1–15.

[zhang2019]

Zhang, Shuai, et al. (2019). Quaternion knowledge graph embeddings NeurIPS’19.

[zhang2019b]

Zhang, W., et al. (2019). Interaction Embeddings for Prediction and Explanation in Knowledge Graphs <https://doi.org/10.1145/3289600.3291014>. WSDM ‘19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining.

[abboud2020]

Abboud, R., et al. (2020). BoxE: A box embedding model for knowledge base completion. Advances in Neural Information Processing Systems, 2020-December(NeurIPS), 1–13.

[galkin2021]

Galkin, M., et al. (2021) NodePiece: Compositional and Parameter-Efficient Representations of Large Knowledge Graphs. arXiv, 2106.12144.

[zaheer2017]

Zaheer, M., et al. (2017). Deep sets. Advances in Neural Information Processing Systems, 2017-December(ii), 3392–3402.

[lacroix2018]

Lacroix, T., Usunier, N., & Obozinski, G. (2018). Canonical Tensor Decomposition for Knowledge Base Completion. arXiv, 1806.07297.

[hitchcock1927]

Hitchcock, F. L. The expression of a tensor or a polyadic as a sum of products. Studies in Applied Mathematics, 6 (1-4):164–189, 1927.

Multimodal KGE Models.

[kristiadi2018]

Kristiadi, A.., et al. (2018) Incorporating literals into knowledge graph embeddings.. arXiv, 1802.00934.

[safavi2020]

Safavi, T. & Koutra, D. (2020). CoDEx: A Comprehensive Knowledge Graph Completion Benchmark. arXiv, 2009.07810.

[shi2017b]

Shi, B., & Weninger, T. (2017). Open-World Knowledge Graph Completion. arXiv, 1957–1964.

[santos2020]

Santos, A., et al (2020). Clinical Knowledge Graph Integrates Proteomics Data into Clinical Decision-Making. bioRxiv, 2020.05.09.084897.

[speer2017]

Robyn Speer, Joshua Chin, and Catherine Havasi. (2017) ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In proceedings of AAAI 31.

[breit2020]

Breit, A., et al (2020). OpenBioLink: A benchmarking framework for large-scale biomedical link prediction, Bioinformatics

[ilievski2020]

Ilievski, F., Szekely, P., & Zhang, B. (2020). CSKG: The CommonSense Knowledge Graph. arxiv, 2012.11490.

[himmelstein2017]

Himmelstein, D. S., et al (2017). Systematic integration of biomedical knowledge prioritizes drugs for repurposing. ELife, 6.

[santurkar2018]

Santurkar, S., et al. (2018). How does batch normalization help optimization?. Advances in Neural Information Processing Systems.

[chao2020]

Chao, L., He, J., Wang, T., & Chu, W. (2020). PairRE: Knowledge Graph Embeddings via Paired Relation Vectors.

[ding2018]

Ding, B., Wang, Q., Wang, B., & Guo, L. (2018). Improving Knowledge Graph Embedding Using Simple Constraints.

[balazevic2019b]

Balažević, I., Allen, C., & Hospedales, T. (2019). Multi-relational Poincaré Graph Embeddings.

[fuhr2018]

Fuhr, N. (2018). Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. SIGIR Forum, 51(3), 32–41.

[sakai2021]

Sakai, T. (2021). On Fuhr’s Guideline for IR Evaluation. SIGIR Forum, 54(1), 1-8.

[galkin2020]

Galkin, M., et al. (2020). Message Passing for Hyper-Relational Knowledge Graphs. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 7346–7359.

[sun2018]

Sun, Z., et al. (2018). Bootstrapping Entity Alignment with Knowledge Graph Embedding. Proceedings of the 27th International Joint Conference on Artificial Intelligence, 4396–4402.

[lin2018]

Lin, T.-Y., et al. (2017). Focal Loss for Dense Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 318–327.

[mukhoti2020]

Mukhoti, J., et al. (2020). Calibrating Deep Neural Networks using Focal Loss.

[walsh2020]

Walsh, B., et al. (2020). BioKG: A Knowledge Graph for Relational Learning On Biological Data. Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 3173–3180.

[nickel2016review]

Nickel, M., et al. (2016). A Review of Relational Machine Learning for Knowledge Graphs. Proceedings of the IEEE, 104(1), 11–33.

[ruffinelli2020]

Ruffinelli, D., Broscheit, S., & Gemulla, R. (2020). You CAN Teach an Old Dog New Tricks! On Training Knowledge Graph Embeddings. International Conference on Learning Representations.

[zhang2017]

Zhang, H., et al. (2017). Visual Translation Embedding Network for Visual Relation Detection. arXiv, 1702.08319.

[sharifzadeh2019vrd]

Sharifzadeh, S., et al. (2019). Improving Visual Relation Detection using Depth Maps. arXiv, 1905.00966.

[gal2016]

Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. ICML 2016.

[zhang2020]

Zhang, Y., et al. (2020). AutoSF: Searching Scoring Functions for Knowledge Graph Embedding. ICDE 2020, 433–444.

[tucker1966]

Tucker, Ledyard R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika volume 31, 279–311.

[ali2021]

Ali, M., et al (2021). Improving Inductive Link Prediction Using Hyper-relational Facts. ISWC 2021

[teru2020]

Teru, K., et al (2020). Inductive Relation Prediction by Subgraph Reasoning. ICML 2020

[zheng2020]

Zheng, S., et al (2020). PharmKG: a dedicated knowledge graph benchmark for biomedical data mining. Briefings in Bioinformatics 2020

[yu2021]

Yu, L., et al (2021). TripleRE: Knowledge Graph Embeddings via triple Relation Vectors. viXra, 2112.0095.

[chandak2022]

Chandak, P., et al (2022). Building a knowledge graph to enable precision medicine. bioRxiv, 2022.05.01.489928.

[thanapalasingam2021]

Thanapalasingam, T., et al (2021). Relational Graph Convolutional Networks: A Closer Look. arXiv, 2107.10015.

[peng2020]

Y. Peng and J. Zhang (2020) LineaRE: Simple but Powerful Knowledge Graph Embedding for Link Prediction, 2020 IEEE International Conference on Data Mining (ICDM), pp. 422-431, doi: 10.1109/ICDM50108.2020.00051.

[koenigs2022]

Königs, C., et al (2022) The heterogeneous pharmacological medical biochemical network PharMeBINet, Scientific Data, 9, 393.

Indices and Tables