First Steps¶
The easiest way to train and evaluate a model is with the pykeen.pipeline.pipeline()
function.
It provides a high-level entry point into the extensible functionality of this package. The following example shows how to train and evaluate the TransE model on the Nations dataset.
>>> from pykeen.pipeline import pipeline
>>> result = pipeline(
... dataset='Nations',
... model='TransE',
... )
The results are returned in a pykeen.pipeline.PipelineResult
instance, which has
attributes for the trained model, the training loop, and the evaluation.
In this example, the model was given as a string. A list of available models can be found in
pykeen.models
. Alternatively, the class corresponding to the implementation of the model
could be used as in:
>>> from pykeen.pipeline import pipeline
>>> from pykeen.models import TransE
>>> result = pipeline(
... dataset='Nations',
... model=TransE,
... )
In this example, the data set was given as a string. A list of available data sets can be found in
pykeen.datasets
. Alternatively, the instance of the pykeen.datasets.DataSet
could be
used as in:
>>> from pykeen.pipeline import pipeline
>>> from pykeen.models import TransE
>>> from pykeen.datasets import nations
>>> result = pipeline(
... dataset=nations,
... model=TransE,
... )
In each of the previous three examples, the training approach, optimizer, and evaluation scheme were omitted. By default, the stochastic local closed world assumption (sLCWA) training approach is used in training. This can be explicitly given as a string:
>>> from pykeen.pipeline import pipeline
>>> result = pipeline(
... dataset='Nations',
... model='TransE',
... training_loop='sLCWA',
... )
Alternatively, the local closed world assumption (LCWA) training approach can be given with 'LCWA'
.
No additional configuration is necessary, but it’s worth reading up on the differences between these training
approaches.
>>> from pykeen.pipeline import pipeline
>>> result = pipeline(
... dataset='Nations',
... model='TransE',
... training_loop='LCWA',
... )
One of these differences is that the sLCWA relies on negative sampling. The type of negative sampling can be given as in:
>>> from pykeen.pipeline import pipeline
>>> result = pipeline(
... dataset='Nations',
... model='TransE',
... training_loop='sLCWA',
... negative_sampler='basic',
... )
In this example, the negative sampler was given as a string. A list of available negative samplers
can be found in pykeen.sampling
. Alternatively, the class corresponding to the implementation
of the negative sampler could be used as in:
>>> from pykeen.pipeline import pipeline
>>> from pykeen.sampling import BasicNegativeSampler
>>> result = pipeline(
... dataset='Nations',
... model='TransE',
... training_loop='sLCWA',
... negative_sampler=BasicNegativeSampler,
... )
Warning
The negative_sampler
keyword argument should not be used if the LCWA is being used.
In general, all other options are available under either training approach.
The type of evaluation perfomed can be specified with the evaluator
keyword. By default,
rank-based evaluation is used. It can be given explictly as in:
>>> from pykeen.pipeline import pipeline
>>> result = pipeline(
... dataset='Nations',
... model='TransE',
... evaluator='RankBasedEvaluator',
... )
In this example, the evaluator string. A list of available evaluators can be found in
pykeen.evaluation
. Alternatively, the class corresponding to the implementation
of the evaluator could be used as in:
>>> from pykeen.pipeline import pipeline
>>> from pykeen.evaluation import RankBasedEvaluator
>>> result = pipeline(
... dataset='Nations',
... model='TransE',
... evaluator=RankBasedEvaluator,
... )
PyKEEN implements early stopping, which can be turned on with the stopper
keyword
argument as in:
>>> from pykeen.pipeline import pipeline
>>> result = pipeline(
... dataset='Nations',
... model='TransE',
... stopper='early',
... )
Deeper Configuration¶
Arguments for the model can be given as a dictionary using
model_kwargs
. There are several other options for passing kwargs in to
the other parameters used by pykeen.pipeline.pipeline()
.
>>> from pykeen.pipeline import pipeline
>>> pipeline_result = pipeline(
... dataset='Nations',
... model='TransE',
... model_kwargs=dict(
... scoring_fct_norm=2,
... ),
... )
Because the pipeline takes care of looking up classes and instantiating them,
there are several other parameters to pykeen.pipeline.pipeline()
that
can be used to specify the parameters during their respective instantiations.
Bring Your Own Data¶
As an alternative to using a pre-packaged dataset, the training and testing can be set
explicitly with instances of pykeen.triples.TriplesFactory
. For convenience,
the default data sets are also provided as subclasses of pykeen.triples.TriplesFactory
.
Warning
Make sure they are mapped to the same entities.
>>> from pykeen.datasets import NationsTestingTriplesFactory
>>> from pykeen.datasets import NationsTrainingTriplesFactory
>>> from pykeen.pipeline import pipeline
>>> training = NationsTrainingTriplesFactory()
>>> testing = NationsTestingTriplesFactory(
... entity_to_id=training.entity_to_id,
... relation_to_id=training.relation_to_id,
... )
>>> pipeline_result = pipeline(
... training_triples_factory=training,
... testing_triples_factory=testing,
... model='TransE',
... )
Beyond the Pipeline¶
While the pipeline provides a high-level interface, each aspect of the training process is encapsulated in classes that can be more finely tuned or subclassed. Below is an example of code that might have been executed with one of the previous examples.
# Get a training data set
from pykeen.datasets import Nations
dataset = Nations()
training_triples_factory = dataset.training
# Pick a model
from pykeen.models import TransE
model = TransE(triples_factory=training_triples_factory)
# Pick an optimizer from Torch
from torch.optim import Adam
optimizer = Adam(params=model.get_grad_params())
# Pick a training approach (sLCWA or LCWA)
from pykeen.training import SLCWATrainingLoop
training_loop = SLCWATrainingLoop(model=model, optimizer=optimizer)
# Train like Cristiano Ronaldo
training_loop.train(num_epochs=5, batch_size=256)
# Pick an evaluator
from pykeen.evaluation import RankBasedEvaluator
evaluator = RankBasedEvaluator(model)
# Get triples to test
mapped_triples = dataset.testing.mapped_triples
# Evaluate
results = evaluator.evaluate(mapped_triples, batch_size=1024)
print(results)
Optimizing a Model¶
The easiest way to optimize a model is with the pykeen.hpo.hpo_pipeline()
function.
All of the following examples are about getting the best model
when training TransE on the Nations data set. Each gives a bit
of insight into usage of the hpo_pipeline()
function.
The minimal usage of the hyper-parameter optimization is to specify the
dataset, the model, and how much to run. The following example shows how to
optimize the TransE model on the Nations dataset a given number of times using
the n_trials
argument.
>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
... n_trials=30,
... dataset='Nations',
... model='TransE',
... )
Alternatively, the timeout
can be set. In the following example,
as many trials as possible will be run in 60 seconds.
>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
... timeout=60,
... dataset='Nations',
... model='TransE',
... )
Every model in PyKEEN not only has default hyper-parameters, but default
strategies for optimizing these hyper-parameters. While the default values can
be found in the __init__()
function of each model, the ranges/scales can be
found in the class variable pykeen.models.Model.hpo_default
. For
example, the range for TransE’s embedding dimension is set to optimize
between 50 and 350 at increments of 25 in pykeen.models.TransE.hpo_default
.
TransE also has a scoring function norm that will be optimized by a categorical
selection of {1, 2} by default.
All hyper-parameters defined in the hpo_default
of your chosen Model will be
optimized by default. If you already have a value that you’re happy with for
one of them, you can specify it with the model_kwargs
attribute. In the
following example, the embedding_dim
for a TransE model is fixed at 200,
while the rest of the parameters will be optimized. For TransE, that means that
the scoring function norm will be optimized between 1 and 2.
>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
... model='TransE',
... model_kwargs=dict(
... embedding_dim=200,
... ),
... dataset='Nations',
... n_trials=30,
... )
If you would like to set your own HPO strategy, you can do so with the
model_kwargs_ranges
argument. In the example below, the embeddings are
searched over a larger range (low
and high
), but with a higher step
size (q
), such that 100, 200, 300, 400, and 500 are searched.
>>> from pykeen.hpo import hpo_pipeline
>>> hpo_result = hpo_pipeline(
... n_trials=30,
... dataset='Nations',
... model='TransE',
... model_kwargs_ranges=dict(
... embedding_dim=dict(type=int, low=100, high=400, q=100),
... ),
... )
If the given range is not divisible by the step size, then the upper bound will be omitted.
Optimizing the Loss¶
While each model has its own default loss, you can explicitly specify a loss
the same way as in pykeen.pipeline.pipeline()
.
>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
... n_trials=30,
... dataset='Nations',
... model='TransE',
... loss='MarginRankingLoss',
... )
As stated in the documentation for pykeen.pipeline.pipeline()
, each model
specifies its own default loss function in pykeen.models.Model.loss_default
.
For example, the TransE model defines the margin ranking loss as its default in
pykeen.models.TransE.loss_default
.
Each model also specifies default hyper-parameters for the loss function in
pykeen.models.Model.loss_default_kwargs
. For example, DistMultLiteral
explicitly sets the margin to 0.0 in pykeen.models.DistMultLiteral.loss_default_kwargs
.
Unlike the model’s hyper-parameters, the models don’t store the strategies for
optimizing the loss functions’ hyper-parameters. The pre-configured strategies
are stored in pykeen.losses.losses_hpo_defaults
. Currently, this
list only has a strategy for optimizing margin raking loss.
However, similarily to how you would specify model_kwargs_ranges
, you can
specify the loss_kwargs_ranges
explicitly, as in the following example.
>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
... n_trials=30,
... dataset='Nations',
... model='TransE',
... loss='MarginRankingLoss',
... loss_kwargs_ranges=dict(
... margin=dict(type=float, low=1.0, high=2.0),
... ),
... )
Warning
In the future, all losses will be re-implemented and the strategies will be stored the same as models.
Optimizing the Regularizer¶
Every model has a default regularizer (pykeen.models.Model.regularizer_default
)
and default hyper-parameters for the regularizer (pykeen.models.Model.regularizer_default_kwargs
).
Better than the loss is that every regularizer class has a built-in hyper-parameter optimization
strategy just like the model at pykeen.regularizers.Regularizer.hpo_default
.
Therefore, the rules for specifying regularizer
, regularizer_kwargs
, and
regularizer_kwargs_ranges
are the same as for models.
Optimizing the Optimizer¶
Yo dawg, I heard you liked optimization, so we put an optimizer around your
optimizer so you can optimize while you optimize. Since all optimizers used
in PyKEEN come from the PyTorch implementations, they obviously do not have
hpo_defaults
class variables. Instead, every optimizer has a default
optimization strategy stored in pykeen.optimizers.optimizers_hpo_defaults
the same way that the default strategies for losses are stored externally.
Optimizing the Negative Sampler¶
When the stochastic local closed world assumption (sLCWA) training approach is used for training, a negative sampler
(subclass of pykeen.sampling.NegativeSampler
) is chosen.
Each has a strategy stored in pykeen.sampling.NegativeSampler.hpo_default
.
Like models and regularizers, the rules are the same for specifying negative_sampler
,
negative_sampler_kwargs
, and negative_sampler_kwargs_ranges
.
Optimizing Everything Else¶
Without loss of generality, the following arguments to pykeen.pipeline.pipeline()
have corresponding *_kwargs and *_kwargs_ranges:
training_loop
(only kwargs, not kwargs_ranges)evaluator
evaluation
Early Stopping¶
Early stopping can be baked directly into the optuna
optimization.
The important keys are stopping='early'
and stopper_kwargs
.
When using early stopping, the hpo_pipeline()
automatically takes
care of adding appropriate callbacks to interface with optuna
.
>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
... n_trials=30,
... dataset='Nations',
... model='TransE',
... stopper='early',
... stopper_kwargs=dict(frequency=5, patience=2, delta=0.002),
... )
These stopper kwargs were chosen to make the example run faster. You will likely want to use different ones.
Optimizing Optuna¶
By default, optuna
uses the Tree-structured Parzen Estimator (TPE)
estimator (optuna.samplers.TPESampler
), which is a probabilistic
approach.
To emulate most hyper-parameter optimizations that have used random
sampling, use optuna.samplers.RandomSampler
like in:
>>> from pykeen.hpo import hpo_pipeline
>>> from optuna.samplers import RandomSampler
>>> hpo_pipeline_result = hpo_pipeline(
... n_trials=30,
... sampler=RandomSampler,
... dataset='Nations',
... model='TransE',
... )
Alternatively, the strings "tpe"
or "random"
can be used so you
don’t have to import optuna
in your script.
>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
... n_trials=30,
... sampler='random',
... dataset='Nations',
... model='TransE',
... )
While optuna.samplers.RandomSampler
doesn’t (currently) take
any arguments, the sampler_kwargs
parameter can be used to pass
arguments by keyword to the instantiation of
optuna.samplers.TPESampler
like in:
>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
... n_trials=30,
... sampler='tpe',
... sampler_kwargs=dict(prior_weight=1.1),
... dataset='Nations',
... model='TransE',
... )
Full Examples¶
The examples above have shown the permutation of one setting at a time. This section has some more complete examples.
The following example sets the optimizer, loss, training, negative sampling, evaluation, and early stopping settings.
>>> from pykeen.hpo import hpo_pipeline
>>> hpo_pipeline_result = hpo_pipeline(
... n_trials=30,
... dataset='Nations',
... model='TransE',
... model_kwargs=dict(embedding_dim=20, scoring_fct_norm=1),
... optimizer='SGD',
... optimizer_kwargs=dict(lr=0.01),
... loss='marginranking',
... loss_kwargs=dict(margin=1),
... training_loop='slcwa',
... training_kwargs=dict(num_epochs=100, batch_size=128),
... negative_sampler='basic',
... negative_sampler_kwargs=dict(num_negs_per_pos=1),
... evaluator_kwargs=dict(filtered=True),
... evaluation_kwargs=dict(batch_size=128),
... stopper='early',
... stopper_kwargs=dict(frequency=5, patience=2, delta=0.002),
... )
If you have the configuration as a dictionary:
>>> from pykeen.hpo import hpo_pipeline_from_config
>>> config = {
... 'optuna': dict(
... n_trials=30,
... ),
... 'pipeline': dict(
... dataset='Nations',
... model='TransE',
... model_kwargs=dict(embedding_dim=20, scoring_fct_norm=1),
... optimizer='SGD',
... optimizer_kwargs=dict(lr=0.01),
... loss='marginranking',
... loss_kwargs=dict(margin=1),
... training_loop='slcwa',
... training_kwargs=dict(num_epochs=100, batch_size=128),
... negative_sampler='basic',
... negative_sampler_kwargs=dict(num_negs_per_pos=1),
... evaluator_kwargs=dict(filtered=True),
... evaluation_kwargs=dict(batch_size=128),
... stopper='early',
... stopper_kwargs=dict(frequency=5, patience=2, delta=0.002),
... )
... }
... hpo_pipeline_result = hpo_pipeline_from_config(config)
If you have a configuration (in the same format) in a JSON file:
>>> import json
>>> config = {
... 'optuna': dict(
... n_trials=30,
... ),
... 'pipeline': dict(
... dataset='Nations',
... model='TransE',
... model_kwargs=dict(embedding_dim=20, scoring_fct_norm=1),
... optimizer='SGD',
... optimizer_kwargs=dict(lr=0.01),
... loss='marginranking',
... loss_kwargs=dict(margin=1),
... training_loop='slcwa',
... training_kwargs=dict(num_epochs=100, batch_size=128),
... negative_sampler='basic',
... negative_sampler_kwargs=dict(num_negs_per_pos=1),
... evaluator_kwargs=dict(filtered=True),
... evaluation_kwargs=dict(batch_size=128),
... stopper='early',
... stopper_kwargs=dict(frequency=5, patience=2, delta=0.002),
... )
... }
... with open('config.json', 'w') as file:
... json.dump(config, file, indent=2)
... hpo_pipeline_result = hpo_pipeline_from_path('config.json')