Bring Your Own Data

As an alternative to using a pre-packaged dataset, the training and testing can be set explicitly by file path or with instances of pykeen.triples.TriplesFactory. Throughout this tutorial, the paths to the training, testing, and validation sets for built-in pykeen.datasets.Nations will be used as examples.

Pre-stratified Dataset

You’ve got a training and testing file as 3-column TSV files, all ready to go. You’re sure that there aren’t any entities or relations appearing in the testing set that don’t appear in the training set. Load them in the pipeline like this:

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> result = pipeline(
...     training=NATIONS_TRAIN_PATH,
...     testing=NATIONS_TEST_PATH,
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')

PyKEEN will take care of making sure that the entities are mapped from their labels to appropriate integer (technically, 0-dimensional torch.LongTensor) indexes and that the different sets of triples share the same mapping.

This is equally applicable for the pykeen.hpo.hpo_pipeline(), which has a similar interface to the pykeen.pipeline.pipeline() as in:

>>> from pykeen.hpo import hpo_pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH, NATIONS_VALIDATE_PATH
>>> result = hpo_pipeline(
...     n_trials=3,  # you probably want more than this
...     training=NATIONS_TRAIN_PATH,
...     testing=NATIONS_TEST_PATH,
...     validation=NATIONS_VALIDATE_PATH,
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_hpo_pre_stratified_transe')

The remainder of the examples will be for pykeen.pipeline.pipeline(), but all work exactly the same for pykeen.hpo.hpo_pipeline().

If you want to add dataset-wide arguments, you can use the dataset_kwargs argument to the pykeen.pipeline.pipeline to enable options like create_inverse_triples=True.

>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> result = pipeline(
...     training=NATIONS_TRAIN_PATH,
...     testing=NATIONS_TEST_PATH,
...     dataset_kwargs={'create_inverse_triples': True},
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')

If you want finer control over how the triples are created, for example, if they are not all coming from TSV files, you can use the pykeen.triples.TriplesFactory interface.

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> training = TriplesFactory.from_path(NATIONS_TRAIN_PATH)
>>> testing = TriplesFactory.from_path(
...     NATIONS_TEST_PATH,
...     entity_to_id=training.entity_to_id,
...     relation_to_id=training.relation_to_id,
... )
>>> result = pipeline(
...     training=training,
...     testing=testing,
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')

Warning

The instantiation of the testing factory, we used the entity_to_id and relation_to_id keyword arguments. This is because PyKEEN automatically assigns numeric identifiers to all entities and relations for each triples factory. However, we want the identifiers to be exactly the same for the testing set as the training set, so we just reuse it. If we didn’t have the same identifiers, then the testing set would get mixed up with the wrong identifiers in the training set during evaluation, and we’d get nonsense results.

The dataset_kwargs argument is ignored when passing your own pykeen.triples.TriplesFactory, so be sure to include the create_inverse_triples=True in the instantiation of those classes if that’s your desired behavior as in:

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> training = TriplesFactory.from_path(
...     NATIONS_TRAIN_PATH,
...     create_inverse_triples=True,
... )
>>> testing = TriplesFactory.from_path(
...     NATIONS_TEST_PATH,
...     entity_to_id=training.entity_to_id,
...     relation_to_id=training.relation_to_id,
...     create_inverse_triples=True,
... )
>>> result = pipeline(
...     training=training,
...     testing=testing,
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')

Triples factories can also be instantiated using the triples keyword argument instead of the path argument if you already have triples loaded in a numpy.ndarray.

Unstratified Dataset

It’s more realistic your real-world dataset is not already stratified into training and testing sets. PyKEEN has you covered with pykeen.triples.TriplesFactory.split(), which will allow you to create a stratified dataset.

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH
>>> tf = TriplesFactory.from_path(NATIONS_TRAIN_PATH)
>>> training, testing = tf.split()
>>> result = pipeline(
...     training=training,
...     testing=testing,
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_unstratified_transe')

By default, this is an 80/20 split. If you want to use early stopping, you’ll also need a validation set, so you should specify the splits:

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH
>>> tf = TriplesFactory.from_path(NATIONS_TRAIN_PATH)
>>> training, testing, validation = tf.split([.8, .1, .1])
>>> result = pipeline(
...     training=training,
...     testing=testing,
...     validation=validation,
...     model='TransE',
...     stopper='early',
...     epochs=5,  # short epochs for testing - you should go
...                # higher, especially with early stopper enabled
... )
>>> result.save_to_directory('doctests/test_unstratified_stopped_transe')

Bring Your Own Data with Checkpoints

For a tutorial on how to use your own data together with checkpoints, see Checkpoints When Bringing Your Own Data and Loading Models Manually.