Data Sets¶

Sample datasets for use with PyKEEN, borrowed from https://github.com/ZhenfengLei/KGDatasets.

Name	Reference
fb15k	`pykeen.datasets.FB15k`
fb15k237	`pykeen.datasets.FB15k237`
hetionet	`pykeen.datasets.Hetionet`
kinships	`pykeen.datasets.Kinships`
nations	`pykeen.datasets.Nations`
openbiolink	`pykeen.datasets.OpenBioLink`
openbiolinkf1	`pykeen.datasets.OpenBioLinkF1`
openbiolinkf2	`pykeen.datasets.OpenBioLinkF2`
openbiolinklq	`pykeen.datasets.OpenBioLinkLQ`
umls	`pykeen.datasets.Umls`
wn18	`pykeen.datasets.WN18`
wn18rr	`pykeen.datasets.WN18RR`
yago310	`pykeen.datasets.YAGO310`

Note

This table can be re-generated with pykeen ls datasets -f rst | pbcopy

class pykeen.datasets.DataSet[source]¶

Contains a lazy reference to a training, testing, and validation data set.

create_inverse_triples: bool¶: All data sets should take care of inverse triple creation

property entity_to_id¶: The mapping of entity labels to IDs.

property factories¶

Return a tuple of three factories in order (training, testing, validation).

Return type: Tuple[TriplesFactory, TriplesFactory, TriplesFactory]

property num_entities¶: The number of entities.

property num_relations¶: The number of relations.

property relation_to_id¶: The mapping of relation labels to IDs.

summarize()[source]¶

Print a summary of the dataset.

Return type: None

summary_str()[source]¶

Make a summary string of all of the factories.

Return type: str

testing: pykeen.triples.triples_factory.TriplesFactory¶: A factory wrapping the testing triples, that share indexes with the training triples

training: pykeen.triples.triples_factory.TriplesFactory¶: A factory wrapping the training triples

validation: pykeen.triples.triples_factory.TriplesFactory¶: A factory wrapping the validation triples, that share indexes with the training triples

class pykeen.datasets.FB15k(cache_root=None, **kwargs)[source]¶

The FB15k data set.

Initialize dataset.

Parameters

url – The url where to download the dataset from.
cache_root (Optional[str]) – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.FB15k237(cache_root=None, **kwargs)[source]¶

The FB15k-237 data set.

Initialize dataset.

Parameters

url – The url where to download the dataset from.
cache_root (Optional[str]) – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.Hetionet(create_inverse_triples=False, eager=False, random_state=0)[source]¶

The Hetionet dataset is a large biological network.

In its publication [himmelstein2017], it is demonstrated to be useful for link prediction in drug repositioning and made publicly available through its GitHub repository in several formats. The link prediction algorithm showcased does not rely on embeddings, which leaves room for interesting comparison. One such comparison was made during the master’s thesis of Lingling Xu [xu2019].

For reproducibility, the random_state argument is set by default to 0. For permutation studies, you can change this.

himmelstein2017: Himmelstein, D. S., et al (2017). Systematic integration of biomedical knowledge prioritizes drugs for repurposing. ELife, 6.
xu2019: Xu, L (2019) A Comparison of Learned and Engineered Features in Network-Based Drug Repositioning. Master’s Thesis.

Initialize dataset.

Parameters

url – The url where to download the dataset from
name – The name of the file. If not given, tries to get the name from the end of the URL
cache_root – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.Kinships(**kwargs)[source]¶

The Kinships data set.

Initialize the data set.

Parameters

training_path – Path to the training triples file or training triples file.
testing_path – Path to the testing triples file or testing triples file.
validation_path – Path to the validation triples file or validation triples file.
eager – Should the data be loaded eagerly? Defaults to false.
create_inverse_triples – Should inverse triples be created? Defaults to false.

class pykeen.datasets.KinshipsTestingTriplesFactory[source]¶

A factory for the testing portion of the Kinships data set.

Initialize the triples factory.

Parameters

path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.
triples – A 3-column numpy array with triples in it. If not specified, you should specify path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.KinshipsTrainingTriplesFactory[source]¶

A factory for the training portion of the Kinships data set.

Initialize the triples factory.

Parameters

path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.
triples – A 3-column numpy array with triples in it. If not specified, you should specify path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.KinshipsValidationTriplesFactory[source]¶

A factory for the validation portion of the Kinships data set.

Initialize the triples factory.

Parameters

path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.
triples – A 3-column numpy array with triples in it. If not specified, you should specify path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.Nations(**kwargs)[source]¶

The Nations data set.

Initialize the data set.

Parameters

training_path – Path to the training triples file or training triples file.
testing_path – Path to the testing triples file or testing triples file.
validation_path – Path to the validation triples file or validation triples file.
eager – Should the data be loaded eagerly? Defaults to false.
create_inverse_triples – Should inverse triples be created? Defaults to false.

class pykeen.datasets.NationsTestingTriplesFactory(**kwargs)[source]¶

A factory for the testing portion of the Nations data set.

Initialize the triples factory.

Parameters

path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.
triples – A 3-column numpy array with triples in it. If not specified, you should specify path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.NationsTrainingTriplesFactory(**kwargs)[source]¶

A factory for the training portion of the Nations data set.

Initialize the triples factory.

Parameters

path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.
triples – A 3-column numpy array with triples in it. If not specified, you should specify path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.NationsValidationTriplesFactory(**kwargs)[source]¶

A factory for the validation portion of the Nations data set.

Initialize the triples factory.

Parameters

path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.
triples – A 3-column numpy array with triples in it. If not specified, you should specify path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.OpenBioLink(create_inverse_triples=False, eager=False)[source]¶

The OpenBioLink dataset.

OpenBioLink is an open-source, reproducible framework for generating biological knowledge graphs for benchmarking link prediction. It is available on GitHub at https://github.com/openbiolink/openbiolink and published in [breit2020]. There are four available data sets - this class represents the high quality, directed set.

breit2020: Breit, A. (2020) OpenBioLink: A benchmarking framework for large-scale biomedical link prediction, Bioinformatics

Initialize dataset.

Parameters

url – The url where to download the dataset from
name – The name of the file. If not given, tries to get the name from the end of the URL
cache_root – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.OpenBioLinkF1(create_inverse_triples=False, eager=False)[source]¶

The PyKEEN First Filtered OpenBioLink 2020 Dataset.

Initialize dataset.

Parameters

url – The url where to download the dataset from
name – The name of the file. If not given, tries to get the name from the end of the URL
cache_root – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.OpenBioLinkF2(create_inverse_triples=False, eager=False)[source]¶

The PyKEEN Second Filtered OpenBioLink 2020 Dataset.

Initialize dataset.

Parameters

url – The url where to download the dataset from
name – The name of the file. If not given, tries to get the name from the end of the URL
cache_root – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.OpenBioLinkLQ(create_inverse_triples=False, eager=False)[source]¶

The low-quality variant of the OpenBioLink dataset.

Initialize dataset.

Parameters

url – The url where to download the dataset from
name – The name of the file. If not given, tries to get the name from the end of the URL
cache_root – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.Umls(**kwargs)[source]¶

The UMLS data set.

Initialize the data set.

Parameters

training_path – Path to the training triples file or training triples file.
testing_path – Path to the testing triples file or testing triples file.
validation_path – Path to the validation triples file or validation triples file.
eager – Should the data be loaded eagerly? Defaults to false.
create_inverse_triples – Should inverse triples be created? Defaults to false.

class pykeen.datasets.UmlsTestingTriplesFactory[source]¶

A factory for the testing portion of the UMLS data set.

Initialize the triples factory.

Parameters

path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.
triples – A 3-column numpy array with triples in it. If not specified, you should specify path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.UmlsTrainingTriplesFactory[source]¶

A factory for the training portion of the UMLS data set.

Initialize the triples factory.

Parameters

path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.
triples – A 3-column numpy array with triples in it. If not specified, you should specify path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.UmlsValidationTriplesFactory[source]¶

A factory for the validation portion of the UMLS data set.

Initialize the triples factory.

Parameters

path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.
triples – A 3-column numpy array with triples in it. If not specified, you should specify path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.WN18(cache_root=None, **kwargs)[source]¶

The WN18 data set.

Initialize dataset.

Parameters

url – The url where to download the dataset from.
cache_root (Optional[str]) – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.WN18RR(cache_root=None, **kwargs)[source]¶

The WN18-RR data set.

Initialize dataset.

Parameters

url – The url where to download the dataset from.
cache_root (Optional[str]) – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.YAGO310(cache_root=None, **kwargs)[source]¶

The YAGO3-10 data set is a subset of YAGO3 that only contains entities with at least 10 relations.

Initialize dataset.

Parameters

url – The url where to download the dataset from.
cache_root (Optional[str]) – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

pykeen.datasets.datasets: Mapping[str, Type[pykeen.datasets.dataset.DataSet]] = {'fb15k': <class 'pykeen.datasets.freebase.FB15k'>, 'fb15k237': <class 'pykeen.datasets.freebase.FB15k237'>, 'hetionet': <class 'pykeen.datasets.hetionet.Hetionet'>, 'kinships': <class 'pykeen.datasets.kinships.Kinships'>, 'nations': <class 'pykeen.datasets.nations.Nations'>, 'openbiolink': <class 'pykeen.datasets.openbiolink.OpenBioLink'>, 'openbiolinkf1': <class 'pykeen.datasets.openbiolink.OpenBioLinkF1'>, 'openbiolinkf2': <class 'pykeen.datasets.openbiolink.OpenBioLinkF2'>, 'openbiolinklq': <class 'pykeen.datasets.openbiolink.OpenBioLinkLQ'>, 'umls': <class 'pykeen.datasets.umls.Umls'>, 'wn18': <class 'pykeen.datasets.wordnet.WN18'>, 'wn18rr': <class 'pykeen.datasets.wordnet.WN18RR'>, 'yago310': <class 'pykeen.datasets.yago.YAGO310'>}¶: A mapping of data sets’ names to their classes

pykeen.datasets.get_dataset(*, dataset=None, dataset_kwargs=None, training_triples_factory=None, testing_triples_factory=None, validation_triples_factory=None)[source]¶

Get the dataset.

Return type: Tuple[TriplesFactory, TriplesFactory, TriplesFactory]