Data Sets

Sample datasets for use with PyKEEN, borrowed from https://github.com/ZhenfengLei/KGDatasets.

Name

Reference

fb15k

pykeen.datasets.FB15k

fb15k237

pykeen.datasets.FB15k237

hetionet

pykeen.datasets.Hetionet

kinships

pykeen.datasets.Kinships

nations

pykeen.datasets.Nations

openbiolink

pykeen.datasets.OpenBioLink

openbiolinkf1

pykeen.datasets.OpenBioLinkF1

openbiolinkf2

pykeen.datasets.OpenBioLinkF2

openbiolinklq

pykeen.datasets.OpenBioLinkLQ

umls

pykeen.datasets.Umls

wn18

pykeen.datasets.WN18

wn18rr

pykeen.datasets.WN18RR

yago310

pykeen.datasets.YAGO310

Note

This table can be re-generated with pykeen ls datasets -f rst | pbcopy

class pykeen.datasets.DataSet[source]

Contains a lazy reference to a training, testing, and validation data set.

create_inverse_triples: bool

All data sets should take care of inverse triple creation

property entity_to_id

The mapping of entity labels to IDs.

property factories

Return a tuple of three factories in order (training, testing, validation).

Return type

Tuple[TriplesFactory, TriplesFactory, TriplesFactory]

property num_entities

The number of entities.

property num_relations

The number of relations.

property relation_to_id

The mapping of relation labels to IDs.

summarize()[source]

Print a summary of the dataset.

Return type

None

summary_str()[source]

Make a summary string of all of the factories.

Return type

str

testing: pykeen.triples.triples_factory.TriplesFactory

A factory wrapping the testing triples, that share indexes with the training triples

training: pykeen.triples.triples_factory.TriplesFactory

A factory wrapping the training triples

validation: pykeen.triples.triples_factory.TriplesFactory

A factory wrapping the validation triples, that share indexes with the training triples

class pykeen.datasets.FB15k(cache_root=None, **kwargs)[source]

The FB15k data set.

Initialize dataset.

Parameters
  • url – The url where to download the dataset from.

  • cache_root (Optional[str]) – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.FB15k237(cache_root=None, **kwargs)[source]

The FB15k-237 data set.

Initialize dataset.

Parameters
  • url – The url where to download the dataset from.

  • cache_root (Optional[str]) – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.Hetionet(create_inverse_triples=False, eager=False, random_state=0)[source]

The Hetionet dataset is a large biological network.

In its publication [himmelstein2017], it is demonstrated to be useful for link prediction in drug repositioning and made publicly available through its GitHub repository in several formats. The link prediction algorithm showcased does not rely on embeddings, which leaves room for interesting comparison. One such comparison was made during the master’s thesis of Lingling Xu [xu2019].

For reproducibility, the random_state argument is set by default to 0. For permutation studies, you can change this.

himmelstein2017

Himmelstein, D. S., et al (2017). Systematic integration of biomedical knowledge prioritizes drugs for repurposing. ELife, 6.

xu2019

Xu, L (2019) A Comparison of Learned and Engineered Features in Network-Based Drug Repositioning. Master’s Thesis.

Initialize dataset.

Parameters
  • url – The url where to download the dataset from

  • name – The name of the file. If not given, tries to get the name from the end of the URL

  • cache_root – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.Kinships(**kwargs)[source]

The Kinships data set.

Initialize the data set.

Parameters
  • training_path – Path to the training triples file or training triples file.

  • testing_path – Path to the testing triples file or testing triples file.

  • validation_path – Path to the validation triples file or validation triples file.

  • eager – Should the data be loaded eagerly? Defaults to false.

  • create_inverse_triples – Should inverse triples be created? Defaults to false.

class pykeen.datasets.KinshipsTestingTriplesFactory[source]

A factory for the testing portion of the Kinships data set.

Initialize the triples factory.

Parameters
  • path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.

  • triples – A 3-column numpy array with triples in it. If not specified, you should specify path

  • create_inverse_triples – Should inverse triples be created? Defaults to False.

  • compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.KinshipsTrainingTriplesFactory[source]

A factory for the training portion of the Kinships data set.

Initialize the triples factory.

Parameters
  • path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.

  • triples – A 3-column numpy array with triples in it. If not specified, you should specify path

  • create_inverse_triples – Should inverse triples be created? Defaults to False.

  • compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.KinshipsValidationTriplesFactory[source]

A factory for the validation portion of the Kinships data set.

Initialize the triples factory.

Parameters
  • path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.

  • triples – A 3-column numpy array with triples in it. If not specified, you should specify path

  • create_inverse_triples – Should inverse triples be created? Defaults to False.

  • compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.Nations(**kwargs)[source]

The Nations data set.

Initialize the data set.

Parameters
  • training_path – Path to the training triples file or training triples file.

  • testing_path – Path to the testing triples file or testing triples file.

  • validation_path – Path to the validation triples file or validation triples file.

  • eager – Should the data be loaded eagerly? Defaults to false.

  • create_inverse_triples – Should inverse triples be created? Defaults to false.

class pykeen.datasets.NationsTestingTriplesFactory(**kwargs)[source]

A factory for the testing portion of the Nations data set.

Initialize the triples factory.

Parameters
  • path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.

  • triples – A 3-column numpy array with triples in it. If not specified, you should specify path

  • create_inverse_triples – Should inverse triples be created? Defaults to False.

  • compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.NationsTrainingTriplesFactory(**kwargs)[source]

A factory for the training portion of the Nations data set.

Initialize the triples factory.

Parameters
  • path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.

  • triples – A 3-column numpy array with triples in it. If not specified, you should specify path

  • create_inverse_triples – Should inverse triples be created? Defaults to False.

  • compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.NationsValidationTriplesFactory(**kwargs)[source]

A factory for the validation portion of the Nations data set.

Initialize the triples factory.

Parameters
  • path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.

  • triples – A 3-column numpy array with triples in it. If not specified, you should specify path

  • create_inverse_triples – Should inverse triples be created? Defaults to False.

  • compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

The OpenBioLink dataset.

OpenBioLink is an open-source, reproducible framework for generating biological knowledge graphs for benchmarking link prediction. It is available on GitHub at https://github.com/openbiolink/openbiolink and published in [breit2020]. There are four available data sets - this class represents the high quality, directed set.

breit2020

Breit, A. (2020) OpenBioLink: A benchmarking framework for large-scale biomedical link prediction, Bioinformatics

Initialize dataset.

Parameters
  • url – The url where to download the dataset from

  • name – The name of the file. If not given, tries to get the name from the end of the URL

  • cache_root – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.OpenBioLinkF1(create_inverse_triples=False, eager=False)[source]

The PyKEEN First Filtered OpenBioLink 2020 Dataset.

Initialize dataset.

Parameters
  • url – The url where to download the dataset from

  • name – The name of the file. If not given, tries to get the name from the end of the URL

  • cache_root – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.OpenBioLinkF2(create_inverse_triples=False, eager=False)[source]

The PyKEEN Second Filtered OpenBioLink 2020 Dataset.

Initialize dataset.

Parameters
  • url – The url where to download the dataset from

  • name – The name of the file. If not given, tries to get the name from the end of the URL

  • cache_root – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.OpenBioLinkLQ(create_inverse_triples=False, eager=False)[source]

The low-quality variant of the OpenBioLink dataset.

Initialize dataset.

Parameters
  • url – The url where to download the dataset from

  • name – The name of the file. If not given, tries to get the name from the end of the URL

  • cache_root – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.Umls(**kwargs)[source]

The UMLS data set.

Initialize the data set.

Parameters
  • training_path – Path to the training triples file or training triples file.

  • testing_path – Path to the testing triples file or testing triples file.

  • validation_path – Path to the validation triples file or validation triples file.

  • eager – Should the data be loaded eagerly? Defaults to false.

  • create_inverse_triples – Should inverse triples be created? Defaults to false.

class pykeen.datasets.UmlsTestingTriplesFactory[source]

A factory for the testing portion of the UMLS data set.

Initialize the triples factory.

Parameters
  • path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.

  • triples – A 3-column numpy array with triples in it. If not specified, you should specify path

  • create_inverse_triples – Should inverse triples be created? Defaults to False.

  • compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.UmlsTrainingTriplesFactory[source]

A factory for the training portion of the UMLS data set.

Initialize the triples factory.

Parameters
  • path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.

  • triples – A 3-column numpy array with triples in it. If not specified, you should specify path

  • create_inverse_triples – Should inverse triples be created? Defaults to False.

  • compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.UmlsValidationTriplesFactory[source]

A factory for the validation portion of the UMLS data set.

Initialize the triples factory.

Parameters
  • path – The path to a 3-column TSV file with triples in it. If not specified, you should specify triples.

  • triples – A 3-column numpy array with triples in it. If not specified, you should specify path

  • create_inverse_triples – Should inverse triples be created? Defaults to False.

  • compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1

class pykeen.datasets.WN18(cache_root=None, **kwargs)[source]

The WN18 data set.

Initialize dataset.

Parameters
  • url – The url where to download the dataset from.

  • cache_root (Optional[str]) – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.WN18RR(cache_root=None, **kwargs)[source]

The WN18-RR data set.

Initialize dataset.

Parameters
  • url – The url where to download the dataset from.

  • cache_root (Optional[str]) – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

class pykeen.datasets.YAGO310(cache_root=None, **kwargs)[source]

The YAGO3-10 data set is a subset of YAGO3 that only contains entities with at least 10 relations.

Initialize dataset.

Parameters
  • url – The url where to download the dataset from.

  • cache_root (Optional[str]) – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable PYKEEN_HOME or defaults to ~/.pykeen.

pykeen.datasets.datasets: Mapping[str, Type[pykeen.datasets.dataset.DataSet]] = {'fb15k': <class 'pykeen.datasets.freebase.FB15k'>, 'fb15k237': <class 'pykeen.datasets.freebase.FB15k237'>, 'hetionet': <class 'pykeen.datasets.hetionet.Hetionet'>, 'kinships': <class 'pykeen.datasets.kinships.Kinships'>, 'nations': <class 'pykeen.datasets.nations.Nations'>, 'openbiolink': <class 'pykeen.datasets.openbiolink.OpenBioLink'>, 'openbiolinkf1': <class 'pykeen.datasets.openbiolink.OpenBioLinkF1'>, 'openbiolinkf2': <class 'pykeen.datasets.openbiolink.OpenBioLinkF2'>, 'openbiolinklq': <class 'pykeen.datasets.openbiolink.OpenBioLinkLQ'>, 'umls': <class 'pykeen.datasets.umls.Umls'>, 'wn18': <class 'pykeen.datasets.wordnet.WN18'>, 'wn18rr': <class 'pykeen.datasets.wordnet.WN18RR'>, 'yago310': <class 'pykeen.datasets.yago.YAGO310'>}

A mapping of data sets’ names to their classes

pykeen.datasets.get_dataset(*, dataset=None, dataset_kwargs=None, training_triples_factory=None, testing_triples_factory=None, validation_triples_factory=None)[source]

Get the dataset.

Return type

Tuple[TriplesFactory, TriplesFactory, TriplesFactory]