Data Sets¶
Sample datasets for use with PyKEEN, borrowed from https://github.com/ZhenfengLei/KGDatasets.
Name |
Reference |
---|---|
fb15k |
|
fb15k237 |
|
hetionet |
|
kinships |
|
nations |
|
openbiolink |
|
openbiolinkf1 |
|
openbiolinkf2 |
|
openbiolinklq |
|
umls |
|
wn18 |
|
wn18rr |
|
yago310 |
Note
This table can be re-generated with pykeen ls datasets -f rst | pbcopy
-
class
pykeen.datasets.
DataSet
[source]¶ Contains a lazy reference to a training, testing, and validation data set.
-
property
entity_to_id
¶ The mapping of entity labels to IDs.
-
property
factories
¶ Return a tuple of three factories in order (training, testing, validation).
- Return type
Tuple
[TriplesFactory
,TriplesFactory
,TriplesFactory
]
-
property
num_entities
¶ The number of entities.
-
property
num_relations
¶ The number of relations.
-
property
relation_to_id
¶ The mapping of relation labels to IDs.
-
testing
: pykeen.triples.triples_factory.TriplesFactory¶ A factory wrapping the testing triples, that share indexes with the training triples
-
training
: pykeen.triples.triples_factory.TriplesFactory¶ A factory wrapping the training triples
-
validation
: pykeen.triples.triples_factory.TriplesFactory¶ A factory wrapping the validation triples, that share indexes with the training triples
-
property
-
class
pykeen.datasets.
FB15k
(cache_root=None, **kwargs)[source]¶ The FB15k data set.
Initialize dataset.
-
class
pykeen.datasets.
FB15k237
(cache_root=None, **kwargs)[source]¶ The FB15k-237 data set.
Initialize dataset.
-
class
pykeen.datasets.
Hetionet
(create_inverse_triples=False, eager=False, random_state=0)[source]¶ The Hetionet dataset is a large biological network.
In its publication [himmelstein2017], it is demonstrated to be useful for link prediction in drug repositioning and made publicly available through its GitHub repository in several formats. The link prediction algorithm showcased does not rely on embeddings, which leaves room for interesting comparison. One such comparison was made during the master’s thesis of Lingling Xu [xu2019].
For reproducibility, the random_state argument is set by default to 0. For permutation studies, you can change this.
- himmelstein2017
Himmelstein, D. S., et al (2017). Systematic integration of biomedical knowledge prioritizes drugs for repurposing. ELife, 6.
- xu2019
Xu, L (2019) A Comparison of Learned and Engineered Features in Network-Based Drug Repositioning. Master’s Thesis.
Initialize dataset.
- Parameters
url – The url where to download the dataset from
name – The name of the file. If not given, tries to get the name from the end of the URL
cache_root – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable
PYKEEN_HOME
or defaults to~/.pykeen
.
-
class
pykeen.datasets.
Kinships
(**kwargs)[source]¶ The Kinships data set.
Initialize the data set.
- Parameters
training_path – Path to the training triples file or training triples file.
testing_path – Path to the testing triples file or testing triples file.
validation_path – Path to the validation triples file or validation triples file.
eager – Should the data be loaded eagerly? Defaults to false.
create_inverse_triples – Should inverse triples be created? Defaults to false.
-
class
pykeen.datasets.
KinshipsTestingTriplesFactory
[source]¶ A factory for the testing portion of the Kinships data set.
Initialize the triples factory.
- Parameters
path – The path to a 3-column TSV file with triples in it. If not specified, you should specify
triples
.triples – A 3-column numpy array with triples in it. If not specified, you should specify
path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1
-
class
pykeen.datasets.
KinshipsTrainingTriplesFactory
[source]¶ A factory for the training portion of the Kinships data set.
Initialize the triples factory.
- Parameters
path – The path to a 3-column TSV file with triples in it. If not specified, you should specify
triples
.triples – A 3-column numpy array with triples in it. If not specified, you should specify
path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1
-
class
pykeen.datasets.
KinshipsValidationTriplesFactory
[source]¶ A factory for the validation portion of the Kinships data set.
Initialize the triples factory.
- Parameters
path – The path to a 3-column TSV file with triples in it. If not specified, you should specify
triples
.triples – A 3-column numpy array with triples in it. If not specified, you should specify
path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1
-
class
pykeen.datasets.
Nations
(**kwargs)[source]¶ The Nations data set.
Initialize the data set.
- Parameters
training_path – Path to the training triples file or training triples file.
testing_path – Path to the testing triples file or testing triples file.
validation_path – Path to the validation triples file or validation triples file.
eager – Should the data be loaded eagerly? Defaults to false.
create_inverse_triples – Should inverse triples be created? Defaults to false.
-
class
pykeen.datasets.
NationsTestingTriplesFactory
(**kwargs)[source]¶ A factory for the testing portion of the Nations data set.
Initialize the triples factory.
- Parameters
path – The path to a 3-column TSV file with triples in it. If not specified, you should specify
triples
.triples – A 3-column numpy array with triples in it. If not specified, you should specify
path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1
-
class
pykeen.datasets.
NationsTrainingTriplesFactory
(**kwargs)[source]¶ A factory for the training portion of the Nations data set.
Initialize the triples factory.
- Parameters
path – The path to a 3-column TSV file with triples in it. If not specified, you should specify
triples
.triples – A 3-column numpy array with triples in it. If not specified, you should specify
path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1
-
class
pykeen.datasets.
NationsValidationTriplesFactory
(**kwargs)[source]¶ A factory for the validation portion of the Nations data set.
Initialize the triples factory.
- Parameters
path – The path to a 3-column TSV file with triples in it. If not specified, you should specify
triples
.triples – A 3-column numpy array with triples in it. If not specified, you should specify
path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1
-
class
pykeen.datasets.
OpenBioLink
(create_inverse_triples=False, eager=False)[source]¶ The OpenBioLink dataset.
OpenBioLink is an open-source, reproducible framework for generating biological knowledge graphs for benchmarking link prediction. It is available on GitHub at https://github.com/openbiolink/openbiolink and published in [breit2020]. There are four available data sets - this class represents the high quality, directed set.
- breit2020
Breit, A. (2020) OpenBioLink: A benchmarking framework for large-scale biomedical link prediction, Bioinformatics
Initialize dataset.
- Parameters
url – The url where to download the dataset from
name – The name of the file. If not given, tries to get the name from the end of the URL
cache_root – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable
PYKEEN_HOME
or defaults to~/.pykeen
.
-
class
pykeen.datasets.
OpenBioLinkF1
(create_inverse_triples=False, eager=False)[source]¶ The PyKEEN First Filtered OpenBioLink 2020 Dataset.
Initialize dataset.
- Parameters
url – The url where to download the dataset from
name – The name of the file. If not given, tries to get the name from the end of the URL
cache_root – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable
PYKEEN_HOME
or defaults to~/.pykeen
.
-
class
pykeen.datasets.
OpenBioLinkF2
(create_inverse_triples=False, eager=False)[source]¶ The PyKEEN Second Filtered OpenBioLink 2020 Dataset.
Initialize dataset.
- Parameters
url – The url where to download the dataset from
name – The name of the file. If not given, tries to get the name from the end of the URL
cache_root – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable
PYKEEN_HOME
or defaults to~/.pykeen
.
-
class
pykeen.datasets.
OpenBioLinkLQ
(create_inverse_triples=False, eager=False)[source]¶ The low-quality variant of the OpenBioLink dataset.
Initialize dataset.
- Parameters
url – The url where to download the dataset from
name – The name of the file. If not given, tries to get the name from the end of the URL
cache_root – An optional directory to store the extracted files. Is none is given, the default PyKEEN directory is used. This is defined either by the environment variable
PYKEEN_HOME
or defaults to~/.pykeen
.
-
class
pykeen.datasets.
Umls
(**kwargs)[source]¶ The UMLS data set.
Initialize the data set.
- Parameters
training_path – Path to the training triples file or training triples file.
testing_path – Path to the testing triples file or testing triples file.
validation_path – Path to the validation triples file or validation triples file.
eager – Should the data be loaded eagerly? Defaults to false.
create_inverse_triples – Should inverse triples be created? Defaults to false.
-
class
pykeen.datasets.
UmlsTestingTriplesFactory
[source]¶ A factory for the testing portion of the UMLS data set.
Initialize the triples factory.
- Parameters
path – The path to a 3-column TSV file with triples in it. If not specified, you should specify
triples
.triples – A 3-column numpy array with triples in it. If not specified, you should specify
path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1
-
class
pykeen.datasets.
UmlsTrainingTriplesFactory
[source]¶ A factory for the training portion of the UMLS data set.
Initialize the triples factory.
- Parameters
path – The path to a 3-column TSV file with triples in it. If not specified, you should specify
triples
.triples – A 3-column numpy array with triples in it. If not specified, you should specify
path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1
-
class
pykeen.datasets.
UmlsValidationTriplesFactory
[source]¶ A factory for the validation portion of the UMLS data set.
Initialize the triples factory.
- Parameters
path – The path to a 3-column TSV file with triples in it. If not specified, you should specify
triples
.triples – A 3-column numpy array with triples in it. If not specified, you should specify
path
create_inverse_triples – Should inverse triples be created? Defaults to False.
compact_id – Whether to compact the IDs such that they range from 0 to (num_entities or num_relations)-1
-
class
pykeen.datasets.
WN18
(cache_root=None, **kwargs)[source]¶ The WN18 data set.
Initialize dataset.
-
class
pykeen.datasets.
WN18RR
(cache_root=None, **kwargs)[source]¶ The WN18-RR data set.
Initialize dataset.
-
class
pykeen.datasets.
YAGO310
(cache_root=None, **kwargs)[source]¶ The YAGO3-10 data set is a subset of YAGO3 that only contains entities with at least 10 relations.
Initialize dataset.
-
pykeen.datasets.
datasets
: Mapping[str, Type[pykeen.datasets.dataset.DataSet]] = {'fb15k': <class 'pykeen.datasets.freebase.FB15k'>, 'fb15k237': <class 'pykeen.datasets.freebase.FB15k237'>, 'hetionet': <class 'pykeen.datasets.hetionet.Hetionet'>, 'kinships': <class 'pykeen.datasets.kinships.Kinships'>, 'nations': <class 'pykeen.datasets.nations.Nations'>, 'openbiolink': <class 'pykeen.datasets.openbiolink.OpenBioLink'>, 'openbiolinkf1': <class 'pykeen.datasets.openbiolink.OpenBioLinkF1'>, 'openbiolinkf2': <class 'pykeen.datasets.openbiolink.OpenBioLinkF2'>, 'openbiolinklq': <class 'pykeen.datasets.openbiolink.OpenBioLinkLQ'>, 'umls': <class 'pykeen.datasets.umls.Umls'>, 'wn18': <class 'pykeen.datasets.wordnet.WN18'>, 'wn18rr': <class 'pykeen.datasets.wordnet.WN18RR'>, 'yago310': <class 'pykeen.datasets.yago.YAGO310'>}¶ A mapping of data sets’ names to their classes