Dataset

Bases: ExtraReprMixin

The base dataset class.

Attributes Summary

`create_inverse_triples`	Return whether inverse triples are created for the training factory.
`entity_to_id`	The mapping of entity labels to IDs.
`factory_dict`	Return a dictionary of the three factories.
`metadata`	the dataset's name
`metadata_file_name`
`num_entities`	The number of entities.
`num_relations`	The number of relations.
`relation_to_id`	The mapping of relation labels to IDs.

Methods Summary

`cli`()	Run the CLI.
`deteriorate`(n[, random_state])	Deteriorate n triples from the dataset's training with `pykeen.triples.deteriorate.deteriorate()`.
`docdata`(*parts)	Get docdata for this class.
`from_directory_binary`(path)	Load a dataset from a directory.
`from_path`(path[, ratios])	Create a dataset from a single triples factory by splitting it in 3.
`from_tf`(tf[, ratios])	Create a dataset from a single triples factory by splitting it in 3.
`get_normalized_name`()	Get the normalized name of the dataset.
`iter_extra_repr`()	Yield extra entries for the instance's string representation.
`remix`([random_state])	Remix a dataset using `pykeen.triples.remix.remix()`.
`restrict`([entities, relations, ...])	Restrict a dataset to the given entities/relations.
`similarity`(other[, metric])	Compute the similarity between two shuffles of the same dataset.
`summarize`([title, show_examples, file])	Print a summary of the dataset.
`summary_str`([title, show_examples, end])	Make a summary string of all of the factories.
`to_directory_binary`(path)	Store a dataset to a path in binary format.
`triples_pair_sort_key`(pair)	Get the number of triples for sorting in an iterator context.
`triples_sort_key`(cls)	Get the number of triples for sorting.

Attributes Documentation

create_inverse_triples: Return whether inverse triples are created for the training factory.

entity_to_id: The mapping of entity labels to IDs.

factory_dict: Return a dictionary of the three factories.

metadata: Mapping[str, Any] | None = None: the dataset’s name

metadata_file_name: ClassVar[str] = 'metadata.pth'

num_entities: The number of entities.

num_relations: The number of relations.

relation_to_id: The mapping of relation labels to IDs.

Methods Documentation

classmethod cli() → None[source]

Run the CLI.

Return type:: None

deteriorate(n: int | float, random_state: None | int | Generator = None) → Dataset[source]

Deteriorate n triples from the dataset’s training with pykeen.triples.deteriorate.deteriorate().

Parameters:

n (int | float)
random_state (None | int | Generator)

Return type:

Dataset

classmethod docdata(*parts: str) → Any[source]

Get docdata for this class.

Parameters:: parts (str)
Return type:: Any

classmethod from_directory_binary(path: str | Path) → Dataset[source]

Load a dataset from a directory.

Parameters:: path (str | Path)
Return type:: Dataset

classmethod from_path(path: str | Path, ratios: list[float] | None = None) → Dataset[source]

Create a dataset from a single triples factory by splitting it in 3.

Parameters:

path (str | Path)
ratios (list[float] | None)

Return type:

Dataset

static from_tf(tf: TriplesFactory, ratios: list[float] | None = None) → Dataset[source]

Create a dataset from a single triples factory by splitting it in 3.

Parameters:

tf (TriplesFactory)
ratios (list[float] | None)

Return type:

Dataset

get_normalized_name() → str[source]

Get the normalized name of the dataset.

Return type:: str

iter_extra_repr() → Iterable[str][source]

Yield extra entries for the instance’s string representation.

Return type:: Iterable[str]

remix(random_state: None | int | Generator = None, **kwargs) → Dataset[source]

Remix a dataset using pykeen.triples.remix.remix().

Parameters:: random_state (None | int | Generator)
Return type:: Dataset

Restrict a dataset to the given entities/relations.

Example:

>>> from pykeen.datasets import get_dataset
>>> full_dataset = get_dataset(dataset="nations")
>>> restricted_dataset = dataset.restrict(entities={"burma", "china", "india", "indonesia"})

Parameters:

entities (None | Collection[int] | Collection[str]) – The entities to keep (or discard, cf. invert_entity_selection). None corresponds to selecting all entities (but is handled more efficiently).
relations (None | Collection[int] | Collection[str]) – The relations to keep (or discard, cf. invert_relation_selection). None corresponds to selecting all relations (but is handled more efficiently).
invert_entity_selection (bool) – Whether to invert the entity selection, i.e., discard the selected entities rather than all remaining ones.
invert_relation_selection (bool) – Whether to invert the relation selection, i.e., discard the selected relations rather than all remaining ones.

Returns:

a new dataset with different entity and relation mappins and a restricted set of triples.

Return type:

EagerDataset | Self

Warning

This is different to pykeen.triples.triples_factory.CoreTriplesFactory.new_with_restriction() as it does modify the label to id mapping.

similarity(other: Dataset, metric: str | None = None) → float[source]

Compute the similarity between two shuffles of the same dataset.

Parameters:

other (Dataset) – The other shuffling of the dataset
metric (str | None) – The metric to use. Defaults to tanimoto.

Returns:

A float of the similarity

Return type:

float