Dataset

class Dataset[source]

Bases: ExtraReprMixin

The base dataset class.

Attributes Summary

create_inverse_triples

Return whether inverse triples are created for the training factory.

entity_to_id

The mapping of entity labels to IDs.

factory_dict

Return a dictionary of the three factories.

metadata

the dataset's name

metadata_file_name

num_entities

The number of entities.

num_relations

The number of relations.

relation_to_id

The mapping of relation labels to IDs.

Methods Summary

cli()

Run the CLI.

deteriorate(n[, random_state])

Deteriorate n triples from the dataset's training with pykeen.triples.deteriorate.deteriorate().

docdata(*parts)

Get docdata for this class.

from_directory_binary(path)

Load a dataset from a directory.

from_path(path[, ratios])

Create a dataset from a single triples factory by splitting it in 3.

from_tf(tf[, ratios])

Create a dataset from a single triples factory by splitting it in 3.

get_normalized_name()

Get the normalized name of the dataset.

iter_extra_repr()

Yield extra entries for the instance's string representation.

remix([random_state])

Remix a dataset using pykeen.triples.remix.remix().

restrict([entities, relations, ...])

Restrict a dataset to the given entities/relations.

similarity(other[, metric])

Compute the similarity between two shuffles of the same dataset.

summarize([title, show_examples, file])

Print a summary of the dataset.

summary_str([title, show_examples, end])

Make a summary string of all of the factories.

to_directory_binary(path)

Store a dataset to a path in binary format.

triples_pair_sort_key(pair)

Get the number of triples for sorting in an iterator context.

triples_sort_key(cls)

Get the number of triples for sorting.

Attributes Documentation

create_inverse_triples

Return whether inverse triples are created for the training factory.

entity_to_id

The mapping of entity labels to IDs.

factory_dict

Return a dictionary of the three factories.

metadata: Mapping[str, Any] | None = None

the dataset’s name

metadata_file_name: ClassVar[str] = 'metadata.pth'
num_entities

The number of entities.

num_relations

The number of relations.

relation_to_id

The mapping of relation labels to IDs.

Methods Documentation

classmethod cli() None[source]

Run the CLI.

Return type:

None

deteriorate(n: int | float, random_state: None | int | Generator = None) Dataset[source]

Deteriorate n triples from the dataset’s training with pykeen.triples.deteriorate.deteriorate().

Parameters:
Return type:

Dataset

classmethod docdata(*parts: str) Any[source]

Get docdata for this class.

Parameters:

parts (str)

Return type:

Any

classmethod from_directory_binary(path: str | Path) Dataset[source]

Load a dataset from a directory.

Parameters:

path (str | Path)

Return type:

Dataset

classmethod from_path(path: str | Path, ratios: list[float] | None = None) Dataset[source]

Create a dataset from a single triples factory by splitting it in 3.

Parameters:
Return type:

Dataset

static from_tf(tf: TriplesFactory, ratios: list[float] | None = None) Dataset[source]

Create a dataset from a single triples factory by splitting it in 3.

Parameters:
Return type:

Dataset

get_normalized_name() str[source]

Get the normalized name of the dataset.

Return type:

str

iter_extra_repr() Iterable[str][source]

Yield extra entries for the instance’s string representation.

Return type:

Iterable[str]

remix(random_state: None | int | Generator = None, **kwargs) Dataset[source]

Remix a dataset using pykeen.triples.remix.remix().

Parameters:

random_state (None | int | Generator)

Return type:

Dataset

restrict(entities: None | Collection[int] | Collection[str] = None, relations: None | Collection[int] | Collection[str] = None, invert_entity_selection: bool = False, invert_relation_selection: bool = False) EagerDataset | Self[source]

Restrict a dataset to the given entities/relations.

Example:

>>> from pykeen.datasets import get_dataset
>>> full_dataset = get_dataset(dataset="nations")
>>> restricted_dataset = dataset.restrict(entities={"burma", "china", "india", "indonesia"})
Parameters:
  • entities (None | Collection[int] | Collection[str]) – The entities to keep (or discard, cf. invert_entity_selection). None corresponds to selecting all entities (but is handled more efficiently).

  • relations (None | Collection[int] | Collection[str]) – The relations to keep (or discard, cf. invert_relation_selection). None corresponds to selecting all relations (but is handled more efficiently).

  • invert_entity_selection (bool) – Whether to invert the entity selection, i.e., discard the selected entities rather than all remaining ones.

  • invert_relation_selection (bool) – Whether to invert the relation selection, i.e., discard the selected relations rather than all remaining ones.

Returns:

a new dataset with different entity and relation mappins and a restricted set of triples.

Return type:

EagerDataset | Self

Warning

This is different to pykeen.triples.triples_factory.CoreTriplesFactory.new_with_restriction() as it does modify the label to id mapping.

similarity(other: Dataset, metric: str | None = None) float[source]

Compute the similarity between two shuffles of the same dataset.

Parameters:
  • other (Dataset) – The other shuffling of the dataset

  • metric (str | None) – The metric to use. Defaults to tanimoto.

Returns:

A float of the similarity

Return type:

float

See also

pykeen.triples.triples_factory.splits_similarity().

summarize(title: str | None = None, show_examples: int | None = 5, file=None) None[source]

Print a summary of the dataset.

Parameters:
  • title (str | None)

  • show_examples (int | None)

Return type:

None

summary_str(title: str | None = None, show_examples: int | None = 5, end='\n') str[source]

Make a summary string of all of the factories.

Parameters:
  • title (str | None)

  • show_examples (int | None)

Return type:

str

to_directory_binary(path: str | Path) None[source]

Store a dataset to a path in binary format.

Parameters:

path (str | Path)

Return type:

None

classmethod triples_pair_sort_key(pair: tuple[str, type[Dataset]]) int[source]

Get the number of triples for sorting in an iterator context.

Parameters:

pair (tuple[str, type[Dataset]])

Return type:

int

static triples_sort_key(cls: type[Dataset]) int[source]

Get the number of triples for sorting.

Parameters:

cls (type[Dataset])

Return type:

int