Sealant

Tools for removing the leakage from datasets.

Leakage is when the inverse of a given training triple appears in either the testing or validation set. This scenario generally leads to inflated and misleading evaluation because predicting an inverse triple is usually very easy and not a sign of the generalizability of a model to predict novel triples.

class Sealant(triples_factory, minimum_frequency=None, symmetric=True, use_tqdm=True, use_multiprocessing=False)[source]

Stores inverse frequencies and inverse mappings in a given triples factory.

Index the inverse frequencies and the inverse relations in the triples factory.

Parameters
  • triples_factory (TriplesFactory) – The triples factory to index.

  • minimum_frequency (Optional[float]) – The minimum overlap between two relations’ triples to consider them as inverses. The default value, 0.97, is taken from Toutanova and Chen (2015), who originally described the generation of FB15k-237.

apply(triples_factory)[source]

Make a new triples factory containing neither duplicate nor inverse relationships.

Return type

TriplesFactory

get_duplicate_triples(triples_factory)[source]

Get labeled duplicate triples.

Return type

ndarray

get_inverse_triples(triples_factory)[source]

Get labeled inverse triples.

Return type

ndarray

new_without_duplicate_relations(triples_factory)[source]

Make a new triples factory not containing duplicate relationships.

Return type

TriplesFactory

new_without_inverse_relations(triples_factory)[source]

Make a new triples factory not containing inverse relationships.

Return type

TriplesFactory

property relations_to_delete: Set[str]

Relations to delete combine from both duplicates and inverses.

Return type

Set[str]

get_candidate_duplicate_relations(triples_factory, *, minimum_frequency=None, skip_zeros=True, symmetric=True, use_tqdm=True, use_multiprocessing=False)[source]

Count which relationships might be duplicates.

Parameters
  • symmetric (bool) – Should set similarity be calculated as the Jaccard index (symmetric) or as the set inclusion percentage (asymmetric)?

  • minimum_frequency (Optional[float]) – If set, pairs of relations and candidate inverse relations with a similarity lower than this value will not be reported.

  • skip_zeros (bool) – Should similarities between forward and candidate inverses of 0.0 be discarded?

  • use_tqdm (bool) – Should tqdm be used to track progress of the similarity calculations?

  • use_multiprocessing (bool) – Should multiprocessing be used to offload the similarity calculations across multiple cores?

Returns

A counter whose keys are pairs of relations and values are similarity scores

get_candidate_inverse_relations(triples_factory, *, symmetric=True, minimum_frequency=None, skip_zeros=True, skip_self=True, use_tqdm=True, use_multiprocessing=False)[source]

Count which relationships might be inverses of each other.

Parameters
  • symmetric (bool) – Should set similarity be calculated as the Jaccard index (symmetric) or as the set inclusion percentage (asymmetric)?

  • minimum_frequency (Optional[float]) – If set, pairs of relations and candidate inverse relations with a similarity lower than this value will not be reported.

  • skip_zeros (bool) – Should similarities between forward and candidate inverses of 0.0 be discarded?

  • skip_self (bool) – Should similarities between a relationship and its own candidate inverse be skipped? Defaults to True, but could be useful to identify relationships that aren’t directed.

  • use_tqdm (bool) – Should tqdm be used to track progress of the similarity calculations?

  • use_multiprocessing – Should multiprocessing be used to offload the similarity calculations across multiple cores?

Return type

Mapping[Tuple[str, str], float]

Returns

A counter whose keys are pairs of relations and values are similarity scores

reindex(*triples_factories)[source]

Reindex a set of triples factories.

Return type

List[TriplesFactory]

summarize(training, testing, validation)[source]

Summarize the dataset.

Return type

None

unleak(train, *triples_factories, n=None, minimum_frequency=None)[source]

Unleak a train, test, and validate triples factory.

Parameters
  • train (TriplesFactory) – The target triples factory

  • triples_factories (TriplesFactory) – All other triples factories (test, validate, etc.)

  • n (Union[None, int, float]) – Either the (integer) number of top relations to keep or the (float) percentage of top relationships to keep. If left none, frequent relations are not removed.

  • minimum_frequency (Optional[float]) –

    The minimum overlap between two relations’ triples to consider them as inverses or duplicates. The default value, 0.97, is taken from Toutanova and Chen (2015), who originally described the generation of FB15k-237.

Return type

Iterable[TriplesFactory]