Sealant

Tools for removing the leakage from datasets.

Leakage is when the inverse of a given training triple appears in either the testing or validation set. This scenario generally leads to inflated and misleading evaluation because predicting an inverse triple is usually very easy and not a sign of the generalizability of a model to predict novel triples.

class Sealant(triples_factory, minimum_frequency=None, symmetric=True)[source]

Stores inverse frequencies and inverse mappings in a given triples factory.

Index the inverse frequencies and the inverse relations in the triples factory.

Parameters:
  • triples_factory (CoreTriplesFactory) – The triples factory to index.

  • minimum_frequency (Optional[float]) – The minimum overlap between two relations’ triples to consider them as inverses. The default value, 0.97, is taken from Toutanova and Chen (2015), who originally described the generation of FB15k-237.

  • symmetric (bool) – If the similarities are computed as symmetric

Raises:

NotImplementedError – If symmetric is False

apply(triples_factory)[source]

Make a new triples factory containing neither duplicate nor inverse relationships.

Return type:

CoreTriplesFactory

Parameters:

triples_factory (CoreTriplesFactory) –

reindex(*triples_factories)[source]

Reindex a set of triples factories.

Return type:

List[CoreTriplesFactory]

Parameters:

triples_factories (CoreTriplesFactory) –

unleak(train, *triples_factories, n=None, minimum_frequency=None)[source]

Unleak a train, test, and validate triples factory.

Parameters:
  • train (CoreTriplesFactory) – The target triples factory

  • triples_factories (CoreTriplesFactory) – All other triples factories (test, validate, etc.)

  • n (Union[None, int, float]) – Either the (integer) number of top relations to keep or the (float) percentage of top relationships to keep. If left none, frequent relations are not removed.

  • minimum_frequency (Optional[float]) –

    The minimum overlap between two relations’ triples to consider them as inverses or duplicates. The default value, 0.97, is taken from Toutanova and Chen (2015), who originally described the generation of FB15k-237.

Return type:

Iterable[CoreTriplesFactory]

Returns:

A sequence of reindexed triples factories