NodePieceRepresentation

class NodePieceRepresentation(*, triples_factory, token_representations=None, token_representations_kwargs=None, tokenizers=None, tokenizers_kwargs=None, num_tokens=2, aggregation=None, max_id=None, **kwargs)[source]

Bases: CombinedRepresentation

Basic implementation of node piece decomposition [galkin2021].

\[x_e = agg(\{T[t] \mid t \in tokens(e) \})\]

where \(T\) are token representations, \(tokens\) selects a fixed number of \(k\) tokens for each entity, and \(agg\) is an aggregation function, which aggregates the individual token representations to a single entity representation.

Initialize the representation.

Parameters:
  • triples_factory (CoreTriplesFactory) – the triples factory

  • token_representations (Union[str, Representation, Type[Representation], None, Sequence[Union[str, Representation, Type[Representation], None]]]) – the token representation specification, or pre-instantiated representation module.

  • token_representations_kwargs (Union[Mapping[str, Any], None, Sequence[Optional[Mapping[str, Any]]]]) – additional keyword-based parameters

  • tokenizers (Union[str, Tokenizer, Type[Tokenizer], None, Sequence[Union[str, Tokenizer, Type[Tokenizer], None]]]) – the tokenizer to use, cf. pykeen.nn.node_piece.tokenizer_resolver.

  • tokenizers_kwargs (Union[Mapping[str, Any], None, Sequence[Optional[Mapping[str, Any]]]]) – additional keyword-based parameters passed to the tokenizer upon construction.

  • num_tokens (Union[int, Sequence[int]]) – the number of tokens for each entity.

  • aggregation (Union[None, str, Callable[[FloatTensor, int], FloatTensor]]) –

    aggregation of multiple token representations to a single entity representation. By default, this uses torch.mean(). If a string is provided, the module assumes that this refers to a top-level torch function, e.g. “mean” for torch.mean(), or “sum” for func:torch.sum. An aggregation can also have trainable parameters, .e.g., MLP(mean(MLP(tokens))) (cf. DeepSets from [zaheer2017]). In this case, the module has to be created outside of this component.

    We could also have aggregations which result in differently shapes output, e.g. a concatenation of all token embeddings resulting in shape (num_tokens * d,). In this case, shape must be provided.

    The aggregation takes two arguments: the (batched) tensor of token representations, in shape (*, num_tokens, *dt), and the index along which to aggregate.

  • max_id (Optional[int]) – Only pass this to check if the number of entities in the triples factories is the same

  • kwargs – additional keyword-based parameters passed to CombinedRepresentation.__init__()

Methods Summary

estimate_diversity()

Estimate the diversity of the tokens via their hashes.

Methods Documentation

estimate_diversity()[source]

Estimate the diversity of the tokens via their hashes.

Return type:

HashDiversityInfo

Returns:

A ratio information tuple

Tokenization strategies might produce exactly the same hashes for several nodes depending on the graph structure and tokenization parameters. Same hashes will result in same node representations and, hence, might inhibit the downstream performance. This function comes handy when you need to estimate the diversity of built node hashes under a certain tokenization strategy - ideally, you’d want every node to have a unique hash. The function computes how many node hashes are unique in each representation and overall (if we concat all of them in a single row). 1.0 means that all nodes have unique hashes.

Example usage:

from pykeen.model import NodePiece

model = NodePiece(
    triples_factory=dataset.training,
    tokenizers=["AnchorTokenizer", "RelationTokenizer"],
    num_tokens=[20, 12],
    embedding_dim=64,
    interaction="rotate",
    relation_constrainer="complex_normalize",
    entity_initializer="xavier_uniform_",
)
print(model.entity_representations[0].estimate_diversity())