NodePieceRepresentation

Bases: CombinedRepresentation

Basic implementation of NodePiece decomposition [galkin2021].

\[x_e = \textit{agg}(\{T[t] \mid t \in \textit{tok}(e) \})\]

where \(T\) are token representations, tok selects a fixed number of \(k\) tokens for each index, and agg is an aggregation function, which aggregates the individual token representations to a single representation.

Initialize the representation.

Parameters:

triples_factory (CoreTriplesFactory) – The triples factory, required for tokenization.
token_representations (str | Representation | type[Representation] | None | Sequence[str | Representation | type[Representation] | None]) – The token representation specification, or pre-instantiated representation module.
token_representations_kwargs (Mapping[str, Any] | None | Sequence[Mapping[str, Any] | None]) – Additional keyword-based parameters.
tokenizers (str | Tokenizer | type[Tokenizer] | None | Sequence[str | Tokenizer | type[Tokenizer] | None]) – The tokenizer to use.
tokenizers_kwargs (Mapping[str, Any] | None | Sequence[Mapping[str, Any] | None]) – Additional keyword-based parameters passed to the tokenizer upon construction.
num_tokens (int | Sequence[int]) – The number of tokens for each entity.
aggregation (None | str | Callable[[Tensor, int], Tensor]) –
Aggregation of multiple token representations to a single entity representation. By default, this uses torch.mean(). If a string is provided, the module assumes that this refers to a top-level torch function, e.g. “mean” for torch.mean(), or “sum” for func:torch.sum. An aggregation can also have trainable parameters, .e.g., MLP(mean(MLP(tokens))) (cf. DeepSets from [zaheer2017]). In this case, the module has to be created outside of this component.

We could also have aggregations which result in differently shapes output, e.g. a concatenation of all token embeddings resulting in shape (num_tokens * d,). In this case, shape must be provided.

The aggregation takes two arguments: the (batched) tensor of token representations, in shape (*, num_tokens, *dt), and the index along which to aggregate.
aggregation_kwargs (Mapping[str, Any] | None) – Additional keyword-based parameters.
max_id (int) – Only pass this to check if the number of entities in the triples factories is the same.
kwargs – Additional keyword-based parameters passed to CombinedRepresentation.

Note

3 resolvers are used in this function.

The parameter pair (token_representations, token_representations_kwargs) is used for pykeen.nn.representation_resolver
The parameter pair (tokenizers, tokenizers_kwargs) is used for pykeen.nn.node_piece.tokenizer_resolver
The parameter pair (aggregation, aggregation_kwargs) is used for class_resolver.contrib.torch.aggregation_resolver

An explanation of resolvers and how to use them is given in https://class-resolver.readthedocs.io/en/latest/.

Methods Summary

estimate_diversity()

Estimate the diversity of the tokens via their hashes.

Methods Documentation

estimate_diversity() → HashDiversityInfo[source]

Estimate the diversity of the tokens via their hashes.

Returns:: A ratio information tuple
Return type:: HashDiversityInfo

Tokenization strategies might produce exactly the same hashes for several nodes depending on the graph structure and tokenization parameters. Same hashes will result in same node representations and, hence, might inhibit the downstream performance. This function comes handy when you need to estimate the diversity of built node hashes under a certain tokenization strategy - ideally, you’d want every node to have a unique hash. The function computes how many node hashes are unique in each representation and overall (if we concat all of them in a single row). 1.0 means that all nodes have unique hashes.

Example usage:

"""Estimating token diversity for NodePiece."""

from pykeen.datasets import CoDExSmall
from pykeen.models import NodePiece

dataset = CoDExSmall(create_inverse_triples=True)
model = NodePiece(
    triples_factory=dataset.training,
    tokenizers=["AnchorTokenizer", "RelationTokenizer"],
    num_tokens=[20, 12],
    embedding_dim=8,
    interaction="distmult",
    entity_initializer="xavier_uniform_",
)
print(model.entity_representations[0].estimate_diversity())