NodePieceRepresentation

class NodePieceRepresentation(*, triples_factory, token_representations=None, token_representations_kwargs=None, tokenizers=None, tokenizers_kwargs=None, num_tokens=2, aggregation=None, max_id=None, shape=None, **kwargs)[source]

Bases: pykeen.nn.representation.Representation

Basic implementation of node piece decomposition [galkin2021].

\[x_e = agg(\{T[t] \mid t \in tokens(e) \})\]

where \(T\) are token representations, \(tokens\) selects a fixed number of \(k\) tokens for each entity, and \(agg\) is an aggregation function, which aggregates the individual token representations to a single entity representation.

Note

This implementation currently only supports representation of entities by bag-of-relations.

Initialize the representation.

Parameters

triples_factory (CoreTriplesFactory) – the triples factory
token_representations (Union[str, Representation, Type[Representation], None, Sequence[Union[str, Representation, Type[Representation], None]]]) – the token representation specification, or pre-instantiated representation module.
token_representations_kwargs (Union[Mapping[str, Any], None, Sequence[Optional[Mapping[str, Any]]]]) – additional keyword-based parameters
tokenizers (Union[str, Tokenizer, Type[Tokenizer], None, Sequence[Union[str, Tokenizer, Type[Tokenizer], None]]]) – the tokenizer to use, cf. pykeen.nn.node_piece.tokenizer_resolver.
tokenizers_kwargs (Union[Mapping[str, Any], None, Sequence[Optional[Mapping[str, Any]]]]) – additional keyword-based parameters passed to the tokenizer upon construction.
num_tokens (Union[int, Sequence[int]]) – the number of tokens for each entity.
aggregation (Union[None, str, Callable[[FloatTensor, int], FloatTensor]]) –
aggregation of multiple token representations to a single entity representation. By default, this uses torch.mean(). If a string is provided, the module assumes that this refers to a top-level torch function, e.g. “mean” for torch.mean(), or “sum” for func:torch.sum. An aggregation can also have trainable parameters, .e.g., MLP(mean(MLP(tokens))) (cf. DeepSets from [zaheer2017]). In this case, the module has to be created outside of this component.

We could also have aggregations which result in differently shapes output, e.g. a concatenation of all token embeddings resulting in shape (num_tokens * d,). In this case, shape must be provided.

The aggregation takes two arguments: the (batched) tensor of token representations, in shape (*, num_tokens, *dt), and the index along which to aggregate.
shape (Optional[Sequence[int]]) – the shape of an individual representation. Only necessary, if aggregation results in a change of dimensions. this will only be necessary if the aggregation is an ad hoc function.
max_id (Optional[int]) – Only pass this to check if the number of entities in the triples factories is the same
kwargs – additional keyword-based parameters passed to super.__init__

Raises

ValueError – if the shapes for any vocabulary entry in all token representations are inconsistent

Methods Summary

`estimate_diversity`()	Estimate the diversity of the tokens via their hashes.
`extra_repr`()	Set the extra representation of the module

Methods Documentation

estimate_diversity()[source]

Estimate the diversity of the tokens via their hashes.

Return type: HashDiversityInfo
Returns: A ratio information tuple

Tokenization strategies might produce exactly the same hashes for several nodes depending on the graph structure and tokenization parameters. Same hashes will result in same node representations and, hence, might inhibit the downstream performance. This function comes handy when you need to estimate the diversity of built node hashes under a certain tokenization strategy - ideally, you’d want every node to have a unique hash. The function computes how many node hashes are unique in each representation and overall (if we concat all of them in a single row). 1.0 means that all nodes have unique hashes.

Example usage:

from pykeen.model import NodePiece

model = NodePiece(
    triples_factory=dataset.training,
    tokenizers=["AnchorTokenizer", "RelationTokenizer"],
    num_tokens=[20, 12],
    embedding_dim=64,
    interaction="rotate",
    relation_constrainer="complex_normalize",
    entity_initializer="xavier_uniform_",
)
print(model.entity_representations[0].estimate_diversity())