NodePieceRepresentation
- class NodePieceRepresentation(*, triples_factory: CoreTriplesFactory, token_representations: str | Representation | type[Representation] | None | Sequence[str | Representation | type[Representation] | None] = None, token_representations_kwargs: Mapping[str, Any] | None | Sequence[Mapping[str, Any] | None] = None, tokenizers: str | Tokenizer | type[Tokenizer] | None | Sequence[str | Tokenizer | type[Tokenizer] | None] = None, tokenizers_kwargs: Mapping[str, Any] | None | Sequence[Mapping[str, Any] | None] = None, num_tokens: int | Sequence[int] = 2, aggregation: None | str | Callable[[Tensor, int], Tensor] = None, aggregation_kwargs: Mapping[str, Any] | None = None, max_id: int | None = None, **kwargs)[source]
Bases:
CombinedRepresentation
Basic implementation of NodePiece decomposition [galkin2021].
\[x_e = \textit{agg}(\{T[t] \mid t \in \textit{tok}(e) \})\]where \(T\) are token representations, tok selects a fixed number of \(k\) tokens for each index, and agg is an aggregation function, which aggregates the individual token representations to a single representation.
Initialize the representation.
- Parameters:
triples_factory (CoreTriplesFactory) – The triples factory, required for tokenization.
token_representations (str | Representation | type[Representation] | None | Sequence[str | Representation | type[Representation] | None]) – The token representation specification, or pre-instantiated representation module.
token_representations_kwargs (Mapping[str, Any] | None | Sequence[Mapping[str, Any] | None]) – Additional keyword-based parameters.
tokenizers (str | Tokenizer | type[Tokenizer] | None | Sequence[str | Tokenizer | type[Tokenizer] | None]) – The tokenizer to use.
tokenizers_kwargs (Mapping[str, Any] | None | Sequence[Mapping[str, Any] | None]) – Additional keyword-based parameters passed to the tokenizer upon construction.
num_tokens (int | Sequence[int]) – The number of tokens for each entity.
aggregation (None | str | Callable[[Tensor, int], Tensor]) –
Aggregation of multiple token representations to a single entity representation. By default, this uses
torch.mean()
. If a string is provided, the module assumes that this refers to a top-level torch function, e.g. “mean” fortorch.mean()
, or “sum” for func:torch.sum. An aggregation can also have trainable parameters, .e.g.,MLP(mean(MLP(tokens)))
(cf. DeepSets from [zaheer2017]). In this case, the module has to be created outside of this component.We could also have aggregations which result in differently shapes output, e.g. a concatenation of all token embeddings resulting in shape
(num_tokens * d,)
. In this case, shape must be provided.The aggregation takes two arguments: the (batched) tensor of token representations, in shape
(*, num_tokens, *dt)
, and the index along which to aggregate.aggregation_kwargs (Mapping[str, Any] | None) – Additional keyword-based parameters.
max_id (int) – Only pass this to check if the number of entities in the triples factories is the same.
kwargs – Additional keyword-based parameters passed to
CombinedRepresentation
.
Note
3 resolvers are used in this function.
The parameter pair
(token_representations, token_representations_kwargs)
is used forpykeen.nn.representation_resolver
The parameter pair
(tokenizers, tokenizers_kwargs)
is used forpykeen.nn.node_piece.tokenizer_resolver
The parameter pair
(aggregation, aggregation_kwargs)
is used forclass_resolver.contrib.torch.aggregation_resolver
An explanation of resolvers and how to use them is given in https://class-resolver.readthedocs.io/en/latest/.
Methods Summary
Estimate the diversity of the tokens via their hashes.
Methods Documentation
- estimate_diversity() HashDiversityInfo [source]
Estimate the diversity of the tokens via their hashes.
- Returns:
A ratio information tuple
- Return type:
Tokenization strategies might produce exactly the same hashes for several nodes depending on the graph structure and tokenization parameters. Same hashes will result in same node representations and, hence, might inhibit the downstream performance. This function comes handy when you need to estimate the diversity of built node hashes under a certain tokenization strategy - ideally, you’d want every node to have a unique hash. The function computes how many node hashes are unique in each representation and overall (if we concat all of them in a single row). 1.0 means that all nodes have unique hashes.
Example usage:
"""Estimating token diversity for NodePiece.""" from pykeen.datasets import CoDExSmall from pykeen.models import NodePiece dataset = CoDExSmall(create_inverse_triples=True) model = NodePiece( triples_factory=dataset.training, tokenizers=["AnchorTokenizer", "RelationTokenizer"], num_tokens=[20, 12], embedding_dim=8, interaction="distmult", entity_initializer="xavier_uniform_", ) print(model.entity_representations[0].estimate_diversity())