In PyKEEN, a
pykeen.nn.representation.Representation is used to map
integer indices to numeric representations. A simple example is the
pykeen.nn.representation.Embedding class, where the mapping is a simple
lookup. However, more advanced representation modules are available, too.
Message passing representation modules enrich the representations of
entities by aggregating the information from their graph neighborhood.
Example implementations from PyKEEN include
pykeen.nn.representation.RGCNRepresentation which uses RGCN layers for
which enrich via CompGCN layers.
Another way to utilize message passing is via the modules provided in
which allow to use the message passing layers from PyTorch Geometric
to enrich base representations via message passing.
Since knowledge graphs may contain a large number of entities, having independent trainable embeddings for each of them may result in an excessive amount of trainable parameters. Therefore, methods have been developed, which do not learn independent representations, but rather have a set of base representations, and create individual representations by combining them.
A simple method to reduce the number of parameters is to use a low-rank
decomposition of the embedding matrix, as implemented in
pykeen.nn.representation.LowRankEmbeddingRepresentation. Here, each
representation is a linear combination of shared base representations.
Typically, the number of bases is chosen smaller than the dimension of
each base representation.
Another example is NodePiece, which takes inspiration
from tokenization we encounter in, e.g.. NLP, and represents each entity
as a set of tokens. The implementation in PyKEEN,
pykeen.nn.representation.NodePieceRepresentation, implements a simple yet
effective variant thereof, which uses a set of randomly chosen incident
relations (including inverse relations) as tokens.
Text-based representations use the entities’ (or relations’) labels to
derive representations. To this end,
pykeen.nn.representation.TextRepresentation uses a
(pre-trained) transformer model from the
transformers library to encode
the labels. Since the transformer models have been trained on huge corpora
of text, their text encodings often contain semantic information, i.e.,
labels with similar semantic meaning get similar representations. While we
can also benefit from these strong features by just initializing an
pykeen.nn.representation.Embedding with the vectors, e.g., using
pykeen.nn.representation.TextRepresentation include the
transformer model as part of the KGE model, and thus allow fine-tuning
the language model for the KGE task. This is beneficial, e.g., since it
allows a simple form of obtaining an inductive model, which can make
predictions for entities not seen during training.
from pykeen.pipeline import pipeline from pykeen.datasets import get_dataset from pykeen.nn import TextRepresentation from pykeen.models import ERModel dataset = get_dataset(dataset="nations") entity_representations = TextRepresentation.from_dataset( triples_factory=dataset, encoder="transformer", ) result = pipeline( dataset=dataset, model=ERModel, model_kwargs=dict( interaction="ermlpe", interaction_kwargs=dict( embedding_dim=entity_representations.shape, ), entity_representations=entity_representations, relation_representations_kwargs=dict( shape=entity_representations.shape, ), ), training_kwargs=dict( num_epochs=1, ), ) model = result.model
We can use the label-encoder part to generate representations for unknown entities with labels. For instance, “uk” is an entity in nations, but we can also put in “united kingdom”, and get a roughly equivalent vector representations
entity_representation = model.entity_representations label_encoder = entity_representation.encoder uk, united_kingdom = label_encoder(labels=["uk", "united kingdom"])
Thus, if we would put the resulting representations into the interaction function, we would get similar scores
# true triple from train: ['brazil', 'exports3', 'uk'] relation_representation = model.relation_representations h_repr = entity_representation.get_in_more_canonical_shape( dim="h", indices=torch.as_tensor(dataset.entity_to_id["brazil"]).view(1), ) r_repr = relation_representation.get_in_more_canonical_shape( dim="r", indices=torch.as_tensor(dataset.relation_to_id["exports3"]).view(1), ) scores = model.interaction( h=h_repr, r=r_repr, t=torch.stack([uk, united_kingdom]), ) print(scores)
As a downside, this will usually substantially increase the computational cost of computing triple scores.
If your dataset is labeled with compact uniform resource identifiers (e.g., CURIEs)
for biomedical entities like chemicals, proteins, diseases, and pathways, then
representation can make use of
pyobo to look up names (via CURIE) via the
pyobo.get_name() function, then encode them using the text encoder.
All biomedical knowledge graphs in PyKEEN (at the time of adding this representation), unfortunately do not use CURIEs for referencing biomedical entities. In the future, we hope this will change.
To learn more about CURIEs, please take a look at the Bioregistry and this blog post on CURIEs.