Representations
In PyKEEN, a pykeen.nn.representation.Representation
is used to map
integer indices to numeric representations. A simple example is the
pykeen.nn.representation.Embedding
class, where the mapping is a simple
lookup. However, more advanced representation modules are available, too.
Message Passing
Message passing representation modules enrich the representations of
entities by aggregating the information from their graph neighborhood.
Example implementations from PyKEEN include
pykeen.nn.representation.RGCNRepresentation
which uses RGCN layers for
enrichment, or pykeen.nn.representation.SingleCompGCNRepresentation
,
which enrich via CompGCN layers.
Another way to utilize message passing is via the modules provided in pykeen.nn.pyg
,
which allow to use the message passing layers from PyTorch Geometric
to enrich base representations via message passing.
Decomposition
Since knowledge graphs may contain a large number of entities, having independent trainable embeddings for each of them may result in an excessive amount of trainable parameters. Therefore, methods have been developed, which do not learn independent representations, but rather have a set of base representations, and create individual representations by combining them.
Low-Rank Factorization
A simple method to reduce the number of parameters is to use a low-rank
decomposition of the embedding matrix, as implemented in
pykeen.nn.representation.LowRankEmbeddingRepresentation
. Here, each
representation is a linear combination of shared base representations.
Typically, the number of bases is chosen smaller than the dimension of
each base representation.
NodePiece
Another example is NodePiece, which takes inspiration
from tokenization we encounter in, e.g.. NLP, and represents each entity
as a set of tokens. The implementation in PyKEEN,
pykeen.nn.representation.NodePieceRepresentation
, implements a simple yet
effective variant thereof, which uses a set of randomly chosen incident
relations (including inverse relations) as tokens.
Text-based
Text-based representations use the entities’ (or relations’) labels to
derive representations. To this end,
pykeen.nn.representation.TextRepresentation
uses a
(pre-trained) transformer model from the transformers
library to encode
the labels. Since the transformer models have been trained on huge corpora
of text, their text encodings often contain semantic information, i.e.,
labels with similar semantic meaning get similar representations. While we
can also benefit from these strong features by just initializing an
pykeen.nn.representation.Embedding
with the vectors, e.g., using
pykeen.nn.init.LabelBasedInitializer
, the
pykeen.nn.representation.TextRepresentation
include the
transformer model as part of the KGE model, and thus allow fine-tuning
the language model for the KGE task. This is beneficial, e.g., since it
allows a simple form of obtaining an inductive model, which can make
predictions for entities not seen during training.
from pykeen.pipeline import pipeline
from pykeen.datasets import get_dataset
from pykeen.nn import TextRepresentation
from pykeen.models import ERModel
dataset = get_dataset(dataset="nations")
entity_representations = TextRepresentation.from_dataset(
triples_factory=dataset,
encoder="transformer",
)
result = pipeline(
dataset=dataset,
model=ERModel,
model_kwargs=dict(
interaction="ermlpe",
interaction_kwargs=dict(
embedding_dim=entity_representations.shape[0],
),
entity_representations=entity_representations,
relation_representations_kwargs=dict(
shape=entity_representations.shape,
),
),
training_kwargs=dict(
num_epochs=1,
),
)
model = result.model
We can use the label-encoder part to generate representations for unknown entities with labels. For instance, “uk” is an entity in nations, but we can also put in “united kingdom”, and get a roughly equivalent vector representations
entity_representation = model.entity_representations[0]
label_encoder = entity_representation.encoder
uk, united_kingdom = label_encoder(labels=["uk", "united kingdom"])
Thus, if we would put the resulting representations into the interaction function, we would get similar scores
# true triple from train: ['brazil', 'exports3', 'uk']
relation_representation = model.relation_representations[0]
h_repr = entity_representation.get_in_more_canonical_shape(
dim="h",
indices=torch.as_tensor(dataset.entity_to_id["brazil"]).view(1),
)
r_repr = relation_representation.get_in_more_canonical_shape(
dim="r",
indices=torch.as_tensor(dataset.relation_to_id["exports3"]).view(1),
)
scores = model.interaction(
h=h_repr,
r=r_repr,
t=torch.stack([uk, united_kingdom]),
)
print(scores)
As a downside, this will usually substantially increase the computational cost of computing triple scores.
Biomedical Entities
If your dataset is labeled with compact uniform resource identifiers (e.g., CURIEs)
for biomedical entities like chemicals, proteins, diseases, and pathways, then
the pykeen.nn.representation.BiomedicalCURIERepresentation
representation can make use of pyobo
to look up names (via CURIE) via the
pyobo.get_name()
function, then encode them using the text encoder.
All biomedical knowledge graphs in PyKEEN (at the time of adding this representation), unfortunately do not use CURIEs for referencing biomedical entities. In the future, we hope this will change.
To learn more about CURIEs, please take a look at the Bioregistry and this blog post on CURIEs.