CharacterEmbeddingTextEncoder
- class CharacterEmbeddingTextEncoder(dim: int = 32, character_representation: str | Representation | type[Representation] | None = None, vocabulary: str = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', aggregation: str | Callable[[...], Tensor] | None = None)[source]
Bases:
TextEncoder
A simple character-based text encoder.
This encoder uses base representations for each character from a given alphabet, as well as two special tokens for unknown character and padding. To encoder a sentence, it converts it to a sequence of characters, obtains the invidual characters representations and aggregates these representations to a single one.
With
pykeen.nn.representation.Embedding
character representation andtorch.mean()
aggregation, this encoder is similar to a bag-of-characters model with trainable character embeddings. Therefore, it is invariant to the ordering of characters:>>> from pykeen.nn.text import CharacterEmbeddingTextEncoder >>> encoder = CharacterEmbeddingTextEncoder() >>> import torch >>> torch.allclose(encoder("seal"), encoder("sale")) True
Initialize the encoder.
- Parameters:
dim (int) – the embedding dimension
character_representation (str | Representation | type[Representation] | None) – the character representation or a hint thereof
vocabulary (str) – the vocabulary, i.e., the allowed characters
aggregation (str | Callable[[...], Tensor] | None) – the aggregation to use to pool the character embeddings
Methods Summary
forward_normalized
(texts)Encode a batch of text.
Methods Documentation