CharacterEmbeddingTextEncoder

class CharacterEmbeddingTextEncoder(dim: int = 32, character_representation: str | Representation | type[Representation] | None = None, vocabulary: str = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c', aggregation: str | Callable[[...], Tensor] | None = None)[source]

Bases: TextEncoder

A simple character-based text encoder.

This encoder uses base representations for each character from a given alphabet, as well as two special tokens for unknown character and padding. To encoder a sentence, it converts it to a sequence of characters, obtains the invidual characters representations and aggregates these representations to a single one.

With pykeen.nn.representation.Embedding character representation and torch.mean() aggregation, this encoder is similar to a bag-of-characters model with trainable character embeddings. Therefore, it is invariant to the ordering of characters:

>>> from pykeen.nn.text import CharacterEmbeddingTextEncoder
>>> encoder = CharacterEmbeddingTextEncoder()
>>> import torch
>>> torch.allclose(encoder("seal"), encoder("sale"))
True

Initialize the encoder.

Parameters:

dim (int) – the embedding dimension
character_representation (str | Representation | type[Representation] | None) – the character representation or a hint thereof
vocabulary (str) – the vocabulary, i.e., the allowed characters
aggregation (str | Callable[[...], Tensor] | None) – the aggregation to use to pool the character embeddings

Methods Summary

forward_normalized(texts)

Encode a batch of text.

Methods Documentation

forward_normalized(texts: Sequence[str]) → Tensor[source]

Encode a batch of text.

Parameters:: texts (Sequence[str]) – length: b the texts
Returns:: shape: (b, dim) an encoding of the texts
Return type:: Tensor