PrecomputedPoolTokenizer
- class PrecomputedPoolTokenizer(*, path: Path | None = None, url: str | None = None, download_kwargs: Mapping[str, Any] | None = None, pool: Mapping[int, Collection[int]] | None = None, randomize_selection: bool = False, loader: str | PrecomputedTokenizerLoader | type[PrecomputedTokenizerLoader] | None = None)[source]
Bases:
Tokenizer
A tokenizer using externally precomputed tokenization.
Initialize the tokenizer.
Note
the preference order for loading the precomputed pools is (1) from the given pool (2) from the given path, and (3) by downloading from the given url
- Parameters:
path (Path | None) – a path for a file containing the precomputed pools
url (str | None) – an url to download the file with precomputed pools from
download_kwargs (Mapping[str, Any] | None) – additional download parameters, passed to pystow.Module.ensure
pool (Mapping[int, Collection[int]] | None) – the precomputed pools.
randomize_selection (bool) – whether to randomly choose from tokens, or always take the first num_token precomputed tokens.
loader (str | PrecomputedTokenizerLoader | type[PrecomputedTokenizerLoader] | None) – the loader to use for loading the pool
- Raises:
ValueError – If the pool’s keys are not contiguous on \(0 \dots N-1\).
Methods Summary
__call__
(mapped_triples, num_tokens, ...)Tokenize the entities contained given the triples.
Methods Documentation
- __call__(mapped_triples: Tensor, num_tokens: int, num_entities: int, num_relations: int) tuple[int, Tensor] [source]
Tokenize the entities contained given the triples.
- Parameters:
- Returns:
shape: (num_entities, num_tokens), -1 <= res < vocabulary_size the selected relation IDs for each entity. -1 is used as a padding token.
- Return type: