Utilities

Utilities for neural network components.

class PyOBOCache(*args, **kwargs)[source]

A cache that looks up labels of biomedical entities based on their CURIEs.

Instantiate the PyOBO cache, ensuring PyOBO is installed.

get_texts(identifiers)[source]

Get text for the given CURIEs.

Parameters:

identifiers (Sequence[str]) – The compact URIs for each entity (e.g., ['doid:1234', ...])

Return type:

Sequence[Optional[str]]

Returns:

the label for each entity, looked up via pyobo.get_name(). Might be none if no label is available.

exception ShapeError(shape, reference)[source]

An error for a mismatch in shapes.

Initialize the error.

Parameters:
Return type:

None

classmethod verify(shape, reference)[source]

Raise an exception if the shape does not match the reference.

This method normalizes the shapes first.

Parameters:
Raises:

ShapeError – if the two shapes do not match.

Return type:

Sequence[int]

Returns:

the normalized shape

class TextCache[source]

An interface for looking up text for various flavors of entity identifiers.

abstract get_texts(identifiers)[source]

Get text for the given identifiers for the cache.

Return type:

Sequence[Optional[str]]

Parameters:

identifiers (Sequence[str]) –

class WikidataCache[source]

A cache for requests against Wikidata’s SPARQL endpoint.

Initialize the cache.

WIKIDATA_ENDPOINT = 'https://query.wikidata.org/bigdata/namespace/wdq/sparql'

Wikidata SPARQL endpoint. See https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service#Interfacing

get_descriptions(wikidata_identifiers)[source]

Get entity descriptions for the given IDs.

Parameters:

wikidata_identifiers (Sequence[str]) – the Wikidata identifiers, each starting with Q (e.g., ['Q42'])

Return type:

Sequence[str]

Returns:

the description for each Wikidata entity

get_image_paths(ids, extensions=('jpeg', 'jpg', 'gif', 'png', 'svg', 'tif'), progress=False)[source]

Get paths to images for the given IDs.

Parameters:
  • ids (Sequence[str]) – the Wikidata IDs.

  • extensions (Collection[str]) – the allowed file extensions

  • progress (bool) – whether to display a progress bar

Return type:

Sequence[Optional[Path]]

Returns:

the paths to images for the given IDs.

get_labels(wikidata_identifiers)[source]

Get entity labels for the given IDs.

Parameters:

wikidata_identifiers (Sequence[str]) – the Wikidata identifiers, each starting with Q (e.g., ['Q42'])

Return type:

Sequence[str]

Returns:

the label for each Wikidata entity

get_texts(identifiers)[source]

Get a concatenation of the title and description for each Wikidata identifier.

Parameters:

identifiers (Sequence[str]) – the Wikidata identifiers, each starting with Q (e.g., ['Q42'])

Return type:

Sequence[str]

Returns:

the label and description for each Wikidata entity concatenated

classmethod query(sparql, wikidata_ids, batch_size=256)[source]

Batched SPARQL query execution for the given IDS.

Parameters:
  • sparql (Union[str, Callable[…, str]]) – the SPARQL query with a placeholder ids

  • wikidata_ids (Sequence[str]) – the Wikidata IDs

  • batch_size (int) – the batch size, i.e., maximum number of IDs per query

Return type:

Iterable[Mapping[str, Any]]

Returns:

an iterable over JSON results, where the keys correspond to query variables, and the values to the corresponding binding

classmethod query_text(wikidata_ids, language='en', batch_size=256)[source]

Query the SPARQL endpoints about information for the given IDs.

Parameters:
  • wikidata_ids (Sequence[str]) – the Wikidata IDs

  • language (str) – the label language

  • batch_size (int) – the batch size; if more ids are provided, break the big request into multiple smaller ones

Return type:

Mapping[str, Mapping[str, str]]

Returns:

a mapping from Wikidata Ids to dictionaries with the label and description of the entities

static verify_ids(ids)[source]

Raise error if invalid IDs are encountered.

Parameters:

ids (Sequence[str]) – the ids to verify

Raises:

ValueError – if any invalid ID is encountered

adjacency_tensor_to_stacked_matrix(num_relations, num_entities, source, target, edge_type, edge_weights=None, horizontal=True)[source]

Stack adjacency matrices as described in [thanapalasingam2021].

This method re-arranges the (sparse) adjacency tensor of shape (num_entities, num_relations, num_entities) to a sparse adjacency matrix of shape (num_entities, num_relations * num_entities) (horizontal stacking) or (num_entities * num_relations, num_entities) (vertical stacking). Thereby, we can perform the relation-specific message passing of R-GCN by a single sparse matrix multiplication (and some additional pre- and/or post-processing) of the inputs.

Parameters:
  • num_relations (int) – the number of relations

  • num_entities (int) – the number of entities

  • source (LongTensor) – shape: (num_triples,) the source entity indices

  • target (LongTensor) – shape: (num_triples,) the target entity indices

  • edge_type (LongTensor) – shape: (num_triples,) the edge type, i.e., relation ID

  • edge_weights (Optional[FloatTensor]) – shape: (num_triples,) scalar edge weights

  • horizontal (bool) – whether to use horizontal or vertical stacking

Return type:

Tensor

Returns:

shape: (num_entities * num_relations, num_entities) or (num_entities, num_entities * num_relations) the stacked adjacency matrix

safe_diagonal(matrix)[source]

Extract diagonal from a potentially sparse matrix.

Note

this is a work-around as long as torch.diagonal() does not work for sparse tensors

Parameters:

matrix (Tensor) – shape: (n, n) the matrix

Return type:

Tensor

Returns:

shape: (n,) the diagonal values.

use_horizontal_stacking(input_dim, output_dim)[source]

Determine a stacking direction based on the input and output dimension.

The vertical stacking approach is suitable for low dimensional input and high dimensional output, because the projection to low dimensions is done first. While the horizontal stacking approach is good for high dimensional input and low dimensional output as the projection to high dimension is done last.

Parameters:
  • input_dim (int) – the layer’s input dimension

  • output_dim (int) – the layer’s output dimension

Return type:

bool

Returns:

whether to use horizontal (True) or vertical stacking