Welcome to YASET - Yet Another Sequence Tagger’s documentation!

yaset.nn

yaset.nn.cnn

class yaset.nn.cnn.CharCNN(char_embedding: torch.nn.modules.sparse.Embedding = None, filters: List[Tuple[int, int]] = None)

Bases: torch.nn.modules.module.Module

forward(input_matrix: torch.FloatTensor = None)
static init_weights(module)

yaset.nn.crf

Conditional random field

class yaset.nn.crf.ConditionalRandomField(num_tags: int, constraints: List[Tuple[int, int]] = None, include_start_end_transitions: bool = True)

Bases: torch.nn.modules.module.Module

This module uses the “forward-backward” algorithm to compute the log-likelihood of its inputs assuming a conditional random field model.

See, e.g. http://www.cs.columbia.edu/~mcollins/fb.pdf

num_tags : int, required
The number of tags.
constraints : List[Tuple[int, int]], optional (default: None)
An optional list of allowed transitions (from_tag_id, to_tag_id). These are applied to viterbi_tags() but do not affect forward(). These should be derived from allowed_transitions so that the start and end transitions are handled correctly for your tag type.
include_start_end_transitions : bool, optional (default: True)
Whether to include the start and end transition parameters.
forward(inputs: torch.Tensor, tags: torch.Tensor, mask: torch.ByteTensor = None) → torch.Tensor

Computes the log likelihood.

reset_parameters()
viterbi_tags(logits: torch.Tensor, mask: torch.Tensor) → List[Tuple[List[int], float]]

Uses viterbi algorithm to find most likely tags for the given inputs. If constraints are applied, disallows all other transitions.

yaset.nn.crf.allowed_transitions(constraint_type: str, labels: Dict[int, str]) → List[Tuple[int, int]]

Given labels and a constraint type, returns the allowed transitions. It will additionally include transitions for the start and end states, which are used by the conditional random field.

constraint_type : str, required
Indicates which constraint to apply. Current choices are “BIO”, “IOB1”, “BIOUL”, and “BMES”.
labels : Dict[int, str], required
A mapping {label_id -> label}. Most commonly this would be the value from Vocabulary.get_index_to_token_vocabulary()
List[Tuple[int, int]]
The allowed transitions (from_label_id, to_label_id).
yaset.nn.crf.is_transition_allowed(constraint_type: str, from_tag: str, from_entity: str, to_tag: str, to_entity: str)

Given a constraint type and strings from_tag and to_tag that represent the origin and destination of the transition, return whether the transition is allowed under the given constraint type.

constraint_type : str, required
Indicates which constraint to apply. Current choices are “BIO”, “IOB1”, “BIOUL”, and “BMES”.
from_tag : str, required
The tag that the transition originates from. For example, if the label is I-PER, the from_tag is I.
from_entity: str, required
The entity corresponding to the from_tag. For example, if the label is I-PER, the from_entity is PER.
to_tag : str, required
The tag that the transition leads to. For example, if the label is I-PER, the to_tag is I.
to_entity: str, required
The entity corresponding to the to_tag. For example, if the label is I-PER, the to_entity is PER.
bool
Whether the transition is allowed under the given constraint_type.
yaset.nn.crf.logsumexp(tensor: torch.Tensor, dim: int = -1, keepdim: bool = False) → torch.Tensor

A numerically stable computation of logsumexp. This is mathematically equivalent to tensor.exp().sum(dim, keep=keepdim).log(). This function is typically used for summing log probabilities. Parameters ———- tensor : torch.FloatTensor, required.

A tensor of arbitrary size.
dim : int, optional (default = -1)
The dimension of the tensor to apply the logsumexp to.
keepdim: bool, optional (default = False)
Whether to retain a dimension of size one at the dimension we reduce over.
yaset.nn.crf.viterbi_decode(tag_sequence: torch.Tensor, transition_matrix: torch.Tensor, tag_observations: Optional[List[int]] = None)

Perform Viterbi decoding in log space over a sequence given a transition matrix specifying pairwise (transition) potentials between tags and a matrix of shape (sequence_length, num_tags) specifying unary potentials for possible tags per timestep. Parameters ———- tag_sequence : torch.Tensor, required.

A tensor of shape (sequence_length, num_tags) representing scores for a set of tags over a given sequence.
transition_matrix : torch.Tensor, required.
A tensor of shape (num_tags, num_tags) representing the binary potentials for transitioning between a given pair of tags.
tag_observations : Optional[List[int]], optional, (default = None)
A list of length sequence_length containing the class ids of observed elements in the sequence, with unobserved elements being set to -1. Note that it is possible to provide evidence which results in degenerate labelings if the sequences of tags you provide as evidence cannot transition between each other, or those transitions are extremely unlikely. In this situation we log a warning, but the responsibility for providing self-consistent evidence ultimately lies with the user.
viterbi_path : List[int]
The tag indices of the maximum likelihood tag sequence.
viterbi_score : torch.Tensor
The score of the viterbi path.

yaset.nn.embedding

class yaset.nn.embedding.BertEmbeddings(model_config_file: str = None, model_file: str = None, model_type: str = None, do_lower_case: bool = None, vocab_dir: str = None, fine_tune: bool = False, only_final_layer: bool = False)

Bases: torch.nn.modules.module.Module

compute_embeddings(batch, cuda)
forward(batch, cuda)
class yaset.nn.embedding.Embedder(embeddings_options: dict = None, pretrained_matrix: numpy.ndarray = None, pretrained_matrix_size: (<class 'int'>, <class 'int'>) = None, mappings: dict = None, embedding_root_dir: str = None)

Bases: torch.nn.modules.module.Module

forward(batch, cuda)

yaset.nn.ensemble

yaset.nn.lstm

class yaset.nn.lstm.LSTMAugmented(lstm_hidden_size: int = None, input_dropout_rate: float = None, input_size: int = None, use_highway: bool = False)

Bases: torch.nn.modules.module.Module

forward(batch_packed)

yaset.nn.lstmcrf

class yaset.nn.lstmcrf.AugmentedLSTMCRF(constraints: list = None, embedder: yaset.nn.embedding.Embedder = None, ffnn_hidden_layer_use: bool = None, ffnn_hidden_layer_size: int = None, ffnn_activation_function: str = None, ffnn_input_dropout_rate: float = None, embedding_input_size: int = None, lstm_hidden_size: int = None, lstm_input_dropout_rate: float = None, lstm_layer_dropout_rate: int = None, mappings: dict = None, lstm_nb_layers: int = None, num_labels: int = None, lstm_use_highway: bool = False)

Bases: torch.nn.modules.module.Module

create_final_layer()
create_lstm_stack()
forward(*args, **kwargs)
forward_ensemble_lstm_attention(batch, cuda)
get_labels(batch, cuda)
get_loss(batch, cuda: bool = False)
get_loss_ensemble(batch, cuda)
infer_labels(batch, cuda)

yaset.single

yaset.single.apply

yaset.single.apply.apply_model(model_dir: str = None, input_file: str = None, output_file: str = None, batch_size: int = 128, cuda: bool = False, n_jobs: int = None, debug: bool = False)
yaset.single.apply.chunks(l, n)

yaset.single.inference

class yaset.single.inference.NERModel(mappings: dict = None, model: torch.nn.modules.module.Module = None, options: dict = None, model_dir: str = None)

Bases: object

collate_sentences(batch)
dev_predict(sentences, cuda, *arg, **kwargs)
sentence_to_ids(sentence)

yaset.single.train

yaset.single.train.create_dataloader(mappings: Dict[KT, VT] = None, options: Dict[KT, VT] = None, instance_json_file: str = None, test: bool = False, working_dir: str = None) → Tuple[torch.utils.data.dataloader.DataLoader, int, torch.utils.data.dataset.Dataset]
yaset.single.train.train_single_model(option_file: str = None, output_dir: str = None) → None

Train a NER model

Parameters:
  • option_file (str) – model configuration file (jsonnet format)
  • output_dir (str) – model output directory
Returns:

None

yaset.tools

Submodules

yaset.tools.conll

yaset.tools.conll.check_bioul_labels(input_file: str = None)
yaset.tools.conll.check_labels(input_file: str = None, label_type: str = None)
yaset.tools.conll.convert_labels(input_file: str = None, output_file: str = None, input_label_type: str = None, output_label_type: str = None)

Convert NER tagging schemes

Args:
input_file (str): input CoNLL filepath output_file (str): output CoNLL filepath input_label_type (str): source NER tagging scheme output_label_type (str): target NER tagging scheme
Returns:
None
yaset.tools.conll.convert_sequence(input_sequence: list = None, input_label_type: str = None, output_label_type: str = None)
yaset.tools.conll.convert_spaces_to_tabulations(input_file: str = None, output_file: str = None) → None

Convert a CoNLL file with spaces as column separators into a CoNLL file with tabulations as column separators

Args:
input_file (str): input CoNLL filepath output_file (str): output CoNLL filepath
Returns:
None
yaset.tools.conll.extract_entities_iob1(input_labels: list = None)

Extract entity offsets for a CoNLL file encoded in conll 2003

Args:
input_labels (list): source labels
Returns:
list: entity offsets
yaset.tools.conll.extract_sent_entities(sentence_buffer: list = None)
yaset.tools.conll.extract_tag_cat(label)

Separate tag from category

Args:
label (str): NER label to split
Returns:
(str, str): tag, category
yaset.tools.conll.load_sentences(input_file: str = None, debug: bool = False)
yaset.tools.conll.split_tag(tag: str = None)

yaset.utils

yaset.utils.config

yaset.utils.config.replace_auto(options: dict = None) → None

Replace the keyword ‘auto’ by ‘-1’ in configuration files

Parameters:options (dict) – configuration parameters
Returns:None

yaset.utils.conlleval

class yaset.utils.conlleval.EvalCounts

Bases: object

class yaset.utils.conlleval.Metrics(tp, fp, fn, prec, rec, fscore)

Bases: tuple

fn

Alias for field number 2

fp

Alias for field number 1

fscore

Alias for field number 5

prec

Alias for field number 3

rec

Alias for field number 4

tp

Alias for field number 0

yaset.utils.conlleval.calculate_metrics(correct, guessed, total)
yaset.utils.conlleval.end_of_chunk(prev_tag, tag, prev_type, type_)
yaset.utils.conlleval.evaluate_ner(corr, pred)
yaset.utils.conlleval.metrics(counts)
yaset.utils.conlleval.parse_tag(t)
yaset.utils.conlleval.start_of_chunk(prev_tag, tag, prev_type, type_)
yaset.utils.conlleval.uniq(iterable)

yaset.utils.copy

yaset.utils.copy.copy_embedding_models(embeddings_options: dict = None, output_dir: str = None) → None

Copy pretrained embeddings specified in configuration file to model directory

Parameters:
  • embeddings_options (dict) – configuration file portion related to embeddings
  • output_dir (str) – directory where files will be copied
Returns:

None

yaset.utils.data

class yaset.utils.data.NERDataset(mappings: dict = None, instance_conll_file: str = None, debug: bool = None, singleton_replacement_ratio: float = 0.0, bert_use: bool = False, bert_voc_dir: str = None, bert_lowercase: bool = False, pretrained_use: bool = False, char_use: bool = False, elmo_use: bool = False)

Bases: torch.utils.data.dataset.Dataset

create_instance(sequence_buffer: list = None)
extract_singletons()
load_instances()
yaset.utils.data.collate_ner(batch, tok_pad_id: int = None, chr_pad_id_literal: int = None, chr_pad_id_utf8: int = None, bert_use: bool = False, char_use: bool = False, elmo_use: bool = False, pretrained_use: bool = False, options: dict = None)
yaset.utils.data.collate_ner_ensemble(batch, model_mappings: dict = None, model_options: dict = None, reference_id: str = None)

yaset.utils.eval

yaset.utils.eval.eval_ner(eval_payload: list = None)

yaset.utils.load

yaset.utils.load.load_model(model_dir: str = None)

Load a single NER model

Args:
model_dir (str): NER model directory
Returns:
NER model
yaset.utils.load.load_model_single(model_dir: str = None, cuda: bool = None)

yaset.utils.logging

class yaset.utils.logging.TrainLogger(tensorboard_path: str = None)

Bases: object

add_checkpoint(step: int = None, checkpoint_payload: dict = None)
add_dev_score(step: int = None, payload: dict = None)
add_histogram(name: str = None, value: str = None, global_step: int = None, bins: str = 'auto')
add_loss(loss_value: float = None, loss_name: str = None, global_step: int = None)
add_other_score_dev(idx_iteration: int = None, score_name: str = None, score_value: float = None)
add_scalar(name: str = None, value: float = None, global_step: int = None)
add_step_values(step: int = None, gs_values: list = None, pred_values: list = None)
close_writer()
do_early_stopping(nb_steps: int = None)
dump_to_disk(custom_log_file: str = None, tensorboard_log_file: str = None)
get_best_step(criterion: str = 'f1', reverse: bool = False)
get_dev_score(step: int = None)
get_last_checkpoint_string(step: int = None)
get_loss(loss_name: str = None, global_step: int = None)
get_step_values(step: int = None)
load_json_file(filepath: str = None)

yaset.utils.mapping

yaset.utils.mapping.extract_char_mapping(instance_file: str = None)
yaset.utils.mapping.extract_label_mapping(instance_file: str = None)
yaset.utils.mapping.extract_mappings_and_pretrained_matrix(options: dict = None, oov_symbol: str = '<unk>', pad_symbol: str = '<pad>', output_dir: str = None) -> (<class 'dict'>, <class 'numpy.ndarray'>)

Extract pretrained embedding matrix, size and mapping.

Parameters:
  • output_dir (str) – model output directory
  • options (dict) – model parameters
  • oov_symbol (str) – symbol to use for OOV (vector will be created if necessary)
  • pad_symbol (str) – symbol to use for padding (vector will be created if necessary)
Returns:

pretrained matrix, pretrained matrix size and pretrained matrix mapping

Return type:

np.ndarray, int, dict

yaset.utils.misc

yaset.utils.misc.chunks(l, n)
yaset.utils.misc.flatten(list_of_lists)

yaset.utils.path

yaset.utils.path.ensure_dir(directory: str) → None

Creates a directory

Args:
directory (str): path to create
Returns:
None

yaset.utils.training

class yaset.utils.training.Trainer(accumulation_steps: int = None, batch_size: int = None, clip_grad_norm: float = None, cuda: bool = False, dataloader_train: torch.utils.data.dataloader.DataLoader = None, dataloader_dev: torch.utils.data.dataloader.DataLoader = None, eval_function: Callable = None, eval_every_n_steps: int = None, fp16: bool = None, len_dataset_train: int = None, len_dataset_dev: int = None, log_to_stdout_every_n_step: int = None, lr_scheduler: object = None, max_steps: int = None, model: torch.nn.modules.module.Module = None, optimizer: torch.optim.optimizer.Optimizer = None, train_logger: yaset.utils.logging.TrainLogger = None, warmup_scheduler: torch.optim.lr_scheduler.LambdaLR = None, working_dir: str = None)

Bases: object

static clear_model_dir(model_dir)

Remove old model parameter files

Args:
model_dir (str): model parameter directory
Returns:
None
perform_training()
test_on_dev(step_counter: int = None)
yaset.utils.training.cycle(iterable)

Indices and tables