Welcome to YASET - Yet Another Sequence Tagger’s documentation!¶
yaset.nn¶
yaset.nn.cnn¶
yaset.nn.crf¶
Conditional random field
-
class
yaset.nn.crf.
ConditionalRandomField
(num_tags: int, constraints: List[Tuple[int, int]] = None, include_start_end_transitions: bool = True)¶ Bases:
torch.nn.modules.module.Module
This module uses the “forward-backward” algorithm to compute the log-likelihood of its inputs assuming a conditional random field model.
See, e.g. http://www.cs.columbia.edu/~mcollins/fb.pdf
- num_tags : int, required
- The number of tags.
- constraints : List[Tuple[int, int]], optional (default: None)
- An optional list of allowed transitions (from_tag_id, to_tag_id).
These are applied to
viterbi_tags()
but do not affectforward()
. These should be derived from allowed_transitions so that the start and end transitions are handled correctly for your tag type. - include_start_end_transitions : bool, optional (default: True)
- Whether to include the start and end transition parameters.
-
forward
(inputs: torch.Tensor, tags: torch.Tensor, mask: torch.ByteTensor = None) → torch.Tensor¶ Computes the log likelihood.
-
reset_parameters
()¶
Uses viterbi algorithm to find most likely tags for the given inputs. If constraints are applied, disallows all other transitions.
-
yaset.nn.crf.
allowed_transitions
(constraint_type: str, labels: Dict[int, str]) → List[Tuple[int, int]]¶ Given labels and a constraint type, returns the allowed transitions. It will additionally include transitions for the start and end states, which are used by the conditional random field.
- constraint_type :
str
, required - Indicates which constraint to apply. Current choices are “BIO”, “IOB1”, “BIOUL”, and “BMES”.
- labels :
Dict[int, str]
, required - A mapping {label_id -> label}. Most commonly this would be the value from Vocabulary.get_index_to_token_vocabulary()
List[Tuple[int, int]]
- The allowed transitions (from_label_id, to_label_id).
- constraint_type :
-
yaset.nn.crf.
is_transition_allowed
(constraint_type: str, from_tag: str, from_entity: str, to_tag: str, to_entity: str)¶ Given a constraint type and strings
from_tag
andto_tag
that represent the origin and destination of the transition, return whether the transition is allowed under the given constraint type.- constraint_type :
str
, required - Indicates which constraint to apply. Current choices are “BIO”, “IOB1”, “BIOUL”, and “BMES”.
- from_tag :
str
, required - The tag that the transition originates from. For example, if the
label is
I-PER
, thefrom_tag
isI
. - from_entity:
str
, required - The entity corresponding to the
from_tag
. For example, if the label isI-PER
, thefrom_entity
isPER
. - to_tag :
str
, required - The tag that the transition leads to. For example, if the
label is
I-PER
, theto_tag
isI
. - to_entity:
str
, required - The entity corresponding to the
to_tag
. For example, if the label isI-PER
, theto_entity
isPER
.
bool
- Whether the transition is allowed under the given
constraint_type
.
- constraint_type :
-
yaset.nn.crf.
logsumexp
(tensor: torch.Tensor, dim: int = -1, keepdim: bool = False) → torch.Tensor¶ A numerically stable computation of logsumexp. This is mathematically equivalent to tensor.exp().sum(dim, keep=keepdim).log(). This function is typically used for summing log probabilities. Parameters ———- tensor : torch.FloatTensor, required.
A tensor of arbitrary size.- dim : int, optional (default = -1)
- The dimension of the tensor to apply the logsumexp to.
- keepdim: bool, optional (default = False)
- Whether to retain a dimension of size one at the dimension we reduce over.
-
yaset.nn.crf.
viterbi_decode
(tag_sequence: torch.Tensor, transition_matrix: torch.Tensor, tag_observations: Optional[List[int]] = None)¶ Perform Viterbi decoding in log space over a sequence given a transition matrix specifying pairwise (transition) potentials between tags and a matrix of shape (sequence_length, num_tags) specifying unary potentials for possible tags per timestep. Parameters ———- tag_sequence : torch.Tensor, required.
A tensor of shape (sequence_length, num_tags) representing scores for a set of tags over a given sequence.- transition_matrix : torch.Tensor, required.
- A tensor of shape (num_tags, num_tags) representing the binary potentials for transitioning between a given pair of tags.
- tag_observations : Optional[List[int]], optional, (default = None)
- A list of length
sequence_length
containing the class ids of observed elements in the sequence, with unobserved elements being set to -1. Note that it is possible to provide evidence which results in degenerate labelings if the sequences of tags you provide as evidence cannot transition between each other, or those transitions are extremely unlikely. In this situation we log a warning, but the responsibility for providing self-consistent evidence ultimately lies with the user.
- viterbi_path : List[int]
- The tag indices of the maximum likelihood tag sequence.
- viterbi_score : torch.Tensor
- The score of the viterbi path.
yaset.nn.embedding¶
-
class
yaset.nn.embedding.
BertEmbeddings
(model_config_file: str = None, model_file: str = None, model_type: str = None, do_lower_case: bool = None, vocab_dir: str = None, fine_tune: bool = False, only_final_layer: bool = False)¶ Bases:
torch.nn.modules.module.Module
-
compute_embeddings
(batch, cuda)¶
-
forward
(batch, cuda)¶
-
yaset.nn.ensemble¶
yaset.nn.lstm¶
yaset.nn.lstmcrf¶
-
class
yaset.nn.lstmcrf.
AugmentedLSTMCRF
(constraints: list = None, embedder: yaset.nn.embedding.Embedder = None, ffnn_hidden_layer_use: bool = None, ffnn_hidden_layer_size: int = None, ffnn_activation_function: str = None, ffnn_input_dropout_rate: float = None, embedding_input_size: int = None, lstm_hidden_size: int = None, lstm_input_dropout_rate: float = None, lstm_layer_dropout_rate: int = None, mappings: dict = None, lstm_nb_layers: int = None, num_labels: int = None, lstm_use_highway: bool = False)¶ Bases:
torch.nn.modules.module.Module
-
create_final_layer
()¶
-
create_lstm_stack
()¶
-
forward
(*args, **kwargs)¶
-
forward_ensemble_lstm_attention
(batch, cuda)¶
-
get_labels
(batch, cuda)¶
-
get_loss
(batch, cuda: bool = False)¶
-
get_loss_ensemble
(batch, cuda)¶
-
infer_labels
(batch, cuda)¶
-
yaset.single¶
yaset.single.apply¶
-
yaset.single.apply.
apply_model
(model_dir: str = None, input_file: str = None, output_file: str = None, batch_size: int = 128, cuda: bool = False, n_jobs: int = None, debug: bool = False)¶
-
yaset.single.apply.
chunks
(l, n)¶
yaset.single.inference¶
yaset.single.train¶
-
yaset.single.train.
create_dataloader
(mappings: Dict[KT, VT] = None, options: Dict[KT, VT] = None, instance_json_file: str = None, test: bool = False, working_dir: str = None) → Tuple[torch.utils.data.dataloader.DataLoader, int, torch.utils.data.dataset.Dataset]¶
-
yaset.single.train.
train_single_model
(option_file: str = None, output_dir: str = None) → None¶ Train a NER model
Parameters: - option_file (str) – model configuration file (jsonnet format)
- output_dir (str) – model output directory
Returns: None
yaset.tools¶
Submodules¶
yaset.tools.conll¶
-
yaset.tools.conll.
check_bioul_labels
(input_file: str = None)¶
-
yaset.tools.conll.
check_labels
(input_file: str = None, label_type: str = None)¶
-
yaset.tools.conll.
convert_labels
(input_file: str = None, output_file: str = None, input_label_type: str = None, output_label_type: str = None)¶ Convert NER tagging schemes
- Args:
- input_file (str): input CoNLL filepath output_file (str): output CoNLL filepath input_label_type (str): source NER tagging scheme output_label_type (str): target NER tagging scheme
- Returns:
- None
-
yaset.tools.conll.
convert_sequence
(input_sequence: list = None, input_label_type: str = None, output_label_type: str = None)¶
-
yaset.tools.conll.
convert_spaces_to_tabulations
(input_file: str = None, output_file: str = None) → None¶ Convert a CoNLL file with spaces as column separators into a CoNLL file with tabulations as column separators
- Args:
- input_file (str): input CoNLL filepath output_file (str): output CoNLL filepath
- Returns:
- None
-
yaset.tools.conll.
extract_entities_iob1
(input_labels: list = None)¶ Extract entity offsets for a CoNLL file encoded in conll 2003
- Args:
- input_labels (list): source labels
- Returns:
- list: entity offsets
-
yaset.tools.conll.
extract_sent_entities
(sentence_buffer: list = None)¶
-
yaset.tools.conll.
extract_tag_cat
(label)¶ Separate tag from category
- Args:
- label (str): NER label to split
- Returns:
- (str, str): tag, category
-
yaset.tools.conll.
load_sentences
(input_file: str = None, debug: bool = False)¶
-
yaset.tools.conll.
split_tag
(tag: str = None)¶
yaset.utils¶
yaset.utils.config¶
-
yaset.utils.config.
replace_auto
(options: dict = None) → None¶ Replace the keyword ‘auto’ by ‘-1’ in configuration files
Parameters: options (dict) – configuration parameters Returns: None
yaset.utils.conlleval¶
-
class
yaset.utils.conlleval.
EvalCounts
¶ Bases:
object
-
class
yaset.utils.conlleval.
Metrics
(tp, fp, fn, prec, rec, fscore)¶ Bases:
tuple
-
fn
¶ Alias for field number 2
-
fp
¶ Alias for field number 1
-
fscore
¶ Alias for field number 5
-
prec
¶ Alias for field number 3
-
rec
¶ Alias for field number 4
-
tp
¶ Alias for field number 0
-
-
yaset.utils.conlleval.
calculate_metrics
(correct, guessed, total)¶
-
yaset.utils.conlleval.
end_of_chunk
(prev_tag, tag, prev_type, type_)¶
-
yaset.utils.conlleval.
evaluate_ner
(corr, pred)¶
-
yaset.utils.conlleval.
metrics
(counts)¶
-
yaset.utils.conlleval.
parse_tag
(t)¶
-
yaset.utils.conlleval.
start_of_chunk
(prev_tag, tag, prev_type, type_)¶
-
yaset.utils.conlleval.
uniq
(iterable)¶
yaset.utils.copy¶
-
yaset.utils.copy.
copy_embedding_models
(embeddings_options: dict = None, output_dir: str = None) → None¶ Copy pretrained embeddings specified in configuration file to model directory
Parameters: - embeddings_options (dict) – configuration file portion related to embeddings
- output_dir (str) – directory where files will be copied
Returns: None
yaset.utils.data¶
-
class
yaset.utils.data.
NERDataset
(mappings: dict = None, instance_conll_file: str = None, debug: bool = None, singleton_replacement_ratio: float = 0.0, bert_use: bool = False, bert_voc_dir: str = None, bert_lowercase: bool = False, pretrained_use: bool = False, char_use: bool = False, elmo_use: bool = False)¶ Bases:
torch.utils.data.dataset.Dataset
-
create_instance
(sequence_buffer: list = None)¶
-
extract_singletons
()¶
-
load_instances
()¶
-
-
yaset.utils.data.
collate_ner
(batch, tok_pad_id: int = None, chr_pad_id_literal: int = None, chr_pad_id_utf8: int = None, bert_use: bool = False, char_use: bool = False, elmo_use: bool = False, pretrained_use: bool = False, options: dict = None)¶
-
yaset.utils.data.
collate_ner_ensemble
(batch, model_mappings: dict = None, model_options: dict = None, reference_id: str = None)¶
yaset.utils.load¶
-
yaset.utils.load.
load_model
(model_dir: str = None)¶ Load a single NER model
- Args:
- model_dir (str): NER model directory
- Returns:
- NER model
-
yaset.utils.load.
load_model_single
(model_dir: str = None, cuda: bool = None)¶
yaset.utils.logging¶
-
class
yaset.utils.logging.
TrainLogger
(tensorboard_path: str = None)¶ Bases:
object
-
add_checkpoint
(step: int = None, checkpoint_payload: dict = None)¶
-
add_dev_score
(step: int = None, payload: dict = None)¶
-
add_histogram
(name: str = None, value: str = None, global_step: int = None, bins: str = 'auto')¶
-
add_loss
(loss_value: float = None, loss_name: str = None, global_step: int = None)¶
-
add_other_score_dev
(idx_iteration: int = None, score_name: str = None, score_value: float = None)¶
-
add_scalar
(name: str = None, value: float = None, global_step: int = None)¶
-
add_step_values
(step: int = None, gs_values: list = None, pred_values: list = None)¶
-
close_writer
()¶
-
do_early_stopping
(nb_steps: int = None)¶
-
dump_to_disk
(custom_log_file: str = None, tensorboard_log_file: str = None)¶
-
get_best_step
(criterion: str = 'f1', reverse: bool = False)¶
-
get_dev_score
(step: int = None)¶
-
get_last_checkpoint_string
(step: int = None)¶
-
get_loss
(loss_name: str = None, global_step: int = None)¶
-
get_step_values
(step: int = None)¶
-
load_json_file
(filepath: str = None)¶
-
yaset.utils.mapping¶
-
yaset.utils.mapping.
extract_char_mapping
(instance_file: str = None)¶
-
yaset.utils.mapping.
extract_label_mapping
(instance_file: str = None)¶
-
yaset.utils.mapping.
extract_mappings_and_pretrained_matrix
(options: dict = None, oov_symbol: str = '<unk>', pad_symbol: str = '<pad>', output_dir: str = None) -> (<class 'dict'>, <class 'numpy.ndarray'>)¶ Extract pretrained embedding matrix, size and mapping.
Parameters: - output_dir (str) – model output directory
- options (dict) – model parameters
- oov_symbol (str) – symbol to use for OOV (vector will be created if necessary)
- pad_symbol (str) – symbol to use for padding (vector will be created if necessary)
Returns: pretrained matrix, pretrained matrix size and pretrained matrix mapping
Return type: np.ndarray, int, dict
yaset.utils.path¶
-
yaset.utils.path.
ensure_dir
(directory: str) → None¶ Creates a directory
- Args:
- directory (str): path to create
- Returns:
- None
yaset.utils.training¶
-
class
yaset.utils.training.
Trainer
(accumulation_steps: int = None, batch_size: int = None, clip_grad_norm: float = None, cuda: bool = False, dataloader_train: torch.utils.data.dataloader.DataLoader = None, dataloader_dev: torch.utils.data.dataloader.DataLoader = None, eval_function: Callable = None, eval_every_n_steps: int = None, fp16: bool = None, len_dataset_train: int = None, len_dataset_dev: int = None, log_to_stdout_every_n_step: int = None, lr_scheduler: object = None, max_steps: int = None, model: torch.nn.modules.module.Module = None, optimizer: torch.optim.optimizer.Optimizer = None, train_logger: yaset.utils.logging.TrainLogger = None, warmup_scheduler: torch.optim.lr_scheduler.LambdaLR = None, working_dir: str = None)¶ Bases:
object
-
static
clear_model_dir
(model_dir)¶ Remove old model parameter files
- Args:
- model_dir (str): model parameter directory
- Returns:
- None
-
perform_training
()¶
-
test_on_dev
(step_counter: int = None)¶
-
static
-
yaset.utils.training.
cycle
(iterable)¶