Data¶

This document presents input file formatting requirements for train and test instances, and word embeddings.

Train and Test Instances¶

YASET accepts CoNLL-like formatted data:

one token per line
sequences separated by blank lines

The first column *must* contain the tokens and the last column *must* contain the labels. You can add as many other columns as you need, they will be ignored by the system. Columns *must* be separated by tabulations.

The example below which is extracted from the English part of the CoNLL-2003 Shared Task corpus (Tjong et al., 2003 [4]) illustrates this format.

...

EU  NNP     I-NP    I-ORG
rejects     VBZ     I-VP    O
German      JJ      I-NP    I-MISC
call        NN      I-NP    O
to  TO      I-VP    O
boycott     VB      I-VP    O
British     JJ      I-NP    I-MISC
lamb        NN      I-NP    O
.   .       O       O

...

Word Embeddings¶

YASET supports two word embedding formats:

gensim models (Řehůřek et al., 2010 [4])
word2vec models (Mikolov et al., 2013 [1])

If you want to use other types of embeddings, you must first convert them to one of these two formats. For instance, if you have computed word embeddings using Glove (Pennington et al., 2014 [2]), you can convert the file to word2vec text format by using the script provided within the gensim library.

References

[1]	Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. Computing Research Repository, 2013.

[2]	Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. Copenhagen, Denmark, 2014. Association for Computational Linguistics.

[4]	Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of CoNLL-2003, 142–147. Edmonton, Canada, 2003. Association for Computational Linguistics.

[4]	Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50. Valletta, Malta, 2010. European Language Resources Association.