Identifying cognates in English-Dutch and French-Dutch by means of orthographic information and cross-lingual word embeddings

This paper investigates the validity of combining more traditional orthographic information with cross-lingual word embeddings to identify cognate pairs in English-Dutch and French-Dutch. In a first step, lists of potential cognate pairs in English-Dutch and French-Dutch are manually labelled. The resulting gold standard is used to train and evaluate a multi-layer perceptron that can distinguish cognates from non-cognates. Fifteen orthographic features capture string similarities between source and target words, while the cosine similarity between their word embeddings represents the semantic... Mehr ...

Verfasser: Lefever, Els
Labat, Sofie
Singh, Pranaydeep
Dokumenttyp: conference
Erscheinungsdatum: 2020
Verlag/Hrsg.: European Language Resources Association (ELRA)
Schlagwörter: Languages and Literatures / LT3 / cognate detection / multi-layer perceptron / orthographic similarity / cross-lingual word embeddings
Sprache: Englisch
Permalink: https://search.fid-benelux.de/Record/base-29033471
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : https://biblio.ugent.be/publication/8662200

This paper investigates the validity of combining more traditional orthographic information with cross-lingual word embeddings to identify cognate pairs in English-Dutch and French-Dutch. In a first step, lists of potential cognate pairs in English-Dutch and French-Dutch are manually labelled. The resulting gold standard is used to train and evaluate a multi-layer perceptron that can distinguish cognates from non-cognates. Fifteen orthographic features capture string similarities between source and target words, while the cosine similarity between their word embeddings represents the semantic relation between these words. By adding domain-specific information to pretrained fastText embeddings, we are able to obtain good embeddings for words that did not yet have a pretrained embedding (e.g. Dutch compound nouns). These embeddings are then aligned in a cross-lingual vector space by exploiting their structural similarity (cf. adversarial learning). Our results indicate that although the classifier already achieves good results on the basis of orthographic information, the performance further improves by including semantic information in the form of cross-lingual word embeddings.