Tehničko veleučilište u Zagrebu · Zagreb

Language identification: how to distinguish similar languages?

izvorni znanstveni rad

izvorni znanstveni rad

Language identification: how to distinguish similar languages?

Vrsta prilog sa skupa (u zborniku)
Tip izvorni znanstveni rad
Godina 2007
Nadređena publikacija Proceedings of the 29th International Conference on Information Technology Interfaces : ITI 2007
Volumen 1
Stranice str. 541-546
DOI 10.1109/ITI.2007.4283829
Status objavljeno

Sažetak

The goal of this paper is to discuss the language identification problem of Croatian, language that even state-of-the-art language identification tools fi nd hard to distinguish from similar languages, such as Serbian, Slovenian or Slovak language. We developed the tool that implements the list of Croatian most frequent words with the threshold that each document needs to satisfy, we added the specific characters elimination rule, applied second-order Markov model classification and a rule of forbidden words. Finally, we built up the tool that overperforms current tools in discriminating between these similar languages.

Ključne riječi

Written language identification; Croatian language; second-order Markov model; web-corpus; most frequent words method; forbidden words method