Tehničko veleučilište u Zagrebu · Zagreb

Error Pattern Discovery in Spellchecking Using Multi-Class Confusion Matrix Analysis for the Croatian Language

izvorni znanstveni rad

izvorni znanstveni rad

Error Pattern Discovery in Spellchecking Using Multi-Class Confusion Matrix Analysis for the Croatian Language

Vrsta prilog u časopisu
Tip izvorni znanstveni rad
Godina 2024
Časopis Computers (Basel)
Volumen 13
Svesčić 2
Stranice 39, 23
DOI 10.3390/computers13020039
EISSN 2073-431X
Status objavljeno

Sažetak

This paper introduces a novel approach to the creation and application of confusion matrices for error pattern discovery in spellchecking for the Croatian language. The experimental dataset has been derived from a corpus of mistyped words and user corrections collected since 2008 using the Croatian spellchecker available at ispravi.me. The important role of confusion matrices in enhancing the precision of spellcheckers, particularly within the diverse linguistic context of the Croatian language, is investigated. Common causes of spelling errors, emphasizing the challenges posed by diacritic usage, have been identified and analyzed. This research contributes to the advancement of spellchecking technologies and provides a more comprehensive understanding of linguistic details, particularly in languages with diacritic-rich orthographies, like Croatian. The presented user-data-driven approach demonstrates the potential for custom spellchecking solutions, especially considering the ever-changing dynamics of language use in digital communication.

Ključne riječi

natural language processing; spellchecking; confusion matrix; Zipf–Mandelbrot law; spelling errors; language properties