Error-tagging of CroLTeC (computer learner corpus of Croatian as a foreign language)

sažetak izlaganja sa skupa

sažetak izlaganja sa skupa

Error-tagging of CroLTeC (computer learner corpus of Croatian as a foreign language)

Vrsta prilog sa skupa (u zborniku)
Tip sažetak izlaganja sa skupa
Godina 2019
Nadređena publikacija E-dictionaries and e-lexicography
Stranice str. 103-103
Status objavljeno

Sažetak

W describe the error-tagging scheme developed for the CroLTeC
learner corpus (http://teitok.iltec.pt/croltec/index.php?action=home)
- the first computer learner corpus of Croatian as a foreign language.
CroLTeC contains essays collected from 755 students with 36 different
mother tongues, among which the most prominent were Spanish, English,
German, Polish, Chinese, French and Arabic. It consists of 6,213 essays,
out of which 1,217 were digitally born, while 4,996 essays were scanned,
transcribed in RTF format and converted into XML format. CroLTeC
has a total of 1,054,287 tokens, and essays have been collected on all 6
CEFR levels of language learning at Croaticum – Center for Croatian as
Second and Foreign Language at the Faculty of Humanities and Social
Sciences in Zagreb. All CroLTeC essays contain metadata about the title,
number and type of essay (homework, part of exam or field class, etc.).
Data were lemmatized and annotated with morphosyntactic tags with the
RELDI tagger (Ljubesic et al., 2016). Also, the corpus ise searchable by
age, sex, language proficiency level and the mother tongue of the learner.
The error-tagging scheme is partially based on Solar (the scheme of
Slovene’s developmental corpus) and the error-coding of the Cambridge
Learner Corpus and further tailored to Croatian language. The goal
of the development of the error-annotation scheme is to build a subcorpus that will serve as a repository of authentic data about the learner’s
interlanguage. It should enable researchers and teachers of Croatian as a
foreign language to explore the interlanguage, to discover the aspects of
the grammar that are the most difficult to master and to tailor teaching
materials to different groups of learners (not only according to their
Croatian language proficiency level, but also to their first language).
Finally, the error-tagged sub-corpus should also serve as a starting point
for designing computer-aided tools to correct lexical errors, misuse of
verbal tenses, phrasal verbs and collocations.

Ključne riječi

Error-tagging; learner corpus