Václav Cvrček, Zuzana Komrsková, David Lukeš, Petra Poukarová, Anna Řehořková, Adrian Jan Zasina
July 25, 2017
diverse and representative corpus of contemporary Czech with manageable size
Category | # |
---|---|
Tokens | 10.8 M |
Words (excl. punct.) | 9 M |
Word forms (types) | 508 K |
Lemmas (types) | 204 K |
Sentences | 714 K |
Text samples | 3,334 |
Min. sample length | 1,000 |
Max. sample length | 4,731 |
divisions (classes) | % tokens | texts |
---|---|---|
Fiction (8) | 18 % | 564 |
Non-fiction (15) | 33 % | 1067 |
Journalistic (12) | 27 % | 844 |
Letters (1) | 2 % | 109 |
divisions (classes) | % tokens | texts |
---|---|---|
Multi-directional (3) | 6.7 % | 263 |
Uni-directional (2) | 4.2 % | 158 |
divisions (classes) | % tokens | texts |
---|---|---|
Interactive (3) | 6.7 % | 258 |
Non-interactive (1) | 2.3 % | 71 |
More than 140 features, 130 used in pilot study
Type-based features – repertoirs of pronouns, prepositions, conjunctions (normalised by zTTR, Cvrček & Chlumská 2015)
Lexical richness – Yule's K, thematic concentration (Popescu et al. 2007), repertoire of unigrams and bigrams (zTTR)
Most prominent features:
Features | Loadings |
---|---|
indicative forms | 0.91 |
finite verbs | 0.88 |
adverbs of time | 0.86 |
verbal aspect | 0.75 |
3rd person pronouns | 0.70 |
adverbs | 0.69 |
Features | Loadings |
---|---|
nominal post-modifiers without agreement | -0.94 |
abstract nouns | -0.92 |
nouns: genitive | -0.83 |
verbal nouns | -0.80 |
noun pre-modifiers with agreement | -0.77 |
complex prepositions | -0.75 |
Most prominent features:
Features | Loadings |
---|---|
contact expressions | 0.93 |
v- prothesis | 0.89 |
ý > ej vowel breaking (diphthongisation) in endings |
0.86 |
é > í narrowing in endings | 0.81 |
locative adverbs | 0.76 |
fillers | 0.76 |
Features | Loadings |
---|---|
clauses with wh-adverbs | -0.52 |
nouns: accusative | -0.50 |
nominal cases with prepositions | -0.48 |
preposition | -0.44 |
verbal aspect | -0.40 |
unigrams | -0.40 |
Most prominent features:
Features | Loadings |
---|---|
hypotactic correlative connectives | 0.65 |
repertoire of conjunctions | 0.56 |
repertoire of pronouns | 0.54 |
verbs: conditional | 0.46 |
predicative nouns | 0.46 |
verbal predicate completed by clause | 0.44 |
Features | Loadings |
---|---|
numerals | -0.43 |
adjevtives denoting similarity | -0.42 |
clusters of same-case adjectives | -0.38 |
This presentation resulted