Václav Cvrček, Zuzana Komrsková, David Lukeš, Petra Poukarová, Anna Řehořková, Adrian Jan Zasina
July 25, 2017
diverse and representative corpus of contemporary Czech with manageable size
| Category | # |
|---|---|
| Tokens | 10.8 M |
| Words (excl. punct.) | 9 M |
| Word forms (types) | 508 K |
| Lemmas (types) | 204 K |
| Sentences | 714 K |
| Text samples | 3,334 |
| Min. sample length | 1,000 |
| Max. sample length | 4,731 |
| divisions (classes) | % tokens | texts |
|---|---|---|
| Fiction (8) | 18 % | 564 |
| Non-fiction (15) | 33 % | 1067 |
| Journalistic (12) | 27 % | 844 |
| Letters (1) | 2 % | 109 |
| divisions (classes) | % tokens | texts |
|---|---|---|
| Multi-directional (3) | 6.7 % | 263 |
| Uni-directional (2) | 4.2 % | 158 |
| divisions (classes) | % tokens | texts |
|---|---|---|
| Interactive (3) | 6.7 % | 258 |
| Non-interactive (1) | 2.3 % | 71 |
More than 140 features, 130 used in pilot study
Type-based features – repertoirs of pronouns, prepositions, conjunctions (normalised by zTTR, Cvrček & Chlumská 2015)
Lexical richness – Yule's K, thematic concentration (Popescu et al. 2007), repertoire of unigrams and bigrams (zTTR)
Most prominent features:
| Features | Loadings |
|---|---|
| indicative forms | 0.91 |
| finite verbs | 0.88 |
| adverbs of time | 0.86 |
| verbal aspect | 0.75 |
| 3rd person pronouns | 0.70 |
| adverbs | 0.69 |
| Features | Loadings |
|---|---|
| nominal post-modifiers without agreement | -0.94 |
| abstract nouns | -0.92 |
| nouns: genitive | -0.83 |
| verbal nouns | -0.80 |
| noun pre-modifiers with agreement | -0.77 |
| complex prepositions | -0.75 |
Most prominent features:
| Features | Loadings |
|---|---|
| contact expressions | 0.93 |
| v- prothesis | 0.89 |
| ý > ej vowel breaking (diphthongisation) in endings |
0.86 |
| é > í narrowing in endings | 0.81 |
| locative adverbs | 0.76 |
| fillers | 0.76 |
| Features | Loadings |
|---|---|
| clauses with wh-adverbs | -0.52 |
| nouns: accusative | -0.50 |
| nominal cases with prepositions | -0.48 |
| preposition | -0.44 |
| verbal aspect | -0.40 |
| unigrams | -0.40 |
Most prominent features:
| Features | Loadings |
|---|---|
| hypotactic correlative connectives | 0.65 |
| repertoire of conjunctions | 0.56 |
| repertoire of pronouns | 0.54 |
| verbs: conditional | 0.46 |
| predicative nouns | 0.46 |
| verbal predicate completed by clause | 0.44 |
| Features | Loadings |
|---|---|
| numerals | -0.43 |
| adjevtives denoting similarity | -0.42 |
| clusters of same-case adjectives | -0.38 |
This presentation resulted