Multi-dimensional analysis of Czech. Pilot study

Václav Cvrček, Zuzana Komrsková, David Lukeš, Petra Poukarová, Anna Řehořková, Adrian Jan Zasina
July 25, 2017

Multi-dimensional analysis

  • study of register variation (Biber 1991; Biber & Conrad 2009)
  • interplay between register and genre (intratextual × extratexual)
  • first application to Slavic language – Czech: rich inflection, distinctive morphology, close to diglossia (Bermel 2004)

Presentation overview

  • corpus compilation
  • feature operationalization
  • preliminary results
  • project outlook

The Koditex Corpus

  • diverse and representative corpus of contemporary Czech with manageable size
  • annotation: lemmatization, morphological tagging, phraseme annotation, NE recognition
  • tools: KonText, MorphoDiTa, ad-hoc scripts (python, R, perl)
Category #
Tokens 10.8 M
Words (excl. punct.) 9 M
Word forms (types) 508 K
Lemmas (types) 204 K
Sentences 714 K
Text samples 3,334
Min. sample length 1,000
Max. sample length 4,731

Corpus composition: Sampling

  • goal: maximum diversity
  • between-stratum diversity: each class sampled separately
  • within-stratum diversity: sampled according to dissimilarity measure based on metadata

Corpus composition: Written mode

80 % of tokens, 2584 texts

divisions (classes) % tokens texts
Fiction (8) 18 % 564
Non-fiction (15) 33 % 1067
Journalistic (12) 27 % 844
Letters (1) 2 % 109

Corpus composition: Internet

11 % of tokens, 421 texts

divisions (classes) % tokens texts
Multi-directional (3) 6.7 % 263
Uni-directional (2) 4.2 % 158

Corpus composition: Spoken mode

9 % of tokens, 329 texts

divisions (classes) % tokens texts
Interactive (3) 6.7 % 258
Non-interactive (1) 2.3 % 71

What is a text? – Written mode

  • uninterrupted samples of 2,000–-5,000 words (respecting sentence boundaries)
  • for letters and administrative, texts as short as 1,000 tokens were allowed
  • equal representation of beginnings, middle portions and endings of texts

What is a text? – Spoken mode

  • samples of 2,000–-5,000 words from utterances by one speaker within one dialogue
  • discontinuous, as the lines of the other participant(s) in the dialogue were removed
  • text as a result of one author’s effort (× situation)

What is a text? – Internet

  • posts (fb, forum, discussions) – grouped by author and time of day into longer “texts” of 2,000–-5,000 words
  • blogs, Wikipedia – uninterrupted samples of 2,000–5,000 words from one post or one article

Linguistic features

More than 140 features, 130 used in pilot study

  • phonetic – v- prothesis, é > í narrowing , ý > ej vowel breaking…
  • morphological – cases, numbers, mood, tense, comparison…
  • derivation – adjectives denoting similarity, verbal nouns, diminutives…
  • lexicon – indefinite pron., thinking & reporting verbs, taboo expressions…

Linguistic features (cont'd)

  • pragmatics – contact expressions, fillers, intensifiers, downtoners…
  • syntax – types of noun modifiers, clusters of POS, wh-adverbs…
  • text/discourse – questions, phraseology, word repetition…

Type-based features – repertoirs of pronouns, prepositions, conjunctions (normalised by zTTR, Cvrček & Chlumská 2015)

Lexical richness – Yule's K, thematic concentration (Popescu et al. 2007), repertoire of unigrams and bigrams (zTTR)

Dimension 1: Dynamic vs. static

Most prominent features:

Positive

Features Loadings
indicative forms 0.91
finite verbs 0.88
adverbs of time 0.86
verbal aspect 0.75
3rd person pronouns 0.70
adverbs 0.69


Negative

Features Loadings
nominal post-modifiers without agreement -0.94
abstract nouns -0.92
nouns: genitive -0.83
verbal nouns -0.80
noun pre-modifiers with agreement -0.77
complex prepositions -0.75

Distribution of text types:

plot of chunk unnamed-chunk-2

Dimension 2: Interactive vs. Non-situationally anchored

Most prominent features:

Positive

Features Loadings
contact expressions 0.93
v- prothesis 0.89
ý > ej vowel breaking (diphthongisation)
in endings
0.86
é > í narrowing in endings 0.81
locative adverbs 0.76
fillers 0.76


Negative

Features Loadings
clauses with wh-adverbs -0.52
nouns: accusative -0.50
nominal cases with prepositions -0.48
preposition -0.44
verbal aspect -0.40
unigrams -0.40

Distribution of text types:

plot of chunk unnamed-chunk-3

Dimension 1 vs. Dimension 2

plot of chunk unnamed-chunk-4

Dimension 3: Low vs. high level of cohesion

Most prominent features:

Positive

Features Loadings
hypotactic correlative connectives 0.65
repertoire of conjunctions 0.56
repertoire of pronouns 0.54
verbs: conditional 0.46
predicative nouns 0.46
verbal predicate completed by clause 0.44


Negative

Features Loadings
numerals -0.43
adjectives denoting similarity -0.42
clusters of same-case adjectives -0.38

Distribution of text types:

plot of chunk unnamed-chunk-5

Project outlook

  • revision of (some) operationalizations
  • assessing recall and precision for features
  • finding optimal number of factors/dimensions
  • validation of factor analysis
  • intratextual classification (as a complement to extralinguistic classification)
  • release of the corpus (with tagged features) + tool for interpretation

References

  • Bermel, N. (2014): Czech Diglossia: Dismantling or Dissolution? In J. Árokay et al. (eds), Divided Languages? Springer.
  • Biber, D. (1991): Variation across speech and writing. Cambridge: Cambridge University Press
  • Biber, B. & Conrad, S. (2009): Register, Genre, and Style. New York, NY: Cambridge University Press.
  • Cvrček, V. & Chlumská, L. (2015): Simplification in translated Czech: a new approach to type-token ratio. Russian linguistics 39/3, (p. 309–325).
  • Popescu, I., Best, K. & Altmann, G. (2007): On the dynamics of word classes in texts. Glottometrics 14, (p. 58–71).

Acknowledgements

This presentation resulted

  • from the implementation of the Czech National Corpus project (LM2015044) funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures and
  • from the implementation of the project Language Variation in the CNC (reg. no. CZ.02.1.01/0.0/0.0/16_013/0001758) supported from the Operational Programme Research, Development and Education within the call 02_16_013 “Research infrastructures”.

Thank you for your attention!