Multi-dimensional analysis of Czech. Pilot study

Václav Cvrček, Zuzana Komrsková, David Lukeš, Petra Poukarová, Anna Řehořková, Adrian Jan Zasina
July 25, 2017

Multi-dimensional analysis

  • study of register variation (Biber 1991; Biber & Conrad 2009)
  • interplay between register and genre (intratextual × extratexual)
  • first application to Slavic language – Czech: rich inflection, distinctive morphology, close to diglossia (Bermel 2004)

Overview

  • corpus compilation
  • feature operationalization
  • preliminary results
  • project outlook

The Koditex Corpus

diverse and representative corpus of contemporary Czech with manageable size

Category #
Tokens 10.8 M
Words (excl. punct.) 9 M
Word forms (types) 508 K
Lemmas (types) 204 K
Sentences 714 K
Text samples 3,334
Min. sample length 1,000
Max. sample length 4,731
  • annotation: lemmatization, morphological tagging, phraseme annotation, NE recognition
  • tools: KonText, MorphoDiTa, ad-hoc scripts (R, shiny, perl)

Corpus composition

Written (80 % of tokens, 2584 texts)

divisions (classes) % tokens texts
Fiction (8) 18 % 564
Non-fiction (15) 33 % 1067
Journalistic (12) 27 % 844
Letters (1) 2 % 109

Internet (11 % of tokens, 421 texts)

divisions (classes) % tokens texts
Multi-directional (3) 6.7 % 263
Uni-directional (2) 4.2 % 158

Spoken (9 % of tokens, 329 texts)

divisions (classes) % tokens texts
Interactive (3) 6.7 % 258
Non-interactive (1) 2.3 % 71


Sampling

  • goal: maximum diversity
  • between-stratum diversity: each class sampled separately
  • within-stratum diversity: sampled according to dissimilarity measure based on metadata

What is a text?

Written mode

  • uninterrupted samples of 2,000–5,000 words (respecting sentence boundaries)
  • for letters and administrative, texts as short as 1,000 tokens were allowed
  • equal representation of beginnings, middle portions and endings of texts

Spoken mode

  • samples of 2,000–5,000 words from utterances by one speaker within one dialogue
  • discontinuous, as the lines of the other participant(s) in the dialogue were removed
  • text as a result of one author’s effort (× situation)

Internet communication

  • posts (fb, forum, discussions) – grouped by author and time of day into longer “texts” of 2,000–5,000 words
  • blogs, Wikipedia – uninterrupted samples of 2,000–5,000 words from one post or one article

Linguistic features

More than 140 features, 130 used in pilot study

  • phonetic – v- prothesis, é > í narrowing , ý > ej vowel breaking…
  • morphological – cases, numbers, mood, tense, comparison…
  • derivation – adjectives denoting similarity, verbal nouns, diminutives…
  • lexicon – indefinite pron., thinking & reporting verbs, taboo expressions…
  • pragmatics – contact expressions, fillers, intensifiers, downtoners…
  • syntax – types of noun modifiers, clusters of POS, wh-adverbs…
  • text/discourse – questions, phraseology, word repetition…

Type-based features – repertoirs of pronouns, prepositions, conjunctions (normalised by zTTR, Cvrček & Chlumská 2015)

Lexical richness – Yule's K, thematic concentration (Popescu et al. 2007), repertoire of unigrams and bigrams (zTTR)

Dimension 1: Dynamic vs. static

Most prominent features:

Positive

Features Loadings
indicative forms 0.91
finite verbs 0.88
adverbs of time 0.86
verbal aspect 0.75
3rd person pronouns 0.70
adverbs 0.69


Negative

Features Loadings
nominal post-modifiers without agreement -0.94
abstract nouns -0.92
nouns: genitive -0.83
verbal nouns -0.80
noun pre-modifiers with agreement -0.77
complex prepositions -0.75

Distribution of text types:

plot of chunk unnamed-chunk-2

Dimension 2: Interactive vs. Non-situationally anchored

Most prominent features:

Positive

Features Loadings
contact expressions 0.93
v- prothesis 0.89
ý > ej vowel breaking (diphthongisation)
in endings
0.86
é > í narrowing in endings 0.81
locative adverbs 0.76
fillers 0.76


Negative

Features Loadings
clauses with wh-adverbs -0.52
nouns: accusative -0.50
nominal cases with prepositions -0.48
preposition -0.44
verbal aspect -0.40
unigrams -0.40

Distribution of text types:

plot of chunk unnamed-chunk-3

Dimension 1 vs. Dimension 2

plot of chunk unnamed-chunk-4

Dimension 3: Low vs. high level of cohesion

Most prominent features:

Positive

Features Loadings
hypotactic correlative connectives 0.65
repertoire of conjunctions 0.56
repertoire of pronouns 0.54
verbs: conditional 0.46
predicative nouns 0.46
verbal predicate completed by clause 0.44


Negative

Features Loadings
numerals -0.43
adjevtives denoting similarity -0.42
clusters of same-case adjectives -0.38

Distribution of text types:

plot of chunk unnamed-chunk-5

Project outlook

  • revision of (some) operationalizations
  • assessing recall and precision for features
  • finding optimal number of factors/dimensions
  • validation of factor analysis
  • intratextual classification (as a complement to extralinguistic classification)
  • release of the corpus (with tagged features) + tool for interpretation

References

  • Bermel, N. (2014): Czech Diglossia: Dismantling or Dissolution? In J. Árokay et al. (eds), Divided Languages? Springer.
  • Biber, D. (1991): Variation across speech and writing. Cambridge: Cambridge University Press
  • Biber, B. & Conrad, S. (2009): Register, Genre, and Style. New York, NY: Cambridge University Press.
  • Cvrček, V. & Chlumská, L. (2015): Simplification in translated Czech: a new approach to type-token ratio. Russian linguistics 39/3, (p. 309–325).
  • Popescu, I., Best, K. & Altmann, G. (2007): On the dynamics of word classes in texts. Glottometrics 14, (p. 58–71).

Thank you for your attention!

This presentation resulted

  • from the implementation of the Czech National Corpus project (LM2015044) funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures and
  • from the implementation of the project Language Variation in the CNC (reg. no. CZ.02.1.01/0.0/0.0/16_013/0001758) supported from the Operational Programme Research, Development and Education within the call 02_16_013 “Research infrastructures”.

logolink