Multi-dimensional analysis of Czech. Pilot study

Václav Cvrček, Zuzana Komrsková, David Lukeš, Petra Poukarová, Anna Řehořková, Adrian Jan Zasina
July 25, 2017

Multi-dimensional analysis

study of register variation (Biber 1991; Biber & Conrad 2009)
interplay between register and genre (intratextual × extratexual)
first application to Slavic language – Czech: rich inflection, distinctive morphology, close to diglossia (Bermel 2004)

Overview

corpus compilation
feature operationalization
preliminary results
project outlook

The Koditex Corpus

diverse and representative corpus of contemporary Czech with manageable size

Category	#
Tokens	10.8 M
Words (excl. punct.)	9 M
Word forms (types)	508 K
Lemmas (types)	204 K
Sentences	714 K
Text samples	3,334
Min. sample length	1,000
Max. sample length	4,731

annotation: lemmatization, morphological tagging, phraseme annotation, NE recognition
tools: KonText, MorphoDiTa, ad-hoc scripts (R, shiny, perl)

Corpus composition

Written (80 % of tokens, 2584 texts)

divisions (classes)	% tokens	texts
Fiction (8)	18 %	564
Non-fiction (15)	33 %	1067
Journalistic (12)	27 %	844
Letters (1)	2 %	109

Internet (11 % of tokens, 421 texts)

divisions (classes)	% tokens	texts
Multi-directional (3)	6.7 %	263
Uni-directional (2)	4.2 %	158

Spoken (9 % of tokens, 329 texts)

divisions (classes)	% tokens	texts
Interactive (3)	6.7 %	258
Non-interactive (1)	2.3 %	71

Sampling

goal: maximum diversity
between-stratum diversity: each class sampled separately
within-stratum diversity: sampled according to dissimilarity measure based on metadata

What is a text?

Written mode

uninterrupted samples of 2,000–5,000 words (respecting sentence boundaries)
for letters and administrative, texts as short as 1,000 tokens were allowed
equal representation of beginnings, middle portions and endings of texts

Spoken mode

samples of 2,000–5,000 words from utterances by one speaker within one dialogue
discontinuous, as the lines of the other participant(s) in the dialogue were removed
text as a result of one author’s effort (× situation)

Internet communication

posts (fb, forum, discussions) – grouped by author and time of day into longer “texts” of 2,000–5,000 words
blogs, Wikipedia – uninterrupted samples of 2,000–5,000 words from one post or one article

Linguistic features

More than 140 features, 130 used in pilot study

phonetic – v- prothesis, é > í narrowing , ý > ej vowel breaking…
morphological – cases, numbers, mood, tense, comparison…
derivation – adjectives denoting similarity, verbal nouns, diminutives…
lexicon – indefinite pron., thinking & reporting verbs, taboo expressions…
pragmatics – contact expressions, fillers, intensifiers, downtoners…
syntax – types of noun modifiers, clusters of POS, wh-adverbs…
text/discourse – questions, phraseology, word repetition…

Type-based features – repertoirs of pronouns, prepositions, conjunctions (normalised by zTTR, Cvrček & Chlumská 2015)

Lexical richness – Yule's K, thematic concentration (Popescu et al. 2007), repertoire of unigrams and bigrams (zTTR)

Dimension 1: Dynamic vs. static

Most prominent features:

Positive

Features	Loadings
indicative forms	0.91
finite verbs	0.88
adverbs of time	0.86
verbal aspect	0.75
3rd person pronouns	0.70
adverbs	0.69

Negative

Features	Loadings
nominal post-modifiers without agreement	-0.94
abstract nouns	-0.92
nouns: genitive	-0.83
verbal nouns	-0.80
noun pre-modifiers with agreement	-0.77
complex prepositions	-0.75

Distribution of text types:

$plot of chunk unnamed-chunk-2$

Dimension 2: Interactive vs. Non-situationally anchored

Most prominent features:

Positive

Features	Loadings
contact expressions	0.93
v- prothesis	0.89
ý > ej vowel breaking (diphthongisation) in endings	0.86
é > í narrowing in endings	0.81
locative adverbs	0.76
fillers	0.76

Negative

Features	Loadings
clauses with wh-adverbs	-0.52
nouns: accusative	-0.50
nominal cases with prepositions	-0.48
preposition	-0.44
verbal aspect	-0.40
unigrams	-0.40

Distribution of text types:

plot of chunk unnamed-chunk-3

Dimension 1 vs. Dimension 2

Dimension 3: Low vs. high level of cohesion

Most prominent features:

Positive

Features	Loadings
hypotactic correlative connectives	0.65
repertoire of conjunctions	0.56
repertoire of pronouns	0.54
verbs: conditional	0.46
predicative nouns	0.46
verbal predicate completed by clause	0.44

Negative

Features	Loadings
numerals	-0.43
adjevtives denoting similarity	-0.42
clusters of same-case adjectives	-0.38

Distribution of text types:

plot of chunk unnamed-chunk-5

Project outlook

revision of (some) operationalizations
assessing recall and precision for features
finding optimal number of factors/dimensions
validation of factor analysis
intratextual classification (as a complement to extralinguistic classification)
release of the corpus (with tagged features) + tool for interpretation

References

Bermel, N. (2014): Czech Diglossia: Dismantling or Dissolution? In J. Árokay et al. (eds), Divided Languages? Springer.
Biber, D. (1991): Variation across speech and writing. Cambridge: Cambridge University Press
Biber, B. & Conrad, S. (2009): Register, Genre, and Style. New York, NY: Cambridge University Press.
Cvrček, V. & Chlumská, L. (2015): Simplification in translated Czech: a new approach to type-token ratio. Russian linguistics 39/3, (p. 309–325).
Popescu, I., Best, K. & Altmann, G. (2007): On the dynamics of word classes in texts. Glottometrics 14, (p. 58–71).

Thank you for your attention!

This presentation resulted

from the implementation of the Czech National Corpus project (LM2015044) funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures and
from the implementation of the project Language Variation in the CNC (reg. no. CZ.02.1.01/0.0/0.0/16_013/0001758) supported from the Operational Programme Research, Development and Education within the call 02_16_013 “Research infrastructures”.

logolink