Multi-dimensional analysis of Czech. Pilot study

Václav Cvrček, Zuzana Komrsková, David Lukeš, Petra Poukarová, Anna Řehořková, Adrian Jan Zasina
July 25, 2017

Multi-dimensional analysis

study of register variation (Biber 1991; Biber & Conrad 2009)
interplay between register and genre (intratextual × extratexual)
first application to Slavic language – Czech: rich inflection, distinctive morphology, close to diglossia (Bermel 2004)

Presentation overview

corpus compilation
feature operationalization
preliminary results
project outlook

The Koditex Corpus

diverse and representative corpus of contemporary Czech with manageable size
annotation: lemmatization, morphological tagging, phraseme annotation, NE recognition
tools: KonText, MorphoDiTa, ad-hoc scripts (python, R, perl)

Category	#
Tokens	10.8 M
Words (excl. punct.)	9 M
Word forms (types)	508 K
Lemmas (types)	204 K
Sentences	714 K
Text samples	3,334
Min. sample length	1,000
Max. sample length	4,731

Corpus composition: Sampling

goal: maximum diversity
between-stratum diversity: each class sampled separately
within-stratum diversity: sampled according to dissimilarity measure based on metadata

Corpus composition: Written mode

80 % of tokens, 2584 texts

divisions (classes)	% tokens	texts
Fiction (8)	18 %	564
Non-fiction (15)	33 %	1067
Journalistic (12)	27 %	844
Letters (1)	2 %	109

Corpus composition: Internet

11 % of tokens, 421 texts

divisions (classes)	% tokens	texts
Multi-directional (3)	6.7 %	263
Uni-directional (2)	4.2 %	158

Corpus composition: Spoken mode

9 % of tokens, 329 texts

divisions (classes)	% tokens	texts
Interactive (3)	6.7 %	258
Non-interactive (1)	2.3 %	71

What is a text? – Written mode

uninterrupted samples of 2,000–-5,000 words (respecting sentence boundaries)
for letters and administrative, texts as short as 1,000 tokens were allowed
equal representation of beginnings, middle portions and endings of texts

What is a text? – Spoken mode

samples of 2,000–-5,000 words from utterances by one speaker within one dialogue
discontinuous, as the lines of the other participant(s) in the dialogue were removed
text as a result of one author’s effort (× situation)

What is a text? – Internet

posts (fb, forum, discussions) – grouped by author and time of day into longer “texts” of 2,000–-5,000 words
blogs, Wikipedia – uninterrupted samples of 2,000–5,000 words from one post or one article

Linguistic features

More than 140 features, 130 used in pilot study

phonetic – v- prothesis, é > í narrowing , ý > ej vowel breaking…
morphological – cases, numbers, mood, tense, comparison…
derivation – adjectives denoting similarity, verbal nouns, diminutives…
lexicon – indefinite pron., thinking & reporting verbs, taboo expressions…

Linguistic features (cont'd)

pragmatics – contact expressions, fillers, intensifiers, downtoners…
syntax – types of noun modifiers, clusters of POS, wh-adverbs…
text/discourse – questions, phraseology, word repetition…

Type-based features – repertoirs of pronouns, prepositions, conjunctions (normalised by zTTR, Cvrček & Chlumská 2015)

Lexical richness – Yule's K, thematic concentration (Popescu et al. 2007), repertoire of unigrams and bigrams (zTTR)

Dimension 1: Dynamic vs. static

Most prominent features:

Positive

Features	Loadings
indicative forms	0.91
finite verbs	0.88
adverbs of time	0.86
verbal aspect	0.75
3rd person pronouns	0.70
adverbs	0.69

Negative

Features	Loadings
nominal post-modifiers without agreement	-0.94
abstract nouns	-0.92
nouns: genitive	-0.83
verbal nouns	-0.80
noun pre-modifiers with agreement	-0.77
complex prepositions	-0.75

Distribution of text types:

$plot of chunk unnamed-chunk-2$

Dimension 2: Interactive vs. Non-situationally anchored

Most prominent features:

Positive

Features	Loadings
contact expressions	0.93
v- prothesis	0.89
ý > ej vowel breaking (diphthongisation) in endings	0.86
é > í narrowing in endings	0.81
locative adverbs	0.76
fillers	0.76

Negative

Features	Loadings
clauses with wh-adverbs	-0.52
nouns: accusative	-0.50
nominal cases with prepositions	-0.48
preposition	-0.44
verbal aspect	-0.40
unigrams	-0.40

Distribution of text types:

plot of chunk unnamed-chunk-3

Dimension 1 vs. Dimension 2

Dimension 3: Low vs. high level of cohesion

Most prominent features:

Positive

Features	Loadings
hypotactic correlative connectives	0.65
repertoire of conjunctions	0.56
repertoire of pronouns	0.54
verbs: conditional	0.46
predicative nouns	0.46
verbal predicate completed by clause	0.44

Negative

Features	Loadings
numerals	-0.43
adjectives denoting similarity	-0.42
clusters of same-case adjectives	-0.38

Distribution of text types:

plot of chunk unnamed-chunk-5

Project outlook

revision of (some) operationalizations
assessing recall and precision for features
finding optimal number of factors/dimensions
validation of factor analysis
intratextual classification (as a complement to extralinguistic classification)
release of the corpus (with tagged features) + tool for interpretation

References

Bermel, N. (2014): Czech Diglossia: Dismantling or Dissolution? In J. Árokay et al. (eds), Divided Languages? Springer.
Biber, D. (1991): Variation across speech and writing. Cambridge: Cambridge University Press
Biber, B. & Conrad, S. (2009): Register, Genre, and Style. New York, NY: Cambridge University Press.
Cvrček, V. & Chlumská, L. (2015): Simplification in translated Czech: a new approach to type-token ratio. Russian linguistics 39/3, (p. 309–325).
Popescu, I., Best, K. & Altmann, G. (2007): On the dynamics of word classes in texts. Glottometrics 14, (p. 58–71).

Acknowledgements

This presentation resulted

from the implementation of the Czech National Corpus project (LM2015044) funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures and
from the implementation of the project Language Variation in the CNC (reg. no. CZ.02.1.01/0.0/0.0/16_013/0001758) supported from the Operational Programme Research, Development and Education within the call 02_16_013 “Research infrastructures”.

Thank you for your attention!

logolink

https://korpus.cz