Theoretical question: How well do web corpora cover the variability present in a given language? What do they represent?
Practical question: To what extent can we replace (more expensive) traditional corpus data with (cheaper) web corpus data?
Comparison sneak peek
Representativeness in corpus linguistics
Representativeness refers to the extent to which a sample includes the full range of variability in a population. (Biber 1993, p. 243).
Thus a corpus design can be evaluated for the extent to which it includes: (1) the range of text types in a language, and (2) the range of linguistic distributions in a language. (Biber 1993, p. 243).
\(\Rightarrow\) comparing corpora w.r.t. the variability they cover
Outline of the experiment
compile a “traditional” corpus, as diverse as possible
chart the space of variation with MDA
take an opportunistic web-crawled corpus and make sample(s)
… in Slavic languages – specific morphology, inflection, free word order
… in Czech – situation bordering on diglossia (Bermel 2014): Standard × Common Czech
Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., & Zasina, A. J. (2018). From extra- to intratextual characteristics: Charting the space of variation in Czech through MDA. Corpus Linguistics and Linguistic Theory.
“traditional” carefully designed corpus covering all available text types
guiding principles: diverse, contemporary, text length control
text excerpts = chunks, 2000–5000 words (not whole texts)
3 modes – written, spoken, web
8 divisions, 45 classes, \(\approx\) 200,000 words per class
Category
#
Tokens
10,8 M
Words (excl. punct.)
9 M
Lemmata (types)
204 K
Text chunks
3 334
Koditex: internet communication
only web-specific “genres”, carefully separated
Multi-directional = user posts on facebook, in forums & comments sections – aggregated according to author and time into chunks of 2000–5000 words
Uni-directional = blogs, Wikipedia – continuous samples of 2000–5000 words
11 % of tokens, 421 of 3334 text chunks, 5 of 45 classes
division
% tokens
chunks
classes
Multi-directional
6.6 %
263
3
Uni-directional
4.5 %
158
2
Features and their operationalization
Originally 140+ features, 122 in final list, covering:
phonetics
morphology
derivation
lexicon
pragmatics
syntax
text/discourse
Statistical evaluation: Factor analysis
3292 text chunks × 122 features
factor analysis:
R environment, using fa function from psych package
number of factors/dimensions: 8
variance explained: 56 %
Interpretation: Dimensions of variability
dynamic (+) × static (-)
spontaneous (+) × prepared (-)
higher (+) × lower (-) level of cohesion
polythematic (+) × monothematic (-)
higher (+) × lower (-) amount of addressee coding
general (+) × particular (-)
prospective (+) × retrospective (-)
attitudinal (+) × factual (-)
Dim 1: dynamic (+) × static (-)
Dim 2: spontaneous (+) × prepared (-)
2D plot: dimensions 1 & 2
Sampling a web-crawled corpus: the Araneum Bohemicum
Collaboration with Vlado Benko
(author of the Aranea family of comparable web corpora covering a range of different languages)
Araneum Bohemicum
part of the Aranea Project (Benko 2014, 2016)
Araneum Bohemicum Maximum 15.04
crawled in several sessions during May and June 2013
yielded 9.5 million documents (web pages)
approx. 5.5 billion tokens of text
after filtering and deduplication \(\Rightarrow\) 5.2 million documents and 3.3 billion tokens
opportunistic design
representation of “searchable” web
Web-crawled samples
Two samples (code-named WS-K1 and WS-K2), based on Araneum Bohemicum
two batches with 5000 text excerpts each
text length distributions modeled after Koditex (2000–5000 words per excerpt)
subsequent processing analogous to Koditex texts
Comparison: Koditex vs. Web-crawled samples
Comparison methodology
evaluate the set of 122 linguistic features assembled for the original MD model on the WS-K1 and WS-K2 batches
calculate “positions” of WS chunks in MD space
comparison of the ranges covered by Koditex vs. WS
for each dimension individually
aggregated across dimensions
2-D visualization
Per-dimension comparison
Per-dimension comparison (2ⁿᵈ–98ᵗʰ %tile)
Interpretation
position of the median does not seem to vary substantially
dispersions are significantly different (tests of homogeneity of variances – Bartlett, Fligner-Killeen – show that the differences are significant, p < 0.01)
Average proportions of shared and corpus-specific variation ranges:
Web batch
Intersection with Koditex
Koditex complement
WS complement
WS-K1
78.00%
14.60%
7.39%
WS-K2
77.10%
15.10%
7.82%
Major differences
Dim 1: Koditex brings in the dynamic extreme, whereas WSs add texts widening the spectrum towards the static pole
Dim 2: the most salient difference – the Koditex complement introduces a range of texts on the spontaneous extreme from private conversations
Dim 5: variation of WSs is fully covered by Koditex, which adds texts with more explicit addressee focus, mainly dialogues in fiction
Dim 8: WSs in general tend to lean towards the factual extreme
2-D Comparison I: fully covered by web
2-D Comparison II: web genres
2-D Comparison III: only in traditional
Methodological postscript: Why text length matters in MDA
One additional web-crawled sample
WS-S (“web sample short”):
similar to WS-K1 and WS-K2, but excerpt lengths are one order of magnitude shorter
2000–5000 words per chunk → 200-500 words per chunk
to make it harder for WS-S to cover the extremes, only 1000 chunks vs. 5000 in WS-K1/2
again, subsequent processing analogous to Koditex texts
2-D Comparison: dimensions 1 & 2
2-D Comparison: dimensions 3 & 6
Coverage comparison across dimensions
Average proportions of shared and corpus-specific variation ranges:
Web batch
Intersection with Koditex
Koditex complement
WS complement
WS-K1
78.00%
14.60%
7.39%
WS-K2
77.10%
15.10%
7.82%
WS-S
73.40%
9.82%
16.70%
→ If we hadn’t controlled for text chunk length, we could have reached the opposite conclusion!
Conclusions
Traditional vs. web-crawled corpora
large overlap in text categories which are easy to obtain on the web as well as when building an offline-text corpus (journalistic and non-fiction texts)
web-crawled texts occasionally tend towards their own distinctive regions of the MD space (static, less cohesive, factual and focused on particular referents)
unique text categories occupying distinct areas are only found in the traditional Koditex corpus – spoken informal (intimate) discourse, written private correspondence and some types of fiction (dynamic and addressee-oriented)
…and more in Cvrček et al. (forthcoming)
Main take-aways
Some text categories cannot be substituted by general web-crawled data and represent an irreplaceable and unique source of linguistic variation.
Web communication has its specificities and web corpora are useful, but they should not be used as an argument against investing in more expensive sources of data.
Text length matters in MDA. For meaningful comparisons, use excerpts of comparable lengths.
Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3), 209–226.
Baroni, M., Kilgarriff, A., Pomikálek, J., & Rychlý, P. (2006). WebBootCaT: a web tool for instant corpora. Proceeding of the EuraLex Conference, 123–132.
Benko, V. (2014). Aranea: Yet another family of (comparable) web corpora. International Conference on Text, Speech, and Dialogue, 257–264. Springer.
Benko, V. (2016). Two Years of Aranea: Increasing Counts and Tuning the Pipeline. LREC, 4245–4248.
Bermel, N. (2014). Czech Diglossia: Dismantling or Dissolution? In J. Arokay, J. Gvozdanovic, & D. Miyajima (Eds.), Divided Languages? Diglossia, Translation and the Rise of Modernity in Japan, China, and the Slavic World (1st ed., pp. 21–37). Dordrecht: Springer International Publishing.
Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257.
Biber, D. (1995). Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge, England: Cambridge University Press.
Biber, D., & Conrad, S. (2009). Register, Genre, and Style. Cambridge, England: Cambridge University Press.
Biber, D., & Egbert, J. (2016). Register Variation on the Searchable Web: A Multi-Dimensional Analysis. Journal of English Linguistics, 44(2), 95–137.
Cvrček, V. et al. (2018). From extra- to intratextual characteristics: Charting the space of variation in Czech through MDA. Corpus Linguistics and Linguistic Theory. [Ahead of print]
Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., Zasina, A. J., & Benko, V. (forthcoming). Comparing web-crawled and traditional corpora.
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013). The tenten corpus family. 7th International Corpus Linguistics Conference CL, 125–127.
Sharoff, S. (2018). Functional Text Dimensions for the annotation of web corpora. Corpora, 13(1), 65–95.
Acknowledgments
This research was supported by the ERDF project Language Variation in the CNC no. CZ.02.1.01/0.0/0.0/16_013/0001758.
It builds upon work made possible by the Czech National Corpus project (LM2015044) funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures.