Can a web-crawled corpus be as diverse as a traditional one?

Comparing ranges of variability through a multidimensional model

Adrian Jan Zasina

David Lukeš

https://trnka.korpus.cz/~lukes/slides/cl2019

Introduction

Aim

Comparison of ranges of variability covered by two different types of corpora – traditional vs. web-crawled:

  • design: careful × opportunistic
  • text categories/genres: well-motivated; subject to quotas × rough; uncontrolled proportions
  • metadata: detailed, manually reviewed × insufficient or absent
  • data: mostly print + some spoken and/or web × web

Corpus linguistics and the web

Last 15+ years – advent of web-crawled corpora:

  • (+) large (cf. the 14-billion-word iWeb corpus, Davies 2018)
  • (+) cheap
  • (+) families of “comparable” corpora: WaC family (Baroni et al. 2009), TenTen family (Jakubíček et al. 2013), Aranea family (Benko 2014)
  • (+) DIY: AntCorGen (Anthony 2019) WebBootCat (Baroni et al. 2006)
  • (-) lack of metadata (URLs only)
  • (-) uncertainty regarding the composition (cf. Biber & Egbert 2016; Sharoff 2018)

Two key questions

  1. Theoretical question: How well do web corpora cover the variability present in a given language? What do they represent?
  2. Practical question: To what extent can we replace (more expensive) traditional corpus data with (cheaper) web corpus data?

Comparison sneak peek

Representativeness in corpus linguistics

Representativeness refers to the extent to which a sample includes the full range of variability in a population. (Biber 1993, p. 243).

Thus a corpus design can be evaluated for the extent to which it includes: (1) the range of text types in a language, and (2) the range of linguistic distributions in a language. (Biber 1993, p. 243).

\(\Rightarrow\) comparing corpora w.r.t. the variability they cover

Outline of the experiment

  1. compile a “traditional” corpus, as diverse as possible
  2. chart the space of variation with MDA
  3. take an opportunistic web-crawled corpus and make sample(s)
  4. project web corpus sample(s) onto MD space
  5. compare ranges of variation within dimensions

Multi-dimensional analysis

Principles of multi-dimensional analysis (MDA)

Biber 1995; Biber & Conrad 2009

  • systemic & functional variability (× random, sociolinguistic…)
    • motivated by context & situation
  • text production process involves interrelated choices
  • dimensions of variation (“intratextual” perspective)

Methodology of MDA

  1. corpus compilation
  2. features: operationalization & extraction
  3. statistical analysis (factor analysis, FA) \(\rightarrow\) dimensions
  4. interpretation of results

MDA of Czech

Czech National Corpus: MDA team

team

MDA of Czech

  • inspiration from English and other languages
  • expected challenges / highlights of MDA…
    • … in Slavic languages – specific morphology, inflection, free word order
    • … in Czech – situation bordering on diglossia (Bermel 2014): Standard × Common Czech

Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., & Zasina, A. J. (2018). From extra- to intratextual characteristics: Charting the space of variation in Czech through MDA. Corpus Linguistics and Linguistic Theory.

Data: Koditex corpus

  • “traditional” carefully designed corpus covering all available text types
  • guiding principles: diverse, contemporary, text length control
    • text excerpts = chunks, 2000–5000 words (not whole texts)
    • 3 modes – written, spoken, web
      • 8 divisions, 45 classes, \(\approx\) 200,000 words per class
Category #
Tokens 10,8 M
Words (excl. punct.) 9 M
Lemmata (types) 204 K
Text chunks 3 334

Koditex: internet communication

  • only web-specific “genres”, carefully separated
  • Multi-directional = user posts on facebook, in forums & comments sections – aggregated according to author and time into chunks of 2000–5000 words
  • Uni-directional = blogs, Wikipedia – continuous samples of 2000–5000 words
  • 11 % of tokens, 421 of 3334 text chunks, 5 of 45 classes
division % tokens chunks classes
Multi-directional 6.6 % 263 3
Uni-directional 4.5 % 158 2

Features and their operationalization

Originally 140+ features, 122 in final list, covering:

  • phonetics
  • morphology
  • derivation
  • lexicon
  • pragmatics
  • syntax
  • text/discourse

Statistical evaluation: Factor analysis

  • 3292 text chunks × 122 features
  • factor analysis:
    • R environment, using fa function from psych package
    • number of factors/dimensions: 8
    • variance explained: 56 %

Interpretation: Dimensions of variability

  1. dynamic (+) × static (-)
  2. spontaneous (+) × prepared (-)
  3. higher (+) × lower (-) level of cohesion
  4. polythematic (+) × monothematic (-)
  5. higher (+) × lower (-) amount of addressee coding
  6. general (+) × particular (-)
  7. prospective (+) × retrospective (-)
  8. attitudinal (+) × factual (-)

Dim 1: dynamic (+) × static (-)

Dim 2: spontaneous (+) × prepared (-)

2D plot: dimensions 1 & 2

Sampling a web-crawled corpus: the Araneum Bohemicum

Collaboration with Vlado Benko

Vlado Benko

(author of the Aranea family of comparable web corpora covering a range of different languages)

Araneum Bohemicum

  • part of the Aranea Project (Benko 2014, 2016)
  • Araneum Bohemicum Maximum 15.04
  • crawled in several sessions during May and June 2013
    • yielded 9.5 million documents (web pages)
    • approx. 5.5 billion tokens of text
    • after filtering and deduplication \(\Rightarrow\) 5.2 million documents and 3.3 billion tokens
  • opportunistic design
  • representation of “searchable” web

Web-crawled samples

Two samples (code-named WS-K1 and WS-K2), based on Araneum Bohemicum

  • two batches with 5000 text excerpts each
  • text length distributions modeled after Koditex (2000–5000 words per excerpt)
  • subsequent processing analogous to Koditex texts

Comparison: Koditex vs. Web-crawled samples

Comparison methodology

  • evaluate the set of 122 linguistic features assembled for the original MD model on the WS-K1 and WS-K2 batches
  • calculate “positions” of WS chunks in MD space
  • comparison of the ranges covered by Koditex vs. WS
    • for each dimension individually
    • aggregated across dimensions
    • 2-D visualization

overlap

Per-dimension comparison

Per-dimension comparison (2ⁿᵈ–98ᵗʰ %tile)

Interpretation

  • position of the median does not seem to vary substantially
  • dispersions are significantly different (tests of homogeneity of variances – Bartlett, Fligner-Killeen – show that the differences are significant, p < 0.01)

Average proportions of shared and corpus-specific variation ranges:

Web batch Intersection with Koditex Koditex complement WS complement
WS-K1 78.00% 14.60% 7.39%
WS-K2 77.10% 15.10% 7.82%

Major differences

  • Dim 1: Koditex brings in the dynamic extreme, whereas WSs add texts widening the spectrum towards the static pole
  • Dim 2: the most salient difference – the Koditex complement introduces a range of texts on the spontaneous extreme from private conversations
  • Dim 5: variation of WSs is fully covered by Koditex, which adds texts with more explicit addressee focus, mainly dialogues in fiction
  • Dim 8: WSs in general tend to lean towards the factual extreme

2-D Comparison I: fully covered by web

2-D Comparison II: web genres

2-D Comparison III: only in traditional

Methodological postscript: Why text length matters in MDA

One additional web-crawled sample

WS-S (“web sample short”):

  • similar to WS-K1 and WS-K2, but excerpt lengths are one order of magnitude shorter
    • 2000–5000 words per chunk → 200-500 words per chunk
  • to make it harder for WS-S to cover the extremes, only 1000 chunks vs. 5000 in WS-K1/2
  • again, subsequent processing analogous to Koditex texts

2-D Comparison: dimensions 1 & 2

2-D Comparison: dimensions 3 & 6

Coverage comparison across dimensions

Average proportions of shared and corpus-specific variation ranges:

Web batch Intersection with Koditex Koditex complement WS complement
WS-K1 78.00% 14.60% 7.39%
WS-K2 77.10% 15.10% 7.82%
WS-S 73.40% 9.82% 16.70%

→ If we hadn’t controlled for text chunk length, we could have reached the opposite conclusion!

Conclusions

Traditional vs. web-crawled corpora

  • large overlap in text categories which are easy to obtain on the web as well as when building an offline-text corpus (journalistic and non-fiction texts)
  • web-crawled texts occasionally tend towards their own distinctive regions of the MD space (static, less cohesive, factual and focused on particular referents)
  • unique text categories occupying distinct areas are only found in the traditional Koditex corpus – spoken informal (intimate) discourse, written private correspondence and some types of fiction (dynamic and addressee-oriented)
  • …and more in Cvrček et al. (forthcoming)

Main take-aways

  1. Some text categories cannot be substituted by general web-crawled data and represent an irreplaceable and unique source of linguistic variation.
  2. Web communication has its specificities and web corpora are useful, but they should not be used as an argument against investing in more expensive sources of data.
  3. Text length matters in MDA. For meaningful comparisons, use excerpts of comparable lengths.

References

  • Anthony, L. (2019). AntConc (Version 3.5.8). Retrieved from http://www.laurenceanthony.net/software
  • Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3), 209–226.
  • Baroni, M., Kilgarriff, A., Pomikálek, J., & Rychlý, P. (2006). WebBootCaT: a web tool for instant corpora. Proceeding of the EuraLex Conference, 123–132.
  • Benko, V. (2014). Aranea: Yet another family of (comparable) web corpora. International Conference on Text, Speech, and Dialogue, 257–264. Springer.
  • Benko, V. (2016). Two Years of Aranea: Increasing Counts and Tuning the Pipeline. LREC, 4245–4248.
  • Bermel, N. (2014). Czech Diglossia: Dismantling or Dissolution? In J. Arokay, J. Gvozdanovic, & D. Miyajima (Eds.), Divided Languages? Diglossia, Translation and the Rise of Modernity in Japan, China, and the Slavic World (1st ed., pp. 21–37). Dordrecht: Springer International Publishing.
  • Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257.
  • Biber, D. (1995). Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge, England: Cambridge University Press.
  • Biber, D., & Conrad, S. (2009). Register, Genre, and Style. Cambridge, England: Cambridge University Press.
  • Biber, D., & Egbert, J. (2016). Register Variation on the Searchable Web: A Multi-Dimensional Analysis. Journal of English Linguistics, 44(2), 95–137.
  • Cvrček, V. et al. (2018). From extra- to intratextual characteristics: Charting the space of variation in Czech through MDA. Corpus Linguistics and Linguistic Theory. [Ahead of print]
  • Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., Zasina, A. J., & Benko, V. (forthcoming). Comparing web-crawled and traditional corpora.
  • Davies, M. (2018). The 14 Billion Word iWeb Corpus. Retrieved from https://www.english-corpora.org/iweb/
  • Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013). The tenten corpus family. 7th International Corpus Linguistics Conference CL, 125–127.
  • Revelle, W. (2018). psych: Procedures for Psychological, Psychometric, and Personality Research.
  • Sharoff, S. (2018). Functional Text Dimensions for the annotation of web corpora. Corpora, 13(1), 65–95.

Acknowledgments

This research was supported by the ERDF project Language Variation in the CNC no. CZ.02.1.01/0.0/0.0/16_013/0001758.

It builds upon work made possible by the Czech National Corpus project (LM2015044) funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures.

logolink