Last 15+ years – advent of web-crawled corpora
Crucial question: What do the web corpora represent?
Representativeness refers to the extent to which a sample includes the full range of variability in a population. (Biber 1993, p. 243).
Thus a corpus design can be evaluated for the extent to which it includes: (1) the range of text types in a language, and (2) the range of linguistic distributions in a language. (Biber 1993, p. 243).
\(\Rightarrow\) comparing corpora w.r.t. the variability they cover
Biber 1995; Biber & Conrad 2009
Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., & Zasina, A. J. (2018). From extra- to intratextual characteristics: Charting the space of variation in Czech through MDA. Corpus Linguistics and Linguistic Theory.
wri
, spo
, web
Category | # |
---|---|
Tokens | 10,8 M |
Words (excl. punct.) | 9 M |
Lemmata (types) | 204 K |
Text chunks | 3 334 |
11 % of tokens, 421 text chunks
division (class) | % tokens | chunks |
---|---|---|
Multi-directional (3) | 6.6 % | 263 |
Uni-directional (2) | 4.5 % | 158 |
Originally 140+ features, final list 122, e.g.:
Two samples (named WS-K1 and WS-K2), based on Araneum Bohemicum
Average proportions of shared and corpus-specific variation ranges:
WS batch | Intersection with Koditex | Koditex complement | WS complement |
---|---|---|---|
WS-K1 | 78.00% | 14.60% | 7.39% |
WS-K2 | 77.10% | 15.10% | 7.82% |
spo-int
)Main take-aways:
Follow-up questions:
This research was supported by the ERDF project Language Variation in the CNC no. CZ.02.1.01/0.0/0.0/16_013/0001758.
It builds upon work made possible by the Czech National Corpus project (LM2015044) funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures.