Methodological issues in multi-dimensional analysis:

Insights from a from-scratch MDA of Czech

David Lukeš

https://korpus.cz

Preliminaries

Project(s)

  • MDA: “Language Variation in the CNC” no. CZ.02.1.01/0.0/0.0/16_013/0001758, supported by the European Regional Development Fund
  • Data & infrastructure: Czech National Corpus project (LM2018137) funded by the Ministry of Education, Youth and Sports of the Czech Republic
  • Additional information & resources: https://korpus.cz/mda

Team 1/2

Team 2/2

Koditex corpus

1. MDA crash course

1.1. 1 feature in 1 text

                     AA
TEXT_CHUNK_ID
08A014N/tumuf 32,157.21

1.2. 1 feature in 3292 texts

                      AA
TEXT_CHUNK_ID
08A014N/tumuf  32,157.21
08A015N/botava 29,443.25
08A021N/lazuk  45,610.43
08A041N/cecab  29,011.79
08A048N/lekof  35,779.48
08A069N/tikisa 17,415.22
08A089N/jepiv  30,684.10
08A103N/botava 26,707.76
08H007N/vakuga 34,646.42
08M006N/bazad  36,538.46
08M016N/jiroka 46,138.07
09A009N/jeziga 23,266.42
09A020N/tuzina 49,914.53
09A031N/hevuk  43,365.13
09A036N/lepih  45,995.89
09A044N/hocuc  13,648.29
09A044N/pugada 20,296.64
09A046N/tikisa 24,429.97
09A057N/cocir  54,427.65
09A057N/monuj  62,784.65
...

3292 rows (text chunks) × 1 column (linguistic feature)

1.3. 122 features in 3292 texts

                      AA    AAV       ABS    ACM      ADN       AMP     ARCH   ASIM  ...
TEXT_CHUNK_ID
08A014N/tumuf  32,157.21   0.00  5,359.54   0.00     0.00 15,631.98   446.63   0.00
08A015N/botava 29,443.25   0.00  1,606.00   0.00     0.00  8,565.31     0.00 535.33
08A021N/lazuk  45,610.43 685.87  5,486.97   0.00     0.00  8,916.32     0.00   0.00
08A041N/cecab  29,011.79   0.00  7,706.26   0.00     0.00  4,986.40     0.00   0.00
08A048N/lekof  35,779.48   0.00  3,650.97   0.00     0.00  6,571.74     0.00   0.00
08A069N/tikisa 17,415.22 916.59  4,582.95   0.00     0.00  5,499.54     0.00   0.00
08A089N/jepiv  30,684.10   0.00  5,533.20 503.02     0.00 13,581.49   503.02   0.00
08A103N/botava 26,707.76   0.00  1,027.22 513.61     0.00 10,272.21     0.00   0.00
08H007N/vakuga 34,646.42   0.00  8,542.95 474.61     0.00 12,814.43   949.22   0.00
08M006N/bazad  36,538.46 769.23  7,692.31   0.00     0.00 16,538.46   384.62   0.00
08M016N/jiroka 46,138.07 341.76  8,544.09   0.00     0.00  8,202.32     0.00   0.00
09A009N/jeziga 23,266.42   0.00  3,193.43   0.00     0.00  9,124.09     0.00   0.00
09A020N/tuzina 49,914.53 341.88 15,042.74   0.00     0.00 12,991.45     0.00   0.00
09A031N/hevuk  43,365.13   0.00  7,372.07   0.00     0.00  6,938.42     0.00   0.00
09A036N/lepih  45,995.89   0.00  9,034.91   0.00     0.00  8,624.23     0.00   0.00
09A044N/hocuc  13,648.29   0.00  7,874.02   0.00 1,049.87  3,674.54     0.00   0.00
09A044N/pugada 20,296.64   0.00  2,341.92   0.00 2,732.24 12,490.24 1,561.28   0.00
09A046N/tikisa 24,429.97   0.00  8,957.65   0.00     0.00  5,700.33   407.17   0.00
09A057N/cocir  54,427.65 431.97  1,295.90   0.00   863.93 12,527.00 2,591.79   0.00
09A057N/monuj  62,784.65 325.31  8,132.73   0.00   325.31 11,060.51   975.93   0.00
...

3292 rows (text chunks) × 122 columns (linguistic features)

1.4. Squeezing the columns

A.k.a. reducing the dimensionality.

                GLS1  GLS2  GLS3  GLS4  GLS5  GLS6  GLS7  GLS8
TEXT_CHUNK_ID
08A014N/tumuf   1.07  3.69 -0.82 -0.62  1.61 -1.09  0.97 -0.21
08A015N/botava  1.67  3.71 -2.22 -0.28 -0.03 -0.22  0.74 -1.48
08A021N/lazuk   0.99  3.64 -1.14  0.26 -0.58 -0.74  0.53 -0.22
08A041N/cecab   1.62  3.43 -1.59 -1.15  1.54 -0.22  0.70  0.21
08A048N/lekof   0.96  4.29 -1.89 -0.04  1.53 -0.18 -0.87 -1.88
08A069N/tikisa  2.13  2.46 -1.87 -0.62  1.27  0.01  0.04 -2.74
08A089N/jepiv   1.14  3.85 -1.71 -0.20  0.18 -1.81  1.39  1.37
08A103N/botava  1.68  3.87 -2.22 -0.56  0.04 -0.16  0.38 -0.01
08H007N/vakuga  1.59  3.31 -1.38 -0.26 -0.82  0.17  0.49  0.16
08M006N/bazad   0.93  4.18 -1.37 -1.03 -0.19 -0.78  0.75  1.28
08M016N/jiroka  1.83  2.59 -1.73 -0.42  1.07 -0.35  1.03 -0.93
09A009N/jeziga  1.15  5.09 -1.47 -1.38  1.83 -1.73 -0.54 -0.79
09A020N/tuzina  1.04  3.86 -1.06 -0.10 -0.22  0.19  0.76 -0.69
09A031N/hevuk   1.18  3.42 -1.41 -0.14 -0.80 -0.71  0.78 -0.45
09A036N/lepih   1.00  3.92 -1.59  0.05  0.91 -0.41  0.49 -0.75
09A044N/hocuc   1.33  2.88 -1.52 -0.51  0.85 -0.52  0.25 -2.20
09A044N/pugada  1.47  3.01 -1.51 -0.87  0.84 -0.60 -0.57 -0.24
09A046N/tikisa  2.09  2.86 -2.15 -0.61  0.58 -0.34  0.42 -2.69
09A057N/cocir   0.52  4.57 -1.46 -0.83  0.92  0.17 -0.54  0.83
09A057N/monuj   0.67  2.86 -1.02 -0.14  0.00 -0.54  1.37  1.72
...

3292 rows (text chunks) × 8 columns (latent dimensions of variation)

1.5. Linguistic features load onto latent dimensions

DIMENSION→  GLS1  GLS2  GLS3  GLS4  GLS5  GLS6  GLS7  GLS8
FEATURE↓
AA         -0.78 -0.18 -0.06 -0.02 -0.06  0.24  0.03  0.02
AAV        -0.50 -0.15 -0.11 -0.30  0.04  0.19 -0.06  0.01
ABS        -0.72 -0.19  0.20 -0.37 -0.03  0.04  0.15 -0.02
ACM         0.05 -0.08  0.29 -0.03 -0.10 -0.04  0.12  0.08
ADN        -0.29 -0.03  0.13 -0.29  0.11  0.02  0.01  0.09
AMP         0.12  0.27  0.13  0.05  0.06 -0.13 -0.03  0.57
ARCH       -0.51 -0.08  0.13 -0.16  0.02 -0.07 -0.13  0.34
ASIM       -0.17  0.00 -0.27  0.04 -0.03  0.29  0.11  0.01
ATA1       -0.72 -0.18  0.16  0.01 -0.12  0.16 -0.08 -0.03
ATA21      -0.31  0.14 -0.04  0.01  0.05  0.20 -0.02 -0.04
ATA22      -0.40 -0.09 -0.07 -0.18 -0.05  0.21 -0.16 -0.05
BIG        -0.02 -0.27  0.16  0.76 -0.05 -0.02 -0.16  0.26
BYTS        0.11  0.25 -0.14 -0.10  0.35  0.02 -0.11 -0.02
CAS        -0.16 -0.62 -0.08  0.08 -0.04 -0.02 -0.13 -0.06
CIR124      0.55  0.21  0.17 -0.17 -0.13  0.11  0.16  0.03
CIR3        0.55  0.05  0.02  0.13 -0.08 -0.08 -0.09  0.43
CLUA       -0.70 -0.09 -0.14 -0.01 -0.07  0.16 -0.09  0.04
CLUAC      -0.67 -0.09 -0.12  0.02 -0.10  0.29 -0.04  0.02
CLUAD       0.27  0.45 -0.15  0.03 -0.01  0.00 -0.08  0.49
CLUN       -0.69 -0.22  0.08 -0.01 -0.04 -0.33  0.00 -0.13
...

122 rows (features) × 8 columns (dimensions)

2. Visualization

2.1. MDAvis

3. The case for text excerpts (chunks)

3.1. A corpus of text chunks

3.2. Advantages of chunking

  • Controlled length (2,000–5,000 words)
    • Min. length: Reliability vs. volatility of feature measurements
    • Max. length: Homogeneous vs. heterogeneous observation units
  • Diversity of corpus
  • Dispersion in MD space

3.3. Diversified stratified random sampling

Within each class:

  1. Pick first chunk at random.
  2. Rank remaining candidate chunks by dissimilarity from currently selected chunks based on selected metadata fields.
  3. Pick next chunk at random from top N most dissimilar candidate chunks.
  4. Repeat.

3.4. Testing chunk length influence empirically

Corpus No. of text chunks Mean chunk length Std. dev.
Koditex 3292 2745.8 748.6
WS-K1 5000 2743.3 772.1
WS-K2 5000 2748.4 771.2
WS-S 1000 290.8 74.4
  • WS-K* = web sample from Araneum Bohemicum Maximum 15.04 corpus, Koditex-like chunk lengths
  • WS-S = ditto, but short chunks

3.5. Influence of chunk length on dispersion

Araneum sample Intersection with Koditex Koditex complement Araneum complement
WS-K1 78.00 14.60 7.39
WS-K2 77.10 15.10 7.82
WS-S 73.40 9.82 16.70

Numeric values in percentage points.

3.6. Influence of chunk length on dispersion: Dim1 × Dim2

3.7. Influence of chunk length on dispersion: Dim3 × Dim6

4. How many dimensions?

4.1. No silver bullet

  • It’s up to the researcher to pick between competing models with varying numbers of dimensions.
  • Biber (1995: 120): “there is no mathematically exact method for determining the number of factors to be extracted”
  • Revelle (2017: 38): “Each of the procedures has its advantages and disadvantages. […] The scree test is quite appealing but can lead to differences of interpretation as to when the scree ‘breaks’. Extracting interpretable factors [i.e. the number of factors which yields the most plausible interpretation] means that the number of factors reflects the investigator’s creativity more than the data. […] The eigen value of 1 rule, although the default for many programs, seems to be a rough way of dividing the number of variables by 3 and is probably the worst of all criteria.”

4.2. Scree plot

Scree plot of factor eigenvalues. Dashed line corresponds to eigenvalue = 1.

4.3. Tidiness: Yet another heuristic

  • Think of dimensions as groups of features.
  • Suppose someone hands us the “correct” groups.
  • Then: pick the model which groups features as similarly as possible.

4.4. Tidy group mapping

Example of a “tidy” relationship between the “true” grouping of linguistic features into dimensions (on the left) and an MD model (= groups of features inferred via FA, on the right).

4.5. Messy group mapping

Example of a “tangled” relationship between the “true” grouping of linguistic features into dimensions (on the left) and an MD model (on the right).

4.6. Theoretical background

Comparing two groupings of features using information-theoretic measures: mutual information and joint entropy.

Tidiness = Mutual information / Joint entropy

Details: https://github.com/czcorpus/mda

4.7. In practice

5. Application: Corpus comparison

5.1. Traditional vs. web corpora

  • Theoretical question: How well do web corpora cover the variability present in a given language? What do they represent?
  • Practical question: To what extent can we replace (more expensive) traditional corpus data with (cheaper, more abundant) web corpus data?

5.2. A lot of overlap

5.3. But not everything is on the web

Koditex classes which occupy noticeably different regions of the space defined by the first 2 dimensions of the MD model compared to web-crawled Araneum samples.

5.4. Also: there’s web and web

Comparison of web-based Koditex text classes with Araneum web-crawled data.

6. Application: Identifying sources of variation

6.1. Elicited texts

CPACT project (Computational Psycholinguistic Analysis of Czech Text, Dalibor Kučera)

6.2. Author vs. scenario: per dimension

Proportions of linguistic variation attributable to author and scenario estimated as effect size measures for ANOVA, the Kruskal-Wallis test and LMM. The residual variation is accounted for by other effects not considered here.

6.3. Author vs. scenario: summary

Method Scenario Author
ANOVA 0.612 0.388
Kruskal-Wallis 0.628 0.372
LMM 0.727 0.273

Average proportion of variation explained by author vs. scenario across dimensions, weighted by the importance of each dimension to the MD model. Rescaled (prior to averaging) to exclude variation not attributable to either of the explanatory variables.

6.4. Distances between elicited texts

Two types of distances between texts in the CPACT corpus. Black arrows represent distances between texts of the same author, red ones represent distances between texts based on identical scenarios.

6.5. Distances in MD space

6.6. Within-author vs. within-scenario distances

Distance type Mean SD Median MAD
Same author 1.46 0.539 1.42 0.546
Same scenario 1.10 0.375 1.05 0.359

Mean, median, standard deviation and MAD (median absolute deviation) values for two types of distances between pairs of texts in the CPACT corpus.

7. Application: Diachronic study of registers

7.1. Disclaimer: not my work

7.2. Czech newspapers, 1995–2018

7.3. Czech newspapers, 1995–2018

Thank you for your attention!