David Lukeš
AA TEXT_CHUNK_ID 08A014N/tumuf 32,157.21
AA TEXT_CHUNK_ID 08A014N/tumuf 32,157.21 08A015N/botava 29,443.25 08A021N/lazuk 45,610.43 08A041N/cecab 29,011.79 08A048N/lekof 35,779.48 08A069N/tikisa 17,415.22 08A089N/jepiv 30,684.10 08A103N/botava 26,707.76 08H007N/vakuga 34,646.42 08M006N/bazad 36,538.46 08M016N/jiroka 46,138.07 09A009N/jeziga 23,266.42 09A020N/tuzina 49,914.53 09A031N/hevuk 43,365.13 09A036N/lepih 45,995.89 09A044N/hocuc 13,648.29 09A044N/pugada 20,296.64 09A046N/tikisa 24,429.97 09A057N/cocir 54,427.65 09A057N/monuj 62,784.65 ...
3292 rows (text chunks) × 1 column (linguistic feature)
AA AAV ABS ACM ADN AMP ARCH ASIM ... TEXT_CHUNK_ID 08A014N/tumuf 32,157.21 0.00 5,359.54 0.00 0.00 15,631.98 446.63 0.00 08A015N/botava 29,443.25 0.00 1,606.00 0.00 0.00 8,565.31 0.00 535.33 08A021N/lazuk 45,610.43 685.87 5,486.97 0.00 0.00 8,916.32 0.00 0.00 08A041N/cecab 29,011.79 0.00 7,706.26 0.00 0.00 4,986.40 0.00 0.00 08A048N/lekof 35,779.48 0.00 3,650.97 0.00 0.00 6,571.74 0.00 0.00 08A069N/tikisa 17,415.22 916.59 4,582.95 0.00 0.00 5,499.54 0.00 0.00 08A089N/jepiv 30,684.10 0.00 5,533.20 503.02 0.00 13,581.49 503.02 0.00 08A103N/botava 26,707.76 0.00 1,027.22 513.61 0.00 10,272.21 0.00 0.00 08H007N/vakuga 34,646.42 0.00 8,542.95 474.61 0.00 12,814.43 949.22 0.00 08M006N/bazad 36,538.46 769.23 7,692.31 0.00 0.00 16,538.46 384.62 0.00 08M016N/jiroka 46,138.07 341.76 8,544.09 0.00 0.00 8,202.32 0.00 0.00 09A009N/jeziga 23,266.42 0.00 3,193.43 0.00 0.00 9,124.09 0.00 0.00 09A020N/tuzina 49,914.53 341.88 15,042.74 0.00 0.00 12,991.45 0.00 0.00 09A031N/hevuk 43,365.13 0.00 7,372.07 0.00 0.00 6,938.42 0.00 0.00 09A036N/lepih 45,995.89 0.00 9,034.91 0.00 0.00 8,624.23 0.00 0.00 09A044N/hocuc 13,648.29 0.00 7,874.02 0.00 1,049.87 3,674.54 0.00 0.00 09A044N/pugada 20,296.64 0.00 2,341.92 0.00 2,732.24 12,490.24 1,561.28 0.00 09A046N/tikisa 24,429.97 0.00 8,957.65 0.00 0.00 5,700.33 407.17 0.00 09A057N/cocir 54,427.65 431.97 1,295.90 0.00 863.93 12,527.00 2,591.79 0.00 09A057N/monuj 62,784.65 325.31 8,132.73 0.00 325.31 11,060.51 975.93 0.00 ...
3292 rows (text chunks) × 122 columns (linguistic features)
A.k.a. reducing the dimensionality.
GLS1 GLS2 GLS3 GLS4 GLS5 GLS6 GLS7 GLS8 TEXT_CHUNK_ID 08A014N/tumuf 1.07 3.69 -0.82 -0.62 1.61 -1.09 0.97 -0.21 08A015N/botava 1.67 3.71 -2.22 -0.28 -0.03 -0.22 0.74 -1.48 08A021N/lazuk 0.99 3.64 -1.14 0.26 -0.58 -0.74 0.53 -0.22 08A041N/cecab 1.62 3.43 -1.59 -1.15 1.54 -0.22 0.70 0.21 08A048N/lekof 0.96 4.29 -1.89 -0.04 1.53 -0.18 -0.87 -1.88 08A069N/tikisa 2.13 2.46 -1.87 -0.62 1.27 0.01 0.04 -2.74 08A089N/jepiv 1.14 3.85 -1.71 -0.20 0.18 -1.81 1.39 1.37 08A103N/botava 1.68 3.87 -2.22 -0.56 0.04 -0.16 0.38 -0.01 08H007N/vakuga 1.59 3.31 -1.38 -0.26 -0.82 0.17 0.49 0.16 08M006N/bazad 0.93 4.18 -1.37 -1.03 -0.19 -0.78 0.75 1.28 08M016N/jiroka 1.83 2.59 -1.73 -0.42 1.07 -0.35 1.03 -0.93 09A009N/jeziga 1.15 5.09 -1.47 -1.38 1.83 -1.73 -0.54 -0.79 09A020N/tuzina 1.04 3.86 -1.06 -0.10 -0.22 0.19 0.76 -0.69 09A031N/hevuk 1.18 3.42 -1.41 -0.14 -0.80 -0.71 0.78 -0.45 09A036N/lepih 1.00 3.92 -1.59 0.05 0.91 -0.41 0.49 -0.75 09A044N/hocuc 1.33 2.88 -1.52 -0.51 0.85 -0.52 0.25 -2.20 09A044N/pugada 1.47 3.01 -1.51 -0.87 0.84 -0.60 -0.57 -0.24 09A046N/tikisa 2.09 2.86 -2.15 -0.61 0.58 -0.34 0.42 -2.69 09A057N/cocir 0.52 4.57 -1.46 -0.83 0.92 0.17 -0.54 0.83 09A057N/monuj 0.67 2.86 -1.02 -0.14 0.00 -0.54 1.37 1.72 ...
3292 rows (text chunks) × 8 columns (latent dimensions of variation)
DIMENSION→ GLS1 GLS2 GLS3 GLS4 GLS5 GLS6 GLS7 GLS8 FEATURE↓ AA -0.78 -0.18 -0.06 -0.02 -0.06 0.24 0.03 0.02 AAV -0.50 -0.15 -0.11 -0.30 0.04 0.19 -0.06 0.01 ABS -0.72 -0.19 0.20 -0.37 -0.03 0.04 0.15 -0.02 ACM 0.05 -0.08 0.29 -0.03 -0.10 -0.04 0.12 0.08 ADN -0.29 -0.03 0.13 -0.29 0.11 0.02 0.01 0.09 AMP 0.12 0.27 0.13 0.05 0.06 -0.13 -0.03 0.57 ARCH -0.51 -0.08 0.13 -0.16 0.02 -0.07 -0.13 0.34 ASIM -0.17 0.00 -0.27 0.04 -0.03 0.29 0.11 0.01 ATA1 -0.72 -0.18 0.16 0.01 -0.12 0.16 -0.08 -0.03 ATA21 -0.31 0.14 -0.04 0.01 0.05 0.20 -0.02 -0.04 ATA22 -0.40 -0.09 -0.07 -0.18 -0.05 0.21 -0.16 -0.05 BIG -0.02 -0.27 0.16 0.76 -0.05 -0.02 -0.16 0.26 BYTS 0.11 0.25 -0.14 -0.10 0.35 0.02 -0.11 -0.02 CAS -0.16 -0.62 -0.08 0.08 -0.04 -0.02 -0.13 -0.06 CIR124 0.55 0.21 0.17 -0.17 -0.13 0.11 0.16 0.03 CIR3 0.55 0.05 0.02 0.13 -0.08 -0.08 -0.09 0.43 CLUA -0.70 -0.09 -0.14 -0.01 -0.07 0.16 -0.09 0.04 CLUAC -0.67 -0.09 -0.12 0.02 -0.10 0.29 -0.04 0.02 CLUAD 0.27 0.45 -0.15 0.03 -0.01 0.00 -0.08 0.49 CLUN -0.69 -0.22 0.08 -0.01 -0.04 -0.33 0.00 -0.13 ...
122 rows (features) × 8 columns (dimensions)
Within each class:
Corpus | No. of text chunks | Mean chunk length | Std. dev. |
---|---|---|---|
Koditex | 3292 | 2745.8 | 748.6 |
WS-K1 | 5000 | 2743.3 | 772.1 |
WS-K2 | 5000 | 2748.4 | 771.2 |
WS-S | 1000 | 290.8 | 74.4 |
Araneum sample | Intersection with Koditex | Koditex complement | Araneum complement |
---|---|---|---|
WS-K1 | 78.00 | 14.60 | 7.39 |
WS-K2 | 77.10 | 15.10 | 7.82 |
WS-S | 73.40 | 9.82 | 16.70 |
Numeric values in percentage points.
Scree plot of factor eigenvalues. Dashed line corresponds to eigenvalue = 1.
Example of a “tidy” relationship between the “true” grouping of linguistic features into dimensions (on the left) and an MD model (= groups of features inferred via FA, on the right).
Example of a “tangled” relationship between the “true” grouping of linguistic features into dimensions (on the left) and an MD model (on the right).
Comparing two groupings of features using information-theoretic measures: mutual information and joint entropy.
Tidiness = Mutual information / Joint entropy
Details: https://github.com/czcorpus/mda
Koditex classes which occupy noticeably different regions of the space defined by the first 2 dimensions of the MD model compared to web-crawled Araneum samples.
Comparison of web-based Koditex text classes with Araneum web-crawled data.
CPACT project (Computational Psycholinguistic Analysis of Czech Text, Dalibor Kučera)
Proportions of linguistic variation attributable to author and scenario estimated as effect size measures for ANOVA, the Kruskal-Wallis test and LMM. The residual variation is accounted for by other effects not considered here.
Method | Scenario | Author |
---|---|---|
ANOVA | 0.612 | 0.388 |
Kruskal-Wallis | 0.628 | 0.372 |
LMM | 0.727 | 0.273 |
Average proportion of variation explained by author vs. scenario across dimensions, weighted by the importance of each dimension to the MD model. Rescaled (prior to averaging) to exclude variation not attributable to either of the explanatory variables.
Two types of distances between texts in the CPACT corpus. Black arrows represent distances between texts of the same author, red ones represent distances between texts based on identical scenarios.
Distance type | Mean | SD | Median | MAD |
---|---|---|---|---|
Same author | 1.46 | 0.539 | 1.42 | 0.546 |
Same scenario | 1.10 | 0.375 | 1.05 | 0.359 |
Mean, median, standard deviation and MAD (median absolute deviation) values for two types of distances between pairs of texts in the CPACT corpus.