We have a new corpus of casual spoken Czech with manual phonetic transcription → ORTOFON v1
Google: korpus ortofon
This presentation: https://trnka.korpus.cz/~lukes/slides/slavicorp2018/ortofon
Our wiki: https://wiki.korpus.cz/doku.php/cnk:ortofon
KonText query interface: https://kontext.korpus.cz/first_form?corpname=ortofon_v1
All the data via LINDAT:
corpus | size | tagging | time span |
---|---|---|---|
ORTOFON | 1M | ✓ | 2012–2017 |
ORAL | 5.4M | ✓ | 2002–2011 |
↳ ORAL2013 | 2.8M | ✗ | 2008–2011 |
↳ ORAL2008 | 1M | ✗ | 2002–2007 |
↳ ORAL2006 | 1M | ✗ | 2002–2006 |
BMK | 490k | ✗ | 1994–1999 |
PMK | 675k | ✗ | 1988–1996 |
corpus | size | tagging | time span |
---|---|---|---|
DIALEKT | 100k | ✓ | 1957–2015 |
LINDSEI_CZ | 120k | ✗ | 2012–2015 |
SCHOLA2010 | 790k | ✗ | 2005–2008 |
# of … | |
---|---|
… tokens | 1,236,508 |
… tokens without puctuation, hesitations and interjections | 1,014,786 |
… different word forms | 65,294 |
… conversations recorded | 332 |
… unique speakers | 624 |
→ length of recordings [hh:mm:ss.ms] | 102:41:14.247 |
On the basis of the following metadata:
Resulting number of categories: \(2 \times 2 \times 2 \times 10 = 80\)
Ideally: equal representation of these 80 categories, at least 5 speakers per category.
→ Target number of words per category: \(\frac{1\ 000\ 000}{80} = 12\ 500\)
final devoicing:
<hrad (nom.), hradu (gen.)> → [hrat, hradu]
regressive / anticipatory assimilation of voicing, even across word boundaries:
<hrad, hrad byl> → [hrat, hrad bil]
in Moravia/Silesia (Eastern part of the country), also triggered by sonorants [r, l, m, n, j…]:
<tak jako> → [tag jako]
WORD | FREQ |
---|---|
tak | 651 |
bych | 348 |
už | 239 |
těch | 224 |
když | 193 |
teď | 186 |
vod | 184 |
jak | 169 |
vůbec | 127 |
pak | 122 |
WORD | FREQ |
---|---|
tak | 1288 |
už | 483 |
jak | 345 |
když | 267 |
teď | 190 |
fakt | 181 |
vůbec | 162 |
víš | 155 |
bych | 153 |
pak | 145 |
WORD | ENTROPY | |
---|---|---|
1 | ježišmarja | 3.803729 |
2 | samozřejmě | 3.717063 |
3 | protože | 3.603883 |
4 | sedmdesát | 3.127680 |
5 | takovýhle | 3.110014 |
6 | sedmnáct | 3.096503 |
7 | člověk | 3.037660 |
8 | ježíšmarjá | 2.947005 |
9 | šestnáct | 2.927707 |
10 | ježíš | 2.883297 |
11 | tohleto | 2.880382 |
12 | normálně | 2.843373 |
13 | povídám | 2.782390 |
WORD | ENTROPY | |
---|---|---|
14 | nějakého | 2.752697 |
15 | takového | 2.682409 |
16 | ježiš | 2.680650 |
17 | podívat | 2.678791 |
18 | tadyhle | 2.676441 |
19 | vůbec | 2.671444 |
20 | potřebovat | 2.637769 |
21 | čtyřicet | 2.619200 |
22 | myslíš | 2.586492 |
23 | přijít | 2.574731 |
24 | takovýho | 2.573642 |
25 | osmnáct | 2.565948 |
26 | ježíšmarja | 2.523211 |
abc
and zbc
:
in practice, deletion and substitution (~ formal simplification) much more common than addition (epenthesis)
→ high normalized Levenshtein distance ~ high amount of simplification
Google: korpus ortofon
This presentation: https://trnka.korpus.cz/~lukes/slides/slavicorp2018/ortofon
Our wiki: https://wiki.korpus.cz/doku.php/cnk:ortofon
KonText query interface: https://kontext.korpus.cz/first_form?corpname=ortofon_v1
All the data via LINDAT:
This research was supported by the Czech National Corpus project (LM2015044) funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures.
Slides: https://trnka.korpus.cz/~lukes/slides/slavicorp2018/ortofon