Pronunciation of casual spoken Czech:

A quantitative survey

David Lukeš

Zuzana Komrsková

Marie Kopřivová

Petra Poukarová

Slavicorp 2018 | September 25th, 2018

Introduction

Overview

Point of departure

We have a new corpus of casual spoken Czech with manual phonetic transcriptionORTOFON v1

Goals

  1. Here’s some cool stuff you can do with it.
  2. And by you, I mean you! 👉 😱 👈

Resources

Google: korpus ortofon

This presentation: https://trnka.korpus.cz/~lukes/slides/slavicorp2018/ortofon

Our wiki: https://wiki.korpus.cz/doku.php/cnk:ortofon

KonText query interface: https://kontext.korpus.cz/first_form?corpname=ortofon_v1

All the data via LINDAT:

Corpora of spoken Czech at the CNC I

corpus size tagging time span
ORTOFON 1M 2012–2017
ORAL 5.4M 2002–2011
↳ ORAL2013 2.8M 2008–2011
↳ ORAL2008 1M 2002–2007
↳ ORAL2006 1M 2002–2006
BMK 490k 1994–1999
PMK 675k 1988–1996

Corpora of spoken Czech at the CNC II

corpus size tagging time span
DIALEKT 100k 1957–2015
LINDSEI_CZ 120k 2012–2015
SCHOLA2010 790k 2005–2008

About ORTOFON v1

Geographically

ORTOFON data provenance

Some figures

# of …
… tokens 1,236,508
… tokens without puctuation, hesitations and interjections 1,014,786
… different word forms 65,294
… conversations recorded 332
… unique speakers 624
→ length of recordings [hh:mm:ss.ms] 102:41:14.247

Per-document metadata

  • 12 pre-defined situation types
  • year, month and location of recording
  • relationship between speakers in recording
  • gender mix of speakers in recording
  • generation mix of speakers in recording
  • … and more.

Per-speaker metadata: balancing

On the basis of the following metadata:

  • gender: male × female
  • age: under 35 × over 35
  • education: tertiary × other
  • childhood dialect region of residence: 10 regions

Resulting number of categories: \(2 \times 2 \times 2 \times 10 = 80\)

Ideally: equal representation of these 80 categories, at least 5 speakers per category.

→ Target number of words per category: \(\frac{1\ 000\ 000}{80} = 12\ 500\)

Transcribed using ELAN

ORTOFON in ELAN

Accessible via KonText (but also LINDAT)

conversation view

Phonetic transcription

  • simplified phonetic transcription
    • use regular Czech alphabet graphemes as much as possible
    • no phonetic diacritics
  • stress group boundaries
  • aligned with basic transcript: tokens match \(1:1\)

Case studies

Assimilation of voicing in Czech

  • final devoicing:

    <hrad (nom.), hradu (gen.)> → [hrat, hradu]

  • regressive / anticipatory assimilation of voicing, even across word boundaries:

    <hrad, hrad byl> → [hrat, hrad bil]

  • triggered by voiced obstruents
  • in Moravia/Silesia (Eastern part of the country), also triggered by sonorants [r, l, m, n, j…]:

    <tak jako> → [tag jako]

Traditional dialect regions

regions

Assimilation of voicing ~ Region of childhood residence

Which forms assimilate to sonorants?

“Bohemia”

WORD FREQ
tak 651
bych 348
239
těch 224
když 193
teď 186
vod 184
jak 169
vůbec 127
pak 122

“Moravia”

WORD FREQ
tak 1288
483
jak 345
když 267
teď 190
fakt 181
vůbec 162
víš 155
bych 153
pak 145

# of variants ~ Frequency (\(\rho = 0.76\))

In KonText: protože

protože

In KonText: samozřejmě

samozřejmě

In KonText: normálně

normálně

Identify competition with entropy

WORD ENTROPY
1 ježišmarja 3.803729
2 samozřejmě 3.717063
3 protože 3.603883
4 sedmdesát 3.127680
5 takovýhle 3.110014
6 sedmnáct 3.096503
7 člověk 3.037660
8 ježíšmarjá 2.947005
9 šestnáct 2.927707
10 ježíš 2.883297
11 tohleto 2.880382
12 normálně 2.843373
13 povídám 2.782390
WORD ENTROPY
14 nějakého 2.752697
15 takového 2.682409
16 ježiš 2.680650
17 podívat 2.678791
18 tadyhle 2.676441
19 vůbec 2.671444
20 potřebovat 2.637769
21 čtyřicet 2.619200
22 myslíš 2.586492
23 přijít 2.574731
24 takovýho 2.573642
25 osmnáct 2.565948
26 ježíšmarja 2.523211

Formally reduced pronunciations

  • motivated by:
    • frequency, length (across languages)
    • lexical effects
  • edit distance between abc and zbc:
    • Levenshtein: 1
    • normalized Levenshtein: 0.33
  • in practice, deletion and substitution (~ formal simplification) much more common than addition (epenthesis)

    → high normalized Levenshtein distance ~ high amount of simplification

Normalized Lev. dist. ~ Frequency

Instead of a conclusion…

Google: korpus ortofon

This presentation: https://trnka.korpus.cz/~lukes/slides/slavicorp2018/ortofon

Our wiki: https://wiki.korpus.cz/doku.php/cnk:ortofon

KonText query interface: https://kontext.korpus.cz/first_form?corpname=ortofon_v1

All the data via LINDAT:

Thank you for your attention!

Acknowledgments

This research was supported by the Czech National Corpus project (LM2015044) funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures.

Slides: https://trnka.korpus.cz/~lukes/slides/slavicorp2018/ortofon