New tools for working with the ORAL series corpora of spoken Czech: AchSynku and Mluvkonk

David Lukeš, Prague
October 21st, 2015

Introduction

Two simple web-based tools for working with the ORAL series corpora of informal spoken Czech.

AchSynku
- compensates for lack of lemmatization in the corpora
MluvKonk
- intuitive multi-layer visualization of spoken concordances

→ Supplement the features of the standard KonText interface

Motivation (I)

why still no lemmatization?
- greater variation in spoken language → complement morphological dictionary first
- unruly syntax (false starts, aposiopeses, apo koinou constructions) → harder to disambiguate
structure of spoken language
- informal interactions are multi-layered: many speakers taking turns, overlaps, back-channelling
- classic concordance format based on linear structure of written language is inappropriate

Motivation (II)

A spoken corpus concordance inside KonText:

AchSynku

The name

A pun on the folk song:

Ach SYNku SYNku, doma-li jsi
...
ORAL jsem ORAL, ale málo

SYN is the CNC's flagship series of written corpora.

Overview

goal: maximize recall of user queries on ORAL series corpora by expanding a target form into its full paradigm, including regional and speaking style variants
existing experimental lemmatization of ORAL series corpora too unreliable to be included directly in official corpus release, but can form basis of external tool
priorities:
- recall over precision
- ease-of-use over configurability

(Dis)advantages

any word form occurring in the ORAL series corpora, as well as any corresponding lemma can be used to seed the variant / paradigm search

several lemmas can sometimes be conflated in the results, e.g.:
- query moci
- corresponds to both various forms of noun moc and infinitive of verb moci/moct
→ both paradigms are returned

Interface

Try searching e.g. for dělat or protože.

Interface

AchSynku blank

Interface

AchSynku -- query "dělat"

Implementation (I)

SQLite database with a word2lemma table:

word	lemma
…	…
ale	ale
Ale	ale
ále	ale
…	…
Honzoj	Honza
…	…

pro: ubiquitous dependency
con: wasteful storage (Finite State Automata are more efficient at this, see e.g. MorphoDiTa POS-tagging framework)

Implementation (II)

input string $ i $ is lowercased and matched against known lowercased lemmas and word forms
all word forms $ x $ are returned such that:
- $ lc(i) \in lc(lemma(x)) $ ($ i $ is one of the lemmas of $ x $)
- $ \left\vert{lc(lemma(i)) \cap lc(lemma(x))}\right\vert > 0 $ ($ i $ and $ x $ belong to a shared lemma)

Implementation (III)

In SQL terms ($query_string == $ i $):

SELECT DISTINCT word
FROM word2lemma
WHERE lemma IN
    (SELECT '$query_string'
     UNION SELECT lemma
     FROM word2lemma
     WHERE word = '$query_string');

Technical details

word and lemma columns initialized with COLLATE NOCASE → case insensitive
but SQLite engine able to casefold only ASCII range
- $ lc("ALE") == "ale" $
- $ lc("ÁČKO") == "ÁČko" $
so all strings represented using Unicode Normalization Form Canonical Decomposition, which decouples base characters from combining diacritics
- $ lc("´AˇCKO") == "´aˇcko" $
tip of the hat to anonymous reviewer who suggested this elegant solution :)

Discussion & future

why not just publish experimental lemmatization?
- a warning to users: be careful when working with this information
more refined searches?
- e.g. allow POS specifications
- would distinguish between moci as a noun vs. verb
- but danger of cluttering UI
definitive solution: a merged ORAL series corpus is planned for publication, hopefully with tagging and lemmatization natively available

MluvKonk

Motivation

classic concordance format: one line per hit
great for written texts
- direct comparisons of key words in their contexts (vertical line-up)
not so great for speech
- many speakers, turns, overlaps
- hard to spot recurrent structural dialogue patterns intuitively
→ How to represent these?

The KonText solution: structural tags

The MluvKonk solution: multi-tier display

one tier per speaker
inspired by well-established tools for speech transcription / annotation:
- ELAN (Annotation Mode)
- EXMARaLDA (Partitur notation)
- Praat (TextGrid tiers)
input: a concordance exported as .csv from KonText

Interface

MluvKonk -- widgets

Interface

MluvKonk -- concordance

Interface

MluvKonk -- statistics

Implementation

no public API to KonText → user required to export concordance to .csv manually and load it into MluvKonk
- 5 MB size limit imposed by server – acceptable in practice
a single-page application (asynchronous communication with backend)
both backend and frontend implemented in R using the Shiny web application framework
responsive: backend optimized to render only the part of the concordance currently requested by user
Statistics tab: only a modest showcase for Shiny's powerful capabilities in terms of data-driven graphics

Discussion & future

re-work if API to KonText becomes available
- greater ease of use
- no need for manual export of concordances
- would require some integration on the part of KonText
definitive solution: integrate a multi-layered view into KonText itself
or use a third-party concordancer which provides these capabilities in addition to KonText, e.g. ANNIS

Conclusion

In summary...

AchSynku
- searching for regional / speaking style variants and/or entire paradigms in the ORAL series corpora, which do not feature lemmatization
MluvKonk
- visualizing ORAL series corpora concordances in multiple layers, with one tier per speaker
Both are open-source under the GNU GPL v3.

Feedback

AchSynku and MluvKonk are stopgap solutions, but test-driving and feedback is welcome!

AchSynku
- live at http://trnka.korpus.cz/~lukes/achsynku
- or via our wiki
- feedback via https://github.com/dlukes/achsynku
MluvKonk
- live at http://trost.korpus.cz/shiny/lukes/mluvkonk
- feedback via https://github.com/dlukes/mluvkonk

Thank you for your attention!

This paper resulted from the implementation of the Czech National Corpus project (LM2011023) funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures.

Slides available at https://trnka.korpus.cz/~lukes/slovko.