New tools for working with the ORAL series corpora of spoken Czech: AchSynku and Mluvkonk

David Lukeš, Prague
October 21st, 2015

Introduction

Two simple web-based tools for working with the ORAL series corpora of informal spoken Czech.

  1. AchSynku
    • compensates for lack of lemmatization in the corpora
  2. MluvKonk
    • intuitive multi-layer visualization of spoken concordances

→ Supplement the features of the standard KonText interface

Motivation (I)

  1. why still no lemmatization?
    • greater variation in spoken language → complement morphological dictionary first
    • unruly syntax (false starts, aposiopeses, apo koinou constructions) → harder to disambiguate
  2. structure of spoken language
    • informal interactions are multi-layered: many speakers taking turns, overlaps, back-channelling
    • classic concordance format based on linear structure of written language is inappropriate

Motivation (II)

A spoken corpus concordance inside KonText:

ORAL concordance in KonText

AchSynku

The name

A pun on the folk song:

Ach SYNku SYNku, doma-li jsi
...
ORAL jsem ORAL, ale málo

SYN is the CNC's flagship series of written corpora.

Overview

  • goal: maximize recall of user queries on ORAL series corpora by expanding a target form into its full paradigm, including regional and speaking style variants
  • existing experimental lemmatization of ORAL series corpora too unreliable to be included directly in official corpus release, but can form basis of external tool
  • priorities:
    • recall over precision
    • ease-of-use over configurability

(Dis)advantages

  • any word form occurring in the ORAL series corpora, as well as any corresponding lemma can be used to seed the variant / paradigm search
  • several lemmas can sometimes be conflated in the results, e.g.:

    • query moci
    • corresponds to both various forms of noun moc and infinitive of verb moci/moct

    → both paradigms are returned

Interface

Try searching e.g. for dělat or protože.

Interface

AchSynku blank

Interface

AchSynku -- query "dělat"

Implementation (I)

SQLite database with a word2lemma table:

word lemma
ale ale
Ale ale
ále ale
Honzoj Honza
 …
  • pro: ubiquitous dependency
  • con: wasteful storage (Finite State Automata are more efficient at this, see e.g. MorphoDiTa POS-tagging framework)

Implementation (II)

  • input string \( i \) is lowercased and matched against known lowercased lemmas and word forms
  • all word forms \( x \) are returned such that:
    • \( lc(i) \in lc(lemma(x)) \) (\( i \) is one of the lemmas of \( x \))
    • \( \left\vert{lc(lemma(i)) \cap lc(lemma(x))}\right\vert > 0 \) (\( i \) and \( x \) belong to a shared lemma)

Implementation (III)

In SQL terms ($query_string == \( i \)):

SELECT DISTINCT word
FROM word2lemma
WHERE lemma IN
    (SELECT '$query_string'
     UNION SELECT lemma
     FROM word2lemma
     WHERE word = '$query_string');

Technical details

  • word and lemma columns initialized with COLLATE NOCASE → case insensitive
  • but SQLite engine able to casefold only ASCII range
    • \( lc("ALE") == "ale" \)
    • \( lc("ÁČKO") == "ÁČko" \)
  • so all strings represented using Unicode Normalization Form Canonical Decomposition, which decouples base characters from combining diacritics
    • \( lc("´AˇCKO") == "´aˇcko" \)
  • tip of the hat to anonymous reviewer who suggested this elegant solution :)

Discussion & future

  • why not just publish experimental lemmatization?
    • a warning to users: be careful when working with this information
  • more refined searches?
    • e.g. allow POS specifications
    • would distinguish between moci as a noun vs. verb
    • but danger of cluttering UI
  • definitive solution: a merged ORAL series corpus is planned for publication, hopefully with tagging and lemmatization natively available

MluvKonk

Motivation

  • classic concordance format: one line per hit
  • great for written texts
    • direct comparisons of key words in their contexts (vertical line-up)
  • not so great for speech

    • many speakers, turns, overlaps
    • hard to spot recurrent structural dialogue patterns intuitively

    → How to represent these?

The KonText solution: structural tags

ORAL concordance in KonText

The MluvKonk solution: multi-tier display

  • one tier per speaker
  • inspired by well-established tools for speech transcription / annotation:
  • input: a concordance exported as .csv from KonText

Interface

Interface

MluvKonk -- widgets

Interface

MluvKonk -- concordance

Interface

MluvKonk -- statistics

Implementation

  • no public API to KonText → user required to export concordance to .csv manually and load it into MluvKonk
    • 5 MB size limit imposed by server – acceptable in practice
  • a single-page application (asynchronous communication with backend)
  • both backend and frontend implemented in R using the Shiny web application framework
  • responsive: backend optimized to render only the part of the concordance currently requested by user
  • Statistics tab: only a modest showcase for Shiny's powerful capabilities in terms of data-driven graphics

Discussion & future

  • re-work if API to KonText becomes available
    • greater ease of use
    • no need for manual export of concordances
    • would require some integration on the part of KonText
  • definitive solution: integrate a multi-layered view into KonText itself
  • or use a third-party concordancer which provides these capabilities in addition to KonText, e.g. ANNIS

Conclusion

In summary...

  1. AchSynku
    • searching for regional / speaking style variants and/or entire paradigms in the ORAL series corpora, which do not feature lemmatization
  2. MluvKonk

    • visualizing ORAL series corpora concordances in multiple layers, with one tier per speaker

    Both are open-source under the GNU GPL v3.

Feedback

AchSynku and MluvKonk are stopgap solutions, but test-driving and feedback is welcome!

Thank you for your attention!

This paper resulted from the implementation of the Czech National Corpus project (LM2011023) funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures.

Slides available at https://trnka.korpus.cz/~lukes/slovko.