David Lukeš, Prague
October 21st, 2015
Two simple web-based tools for working with the ORAL series corpora of informal spoken Czech.
→ Supplement the features of the standard KonText interface
A spoken corpus concordance inside KonText:
A pun on the folk song:
Ach SYNku SYNku, doma-li jsi
...
ORAL jsem ORAL, ale málo
SYN is the CNC's flagship series of written corpora.
several lemmas can sometimes be conflated in the results, e.g.:
moci
moc
and infinitive of verb moci/moct
→ both paradigms are returned
Try searching e.g. for dělat
or protože
.
SQLite database with a word2lemma
table:
word | lemma |
---|---|
… | … |
ale | ale |
Ale | ale |
ále | ale |
… | … |
Honzoj | Honza |
… | … |
In SQL terms ($query_string
== \( i \)):
SELECT DISTINCT word
FROM word2lemma
WHERE lemma IN
(SELECT '$query_string'
UNION SELECT lemma
FROM word2lemma
WHERE word = '$query_string');
word
and lemma
columns initialized with COLLATE NOCASE
→ case insensitivemoci
as a noun vs. verbnot so great for speech
→ How to represent these?
.csv
manually and load it into MluvKonk
MluvKonk
Both are open-source under the GNU GPL v3.
AchSynku and MluvKonk are stopgap solutions, but test-driving and feedback is welcome!
This paper resulted from the implementation of the Czech National Corpus project (LM2011023) funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures.
Slides available at https://trnka.korpus.cz/~lukes/slovko.