Phonetic Corpus of Estonian Spontaneous Speech
The aim of the corpus is to compile a large amount of quality recordings of spontaneous Estonian and segment it phonetically on different levels. The project started in autumn 2006 and it has been funded by the National Programme for Estonian Language Technology.
Structure of corpus
- The corpus contains 60 hours high quality recordings of spontaneous (not read) speech. Speakers are from different age groups with different dialectological and social background. They are asked to participate with face-to-face invitation. To get the situation to be spontaneous, the dialogues of speakers who know each other earlier are recorded. Each recording session is approximately 30 minutes long.
- Every participant fills in the questionnaire about her/his background. For the anonymity of the participants in the corpus the speakers are coded. When one speaker participates in several recordings (s)he gets the same codename.
- Monologues and dialogues are recorded. Most of the recordings are made in a recording studio, some also on fieldwork. The signal of each speaker is recorded in a separate channel. For the studio recordings large diaphragm microphones are used; the distance between the speakers is about 1.5-3 meters to minimize the effect of overlaps. For the field-work recordings head-set microphones are used. Recordings are saved in PCM wav-format and are not compressed. Background information about the recordings is collected in a text-file.
- Segmentation and annotation files are saved as Praat TextGrid files and get same filenames as recordings segmented.
Segmentation and annotation
Segmentation and annotation is done with the Praat program (www.praat.org). Recordings are segmented manually on different levels (automatic segmentation program is also elaborated and tested).
Following tiers are used:
- phonetic and linguistic tiers: words (in orthographic spelling), speech sounds (SAMPA adjusted for Estonian is used for transcription), sound structures (CV-structures), syllables (short – long, open – closed), feet, utterances;
- dialogue units: turns and pauses;
- changes in voice quality (e.g. creaky breathy voice, whisper);
- Paralinguistic phenomena (e.g. expiration and inspiration (also speaking during inspiration), sighing, yawning, sneezing, coughing etc.);
- emotional states (e.g. laugher, weeping, whimper);
- Other tiers (e.g. smacking with lips or tongue).
Using the corpus
Currently the web-based search engine lets you search the orthographic form or the phonetic transcription of a word-level segment.
If the web-based search options do not fit your needs or you have other questions related to the phonetic corpus of Estonian spontaneous speech, please write to the corpus administrator Pärtel Lippus: partel.lippus [ät] ut.ee.
Institute of Estonian and General Linguistics
University of Tartu
Word frequencies in phonetic corpus
The frequency list of the 1000 most frequent words in the Phoneti Corpus of Estonian Spontaneous Speech was created on December 7 2015, based on a total of 470 033 words (61 h 44 minutes of speech) in the corpus. The words were lemmatisized using Filosoft's morphological analyzer (see the list of Estmorf's word clases).
Word frequencies in written Estonian can be found from here.