GlobalPhone (since 1995)

Development and evaluation of large speech processing systems in the most widespread languages of the world. GlobalPhone is designed to be uniform across languages with respect to the amount of text and audio data per language, the audio data quality (microphone, noise, channel), the collection scenario (task, setup, speaking style etc.), and the transcription conventions and supplies an excellent basis for research in the areas of (1) multilingual speech recognition, (2) rapid deployment of speech processing systems to new languages, (3) language and speaker identification tasks, (4) monolingual speech recognition in a large variety of languages, as well as (5) comparisons across major languages based on text and speech data. To date, the GlobalPhone corpus covers 19 languages Arabic, Bulgarian, Chinese (Mandarin and Shanghainese), Croatian, Czech, French, German, Japanese, Korean, Portuguese, Polish, Russian, Spanish, Swedish, Tamil, Thai, and Turkish. In each language about 100 adult speakers were recorded with close-speaking microphones when reading about 100 sentences each. The entire corpus contains over 300 hours speech spoken by more than 1,500 native adult speakers.

To date GlobalPhone covers the following 19 Languages:

  • Arabic (Tunesian und Palestine)
  • Bulgarian
  • Chinese (Mandarin and Shanghai dialect)
  • Croatian
  • Czech
  • French
  • German
  • Hausa
  • Japanese
  • Korean
  • Portuguese (Brazil)
  • Polish
  • Russian
  • Spanish (Costa Rica)
  • Swedish
  • Tamil
  • Thai
  • Turkish
  • Vietnamese

More information can be found on our publications page and on the GlobalPhone webpage!