Screen readers read displayed text elements and translate the information via synthetic speech output or send signals to a connected braille display. This display then outputs the corresponding text as tactile braille, for translation.
Synthetic speech output uses synthesizers such as Elo or eSpeak to produce the sounds. These applications use integrated dictionaries, which also contain the pronunciation of most of the words. However, it’s not easy to create a natural sounding speech this way. After all, words written in the same way are often pronounced differently depending on their meaning or syntactic position (e.g. read, wind, tear, dove, etc.). If a synthesizer doesn’t differentiate between upper and lower case and word meaning, the word 'Polish' (a person from Poland) could end up being pronounced like 'polish' (making something smooth or shiny by cleaning it).
Other problems arise due to the different intonation of interrogative sentences. The way the voice elevates and sinks often influences the meaning. 'OK' and 'OK?' have different meanings, which only become clear when spoken, since they are emphasized differently. For the speech synthesizer to reproduce the second word correctly – as a question, the voice must be elevated at the end of the sentence, otherwise, the user will mistakenly interpret the question as a statement.
In speech synthesis, the following quality features are of particular importance:
- Word stress: The synthesizer should not only pronounce each individual word correctly, but should also be able to emphasize certain words.
- Syllable transitions: If the synthesizer composes the language from syllabic blocks, the transitions must be fluid to produce understandable words.
- Intonation: If elevating and dropping the voice is relevant to the meaning of the sentence (marked in the text), the synthesizer must be able to reproduce it.
- Speech rhythm: The synthesizer should try to imitate the natural rhythm of the speech so that the user gets a natural hearing impression.
- Speech tempo: The speed of reading aloud is also important for the user. Ideally, it can be set by the user.
- Breaks: Format elements such as paragraphs and line breaks should be marked with pauses to make it easier to register the end and beginning of a passage.
Computer linguistic research has made considerable progress in recent years. Google’s Tacotron 2 system is already very close to human speech quality. The driving force behind the latest developments is the independent learning ability of modern synthesizers. Roughly speaking, the program learns the language like an infant and then 'builds' the language from real voice documents. It is particularly astonishing that Tacotron 2 is relatively resistant to typographical errors and deals well with punctuation and stress in sentences (e.g. with caps lock). However, there is still a lack of emotion in speech synthesis. Foreign words can also cause difficulties for Tacotron 2. It remains to be seen when this synthesizer and similarly strong competing products will be made available to a wider public.