TTS, or Text-To-Speech, is the creation of audible speech from computer readable text.TTS is separate from speech recognition. You can think of TTS as “talking” and speech recognition as “listening”. There is some shared technology, but neither is just the reverse of the other. And the talking/listening analogy is limited too. Neither technology really involves much language understanding. People new to the idea of TTS often underestimate the difficulty of the task.
After all, humans can typically learn this stuff in early childhood. They talk, listen, understand, and even translate without much apparent effort. Humans do all this work without even being aware of it in most cases, but that doesn’t make it easy. If programmers could create software that really understands human language we could avoid most of the guesswork in TTS, but that hasn’t happened yet. Until then, TTS is more like learning to read a foreign language aloud without ever understanding the words.
With a good dictionary, grammar rules, etc. you can get better and better but will still make mistakes occasionally that are obvious to native speakers. TTS is often described as two conceptual stages. In the first stage, it decides how the text should be spoken, that is, how each word should be pronounced, what length and pitch each phoneme should have, etc.










