Hear me out. Yes, Google's been synthesising speech for years, but now their text-to-speech really learns.
Traditional speech synthesis takes a lot of work, a lot of research, and a tonne of samples of speech from a single person. The samples are carefully cut up into distinct sounds, categorised, and then cleverly blended by an algorithm that tries to get the tone as natural as possible. It's all very hard actually, as you can probably guess from the fact that despite all that effort, it's still obviously robotic.
But wait a minute, babies aren't linguistics experts; how do they learn to talk, and talk so well? That's the general idea behind this new WaveNet, which uses a neural network to learn to speak without being explicitly told about all the ins and outs of language.
The results happen to end up better, and there are a few interesting side effects!
- It can learn from multiple people. You don't need to record hours and hours of speech from a single person.
- It learns better from multiple people. The neural network becomes more natural-sounding and robust if trained with multiple speakers.
- It can have multiple voices. When it's learning from several people, it not only learns to talk, but it learns about accents and timbre for free.
Nifty stuff! But perhaps the most tangible benefit of those little bonuses is Google Assistant now comes with both male and female voices.
A bit of speculation here, but expect many more types of voices to be available in the future. Other AI research into style transfer has shown that you can do very intuitive "math" with styles - e.g. "King - man + woman = queen". WaveNet definitely shows signs of having developed a sense of styles in its own understanding of speech, so the groundwork's in place.
There's strong transfer learning, so adding new voices doesn't require starting from scratch - once it's been trained on a diverse enough set of data, it's just a matter of calculating the nearest "style" parameter that matches that new accent, and voila.
And that's more than just OK, Google.
You can thank DeepMind for the slick new voice that emanates from Google’s Home speaker and Assistant app. This time last year, Google’s London-based AI division announced a new way to synthesize speech. Its software, called WaveNet, tore up the regular rule book for generating human-like voices: instead of stitching together chunks of sound, which ends up creating the clunky robotic voices we’re used to, it generated a whole audio waveform from scratch, one sample after the next. The result was far smoother, with more natural intonations than other speech synthesis approaches.