Introduction to New Generation TTS Voices: Google Wavenet

Introduction

This page demonstrates the new generation high quality text to speech voices from Google called Wavenet. According to Google, WaveNet generates speech that sounds more natural than other text-to-speech systems. It synthesizes speech with more human-like emphasis and inflection on syllables, phonemes, and words. Learn more.

Great speakers know that adding syntax-based pauses to their speech makes it easier to understand and remember. Learn more. Similarly, when syntax-based pauses are added to Wavenet voices the result is more understandable speech - as is demonstrated below. (The pauses are added to TTS voices by means of tags in the text.).

The Wavenet voices allow individual words of importance to be emphasized where this can change the meaning of the sentence. Hear example below.

Finally we demonstrate how effectively the voice reads at various commonly-used speeds from very slow (for children/beginners) to very fast (for review/browsing).

The aim is to show to what extent these voices can be used in place of real voices for applications of e-learning, training, and audio books.

Evaluation Criteria

The criteria with which we evaluate the TTS voices are

  1. Realistic and human sounding (not robotic)
  2. Correct pronunciation of words and phrases
  3. Adding syntax-based pauses to the voice makes the speech more understandable
  4. Ability to emphasize individual words of importance.
  5. Ability to speak effectively at different commonly-used speeds

Sample texts

Sample texts are taken from e-learning and audio books: Selections 1,2 are examples of texts for e-learning (Rachel's English), selection 3 is an example of text for reading audio books (Call of the Wild).

Voices Used

For each selection the following Wavenet voices are used
Wavenet female F and Wavenet male D

For reference, the following samples are added:
- real voice sample
- sample of previous generation TTS voice from Acapela group

Voice Speeds

 Speed Value WPM
  (orig/pauses) 
 Uses
 fast 1.00  190/180 radio, tv  commentators
 normal .85 160/150 conversational, audio books
 slow .70 130/120 presentations
 very fast 1.15 220/210 auctioneers, for review

 

 

 

 

 

Voice Pitch

The Wavenet voices can be adjusted for pitch. In this presentation, the F female voice is used with different pitch values for comparison. 
pitch = 0.0 in sections 1,3
pitch = -2.00 in section 2 (lower pitch, more mature sounding voice)

Contents

Introduction to New Generation TTS Voices: Google Wavenet

Section 1: Wavenet TTS voices for e-Learning - Multi-Syllable Words
Section 2: Wavenet TTS voices for e-Learning - Stress-Timed Language
Section 3 Wavenet TTS voices for Audio Books  - Call of the Wild
Section 4 Emphasizing individual words with Wavenet TTS voices


Section 1: Wavenet TTS voices for e-Learning

Sample From Rachel's English - Multi-Syllable Words

Source: Multi-syllable words

 

Contents:
1. Comparison of Wavenet F voice with Rachel's real voice and with Acapela TTS voice Sharon
2. Wavenet F (female) original - at all speeds
3. Wavenet F (female) with pauses - at all speeds
4. Wavenet D (male) original - at all speeds
5. Wavenet D (male) with pauses - at all speeds

 

To hear a voice, click on its row.

 

1. Comparison of Wavenet F voice with Rachel's real voice and with Acapela TTS voice Sharon


 

2. Wavenet F (female) original - at all speeds


 

3. Wavenet F (female) with pauses - at all speeds


 

4. Wavenet D (male) original - at all speeds


 

5. Wavenet D (male) with pauses - at all speeds

 

source text

Source - http://rachelsenglish.com/practice-multi-syllable-words/

Multi-syllable words| can be really tricky. There are so many sounds| and transitions in them. So today| we’re going to talk |about how  to work∣ on multi-syllable words.¶

I encourage you| to keep a running list |of long words that have come up in conversation for you| that are hard| for you to say. Maybe |they are words that relate to your field of study | or work.¶

Let’s use | as an example | the word ‘underestimate’. First, look it up∣ in the dictionary| and get the I P A..

But| what I really want to talk about today| is, make sure you know ∣ which syllables |are stressed.¶

This is a five syllable word | with stress| on the middle syllable. There is secondary stress | in this word| marked ∣ by the little line |at the bottom. I’m going to say, don’t worry about that. They’re more like un-stressed syllables | than stressed syllables.¶

Let’s start |by practicing |the stressed syllable.  Do you know the shape| of a stressed syllable? I made a video| a long time ago | about how  the voice should curve up |and then  down |in a stressed syllable .

The sounds| are the most important | in this stressed syllable. They should be the clearest |in your word.

Just |practice |  the stressed syllable | using a hand movement.

The shape | really is important| in making the word sound natural.¶

Now let’s look ∣ at the rest | of the syllables. We have two| before| and two| after. Practice | these syllables | together. There’s no need | to practice them| separately| like the stressed syllable.¶

At the beginning of the video, I talked about |how long words| can be hard | because there are so many sounds. But I want you to see that ∣ in unstressed syllables∣ the sounds don’t have to be fully formed | and fully pronounced.

These sounds| are quieter, flatter in pitch, faster, simpler. This | should make long words easier, but that doesn’t mean you don’t have to practice them. You do, you need to repeat a new word |over| and over. But the point is to break it up| into simplified |and stressed ∣syllables.¶

Put together a list| of long words | and work through them |this way. I really think | that breaking up a word | into stressed | and unstressed| syllables |is the best way to master it — along with repetition. The more| you get used | to the contrast | of stressed |and unstressed |syllables, the better. Stress | really matters | in American English.

 


Section 2: Wavenet TTS voices for e-Learning

Sample From Rachel's English - Stress-Timed Language

Source: Stress-Timed Language

 Contents:
1. Comparison of Wavenet F with real voice Rachel and with Acapela TTS voice Sharon
2. Wavenet F (female) original - at all speeds
3. Wavenet F (female) with pauses added - at all speeds
4. Wavenet D (male) original - at all speeds
5. Wavenet D (male) with pauses added - at all speeds

 

1. Comparison of Wavenet F with real voice Rachel and with Acapela TTS voice Sharon

 

2. Wavenet F (female) original - at all speeds

 

3. Wavenet F (female) with pauses added - at all speeds

 

4. Wavenet D (male) original - at all speeds

 

5. Wavenet D (male) with pauses added - at all speeds

 

source text

Source: http://rachelsenglish.com/english-stress-timed-language/

In this American English pronunciation video, we’re going to discuss why some words sound different | when they’re said on their own | than they do when they’re said| as part of a sentence.¶

A lot of people think, when they’re studying a language |and they’re new to it, that they need to pronounce each word| fully | and clearly | in order to be well-understood. But in English | that’s actually not the case. English | is a stress timed language. That means | some syllables will be longer, and some | will be shorter. Many languages, however, are syllable timed, which means |each syllable has the same length. Examples of syllable-timed languages:  French, Spanish, Cantonese. ¶

So, when an American hears | a sentence of English | with each syllable | having the same length, it takes just a little bit longer | to get the meaning. This is because we are used to stressed syllables, syllables that will pop out of the line | because they’re longer | and they have more shape. Our ears, our brains, go straight | to those words. Those | are the content words. When all syllables | are the same length, then there’s no way | for the ear to know | which words | are the most important.¶

So this is why stress | is so important in American English. It’s a stress timed language. When you give us | nice shape in your stressed syllables, you’re giving us the meaning | of the sentence. This means | that other syllables need to be unstressed — flatter, quicker — so that the stressed  syllables are what the ear goes to. ¶

This is why it’s so important | to reduce function words that can reduce | in American English. When those function words | are part of a whole, part of a sentence, they are pronounced differently.

Let's look | at some examples.¶

Do you know what I’m saying?  A native speaker | might not either.¶

This | is really | of primary importance  | in American English pronunciation. As you’re working on pronunciation, keep| in mind | this idea| of a word | being part | of a whole.¶

That’s it, and thanks so much | for using Rachel’s English.¶

I’m excited to announce | that I’m running another | online course , so do check out my website | for details. You’ll find on there | all sorts of information | about the course, who should take the course, and requirements. ¶

I’ve had a blast | with my first | online course, and I’m looking forward | to getting to know you.

 


Section 3 Wavenet TTS voices for Audio Books

Sample from Jack London's Call of the Wild

Source: Call of the Wild by Jack London

Contents:
1. Comparison of Wavenet D (male) with real voice John Lee and Acapela TTS voice Ryan
2. Wavenet F (female) original - at all speeds
3. Wavenet F (female) with pauses - at all speeds
4. Wavenet D (male) original - at all speeds
5. Wavenet D (male) with pauses - at all speeds


1. Comparison of Wavenet D with real voice John Lee and Acapela TTS voice Ryan

 

2. Wavenet F (female) original - at all speeds

 

3. Wavenet F (female) with pauses - at all speeds

 

4. Wavenet D (male) original - at all speeds

 

5. Wavenet D (male) with pauses - at all speeds

 

source text

He was not | so large. He weighed |  only one hundred and forty pounds, for his mother, Shep, had been a Scotch shepherd dog. Nevertheless, one hundred and forty pounds, to which was added |  the dignity  that comes  of good living |   and universal respect, enabled him |  to carry himself |  in right |  royal fashion.

During the four years |  since his puppyhood |  he had lived the life |  of a sated aristocrat. He had  | a fine pride in himself, was even |  a trifle  egotistical, as country gentlemen |  sometimes become  |  because of their insular situation. But he had saved himself |  by not | becoming a mere  | pampered  | house dog. Hunting |  and kindred | outdoor delights |  had kept down the fat |  and hardened his muscles |  and to him, as to the cold-tubbing races, the love of water |  had been a tonic |  and a health preserver.

And this |  was the manner of dog Buck was |   in the fall of 1897, when the Klondike strike |   dragged men |  from all the world |  into the frozen North. But Buck did not | read the newspapers, and he did not | know |  that Manuel, one of the gardener's helpers, was an undesirable acquaintance. Manuel had |  one besetting sin. He loved |  to play Chinese lottery. Also, in his gambling, he had |  one besetting weakness: faith |  in a system, and this |  made his damnation certain. For to play a system |  requires money, while the wages of a gardener's helper |  do not | lap |  over the needs of a wife |  and numerous progeny.

The Judge |  was at a meeting |  of the Raisin Growers' Association, and the boys |  were busy organizing an athletic club, on the memorable night |  of Manuel's treachery. No one |  saw him and Buck go off through the orchard |  on what Buck imagined |  was merely a stroll. And with the exception |  of a solitary man, no one |  saw them arrive |  at the little flag station |  known as College Park. This man |  talked with Manuel, and money |  chinked between them.

“You might wrap up the goods |  before you deliver'im,” the stranger said gruffly, and Manuel doubled a piece of stout rope |  around Buck's neck |  under the collar.

“Twist it, an' you'll choke'im plentee,” said Manuel, and the stranger grunted |  a ready affirmative. ¶

Buck had accepted the rope |  with quiet dignity. To be sure, it was an unwonted performance: but he had learned to trust in men he knew, and to give them credit for a wisdom |  that outreached his own. But when the ends of the rope |  were placed |  in the stranger's hands, he growled menacingly. He had merely intimated his displeasure, in his pride believing |  that to intimate |  was to command. But to his surprise |  the rope |  tightened around his neck, shutting off his breath. In quick rage |  he sprang |   at the man, who met him halfway, grappled him close by the throat, and |  with a deft twist |  threw him over |  on his back. Then the rope |  tightened mercilessly, while Buck struggled in a fury, his tongue |  lolling out of his mouth |  and his great chest |  panting futilely. Never |  in all his life |  had he been so∣ vilely treated, and never |  in all his life |  had he been so angry.


Section 4 Emphasizing individual words with Wavenet TTS voices

Emphasizing Individual Words in a Sentence

 

1. Wavenet F and wavenet D emphasizing individual words in a sentence

 

 

 

source text

I never said she ate your  sandwich.

I never said she ate your  sandwich.

I never said she ate your  sandwich.

I never said she ate your  sandwich.

I never said she ate your  sandwich.

I never said she ate your  sandwich.

I never said she ate your  sandwich.