11.9 Voices: talking, singing and birdsong

PREVIOUS SECTION

A. The anatomy of the human voice

I am grateful for guidance and support from Sten Ternström, Johan Sundberg and Joe Wolfe in writing this section.

Depending on your point of view, the human voice is either the oldest of musical instruments, or it is not really an “instrument” at all because it doesn’t involve a constructed device of any kind. In any case, singing seems to be a universal human activity: mothers sing to their children, children in turn start to sing from a very young age, and collective singing forms an important part of cultural bonding and ritual in societies across the world and through time.

Humans seem to be driven to make music using anything that can make controllable sounds. Once people discover musical possibilities in an activity, they then make progressive refinements in order to make better music. Singing is a striking example: the gradual refinement process has led to a wide range of specialised singing styles and techniques, from Chinese or western operatic styles to Welsh male voice choirs to Tuva overtone singing.

This section will mainly be concerned with human singing, although we will begin with a little bit about normal speaking, and at the end we will have a brief look at the mechanics of birdsong. Humans can make a very wide variety of sounds, which they combine in a rapid and virtuosic manner when talking or singing. For the vast majority of those sounds, the power source is air-flow from the lungs. There are exceptions: when you click your tongue the sound is generated by your tongue slapping against the lining of your mouth, so that the mechanism of sound generation is rather like that of a hand-clap. If you speak a “click language” like some African peoples, such clicks are incorporated in a sophisticated way into speech (see for example https://en.wikipedia.org/wiki/Khoisan_languages).

However, most sounds are driven by the lungs, so that in some sense the human voice is a kind of wind instrument — but it doesn’t fit at all neatly within the classification system for wind instruments that we have been using in this chapter. Figure 1 shows a schematic diagram of the “instrument”, the human vocal tract. Air from the lungs enters at the bottom, passes through the vocal folds (the preferred modern term instead of “vocal cords”), then enters a duct of complicated shape, comprised of the throat, mouth and nasal cavity, before exiting to the outside world through the mouth and nose. For the purpose of this schematic diagram, I have rotated the vocal folds by $90^\circ$: the opening really occurs in the side-to-side orientation, not the front-to-back one shown here. I have also tried to minimise the amount of anatomical detail and jargon in this section, but the next link gives a brief description if you are interested.

SEE MORE DETAIL

Figure 1. Schematic diagram of the human vocal tract. Note that strictly speaking the opening of the vocal folds has been shown in the wrong orientation here: it really occurs in the side-to-side direction, but that doesn’t matter for the purposes of this preliminary discussion.

Figure 1 doesn’t really look like any of the wind instruments we have studied so far, but perhaps it comes closest to resembling a brass instrument. The vocal folds behave in a rather similar way to a brass player’s lips, with flaps of squashy flesh being set into vibration by the air flow. The fluctuating pressure produced by that vibration then interacts with a duct with its own resonance frequencies, before emerging to radiate sound into the outside air. So far, so similar to a brass instrument. But there are several very important ways in which the voice is quite different.

First, the shape of the vocal tract is nothing like the shape of any brass instrument we have looked at. Furthermore, that shape is not fixed. The walls of this “duct” are not hard, they consist of soft deformable flesh. Embedded in that flesh are many muscles that allow the owner to actively change several aspects of the shape. The configuration of the vocal folds can be altered, the tongue and the lips can move around, the mouth opening can be varied, and the soft palate can be moved in order to open or close the passage connecting to the nasal cavity. We will see in a moment that rapid reconfiguration of all these things lies at the heart of our ability to speak or sing.

The next important difference from a brass instrument concerns the pitch of a sung note, compared to the length of the vocal tract. A typical male vocal tract is about 17 cm long, whereas a brass instrument capable of producing a similar range of notes, such as the trombone, needs to be far longer: a tenor trombone at full extension is around 2.7 m long. We can immediately deduce that the fundamental frequency of a sung note does not (ordinarily) fall close to a duct resonance. Instead, the sung pitch is normally determined directly by the singer setting the oscillation frequency of the vocal folds. The brass equivalent would be the range of pitches that a player can produce when they “buzz” their lips against a bare mouthpiece, without having the tubing of the instrument attached. Resonances of the vocal tract do play an important role in shaping the sound of speech or singing, though, as we will see shortly.

The next step is to think about what we do to make various sounds, via some simple examples. We will investigate speech first, then move on to singing. Say the word “pay”, and think about what you are doing with your mouth. To make the initial “p” sound you close your lips, build up some air pressure behind them, then open them suddenly to make a small explosion. We will think about the ensuing “ay” sound in a moment: first, we will investigate other ways to start a word, sticking to the same “ay” ending.

Say the word “bay”. You should find that everything is very similar to “pay”, but with a slightly stronger and more emphatic initial “lip explosion”. Now say “day”. This time, you use your tongue rather than your lips to create the explosion, and it comes out sounding a little different. Now say “lay”. Again you use your tongue to create a partial blockage, but this time there is not really an explosion. Instead, you vocalise while you manipulate your tongue – if you want to, you can sustain the “llll” sound for a while, then release the tongue to give the “ay” sound.

Next, say the word “say”. This time, to make the initial “ssss” noise you use the tongue to create a rather narrow opening, then blow some air through it. The turbulence in this air flow makes the “ssss” sound, without any vocalising from your vocal folds. Finally, combine the last two examples by saying the word “slay”. Notice how quickly you have to move your tongue from the “ssss” position to the “llll” position, while switching on vocalisation to make the “llll”. This is a first inkling of the virtuosity involved in normal speaking.

Now we can think about sustained sounds. Most of these are associated with vowels (“aaaa”, “oooo” etc.), or combinations of vowels called “diphthongs” (like our “ay” sound from the earlier examples). However, there is also the sound of humming (“mmmm”), or hissing (“ssss”), or whistling. The sounds of vowels or humming involve oscillation of the vocal folds, modulating the air flow from the lungs. Hissing and whistling are different: we have already mentioned hissing, while the sound of whistling is generated by the periodic production of vortex rings (a bit like smoke rings) when air is blown or sucked through rounded lips, at a frequency governed by a Helmholtz resonance in the mouth cavity. This sound doesn’t play a role in normal English speech or singing, so we won’t dwell on it.

You can get a first idea of how you create different vowel sounds by trying a few more simple examples. Say the words “bah”, “be”, “boo” and “bore”. Say them slowly, with the vowel sounds continuing while you think about what you are doing with your lips and tongue. These words all start with the same “lip explosion”, but to make the different vowel sounds you shape your mouth in four different ways. These different shapes create different resonance frequencies of your vocal tract, and this is the key to the perceptual effects of the different vowels.

B. The source-filter model and vowels

We can get a good understanding of how vowel sounds are associated with particular sets of vocal tract resonances by using a simple argument called a source-filter model. First, look at Fig. 2, which shows the configuration of the vocal tract when a typical vowel is being sung. The nasal cavity has been blocked by a movement of the soft palate. (You can convince yourself that this happens by singing any vowel, then pinching your nose shut — nothing changes, because the nasal cavity is not connected to the mouth and lungs.) When the vowel is sung at a particular pitch, the vocal folds open and close at the frequency of the note, much like a brass-player’s lips. They are shown in the figure at a moment when they are closed.

Figure 2. A version of the vocal tract diagram from Fig. 1, showing the conditions under which a typical vowel sound might be produced. The vocal folds are shown at a moment when they are closed, and the nasal cavity has been blocked off by movement of the soft palate.

If we were to model this system by the same approach that we used for brass instruments, we would use a procedure that is summarised in the upper plot of Fig. 3. There is a feedback loop: the varying flow rate through the vocal folds excites the resonances of the vocal tract, and the resulting pressure variation acts back on the vocal folds. The particular waveforms of flow rate and pressure are determined in a rather complicated way by this feedback process.

Figure 3. Upper diagram: feedback loop for the voice modelled like a brass instrument; lower diagram: simplified version appropriate to the source-filter model, without the feedback link from the pressure to the vocal folds.

But we have already noted that the pitch of a sung note is not much influenced by the vocal tract resonances (because the tract is short, so they are too high in frequency). Instead, the pitch seems to be determined mainly by a resonance frequency of the vocal folds themselves, as set by the singer through their muscular action. This suggests that, at least for a preliminary understanding, we might get away with forgetting about the feedback and using the simplified procedure summarised in the lower plot of Fig. 3.

This is the source-filter model: we treat the two stages of the process entirely separately. First, steady air-flow from the lungs causes the vocal folds to vibrate, giving a waveform of volume flow rate past the vocal folds rather like that shown in Fig. 4. For part of each cycle the folds are closed so that there is no flow, then they open to let a pulse of flow through. The example shown here is artificially generated using a formula suggested by Titze [1], chosen to give a reasonable representation of earlier measurements.

Figure 4. Idealised plot of the volume flow rate past the vocal folds, as they open and close once per cycle of the sung pitch.

The second stage is to take this flow waveform and use it as input to a suitable frequency response function describing the acoustics of the vocal tract, with pressure at the mouth as the output. If the vocal tract had been a simple cylindrical pipe, we would have known what this frequency response needed to be — we already looked at this case, back in section 4.2. Figure 5 reproduces Fig. 12 from that section, showing the first few mode shapes. Each shape consists of a number of quarter-cycles of a sine wave, with a pressure antinode at the closed end (corresponding to the vocal folds) and a node at the open end (corresponding to the singer’s mouth). The lowest mode has no nodal points within the pipe, the second mode has one node, the third mode has two nodes, and so on in an orderly sequence.

Figure 5. Pressure variation associated with the first few acoustic modes of a cylindrical pipe that is closed at one end and open at the other, reproduced from Fig. 12 of section 4.2. This gives a very crude model of what we should expect the resonances of the vocal tract to look like.

Of course, the real vocal tract has a more complicated shape, as indicated schematically in Fig. 2. If we imagine morphing the cylindrical pipe gradually into the correct shape, the mode shapes will change gradually — but the qualitative features of those shapes will remain the same. The lowest mode will have no internal nodal points, the second mode will have one, and so on. Similarly, the resonance frequencies will all change during the morphing process, but (because the length of the pipe/tract remains the same) the average spacing of those frequencies will remain pretty much the same.

Figure 6 shows some measured results for vocal tract resonance frequencies, taken from Ladefoged and Johnson [2]. But there isn’t a single set of frequencies, because the singer can shape their vocal tract in many different ways by moving the tongue and changing the mouth opening. And, of course, this is exactly what we noticed earlier when we spoke or sung the different vowels of the words “bah”, “be”, “boo” and “bore”. The three sets of measured resonance frequencies shown in different colours in Fig. 6 correspond to the vowel sounds in the words “hard” (in red), “food” (in blue) and “bed” (in green). The vertical dashed lines indicate the frequencies for the cylindrical pipe of Fig. 5, with the length of a typical male vocal tract. You can see that the pattern is as we expected: the individual frequencies move around, but the average spacing stays more or less the same so that there are always three in this frequency range.

Figure 6. The first three formant frequencies for the three vowels giving rise to the filter characteristics plotted in Fig. 7 and the synthesised sounds in Sound 1. The vertical dashed lines show the first three resonance frequencies of a cylindrical tube of length 0.17 m, the length of a typical male vocal tract. Colours correspond to those used in Fig. 6: red for the vowel sound in “hard”, blue for “food”, green for “bed”.

Plausible approximations to the frequency response functions for these three vowels are shown in Fig. 7. We are now ready to synthesise some example sounds with the source-filter model. An input waveform based on Fig. 4 was constructed, consisting of 1 s bursts of three different pitches. This input waveform, exactly the same in each case, was filtered by the three frequency response functions from Fig. 7. The result is in Sound 1: you should hear one vowel “sung” at three pitches, followed by a different vowel at the three pitches, followed by a third. You should listen out for two things. First, does each of the three pitches sound like the same vowel? Second, do the three different frequency response functions lead to sounds that are at least somewhat recognisable as the three target vowels (“hard”, “food” and “bed”)?

Figure 7. Plots of the impedance functions used to simulate the vowel sounds in Sound 1. Each curve shows the external pressure at the mouth for unit volume flow at the vocal folds. The formant frequencies are as shown in Fig. 6: the red curve is for the vowel sound in “hard”, the blue curve for “food”, the green curve for “bed”.
Sound 1. Three synthesised vowels, each “sung” at three different pitches. These sounds have been computed by the source-filter model, using the source flow rate waveform plotted in Fig. 4 and the three filters plotted in Fig. 7.

The next link gives some details of how all the ingredients of this synthesis were done. I should emphasise that the aim here was to produce the simplest synthesis that could demonstrate the vowel effect clearly. It would no doubt have been possible to add more “bells and whistles” to give a more realistic, human-sounding result, but that would have been to address a different question.

SEE MORE DETAIL

If you would like to explore the link between vocal tract shape and vowel sounds some more, there are several excellent resources available online. The most directly available is a web site called Pink Trombone. This is an interactive model allowing you to emulate a wide variety of speech and singing sounds. I suggest you switch off “Pitch wobble”, then play around with the controls. You can alter the sung pitch, you can move the tongue, and by clicking in various places around the mouth opening you can make different articulatory sounds like “p”, “b”, “l”, “m” and so on.

Two other free resources take the form of Windows executables. The program “Madde” available here allows you to construct a wide variety of steady sung notes, some which sound remarkably like realistic human singing. The program “VocalTractLab” available here is a full articulatory model, like Pink Trombone but more sophisticated.

C. High notes, intelligibility and resonance tracking

Our ability to distinguish different vowel sounds largely independently of the pitch of a sung note depends on the interaction between the harmonics of the note and the frequency response of the vocal tract. We have already touched on this issue, back in section 5.3. Figure 8 is a repeat of Fig. 6 from that section, and it illustrates schematically what is going on. The red lines in the animation indicate the harmonic amplitudes and frequencies, as a chromatic scale is sung. The dashed curve shows a frequency response with two broad resonances, qualitatively similar to those of the vocal tract. This curve is “sampled” by the harmonics, and provided these are sufficiently dense, the pattern of their amplitudes gives an indication of the positions of the resonant peaks. This remains true as the note changes as it moves up the scale — this illustrates the phenomenon of formants.

Figure 8. The interaction between the harmonics of a sung chromatic scale and the frequency response of the vocal tract, reproduced from Fig. 6 of section 5.3. The red lines mark the harmonics produced by a singer, performing a one-octave chromatic scale starting at $G_3$ (196 Hz). The dashed line shows a schematic version of the frequency response of the vocal tract, in a configuration corresponding to a particular vowel.

Figure 8 deliberately made use of rather low notes, so that the harmonics are sufficiently close together that the formant pattern can be seen. But you may have spotted that something will go wrong if a very high note is sung. Suppose the fundamental frequency of the note had been 750 Hz, with harmonics at 1500 Hz, 2250 Hz and 3000 Hz. Look where those frequencies fall on the dashed curve — neither resonance would come through clearly as a peak in the sound spectrum, these frequencies are simply too far apart to resolve the structure of the frequency response.

This gives a problem for sopranos, and also leads to an opportunity. It is indeed the case that different vowels cannot be clearly distinguished when sung at very high pitch. This is not the fault of the singer, it is an inevitable consequence of the laws of physics, and composers and conductors need to be aware of it.

But an opera singer is in the business of making enough sound to be heard clearly, and it would seem a pity to let a good resonance go to waste if they are singing a note with a fundamental lying above the first formant frequency for the vowel in question. So, as you might perhaps guess, a trained singer will adjust their vocal tract for very high-pitched notes in an effort to match the first vocal tract resonance to the fundamental frequency. The effect is illustrated very clearly in Fig. 9, reproduced from Joliveau, Smith and Wolfe [3]. Eight experienced sopranos were asked to sing the words “hard”, “who’d”, “hoard” and “heard” at a range of pitches, to give four different vowel sounds. Simultaneously, the authors used an ingenious method for non-invasive measurement of the vocal tract frequency response, so that they could pin down the first vocal tract resonance frequency (labelled “$R1$” in the plot) associated with each sung note.

Figure 9. Resonance tracking by a soprano singer, copyrighted by Joliveau, Smith and Wolfe [3] and reproduced by permission. Eight classically trained singers produced sustained notes at a range of pitches, based on four different vowel sounds. Simultaneously, the frequency response of their vocal tract was measured, to locate the peak frequency of its first resonance. At lower sung pitches the four vowels have distinct frequencies of this first resonance, but as the pitch rises the four different vowels tend to converge near the dashed line which represents matching between the pitch frequency and the resonance frequency.

The result of plotting this frequency against the fundamental frequency of the note is shown in Fig. 9. The sloping dashed line indicates where these two frequencies would become equal. At relatively low pitches, the four vowels are clearly separated to mark out four approximately horizontal lines in the plot. The singers are using the usual vocal tract configuration for each vowel to produce four different resonance frequencies $R1$. But as each of the lines in turn approaches the dashed line, it curves upwards and tracks the dashed line. Or at least, it tracks it until the extreme right-hand side of the plot, where perhaps it becomes physically challenging for the singer to shape their mouth to give such a high resonance while still able to sing the note.

In later work from the same research group, Garnier, Henrich, Smith and Wolfe [4] demonstrated a further twist to this story. It turns out that some sopranos learn to tune the second vocal tract resonance to the fundamental frequency of very high notes, and in that way extend their range of notes that can be sung with strong tone. But in order to achieve this, they have to do something counter-intuitive when they make the transition from first to second resonance: the mouth has to be closed somewhat, whereas the tuning of the first resonance frequency requires the mouth to be progressively opened as the pitch rises.

Perhaps the most striking manifestation of vocal tract tuning is given by overtone singing. This is a technique practised by peoples around the world: a well-known example is Tuvan or Mongolian throat singing. On the principle of a picture being worth a thousand words (and a video worth even more), before I describe the phenomenon watch this short clip of singer Anna-Maria Hefele demonstrating inside an MRI scanner.

Figure 10. Anna-Maria Hefele demonstrating overtone singing inside an MRI scanner.

Your immediate impression may be that she is able to sing two tunes at the same time: one at very low pitch, the other at high pitch. That is true, but she did not have a free choice of which notes could be produced simultaneously. The high note is always an exact harmonic of the low note, but for a given low note she is able to pick out and emphasise different harmonics by adjusting a vocal tract resonance, mainly by what she does with her tongue. So with a single low note she can emphasise a bugle-like sequence of harmonics, while to produce other high notes she has to vary the low note appropriately.

D. The singer’s formant

Opera singers have another problem with making themselves heard: they are often competing with the sound of a full orchestra. Figure 11 shows typical averaged sound spectra, taken from the work of Johan Sundberg [5]. It reveals that an orchestra alone (black solid curve) has a very similar distribution of sound energy across the frequency range to a normal speaking voice (dashed curve). It follows that a speaker could easily be masked by the orchestral sound.

Figure 11. Long-time-average spectra for the sound of a symphony orchestra with and without a singer soloist (black and red solid curves) and for normal speech (dashed curve). The “singer’s formant” constitutes the major difference between the orchestra with and without the singer soloist. Copyright Johan Sundberg, reproduced by permission from Sundberg [5].

However, an orchestra with a trained singer soloist (in this case the late Jussi Björling) gave the spectrum plotted in red. In a range around 2–3 kHz, the red curve rises well above the black curve. This boosted level in a well-defined bandwidth is known as the singer’s formant, and it allows the soloist to be heard despite the orchestral “background noise”. It also gives the voice a characteristic tone quality, which accounts for a large part of what it means to “sound like an opera singer”.

The precise details of the physics behind the singer’s formant have been a matter of some debate. Experiments with ducts made to mimic the vocal tract profile inferred from MRI scans of singers led Sundberg to conclude that this formant is not caused by a single resonance but by a cluster of resonances [6]. The singer does something to the configuration of their vocal tract that causes the 3rd, 4th and 5th resonances to shift so that they are close together, in the vicinity of 3 kHz, rather than being more uniformly spaced out as Fig. 6 suggested. This clustering can persist independently of which vowel is sung, so that the beneficial effect of the singer’s formant also persists. A brief description of what a singer is believed to do in order to create the singer’s formant is included in the first side link above, section 11.9.1.

NEXT SECTION


[1] Ingo R. Titze: “Sensitivity of odd-harmonic amplitudes to open quotient and skewing quotient in glottal airflow (L)”; Journal of the Acoustical Society of America 137, 502–504 (2015).

[2] Peter Ladefoged and Keith A. Johnson: “A course in phonetics”. Wadsworth (2011).

[3] Elodie Joliveau, John Smith and Joe Wolfe: “Tuning of vocal tract resonance by sopranos”, Nature 427, 116 (January 2004).

[4] Maëve Garnier, Natalie Henrich, John Smith and Joe Wolfe: “Vocal tract adjustments in the high soprano range” Journal of the Acoustical Society of America 127, 3771-3780. (2010)

[5] Johan Sundberg: “The science of the singing voice”, Northern Illinois University Press (1987)

[6] Johan Sundberg: “The singer’s formant revisited”, Voice 4, 106-119 (1995). The same text is also available here.