Speech from Another: The Mechanized Pursuit of the Human Voice

Speech from Another: The Mechanized Pursuit of the Human Voice

Spring 2015

Jeremy Davis

Davis, Speech from Another May 5, 2015

1

Few arguably trivial endeavors have been fraught with more fear, failure, and potential than

the search to replicate human speech. Whether for the novelty of hearing those familiar words

though simulacra or to achieve connection and unity via long-distance communication, both

rumors and reality of the achievement of replicating human speech are found throughout our

history.

It is no coincidence that the invention of a working “speaking” machine happened squarely

toward the end of the Age of Enlightenment. Concerning technology, timing is everything, and

inventions tend to manifest for specific reasons at specific moments. Society, an ever-evolving

organism, reaches certain points in its progress that seem to necessitate a particular novel idea

and/or contraption that express the needs and concerns of the age. Numerous devices were

created throughout the 18th and 19th centuries that replicated the human voice more or less

effectively for purposes of both entertainment and science. While voice replication was

somewhat well-received in the 17th and 18th centuries, this wasn’t always the case. There is one

famous story of the friar-philosopher Thomas Aquinas dashing apart a speaking machine

created by St. Albertus Magnus over a 30-year period, believing “the Devil to be in her.”1

Thanks to the Enlightenment, superstition began to be replaced by science, and mankind

rediscovered itself as a capable, potent, and intellectual being that was imbued with great

understanding and astounding feats of creation. Immanuel Kant wrote about the necessity to

understand the reasons for the various phenomena of our world through a comprehension of


2

how they actually worked, physically. This pursuit known as teleology is closely linked with

theology. Just as theology attempts to explain God, teleology’s goal was to understand nature

through studying how it worked, quite literally. Man being seen as the ultimate goal of nature

at the time was thought to unlock keys to understanding if only we could know how our bodies

functioned. In short, one could comprehend the meaning of life if one understood how it

worked – how we worked. The first efforts to replicate human speech can be seen as an

attempt to understand the reasons for our very existence and what meaning could be

discerned from speech itself. This period of enlightenment became the engine that drove the

pioneering days of voice synthesis. Later it would be driven by pursuits split equally between a

desire for scientific discovery and frivolous entertainment. Twentieth and 21st century artists

would take up the technology for their own purposes, in both fine art and music, culminating in

a social commentary on our ever-increasing relationship with technology.

To begin at the beginning would be to delve into myth and rumor. While many stories about

“talking” machines go back over a millennia, none of the oldest rumored objects actually still

exist nor do any reliable documents exist regarding how they achieved their feats. In most cases

it should be assumed that apparent voice replication was achieved through some sort of

trickery, such as hidden speaking tubes or actual hidden persons in the apparatus, and the like.

The first legitimate manipulatable speaking machine in historical documentation was made by

Wolfgang von Kempelen in 1791.1 An aristocrat in the court of Vienna and a city official, he was

also widely known at the time for his mechanical abilities. Having become famous for the

creation of “The Turk,” a chess-playing automaton (which, incidentally turned out to be a


3

fraud), he actually did succeed in creating a machine capable of producing human speech based

on scientific research of human physiology and phonetics. Deciding that our vocal organs most

closely resembled a musical instrument in terms of form and function, he found the bagpipe

the most suitable instrument on which to base his machine. After nearly 20 years of effort and

three prototypes, in the end, the machine comprised a box containing a functional glottis as a

moveable reed (such as that found in a bagpipe) fed by an elbow-operated kitchen bellows that

acted as lungs. The end of the machine, where the sound came out, acted as the mouth and

was covered by the user’s hand (as lips), which was formed in various ways to mimic the various

shapes of the mouth and tongue; the effect was not unlike trumpeter playing contemporary

jazz with a toilet plunger over the end of the instrument. Kempelen’s Speaking Machine is said

to have best spoken in French or Latin and was famous for its pronunciation of words such as

“papa,” “mama,” “Marianna,” and “astronomie,” and even short phrases such as “Romanum

Imperator semper Augustus” and “Maman aimez-moi.”1 Goethe himself heard the machine in

1797 and said of it, “The speaking machine of Kempelen… is in truth not very loquacious, but it

pronounces certain childish words very nicely.”1 Although Kempelen’s original machine is lost

to us, his work was carried on by later innovators, most notably German physician and physicist

Hermann vonn Helmholtz in the mid 19th century. His contribution was in focusing on vowels as

a primary vehicle for the formation of comprehensible speech. He discovered that each vowel

contained combinations very specific frequencies (formants) that were invariable reproducible

from person to person as understandable transmissions, akin to musical chords in a piece of

music.2 While he never created a speaking machine per se, he did use tuning forks and

electromagnetic solenoids to create a machine called the Tuning Fork Vowel Synthesizer, a


4

device that sustained “chords” of pure human formants to mimic vowel sounds in pursuit of the

science of phonetics. Phonetics was but a small diversion for Helmoltz and he did not carry his

research to its ultimate end of a speech synthesizer.

Up to this historical point, all speaking machines and their ilk either used a clockwork or

electrical means of operation to reproduce human speech in a repetitive or sustained way (i.e.,

without manipulation or variation) or, as was the case with Kempelen’s Speaking Machine, a

direct physical manipulation with the hand to form various words (not unlike the natural way

we manipulate the air and vibrations from our chest). It would be Joseph Faber and his

invention of the Euphonia in 1846 that would bring about a speaking machine that could be

played, as a piano, to produce human speech. A true technical marvel of the time, it consisted

of sixteen keys – similar in appearance and arrangement to that of a piano – that operated the

jaw, lips, and tongue of a carved wooden head at the end of the machine, and a seventeenth

key that operated the glottis while bellows and an ivory reed made up the lungs and larynx.1 A

skilled operator could produce a wide variety of words and sentences. The variation of “notes”

was really a program, predetermined. The machine’s capabilities were limited by its design with

only sixteen keys to choose from, but the alphabet and language are also limited. The blending

of these options could be considered as mechanized and predetermined in the keys and pedals,

but the operator was able to choose the words and pitch, and could choose from three

different speaking modes: normal, whisper, or song. Sadly, the machine and its inventor

became objects of derision and satire for a press writing for a society that was becoming

increasingly exposed to technical marvels almost daily. Perhaps more than that, the voice


5

produced by the machine was unsettling; “ghostly” and “disembodied” were common

descriptors. Notably, John Hollingshead, a London theater manager, said of the machine: “One

keyboard, touched by [Faber], produced words which, slowly and deliberately in a hoarse

sepulchral voice came from the mouth of the figure, as if from the depths of a tomb.”1 As is the

case with many a failed life-long endeavor, Faber eventually took his own life, his machine

suffering the same fate as himself: as a financial and social failure. Even this final act would be

derided by Hollingshead1:

He disappeared from quietly from London, and took his marvel to the provinces, where it

was even less appreciated. The end came at last, and not the unexpected end. One day…

he destroyed himself and his figure. The world went on just the same, bestowing as little

notice on his memory as it had on his exhibition. As a reward for this brutality, the world,

thirty years afterwards, was presented with the phonograph.

It would seem that by failing to choose either science or entertainment as his ultimate purpose,

he was a disappointment to both communities. Faber was neither the showman people

expected nor the noble scientist in pursuit of a greater truth. His handling of his machine was

lackluster, and he produced no papers on the subject of mechanics or phonetics. Furthermore,

the hollow disembodied voice that emerged from the Euphonia was a step to far. People were

not ready for a voice from another, and they eviscerated Faber, leaving him forlorn and suicidal.

Faber could not have known it, but his machine, while being exhibited in London in the 1840s

was seen by the young Alexander Graham Bell, the inventor of the telephone. While Euphonia

and its technology would not prove vital to what made a telephone work, it none-the-less

inspired the young Bell to study phonetics and to create his own speaking machine. It would be


6

during his own pursuit of a speaking machine that he would stumble upon the technology

necessary to transmit the human voice over great distances.3

In the 20th century, replication of the human voice was passed over in favor of recording and

transmitting it instead. Much progress was made with electromagnetism and electricity, as well

as with recording technology to the transmission of radio waves around the planet. The

phonograph and telephone became common, and it was no longer odd to sit in one’s home and

listen to distant voices emerging from a wooden box in the living room. The telephone also

became commonplace, crucial even, to daily life and business that bandwidth became a real

problem. One can only fit so much electricity into a wire and there simply was too much

demand for the limited cables of the time. Research began into technology that would allow

more headroom in the available infrastructure. Questions such as just how much of the actual

voice needed to be transmitted were pursued, and in the early 1930s, Homer Dudley, an

engineer at Bell Laboratories, developed the Vocoder (VOice enCODER).4 The Vocoder was a

voice synthesizer that analyzed speech by both filtering the human voice and encoding it in

order to transmit it more efficiently. A fortuitous side-effect of this encoding process was that

one could send secure radio transmissions that could only be decoded by select individuals on

the other end. Thus, the Vocoder mainly served as a secure transmission device during World

War II. One notable aspect of the Vocoder’s operation was that it used an input from once

source and then through filtering outputted quite a different sound. It became clear that other

inputs besides the original human’s voice could be used. Potentially other sources could be

employed as the foundational sound for a new synthesized and articulated speech. Dudley


7

imagined that, in time, the Vocoder could fully replace a poor singer’s voice with a good one

from another.5 Naturally there were applications for this in the movie industry, and had it been

developed earlier, all those silent movie actors put out of work when sound was introduced

might have found a job in the talkies. It is interesting that years later, because of its peculiar

and distinctive aesthetical altering of the human voice, the Vocoder has become most widely

known for application in the music industry to this day, as discussed in more detail below.

Toward the end of the 1930s, using this early work as a springboard, Dudley also invented a

device called the Voder (VOice operating DemonstratER).4 Debuting at the 1939 World’s Fair in

San Francisco, the world was introduced to the first bona-fide voice synthesizer. It was a

complicated machine that required a year’s worth of training to operate it well, and the most

skilled telephone operators were draft just for this exhibition. The Voder achieved true voice

synthesis through a combination of analog and electronic technologies: pedals and bars that

operated gas discharges and buttons and switches that controlled oscillators, band-pass filters,

tones, and amplifiers. The result was fully recognizable and understandable speech, even if

robotic sounding. Dudley imagined that the Voder could replace the transmission of an actual

voice over the telephone wires by simply speaking for a person on the other end with its (the

Voder’s) own voice.5

Around the same time the Vocoder was being developed, another inventor stumbled upon an

interesting phenomenon while shaving. Gilbert Wright noticed that when using an electric razor

near his Adam’s apple, the vibrations would emanate from his open mouth. If he mouthed

words with his lips and tongue, the razor’s sound would emerge from is lips as articulated


8

speech – words from a razor. He ran with the idea and in 1939 developed the Sonovox. Not

exactly a voice synthesizer, rather, it was a curious synthesis of another type, that of human

and machine. The Sonovox worked by placing two small transducers on either side of the larynx

on the throat and the subject mouthing the words he or she wanted to speak. The sound

emitted by the transducers was transmitted through the vocal tract and emerged from one’s

mouth as the source sound, but articulated as speech. In this way, train whistles, car engines,

trumpet’s notes, or even the wind could become the surrogate voice for the user. In one

notable quote, the effect of the Sonovox on unsuspecting audience was described: “The

audience was enjoying what they thought was a pipe organ playing ‘The Bells of St. Mary’s,’

when they suddenly ‘sat bolt upright… Some dug at their ears, certain their hearing was playing

them false. Others sat in puzzled wonder. For the pipe organ suddenly burst out singing the

words of the chorus.”5 Shortly after, as this event unfolded, an electric hum became audible

and it started saying “I am power: I light your houses. I run your street cars. I work for you. I am

Power!” While the Sonovox was eventually abandoned as a novelty to the practical usefulness

of the Vocoder and Voder, it did experience a few notable distinctions. In the 1940s, several

films showcased the Sonovox’s abilities. In what almost amounts to a cinematic advertisement

for the technology, the 1940 movie They’ll Find Out featured several extended scenes in which

a big band’s instruments suddenly began speaking the words they were accompanying. In

succession from a trumpet to clarinet to oboe, the scene ended with the lead singer performing

a duet with a woman, with her own voice, singing alongside his (as the voice of the entire

band), complete with him holding the Sonovox to his own throat for all to see. More subtly, in

the 1947 movie Possession, the Sonovox was used to haunt Joan Crawford as the voice of her


9

new husband’s dead wife via the sound of a buzzer or a piano. In the interim between these

two films, it would seem the Sonovox did serve a more practical function – that of wartime

propaganda against Nazi Germany. Utilizing short-wave radio, Allied broadcasts to German

soldiers in Russia would make the wind howl “You’ll never win.” It was even used to broadcast

messages to Allied troops saying (as whispering wind) “Give us revenge,” as factory whistles

chanting “Planes, guns, tanks,” or as a screaming bomb saying “Kill, kill, kill.” Even NBC’s

distinctive chimes would implore Americans to “Buy-WAR-Bonds.” 5

Again, the novelty of the robotic voice would become passé and serious effort returned to the

exploration of practical voice synthesis. Two devices emerged in the 1950s that were capable of

using light to read the spectral patterns of speech and played facsimiles through various means,

including vacuum tube oscillators. Frank Cooper’s Pattern Playback Device and Walter

Lawrence’s PAT (Parametric Artificial Talker) emerged as the ultimate predigital achievements

in voice synthesis.6 In rounding out the 20th Century, we saw vacuum tubes replaced with the

faster and smaller transistors and microchips, and voice synthesis finally split into two distinct

paths. Those of art and science. The time finally came in the 1970s when a computer could read

text with an artificial voice. In 1979, Texas Instruments developed and marketed the Speak and

Spell, a device that was meant to teach children how to read by either challenging them to spell

the words it spoke, or reading the words they typed into it. It was a wild commercial success

and can be seen as a pivotal device in making the acceptance of artificial speech a universal

reality. This cultural and scientific acceptance would come just in time for perhaps the most

famous astrophysicist in history, Stephen Hawking. In 1985, during a trip to CERN, Hawking


10

developed life-threatening pneumonia. An emergency tracheotomy was performed, but in

saving his life it also took his voice. Using artificial speech technology developed by Bell Labs

with software written by a company called Speech Plus, Hawking was able to communicate

efficiently once again. Interestingly, even after more natural voice synthesis technologies were

developed in the intervening years since the adoption of his first speaking device, he has

refused to “upgrade,” considering the robotic voice truly his – distinctive and recognizable.

Since that early time, speech synthesis has made great leaps forward in helping those with

disabilities and has found greater acceptance in society.

During the computer-boom of the late 1970s and 1980s, interest in technology and its place in

our society peaked. People were obsessed with the “world of tomorrow,” and characters such

as Max Headroom appeared in popular culture as the spokesmen of the future. Max was a

computer-generated TV personality that spoke using an actor’s voice though a Vocoder, giving

it a distinctive robotic coloring. Other cinematic robots had emerged complete with seemingly

synthesized voices, such as C-3PO from Star Wars or Johnny Five from Short Circuit. As the use

of artificial (or faux-artificial) speech synthesis began to inundate society, many artists and

musicians began to use the technology in their work as a direct reflection of its impact in our

culture. The Vocoder became an indispensable technology for musicians and performance

artists alike. Laurie Anderson, experimental instrument maker and performance artist, used it in

her 1981 hit O Superman (for Massenet), and more mainstream bands such as Kraftwerk and

Imogen Heap are famous for using the Vocoder in their music. Many visual artists since the

1980s also began working with various forms of speech synthesis, in direct or indirect ways.


11

Martin Riches, a German sound sculptor, debuted The Talking Machine in 1991. The sculpture

has 32 voice pipes, wind chests, valves, bellows and blower – all driven by a computer. Each of

the voice pipes (resonators) are made of wood and modeled on actual X-ray photographs of the

artist while speaking. Each resonator is a dedicated vowel or consonant and the computer

controls them in concert to form words and sounds.7 In 2014, Japanese performance and sound

artist Tomomi Adachi collaborated with Riches to perform with his Talking Machine live on

stage. It would same that an early promotional photograph depicting Riches mimicking the

action of “teaching” his machine to speak (as one would a child) led the young Adachi to believe

this is how the machine achieved its feat. The resulting performance is achieved by exposing

that falsity, and acting it out on the stage with the Talking Machine itself, both artist and

machine finishing the spectacle in stuttering unintelligibility.8 Other artists have made a more

direct use of modern computerized speech synthesis, including Ken Fiengold with his talking

heads. In a variety of compositions, he utilizes two or more prosthetic-like heads that have

conversations or arguments with each other. The statements are more or less random but

sampled from actual conversations and, as generated, will counter one another ad infinitum in

a flat robotic monotone. Mark Hansen and Ben Rubin collaborated in a work called Listening

Post that uses text fragments in real time from thousands of unrestricted Internet chat rooms,

bulletin boards, and other public forums. The texts are read by a voice synthesizer and

simultaneously displayed across a suspended grid of more than two hundred small LCD screens.

These real contextual statements become disassociated from their sources even more when

reanimated by a dispassionate electronic voice.


12

It is hard to say what the future of voice synthesis will be. While artists will no doubt continue

to use it to speak in ways that reflect on our contemporary society, for better or worse, it will in

all practicality continue to be integrated more and more in our day-to-day lives. In 2010, Apple

introduced Siri, a voice activated, interactive software program for the iPhone that not only

listens to you, but will respond with a natural voice. Siri, while still a novelty for some, has

become a very real participant in our lives. I would argue that it isn’t only practical usefulness of

an interactive hands-free program, but also the human-like qualities of Siri’s voice and

personality that has made it an acceptable component of daily life. Since the Enlightenment, we

have sought ways to explain life through scientific study, and that study has manifested as both

practical tools and entertainment. We need it to be both. As the future of voice synthesis

continues to be forged, we can look to parallel paths as laid out by two films featuring

prominent synthesized voices (simulated). Hal, in the 1968 Film 2001: A Space Odyssey presents

us with an apocryphal future where the soulless logical computer Hal endangers its crew

because of malfunction. The almost passive and arguably pleasant voice of Hal makes our fears

of a future spiraling out of our control quite visceral. Conversely, in the 2013 film Her, the

protagonist is presented with the first truly artificially intelligent operating system. Resistant at

first, he quickly succumbs to the intuitiveness of the program, its easy way of anticipating his

needs, and also by its soulful and seductive voice. The two fall in love, and they aren’t the only

ones. The film Her presents a world where, drawn in by the synthesized natural voice of

ourselves, we succumb to technology completely. In perhaps the only future that makes sense

(considering the predictions of Moore’s Law), we finally find a place where the perfect artificial

voice has a home – as the spokesperson of our future.


13

References

1 Hankins, Thomas L., & Silverman, Robert J. “Instruments of the Imagination,” Princeton, NJ: Princeton University Press, 1995. 2 Helmholtz, Hermann. “On the Sensations of Tone.” New York, NY: Dover Publications, 1954. 3 Stearne, Jonathan. “The Audible Past: Cultural Origins of Sound Reproduction.” Trinity, NC: Duke University Press Books, 2003. 4 Mills, Mara. “Media and Prosthesis: The Vocoder, the Artificial Larynx, and the History of Signal Processing.” Qui Parle: Critical Humanities and Social Sciences, Vol. 21, No. 1, Fall/Winter, 2012. 5 Smith, Jacob. “Tearing Speech to Pieces: Voice Technologies of the 1940s.” Music, Sound, and the Moving Image, Vol. 2, Issue 2, Autumn 2008. 6 “PAT Does the Talking.” Popular Electronics. December, 1958. 7 Schulz, Bernd. “Resonanzen: Aspekte der Klangkunst = Resonances: aspects of sound art.” Heidelberg: Kehrer, 2002. 8 Adachi, Tonami. Martin Riches Website. Available at http://martinriches.de/tomomi.html. Accessed May 5th 2015.

http://martinriches.de/tomomi.html

Speech from Another: The Mechanized Pursuit of the Human Voice

Documents

Transcript of Speech from Another: The Mechanized Pursuit of the Human Voice