Every time I get on the phone with some corporation or other, I find myself reflecting on why voice interfaces are so uniquely infuriating. Clearly, I’m not the only one who thinks so, or sites like dialahuman.com and gethuman.com wouldn’t exist. I suspect the problem lies not only in wretched usability, but also in the fact that voice interaction sets higher expectations for reasonable, human-like behavior. If humans interact with computers as if they were also human, as discussed by Byron Reeves and Clifford Nass in The Media Equation, this seems even more true for computers other software-powered devices that accept voice input in addition to using voice output; after all, if it can understand what you’re saying, it must be able to think, right? In their very readable 2005 book, Wired for Speech, Nass and another colleague, Scott Brave, assert that this is indeed true. Hearing a human say, “I’m sorry, I didn’t understand that” three or four times in a row would be enough to inspire violent impulses in the most dedicated pacifist, and many people have similar reactions to voice interfaces. So is a more “human” interface necessarily better?
Even though we know we’re talking to a machine, we humans respond to perceived emotion even in recorded voices. For several days after we installed a new phone system in our offices, people continually commented on the doleful female voice that responded to deleted phone messages by saying “duuh-leted,” dragging out the first syllable and drooping at the end, kind of like a mopey teenager asked to take out the garbage. Discontented machines are especially noticeable, though excessive perkiness is irritating in some circumstances: “I’m sorry, you’ve been on hold for 20 minutes, so your session has expired.”Then there’s the question of voice interface etiquette. I don’t need to feel like I’m talking to a 1970’s cylon that responds to my requests with a metallic and subservient “By your command,” but I want to smack my bank’s voice system for its presumption when it says something like, “If you’d like to speak to an agent, say ‘Agent, please.’” I believe in saying “please” to other humans, but I don’t politely ask the cat to move off the couch, and I’m certainly not going to extend the courtesy to a computer (though I think it ought to apologize to me when it can’t help). According to Nass and Brave, I’m not alone. Most people in their experiments didn’t respond well to synthesized voices using the first person,and even recordings of real voices didn’t get a warm reception when using the first person to deliver bad news. In fact, the use ofthe first person for bad news only increased listener perception of the system’s incompetence, perhaps since listeners were more likely to judge it by human standards.
Perhaps the most interesting point Nass and Brave demonstrate is how contrast of any kind draws attention to system shortcomings. This makes intuitive sense from everyday life; you might be content driving your five-year-old economy car until you ride in a colleague’s brand new sports car, or think Madonna sings well until you hear Ella Fitzgerald in her prime. In audible interfaces, the unfortunate contrasts underscore the ways in which the technology simply can’t replace a human.
Your voice system will be better received if you avoid these typical contrasts:
- Inconsistency in personality and content. There’s a reporter on one of Bay Area TV news shows who tends to report on the death toll from the latest global catastrophe with a smile on her face, which always makes me wonder what kind of strange things are going on in her head. Similarly, people are less likely to enjoy or trust their interactions with a system that cheerfully reports an inability to help, or that seems terse or unfriendly in the course of ordinary transactions.
- Combining high quality output with low fidelity input. If a system talks in complete sentences using a recorded human voice but can’t parse a simple request or recognize common words, it comes across not only as a technologically limited system, but as a deliberately obtuse and infuriating person. Clear but obviously synthesized speech leads to lower expectations of “intelligence.”
- Mixing recorded human voices with synthesized output. Dynamic content—such as email, news, and Web site content—is difficult or impossible to construct from pre-recorded bits of human voices, so synthesized output is sometimes necessary. Having a human voice speak part of the content while a synthesized voice speaks the remainder is distracting.
If you haven’t heard much from me in the Journal lately it’s because I’ve been immersed in writing the comprehensive book on Cooper’s methods, from planning research to writing specs, and there are still about 150 pages between today and my content completion deadline at the end of August. However, I thought I’d start sharing some snippets of thinking from the book, which won’t hit shelves until January. This is my first installment.