A while back, several of us in the studio had a little spontaneous discussion about voice user interfaces over email. We thought we’d share some highlights. Please pile on in the comments section.
Steve Calde: What are people’s experiences with voice user interfaces? [A client] is interested in learning more about how to document voice-activated systems, and wondered if we had any experience to share.
Alan Cooper: You could also suggest to them that voice interfaces are inherently bad and will never work very well.
Dave Cronin: Why are they inherently bad?
I agree that they often are bad, but it seems to be more an implementation issue than something intrinsic about voice commands.
Stefan Klocek: The reason they are inherently flawed is that we use our voice for other more important things in addition to the system level input we would like to give to our DVD player. There is no way for the voice interface to understand that the context has changed and that I am no longer giving it a command, rather I am now giving my child a command or am simply muttering to myself. Of course we could imagine a system in which we indicate context by saying “DVD player – pause”, but this is adjusting my input to the deficiencies of the system. Voice input works “moderately” well in extremely constrained situations; Google 411, or the Bay Area traffic 511 are both voice command systems. When I call these systems I have to roll up the windows in the car, turn off the radio, and ask any passengers to stop talking for a minute. If I do these three things, the experience is usually flawless and quick. The brevity of the interaction itself contributes to this success. The fact that I have to change modes (make a call) and prep the environment (eliminate competing voices and noises) gives the system a good chance of performing well.
This is not the case with the home DVD player. The DVD player must be in an always-on state, waiting in case I issue a command. In this always-on state it is unable to distinguish when I am talking to it, and when I am talking to someone else. I don’t see any way, even with a magic technology solution, of getting around the fact that a voice system will be unable to detect the nuances of the context. The reason this isn’t a problem with keyboards is that the very fact I bring my fingers to the keys indicates a context that always means “input” to the computer. The keyboard itself acts as context, which distinguishes all other movements I make with my fingers and hands from those I specifically intend for the computer.
It could be argued that the computer has the same limitations of voice because of context of input problem. But our hands are well appropriated for volitional types of activities; the keyboard is a natural fit for how our hands work. Our voices, and sound in general has a more fluid and varied appearance in life. More than one person can talk at a time, the sound mixes into a single blob of input, not so with the keyboard (only one set of hands really fit on it at any one time). Ambient sound such as traffic or the radio can directly interface with a voice command system, not so with the keyboard (ambient movement, such as someone else tapping their hands on the desk does not affect my keyboard).
So while yes… it is an implementation issue, in this case the implementation issue can’t be solved without requiring that the human has to adjust their behavior to the system if it is going to work well. That is an inherent flaw.
Dave: Respectfully, I disagree.
Here’s an experiment: first record yourself saying “honey, do you want me to pause this?” (imagining that your sweetie has just gotten up to answer the phone); then record yourself telling an imaginary DVD player the command “pause.” Now download Audacity and isolate the “pause” from the initial statement. Compare the two. Can you tell the difference between the two utterances? If you can, so can a computer. Don’t assume that your ability to discern the difference is due to some supra-aural ESP skills you might have. You’re decoding nuanced information from changes in pitch, volume and cadence. Fairly primitive DSP (digital signal processing) can interpret this stuff quite capably. Especially in the world of set top boxes and game consoles as video players.
Tim McCoy: Mac OSX has had this ability since I think 10.2—it responds to a set of voice commands (and more you can define) by listening for a keyword of your devising. Saying “Honey, recycle these catalogs and open the mail” would do nothing (unless your honey was around) while “Computer, open the mail” would bring you to your inbox.
I’m not vouching for how well this works in practice, just that it’s out there.
What if my DVD’s remote had one button and a microphone, and I could speak into it like a phone?
And GOD does that interminable voicemail lady on cell phones tick me off—just imagine how many 30 second messages they turn into 1:03 phonecalls!
Kim Goodwin: I’ve been pleasantly surprised by how well my Mac dictation software can differentiate between commands and content it should type, and unless I pause to think in mid-sentence, it’s not bad at punctuation. (Oddly, the Windows version, which uses the same engine, is less adept.)
Doug LeMoine: During the Agfa project of olden times, Nate and I saw some pretty amazing voice recognition software … The radiologist “trained” it to map the unique ways that each person has of saying specific words, and this would take – generally – around 15 minutes. Building on what Dave said about the computer’s ability to determine subtle voice inflections, it seems to me that a brief “training” session (2 minutes) could be a way of refining the computer’s knowledge AND serve as a pedagogical vector for the user – these are the commands, this is how you say them.
(Also, perhaps they could be configured to include profanity for users like Alan?)
Still, it’s going to be weird. I would guess that, no matter how subtle, you’ll always need to have a “DVD voice” – in a somewhat similar way that you learn your “pack leader” voice when you get a dog – and you’ll need to talk to the DVD in an (albeit subtly) different way.
“Do you want me to tell it to eject the disk?”
“EJECT the disk.”
Which is not tragic, really, but also not totally “human.”