A conversation about voice interactions

A while back, several of us in the studio had a little spontaneous discussion about voice user interfaces over email. We thought we’d share some highlights. Please pile on in the comments section.

Steve Calde: What are people’s experiences with voice user interfaces? [A client] is interested in learning more about how to document voice-activated systems, and wondered if we had any experience to share.

Alan Cooper: You could also suggest to them that voice interfaces are inherently bad and will never work very well.

Dave Cronin: Why are they inherently bad?

I agree that they often are bad, but it seems to be more an implementation issue than something intrinsic about voice commands.

Stefan Klocek: The reason they are inherently flawed is that we use our voice for other more important things in addition to the system level input we would like to give to our DVD player. There is no way for the voice interface to understand that the context has changed and that I am no longer giving it a command, rather I am now giving my child a command or am simply muttering to myself. Of course we could imagine a system in which we indicate context by saying “DVD player – pause”, but this is adjusting my input to the deficiencies of the system. Voice input works “moderately” well in extremely constrained situations; Google 411, or the Bay Area traffic 511 are both voice command systems. When I call these systems I have to roll up the windows in the car, turn off the radio, and ask any passengers to stop talking for a minute. If I do these three things, the experience is usually flawless and quick. The brevity of the interaction itself contributes to this success. The fact that I have to change modes (make a call) and prep the environment (eliminate competing voices and noises) gives the system a good chance of performing well.

This is not the case with the home DVD player. The DVD player must be in an always-on state, waiting in case I issue a command. In this always-on state it is unable to distinguish when I am talking to it, and when I am talking to someone else. I don’t see any way, even with a magic technology solution, of getting around the fact that a voice system will be unable to detect the nuances of the context. The reason this isn’t a problem with keyboards is that the very fact I bring my fingers to the keys indicates a context that always means “input” to the computer. The keyboard itself acts as context, which distinguishes all other movements I make with my fingers and hands from those I specifically intend for the computer.

It could be argued that the computer has the same limitations of voice because of context of input problem. But our hands are well appropriated for volitional types of activities; the keyboard is a natural fit for how our hands work. Our voices, and sound in general has a more fluid and varied appearance in life. More than one person can talk at a time, the sound mixes into a single blob of input, not so with the keyboard (only one set of hands really fit on it at any one time). Ambient sound such as traffic or the radio can directly interface with a voice command system, not so with the keyboard (ambient movement, such as someone else tapping their hands on the desk does not affect my keyboard).

So while yes… it is an implementation issue, in this case the implementation issue can’t be solved without requiring that the human has to adjust their behavior to the system if it is going to work well. That is an inherent flaw.

Dave: Respectfully, I disagree.

Here’s an experiment: first record yourself saying “honey, do you want me to pause this?” (imagining that your sweetie has just gotten up to answer the phone); then record yourself telling an imaginary DVD player the command “pause.” Now download Audacity and isolate the “pause” from the initial statement. Compare the two. Can you tell the difference between the two utterances? If you can, so can a computer. Don’t assume that your ability to discern the difference is due to some supra-aural ESP skills you might have. You’re decoding nuanced information from changes in pitch, volume and cadence. Fairly primitive DSP (digital signal processing) can interpret this stuff quite capably. Especially in the world of set top boxes and game consoles as video players.

Tim McCoy: Mac OSX has had this ability since I think 10.2—it responds to a set of voice commands (and more you can define) by listening for a keyword of your devising. Saying “Honey, recycle these catalogs and open the mail” would do nothing (unless your honey was around) while “Computer, open the mail” would bring you to your inbox.

I’m not vouching for how well this works in practice, just that it’s out there.

What if my DVD’s remote had one button and a microphone, and I could speak into it like a phone?

And GOD does that interminable voicemail lady on cell phones tick me off—just imagine how many 30 second messages they turn into 1:03 phonecalls!

Kim Goodwin: I’ve been pleasantly surprised by how well my Mac dictation software can differentiate between commands and content it should type, and unless I pause to think in mid-sentence, it’s not bad at punctuation. (Oddly, the Windows version, which uses the same engine, is less adept.)

Doug LeMoine: During the Agfa project of olden times, Nate and I saw some pretty amazing voice recognition software … The radiologist “trained” it to map the unique ways that each person has of saying specific words, and this would take – generally – around 15 minutes. Building on what Dave said about the computer’s ability to determine subtle voice inflections, it seems to me that a brief “training” session (2 minutes) could be a way of refining the computer’s knowledge AND serve as a pedagogical vector for the user – these are the commands, this is how you say them.

(Also, perhaps they could be configured to include profanity for users like Alan?)

Still, it’s going to be weird. I would guess that, no matter how subtle, you’ll always need to have a “DVD voice” – in a somewhat similar way that you learn your “pack leader” voice when you get a dog – and you’ll need to talk to the DVD in an (albeit subtly) different way.

“Do you want me to tell it to eject the disk?”
“EJECT the disk.”

Which is not tragic, really, but also not totally “human.”

4 Comments

Colleen Jones
Enjoyed this conversation! It thoughtfully touches on many of the key issues at hand. I think emotion also plays a part in voice interaction, more so than I would have expected when I first started working in it. My friend, Darnell Clayton, and I recently wrote a brief article exploring it. But it merits more exploration. I hope you keep this conversation going because we need more quality resources on speech interaction.
Peter
Seems to me that adjusting the way you speak to suit the audience is in fact quite human. Talking to your boss the same way you talk to your six year old would invariably lead to mis-interpretation. Why hold your DVD player to a higher standard than you would your fellow human beings?
Adam
One of the important considerations is how the product sets expectations about what kind of voice input it will accept. Systems that speak to you in a conversational way create an expectation that they can parse natural language responses. When they fail, it's monumentally frustrating because they have set a high bar for themselves. It's OK if the system only understands simple commands, as long as it prompts you and provides feedback on the same level. As Peter says, most people naturally modulate their speech patterns based on context (age, setting, native vs. non-native speakers, etc.), and they do that largely based contextual clues. The same holds true of VUIs. BTW -- We did some documentation of this for the Visteon project, way back when. If you haven't looked at it already it might be worth checking out, both in terms of how we represented the interactions in scenarios in the Approach phase and in the F&B Spec. It's not necessarily a perfect approach to documenting a VUI, but it's at least one point of reference.
Gina
I don't see why we need to work from the persepctive that we can't address our technology? When there is more than one person in a room, and you want to give them an instruction, you generally preface the instruction with the person's name, in order to clarify that you are talking to that person, so that person knows that instruction is intended for them. Removes ambiguity, and all that other stuff... On top of that, the assumption is now that only the DVD player will be speech-enabled. What if my oven, fridge, TV, vacuum and washing machine were all capable of understanding voice commands? In many homes (I'm psecifically thinking about small apartments), you might find all these items within the one zone of the house. Many of these items might share the same commands, so how else would you distinguish between them, other than by saying their name and calling them to attention?

Post a comment

We’re trying to advance the conversation, and we trust that you will, too. We’d rather not moderate, but we will remove any comments that are blatantly inflammatory or inappropriate. Let it fly, but keep it clean. Thanks.

Post this comment