Talking to machines is primitive and frustrating, but it's gonna be huge

At Cooper, we care a lot about how to make technology work for humans. For much of our history, that has meant figuring out how to help people design great products for mice and keyboards, and then touch devices, and then for services that involve many screens (or no screens at all!) and people over time.

And now there is this new interest in talking to computers. A lot of hype! In reality, conversational UIs are a primitive and frustrating experience. But they will end up being a much bigger deal than most people think.

Conversational UIs are a new medium for a big chunk of the UX profession (there are plenty of veterans, but it’s new to a lot of folks). Cooper’s goal this year is to explore and help refine the best practices for the industry, the way we have for 25 years. We’ve teamed up with the venerable SF-based design/dev shop Carbon Five to design and develop conversational applications for consumers and enterprises, starting with Alexa skills for Amazon Echo. You’ll see more in the Cooper Journal about that collaboration in the coming weeks and months.

But I also want to take a few minutes to describe why we think talking/chatting computers is a big deal, why we think voice UIs and conversational applications aren’t a flash in the pan, a passing fad.

Suddenly everyone’s talking about talking to computers

A lot has been said and written about conversational UIs over the past year—the spread of chatbots and the launch of Amazon Echo were two watershed developments, in particular. The stars have aligned, and the technology to support these applications are suddenly cheap, ubiquitous, and powerful. Advances in machine learning mean that voice recognition has gone from a high error rate (>25%) to a low rate (<5%)—i.e. it basically works almost all the time. The hardware to support these talking robots is cheap and easy to produce, far less sophisticated than everyday smartphones, for example.

Platforms like Echo and Google Home are rapidly expanding their capabilities, and analysts expect new entrants in the “smart speaker” category in 2017 from at least one of the other big consumer hardware players (Apple, Samsung, Microsoft). The Chinese digital economy pretty much runs on WeChat these days, and Facebook Messenger is positioning itself as the face of the so-called “post-app” era in the West by letting you chat with a menagerie of bots who do your bidding. The race is on, and it’s being driven by talking computers.

It’s pretty early days to speculate about the size of the opportunity behind conversational UIs, but signs point to: huge. WeChat alone processes somewhere around $550 billion in person-to-person payments every year (twice the rate of Paypal). When this much commerce happens in one place, it’s evidence of... something, for sure.

Other circumstantial evidence:

  • Bots are already woven into the fabric of our social networks and knowledge systems (9-15% of active twitter accounts are bots; wikipedia bots account for a significant proportion of total edits on the platform).
  • “Voice-first” device (Amazon Echo, Google Home) unit shipments are growing exponentially in the US, and the number of skills in the Alexa Skill store is showing signs of accelerating, recently hitting 10,000. There will be about 30 million of these devices in American homes by the end of this year.
Available Alexa Skills: June 2016 to January 2017

source: voicelabs.co

Voice-First Device Footprint

source: voicelabs.co

Challenges

Before we talk about the opportunities, let’s start with the counter-arguments. There are a bunch of reasons why “natural” language as a UI will not be the next big thing. Benedict Evans at Andreessen Horowitz summarizes a few of them well:

  • You have to map each voice request to a manually-curated query (i.e. create “all the dialog boxes” by hand)
  • Users don’t know what they can and cannot ask, and adding more allowed commands arguably makes the problem worse, not better
  • By telling me that I can talk to this machine, users expect HAL 9000, but they get a dumb computer instead

Evans concludes that voice is “a big deal, but we’ll have to wait a bit longer for the next platform shift.” As a venture capitalist, Evans wants to know when the next equivalent of mobile computing will arrive. He wants to see the next world-eating paradigm. Voice is not that.

As an interaction designer, I’m interested in when the next equivalent of touch interfaces will arrive. Conversational UIs are that new frontier for interacting with machines, well suited for particular times and places, and we will see increasing attention and investment directed toward them in the coming months and years.

Big Challenge #1: Syntax and Dumb Computers

One of the biggest challenges with voice as an interface is that we don’t have anything close to “general” AI to go with it. When you unwrap the so-called “smart speaker,” you expect to be able to start chatting with it, and suddenly have a new buddy. But bots don’t really understand what we’re saying, so as soon as we ask something in a way that doesn’t quite fit their underlying ontology, things go haywire.

So you end up having to just remember how to ask the computer certain things. This can end up being really frustrating.

Now, what do you say to Alexa to get a pizza? If you want pizza from Pizza Hut, you say “Alexa, ask Pizza Hut to place an order.” If you want a Domino’s pizza, you might say “Alexa, open Domino’s and place my Easy Order.” For Amazon Restaurants, you could say “Alexa, order pizza from Amazon Restaurants.”

Ways of ordering Pizza through Amazon Alexa

Each service defines a bunch of synonymous phrases for you to describe your “intent,” so there’s likely to be some overlap. For example, you’ll probably be able to say “Alexa, ask [Pizza Hut/Domino’s] to place an order” for both pizza-specific services. But the concept of an “Easy Order” appears to be Domino’s-specific. (Or wait, does “Easy Order” apply to Pizza Hut?)

The point is, I have to memorize all this stuff, and it’s hard. With a visual interface, I can “re-learn” on the fly, so if the button says “Place an Order” vs “Order a Pizza,” I don’t mind too much because I can figure out what the button does by reading the label.

(Note: chatbots have a significant advantage over voice-only services in this regard because they have a visual interface, and often incorporate buttons or other GUI elements inline in the chat window. But this starts to defeat the alleged purpose of conversational UIs. As Evans points out, “...you can solve some of this by adding a screen, as is rumored for the Amazon Echo - but then, you could also add a touch screen, and some icons for different services. You could call it a 'Graphical User Interface', perhaps, and make the voice part optional...”)

Because so much of these interactions involves memorizing how to ask the computer how to do something, we should expect:

  1. usage will gravitate strongly toward a small number of voice applications (because memorizing a lot of commands is hard).
  2. even if someone tries a voice app or chatbot, the fall-off in usage will be dramatic (because memorizing any command syntax is hard, you’ll probably forget most of them if the value threshold is not high enough). 

And in fact we see both of these outcomes in the data. Of the top 12 tasks that any Alexa user has tried at least once, 3 of the top 5 are timer-related (set a timer, set an alarm, check the time), and only six tasks crack the 50% mark.

How many people use Amazon's Virtual Assistant

Almost none of the more than 10,000 skills available for Amazon Echo have a review (indicative of low usage). Alexa skill user retention after two weeks is a dismal 3% (compared to 10-11% for a typical mobile app on iOS or Android), and 40% of users abandon a chatbot after just a single interaction.

Alexa skill retention

Big Challenge #2: Conversational Systems Have Limited Domains

As clever as these devices are, they only know how to talk about specific things (remember: computers are dumb). This is necessary to maintain the illusion of intelligence.

Unfortunately, this ends up back-firing because it places a second burden on users: in addition to remembering how to ask things, one also has to remember what one is allowed to ask about. I might be able to ask Siri about NBA scores but maybe not cricket scores. Or maybe I can ask about cricket scores, but not in real time (as a game is being played). Or perhaps I can ask about MLB scores, and it will tell me how many balls and strikes a batter has, but maybe not who’s on base.

I really have no idea what the boundaries of the domains are, because I would have to go experiment endlessly with Siri and Alexa and all the others, and I don’t have the patience. But that’s the point: I have no idea even roughly what I’m likely to be able to ask about. And it’s a moving target because the platform makers assume that more content is better, so they shovel new content into the system as fast as they can. So in a very real sense, the burden of memorizing the list of commands is increasing over time, as the system “improves.”

Here’s an interesting illustration of the intersection of challenges #1 and #2: this screenshot of Siri-enabled apps promoted on the iOS app store from a few months ago shows the insanity of having to remember both syntax and domain at the same time. Want to find pictures of something? You can do that in several ways:

Ways of finding searching in App Store
  • “Find images of Kate Spade in Vogue Runway”
  • “Find images of outdoor lights in Pinterest”
  • “Find images of food in Canva”

Implicitly, the user needs to know (with no visual cues):

  • How to ask (“Find images of [x] in [app_name]”)
  • Allowable domains (i.e. which [app_name]s do I have, and which ones support Siri queries?)
  • Allowable domain parameters (i.e. what [x] maps to which [app_name])

The user also needs to realize that they can’t just “find images of [x]” across all apps that might have images, which would be the “natural” way of asking for images. This is, of course, crazy. And it is a serious speed bump in our quest for voice-driven interfaces that are easy to use.

This is, of course, crazy. And it is a serious speed bump in our quest for voice-driven interfaces that are easy to use.

Big Challenge #3: Conversational Systems Just Aren’t Very Good at Conversation

If we’re honest with ourselves, it’s a bit premature to call most of these things “conversational” in the first place. (It reminds me of the old saying about the dancing bear: what’s interesting is not that the bear dances well, but that the bear dances at all.) Very few applications or devices today can string together anything you might recognize as a conversation. The best most of them can muster is some kind of linear script of multiple choice questions (see: most chatbots).

The big issue here is that building in the elements of natural language that conversation entails—context, environment, experience, humor, etiquette, relationships, common sense, etc.—is beyond the capabilities we have. Amazon has very helpfully offered a $2.5 million prize pool for a team to build a system that can carry on a convincing conversation for 20 minutes. It’s a class of problem at least as hard as building a self-driving car. Probably harder. Now multiply by all the languages in the world.

Why chat and voice UIs are a bigger deal than you think

So, there are a bunch of challenges. But despite those challenges, natural language UIs are a big deal. The presence of bots in the social fabric of the world is already bigger than most of us realize. But this is really just the beginning, because most of what will end up being transformative about voice- and chat-driven interactions hasn’t been conceived of yet.

Talking to a computer is pretty cool

The Amazon Echo sales numbers and corresponding hype around the device are partly driven by the fact that the idea of talking to a computer is pretty cool, plus the fact that the devices themselves are so cheap ($50). You can just talk into the air, and stuff happens. Even the most jaded gadget tech news pundit has to agree that this is compelling.

Of course, being cool is not enough. Google Glass was pretty cool, but it turns out nobody wants to wear them, let alone pay $1500 for the privilege. Also, the “augmented reality” part of that equation never really lived up to the promise.

Google Glass

Cool, until you think about it for a minute

The Nintendo Switch is undeniably cool, and will probably be very commercially successful. But it’s not going to change much about everyday life. It won’t change the way we work or interact with other people in the long run. It’s not fundamental.

Nintendo Switch

Cool, but not changing the world in any big way

Talking to computers is pretty cool, which is necessary but not sufficient.

We’ve been dreaming of this moment for a long time

Pick a culture past or present, and you’ll find a story somewhere about someone who creates something living out of something inanimate. Ovid’s Pygmalion. Collodi’s Pinocchio. Shelley’s Frankenstein’s monster. Lucas’ C3PO. A ticket to the moon will cost you a hundred million dollars or so. But anyone can get a taste of the pinnacle of human hubris and determination for just $50.

My point is, there’s demand. Demand pent up since the dawn of history.

Alexa + C3PO

People love this stuff.

Talking is a “natural” interaction

One of the big issues with conversational systems is our heightened expectations: we expect that when you’re talking to something, it knows how to talk. This isn’t true, but we think it’s true because humans treat pretty much everything like another human, even if they know it’s not a human. Even rocks pass a primitive form of the Turing test.

A lot of research examines this peculiarity of humans. The Media Equation (Reeves and Nass) is a good start, if you’re interested in learning more about the human behavior at play. Kate Darling at MIT Media Lab has done more research on several of these questions (e.g. why do people feel revulsion about bashing a cute robot pet with a hammer? And should we make it illegal, maybe?).

This potential for universal access means that even small advances in the technology can theoretically have a large impact on the welfare of the entire world.

But all of this subconscious confusion about whether a thing is human or not just goes to show the bigger point, which is that talking is a natural way of interacting with another agent. Much like touch interfaces were revolutionary because they removed a bunch of mediating layers between you and the computer (trackball, keyboard, stylus, etc.), conversational systems take advantage of an interface that virtually all humans can access: language. This potential for universal access means that even small advances in the technology can theoretically have a large impact on the welfare of the entire world.

Conversational UIs open us up to new and interesting business possibilities

It’s all hype unless there are some new business opportunities, and conversational UIs are already proving their mettle. Aside from platforms like WeChat where a lot of commerce is already taking place, companies are finding new ways to engage customers with chat-based interfaces.

A new way to launch

An early innovator was Digit, a service that automatically takes tiny sips from your checking account and drops the money into a savings account, every day or two. The idea is that it happens in the background, invisibly, and that it moves just enough money based on your balance and spending habits that you don’t really notice. And then a few months later, you have some money socked away.

Digit’s innovation was to implement the service as SMS-only. You could check your account balance on a website, but otherwise all transactions and interactions happened over text message. To understand the genius here, consider how much they spent on visual design ($0), front-end development (near $0), app store revenue sharing ($0), app store submission and maintenance ($0) and all the other typical costs of launching a mobile app. Also consider how fast they can push a new “version” of their service, which existed only as a server application (instantly). And that they could easily segment and test features in an uncomplicated way. And how easy it is to deploy on new chat-based platforms like Facebook Messenger. Fastco called Digit the #2 “most innovative companies in finance” behind (ironically?) Goldman Sachs.

A new way to engage customers

Amazon has been on a tear, getting big brands to build things on the Alexa platform. Johnnie Walker recently launched a skill that takes you on a “guided tasting.” You tell Alexa which Johnnie Walker bottle you have, and Alexa tells you all about the whiskey, and walks you through sipping, adding drops of water, appreciating various flavors and aromas. It’s definitely not for everyone, and it’s not life-changing, but it points to a new dimension of brand engagement, where conversational computers contribute to social interactions, rather than simply mediating them. The experience of Alexa talking to me about smoke and honey and hazelnut notes as I sipped on a dram of Double Black was vastly different than visiting a website would have been. Brands have an opportunity here, and it’s mostly unexplored.

A new way to scale

There is increasing interest in exploring natural language UIs as a way to scale a personalized service. An entirely human-powered service is hard to scale (think: a customer service phone line), but adding some bots to the mix can be an effective strategy. A conversational UI is the perfect front-end for a human/bot hybrid service.

Fin is a good example of this strategy. Fin is a chat-based take on the virtual assistant—you chat with the service, sometimes a bot handles your conversation, sometimes a human handles it, but (importantly) it’s transparent to the user. The chat interface obscures the agent. As the software gets “smarter” over time, Fin’s costs fall as they can handle a larger volume of requests for every human operator, but users don’t see any change (other than maybe speed). This offers Fin a clear path forward, and avoids all of the syntax and “what am I allowed to ask” problems that bot-only solutions suffer from.


It’s all just getting started

The field is nascent, and the platforms are still pretty weak. But these language-based systems have momentum, and there’s a lot of headroom to grow. Sales of voice-first devices are accelerating, as are submissions to Alexa’s skill store. Businesses everywhere are developing proofs-of-concept and experimenting with conversational systems.

Analysts expect not just a lot of new users, but a lot of new platform capabilities. Amazon is said to be working on features like telephony for the Echo. Voice identification opens up a bunch of interesting use cases, and could add a layer of security to a fundamentally insecure tool. It’s like the early days of the iPhone, when the big new feature was that it had a screen at all, and then you add GPS and LTE and ten generations of camera improvements and suddenly you have Snapchat and Lyft (not to mention more than $1 trillion in revenue for Apple).

That’s where we are today with conversational computers. Designing for voice and chat will be a sought-after skill in the UX profession in the very near future (now, in fact). The platforms will battle for market share, and they will add capabilities rapidly. The SDKs themselves will evolve to be more turnkey, and third parties will join the fray to create tools for makers.

As a UX community, let’s continue to share what we discover, and learn from each other.