What voice UI is good for (and what it isn’t)

Main illustration: Lynn Scurfield

Voice is either a genius technology whose time has finally come, or the most overhyped waste of time we’ve seen since bots, blockchain, or winding back the clock, gamification.

The reality is less dramatic, more nuanced. There is now a new broadly available input/output interface to use and design for, and the most useful thing product and design folk can do is learn when and how that matters.

The end of the beginning

The recent emergence of Alexa, Siri, Cortana and “Okay Google” doesn’t mean voice has “finally” arrived. Quite the opposite, it means we’re finally getting going. The phase of concept demos, hype-cycles, and over-promising has ended. From here onwards, it’s real technology supporting real use cases, or pack up and go home.

There is a famed “long nose” of innovation that every significant new technology must pass through. Bill Buxton, principal researcher at Microsoft Research, has lived through every new UI form and estimates that it takes 30 years from “research project” through to full maturity (defined as generating as being a billion dollar business).

So these things will take a while, and when they arrive we shouldn’t expect them to conquer every existing input mechanism, they complement them.

Replacement is rare

New technologies stack on top of old ones
New input devices don’t kill their predecessors, they stack on top of them. Voice won’t kill touchscreens. Touchscreens didn’t kill the mouse. The mouse didn’t kill the command line. Analysts yearn for a simple narrative where the birth of every new technology instantly heralds the death of the previous one, but interfaces are inherently multimodal. The more the merrier. Every new technology starts in a new underserved niche and slowly expands until it finds all the areas it’s best suited for. And voice has a great niche to start in…

Placeonas

How locations place limits on the type of interactions we can have with our devices
Bill Buxton introduced the concept of a “place-ona”, adapting the concept of a persona (which we all love to hate) to show how a location can place limits on the type of interactions that makes sense. There is no “one best input” or “one best output”. It all depends on where you are, which in turn defines what you have free to use.

At a very simple level, humans have hands, eyes, ears and a voice. (Let’s ignore the ability to ‘feel’ vibrations as that’s alert-only for the moment). Let’s look at some real world scenarios:

  • The “in a library wearing headphones” placeona is “hands free, eyes free, voice restricted, ears free”.
  • The “cooking” placeona is “hands dirty, eyes free, ears free, voice free”.
  • The “nightclub” placeona is “hands free, eyes free, ears busy (you can’t hear), voice busy (you likely can’t speak/can’t be heard)”.
  • The “driving” placeona is “hands busy, eyes busy, ears free, voice free”.

Based on the above, you can see which scenarios voice UI are useful in and in general the role of voice as an input mechanism.

 

While Benedict Evans is going for his signature cocktail of insight mixed with pointed humour in this tweet, it’s safe to say that’s not the point of voice. Or rather, voice isn’t optimal in most placeonas.

But voice is slow and buggy

Speed and accuracy are worse with voice than they are with all other user interfaces. Yes we can talk faster than we can type but even the most advanced audio processing still relegates us to slower, over-enunciated speech and still results in errors. Secondly listening is far slower than reading, especially listening to a digital voice. We can scan and skip through text far quicker than we can listen to it. This is why Visual voicemail was such a hit (as Benedict again pointed out).

 

So two things are clear:

  1. Voice is a substandard input/output mechanism
  2. There are a lot of scenarios where it’s best, by virtue of being the only one suitable

So how big can voice get?

This question has been asked at countless conference panels on the matter, and the answer is typically ‘it depends’, but I think it’s better to ask more specific questions:

How often is voice preferable?

Today it seems that driving and “playing music while walking around your house” lend themselves well to voice interface, but how many other scenarios will present themselves, and will the use cases move towards productivity or remain casual? Will people ever want to have their email read out through their AirPods?

How good can audio processing/playing get, and when?

The vast majority of the world can speak faster than they can type but today’s technologies can’t keep up reliably. How far is that away from changing?

When will true multi-modal messaging take off?

The library-driver problem - how to communicate when different input/output options are available to the participants
While most messaging products today include asynchronous voice clips, they require that messages are received in the same way they were composed. Users have to agree on a medium for conversation which doesn’t work when they’re in different contexts. This leads to what I call the “library-driver problem”: If Michelle is in a library and Alice is driving a car, how can they communicate?

Alice is driving so can’t use her hands or eyes and Michelle can’t speak or make noise in a library. In an ideal messaging app the users can compose messages any way they want, and consume any way they want, and never does it block the conversation from happening.

Bringing voice into normal ubiquitous messaging would represent a tipping point of sorts, normalising the idea of people talking at their devices to control them.

Neither a platform or a paradigm…

So whilst it’s true that voice isn’t a platform, or as is often claimed the new UI paradigm, it is another new interface that we must design for and deliver on, otherwise we risk sounding like some of these folks…
Early criticism of the mouse