Siri and natural language voice control: an important step-change in user-device interface design?

Printer-friendly versionPrinter-friendly versionSend to friendSend to friend

By Jon Tee

‘I heard you talking, and thought there was some one with you.’
‘Oh,’ he replied, with a smile, ‘I was only entering my diary.’
‘Your diary?’ I asked him in surprise.
‘Yes,’ he answered. ‘I keep it in this.’ As he spoke he laid his hand on the phonograph. I felt quite excited over it, and blurted out: -
‘Why, this beats even shorthand! May I hear it say something?’
‘Certainly,’ he replied with alacrity, and stood up to put it in train for speaking. Then he paused, and a troubled look overspread his face... ‘You see, I do not know how to pick out any particular part of the diary... Although I have kept the diary for months past, it never once struck me how I was going to find any particular part of it in case I wanted to look it up?’
‘Then, Dr Seward, you had better let me copy it out for you on my typewriter.’

Conversation between Mina Harker and Dr. Seward in Bram Stoker’s Dracula (1897)

With its typical marketing flair, Apple’s launch of its ‘intelligent personal assistant’, Siri, in October 2011 has whipped up a huge amount of interest around the potential of natural language voice interaction with electronic devices. The hype has led to increased attention to a number of other recent voice-interaction-related announcements such as:

  • Microsoft’s Xbox Kinect voice recognition and in-built TellMe voice control in Windows Phone 7.5 ('Mango')
  • Samsung’s CES-debuted voice and gesture controlled Smart TV
  • rumours of a natural language improvement to Google Voice Actions for Android (allegedly codenamed ‘Majel’)
  • AT&T’s announcement of its development of a natural language in-car voice-enabled virtual assistant in collaboration with Panasonic Automotive Systems Company of America and QNX (a Research in Motion subsidiary)
  • voice interface software developers such as Nuance and Vlingo* with the former’s software reported to be deployed within Siri, the latter’s natural language Virtual Assistant app available on Android phones

* Nuance is in the process of buying Vlingo.

What do these developments mean for vendors of connected devices and for network operators and service providers?

Natural language voice control can be thought of as a continuation of a trend towards easier human-device interaction – from punched cards, to keyboards, graphical user interfaces, the mouse, trackpads, gestures, facial recognition and voice control. While some of these (such as punched cards) have fallen by the wayside, replaced by better, easier to use technologies others co-exist, often in the same device (and even touch-screen tablets tend to have a virtual keyboard for ease of entering words). Natural language voice control has been enabled by the continuing increase in mobile and wireless network availability and coverage, the increasing speeds of broadband networks and the falling cost and increasing power of computer resources.

One of the things that is distinctive about Apple’s Siri implementation is the way it has brought together a number of elements (many voice control implementations do some of these but without bringing them all together so fluently):

  • use of the cloud to enable high-powered off-device natural language processing (avoiding the battery drain of on-device processing)
  • sensitivity to context (e.g. user location or recent actions)
  • presentation of formatted, filtered content (rather than simply showing web pages)

The year when a majority of human-device interaction shifts to voice?

There remain a number of major issues that suggest that any expectations that 2012 could be the year when a majority of human-device interaction shifts to voice should be rapidly lowered.

In general, key enablers for new technology adoption include cost, availability, usefulness and ease of use. Voice interaction with devices is not, of course, new. But well-functioning natural language voice interaction removes the barrier of having to learn device-specific commands (rather than just pressing a couple of buttons). Nonetheless there remain a number of significant issues that will hold back widespread voice-device interaction:

  1. Voice control struggles with background noise with accuracy rapidly falling off as background noise increases. Most people will use their various devices (mobile phones, tablets, laptops, TVs and so on) in situations with significant background noise (for example in crowded living rooms or on noisy platforms or streets). The usefulness of voice interaction is significantly less if it doesn’t work well without the user being in artificially quiet conditions
  2. Talking out loud to control a device means that the user’s interactions with it are public (unless voice control is only used when alone). Typing an SMS, email, or web search query is a typically a private activity. Speaking those same items out loud is disconcertingly public. Talking out loud can also be simply inappropriate in a range of situations
  3. Even the ‘natural language’ Siri requires learning some of its quirks, but sending longer messages requires learning how to dictate. This is a learned skill rather than a ‘natural’ activity – and like learning to type or to play a musical instrument this skill acquisition takes time and so has a cost attached. While dictation can be faster than typing once that skill is learnt (especially on a small screen and for those who find it difficult to type on small screens), the time cost of learning a new skill is a barrier to use and widespread adoption. Speaking one’s punctuation marks isn’t something that is instinctive semi-colon in fact it can feel downright odd exclamation mark
  4. Where implementations require network connectivity, poor or absent mobile or WiFi coverage is a fundamental barrier (‘I’m really sorry about this, but I can’t take any requests right now. Please try again in a little while’).

These are significant barriers and there is no certainty that natural language voice control will move to mass adoption. However, if the enablers of cost, availability, usefulness and ease of use are met, and these barriers are overcome (i.e. if voice control of devices is found to be useful, affordable, easy to use and available when required) then users of devices will begin to form new habits of interactions with their devices that will gradually, and perhaps rapidly, alter expectations as to how these devices should operate.

Mass adoption of natural language voice control would hold a number of implications for suppliers

If voice control came to be regarded as an essential feature by most consumers (as the keyboard, mouse and, increasingly, the touchscreen have over recent years), then there are a number of implications for companies involved in developing consumer electronic devices, software for these or who deliver services such as Internet TV that come with connected devices:

  • Mobile operators would have most to gain if mobile devices become easier to use through voice interaction. Assuming adequate in-building coverage, increased ease of use of the mobile device could help mobile’s final encroachment on the fixed line if user habits shift from remembering numbers to an expectation of being able to just say ‘call [...]’. Operators would need to ensure adequate network coverage as well as carefully tariffing data packages to enable sufficient usage without causing network congestion
  • Handset vendors are already actively exploring how to improve or add natural language voice control (either through native OS development or through third-party app implementations). Any that do not would risk their handsets rapidly losing market share.
  • Connected TV and games console vendors such as Samsung and Microsoft are already promoting voice interaction as a capability of their devices; voice interface software vendor Vlingo has announced its Virtual Assistant for Smart TV. The challenge for successful implementation of distant voice control in noisy living rooms is significant, but other TV service providers and console vendors will need to ensure they can follow Samsung and Microsoft’s lead if these barriers are surmounted and voice control comes to be regarded as a desirable feature
  • App developers have the opportunity to offer their customers a voice interface to their software’s functionality (if they can obtain permission and negotiate terms with the owners of the device’s voice control implementation). In-app voice control could become a distinctive and desirable feature if well-implemented and easy to use

Some of the challenges of voice interaction with devices were evident right from the beginning (as the conversation between Mina Harker and Dr. Seward, quoted above, illustrates). And voice control, like the use of Edison’s phonograph for voice dictation, could conceivably remain a niche proposition.

But there are some big players committed to making propositions work, and it seems likely that well-implemented natural language voice control will increasingly form a complementary part of a number of small form-factor device interfaces. Mobile OS and app developers will need to carefully consider how best to fully integrate voice control with the range of apps most users will have on their devices. Fixed service providers need to watch these developments to ensure they can position their device strategies to avoid a threat of increased fixed-mobile substitution and being further displaced from their customers.

Jonathan Tee, Principal, Telecoms and New Media, Innovation Observatory
jontee@innovationobservatory.com

If you would like to receive our regular Opinions newsletter you can sign up for it here.