The trouble with VoiceXML (part 2)

(continued from part 1)

VoiceXML remains to this day one of the most successful standards to come out of W3C. Even though the Voice Browser Working Group, the committee that designed it, never got the same visibility as the ones that designed CSS or XML, it continues to be one of the biggest working groups at W3C. Most of the IVR industry is a member, and it has managed to produce a true industry standard. The working is currently designing VoiceXML 3.0. However, VoiceXML will probably never enjoy the success it once had.

With online access becoming ubiquitous, people don’t call IVRs so much anymore. It has become far easier to order pizza or book a flight online, rather than to run through the lenghtier process of doing it on the phone. In fact, there is not that much innovation in the area of IVR applications any more (maybe better speech recognition allowing mixed initiative dialogue, but that’s rarely found), and the the most popular applications like voicemail or simple menu-based information services, are sufficiently stable and familiar to users that there is no need for real innovation (this is not true everywhere, though — see below). You don’t find so many job ads for IVR designers these days, and there hasn’t been a book published on VoiceXML since 2002.

That does not mean that voice applications aren’t dead, though. Open source telephony software is going strong with Asterisk or Freeswitch, originally PABXs, but which also integrate IVR functionality through proprietary scripting languages. Moreover, companies like Voxeo and Twillio offer cloud services for developing Speech and SMS apps, and are enjoying a certain amount of success. Interestingly, Voxeo offers two ways of designing applications: using VoiceXML, or with an API for various standard languages (PHP, Python, etc.). The latter is much more used, because there are many more hackers familiar with those languages than there are who prefer VoiceXML. The programmatic approach especially makes it easy to integrate voice in other Web applications, make mashups and everything Web 2.0. Even though VoiceXML 2.1 extended the language specifically for that purpose, it hardly changed things. Another recent evolution of speech applications is that they are now found on mobile devices: iOS’s Siri and Android’s Voice Search are the best known examples. Those applications aren’t written in VoiceXML either, also because they aren’t just based on voice, but combined with visual and tactile interaction.

VoiceXML is catching up, though. Version 3 will handle multimodal interaction, for instance. And the working group is working on complementary standards, such as SCXML, which can declaratively describe interactive applications independently on whether the application is visual, audio, or a combination or both. But until those standards are finished and implementations are available, the procedural approach offered by Tropo, Android, or Microsoft is gaining in popularity. Moreover, standardisation is also happening in that area too: the W3C’s HTML Speech Incubator Group and Speech API community group¬†are very active in defining a speech API for web page, some of which is already implemented in browsers.

All in all, it will take three things to make VoiceXML return: the working group finishing version 3, implementations being released publicly (breaking the monopoly of mobile operators and their walled gardens), and VoiceXML supporters convincing developers that designing elaborate speech applications leads to extremely complex code, and can only be avoided using declarative markup.


Note: the above currently doesn’t apply to developing countries, by the way. Voice-based web access in Africa is at the heart of 5 projects that the Web Foundation (my employer) is running, and all the IVR we design use VoiceXML. For various reasons (illiteracy, cost of smartphones, culture) the voice channel, as opposed to SMS or internet access, remains one of the most important, for human-to-human communication but also human-to-machine. That means there is a great potential for voice-only online applications that’s yet to be tapped into. Many projects coming out of the Foundation’s “mobile entrepreneurship labs” are signs of that potential. Yet there is little doubt that, with the rise of the mobile web in Africa, the voice channel will eventually follow the trend described above.

This entry was posted in General. Bookmark the permalink.