The trouble with VoiceXML (part 1)

Following up on the previous entry I thought I talk about more technical details on how, at the Web Foundation, we’re designing our radio-platform.

In general, voice application share the same architecture as standard websites. Just replace “browser” with “voice browser” and “HTML” with “VoiceXML” (the most widespread language for voice applications). Also don’t put the browser on the user’s computer but on the web, usually not where the application server is since it’s often provided by a third-party, like a telco.

Voice apps vs Web apps

Because VoiceXML is the HTML of Interactive Voice Response applications you can do just as you would in a standard web application and generate the files served using PHP.

Here’s a basic (simplified) VoiceXML file:

<vxml>
  <form>
    <field name="year">
      <prompt>Please say the year you were born</prompt>
      <grammar src="year.srgs"/>
      <noinput>You did not say anything</noinput>
      <nomatch>I did not understand</noinput>
      <filled>
        <if cond="year &gt; 1980">
          <submit next="senior.vxml.php" namelist="year"/>
        <else/>
          <submit next="senior.vxml.php" namelist="year"/>
        </if>
      </filled>
    </field>
  </form>
</vxml>

Unsurprisingly there is, unlike standard HTML, some logic in the application. In fact a large portion of the VoiceXML specification describes the Form Interpretation Algorithm, which goes far beyond simple <if> statements, but includes features like error recovery, events and exceptions. Things that are barely visible in the language’s syntax, but are rather complex. Barely visible, that is, when you’re writing simple examples. But in a real application, things becomes quite complex and the resulting VoiceXML files can be hard to read (a bit like XSLT).

And you can add to that the complexity of PHP, because server-side logic is mandatory. Indeed, a VoiceXML application being just a set of forms, each one has to <submit> its contents back to the server, which then generates and serves the next VoiceXML file.

And little by little you end up with code like what I put at the end of this post. What was originally a simple VoiceXML file has become a horrible mix of two languages. Despite the ugliness it’s still code that looks familiar to many PHP developers. But again, this isn’t just PHP generating HTML, this is PHP generating VoiceXML, itself a programming language. (Yes, HTML can also contain JavaScript. Guess what, so can VoiceXML).

I’m not the first to notice it. In 2007 the W3C’s Voice Browser Working Group released VoiceXML 2.1, which adds a small number of features that can help us, the <data> tag, which lets you do XMLHttpRequest stuff, and <foreach> to loop over a variable. <data> is great, because instead of having to submit a form back to the server and receive another VoiceXML file, you can send the data over but remain in the same file. And <foreach> also removes some dependency on server-side logic. However, I know of no VoiceXML browser that implements the specification completely, including the one I’m stuck with (Voice Glue). Seven years after the release of the specification.

Are things going to improve? Are implementations going to catch up, especially FOSS ones? Unlikely. For the reason that VoiceXML is dying. I’ll write about it, and the present and future of voice applications, in another entry.


And now the ugly code (which is not too bad, actually, but you can see how it quickly gets much uglier). Nothing but code-generating code; imagine the debugging, especially when all the error reporting you have from the VoiceXML interpreter is a message on the phone saying “A serious error has occurred. Exiting.”

<?php
// authorization: get callerId, try and match it against the user list
// if it checks, go ahead. If it doesn't, create a new user
// input variables: callerId

require_once('log.php');
require_once('i18n.php');
require_once('radio-platform.php');
require_once("ivr-platform.php");

Log::write("starting auth-callerId");
Log::write($_SERVER['REQUEST_URI']);

if (isset($_REQUEST['callerId'])) {
  $callerId = $_REQUEST['callerId'];
} else {
  $callerId = 'unknown';
}

$sessionId = $_REQUEST['sessionId'];

// fetch user list
$users = RadioPlatform::getUsers();

// search user with correct callerId
$userFound = false;
foreach ($users as $user) {
  if (phoneNumbersMatch($user['phone'], $callerId)) {
    $userFound = $user;
    $userId = $user['id'];
    $userRadioId = $userFound['radios'][0];
    break;
  }
}

if ($userFound) {
  $userLang = $userFound['lang'][0];
  Log::write("User: $userId");
} else {
  Log::write("No user found.");
}

header('Content-Type: application/voicexml+xml; charset=utf-8');
print('<?xml version="1.0" encoding="utf-8"?>');
?>

<vxml xmlns="http://www.w3.org/2001/vxml" version="2.1">
  <property name="inputmodes" value="dtmf"/>
  <var name="sessionId" expr="'<?php echo $sessionId ?>'"/>

<?php
if($userFound) {
  $radios = RadioPlatform::getRadios();
?>
<form>
  <var name="userId" expr="'<?php echo $userId ?>'"/>
  <var name="userRadioId" expr="'<?php echo $userRadioId ?>'"/>
  <var name="userLang" expr="'<?php echo $userLang ?>'"/>
  <block>
<?php prompt($userLang, 'welcome') ?>
    <audio src="<?php echo $radios[$userRadioId]['audio']?>"/>
    <submit next="main-menu.vxml.php" method="get" namelist="userLang userId userRadioId sessionId"/>
  </block>
</form>

<?php
} else { // No user found through callerID. Create new user.
?>

<form>
  <block>
    <?php prompt('bam','welcome'); ?>
    <?php prompt('fr','welcome'); ?>
  </block>
  <field name="userLang">
    <?php prompt('bam','select_bam_1'); ?>
    <?php prompt('fr','select_fr_2'); ?>
    <option dtmf="1" value="bam">Bambara</option>
    <option dtmf="2" value="fr">French</option>
    <noinput><reprompt/></noinput>
    <nomatch><reprompt/></nomatch>
    <filled>
      <var name="callerId" expr="'<?php echo $callerId ?>'"/>
      <submit next="auth-new.vxml.php" namelist="userLang callerId sessionId"/>
    </filled>
  </field>
</form>

<?php } ?>
</vxml>

<?php
// tries to fix bad callerIds, removing leading whitespace, '+' or '0'
function clean_phone_id($caller_id) {
  $ph=ltrim($caller_id);
  $ph=preg_replace('/\s*$/','',$ph);
  $ph=preg_replace('/^\s*/','',$ph);
  $ph=preg_replace('/^\+/','',$ph);
  $ph=preg_replace('/^0*/','',$ph);
  return $ph;
}
// returns true if both numbers match
function phoneNumbersMatch($n1, $n2) {
  if ($n1 === $n2) return true;
  return clean_phone_id($n1) === clean_phone_id($n2);
}
function prompt($lang,$msg) {
  $xmllang = IvrPlatform::xmllang($lang);
  echo "<prompt xml:lang='$xmllang'>".I18N::say($lang,$msg)."</prompt>\n";
}
?>
This entry was posted in General. Bookmark the permalink.

3 Responses to The trouble with VoiceXML (part 1)

  1. Jim rush says:

    Yes, VoiceXML has its horrors and never lived up to its promise. The comment about it dying is a bit more complicated as I question if it was really alive. It was a standard driven by analysts and enterprises looking for a silver bullet for a set of complex problems. If you just want to build an IVR application, there are better choices. VoiceXML is the choice if you want to minimize vendor lock in.

    As for the code, you’ve picked, in my opinion, the worst way to approach VoiceXML. Most of the large, successful projects I’ve been in or observed take one of two approaches. Server centric, where your server side code generates VoiceXML. In this model, your structure and flow are all in your server logic and the VoiceXML plays output and does minimal input validation. Client centric is the alternate where all of the logic is in VoiceXML with heavy amounts of JavaScript with the Data element being used for data collection. Server side solutions, in my experience, significantly out number client side. Both of these approaches tend to be cleaner than the PHP, JSP, and ASP-like approaches.

    I have seen one clever approach that used a higher level language the generated static VoiceXML plus JavaScript for logic. That allowed the developer to focus on larger constructs and reuse. The static VoiceXML was generated like a compile step and deployed.

  2. site admin says:

    Thanks Jim for the insightful comment. I was part of the W3C Working Group that designed VoiceXML for a few years, and given the sheer size of the group and the companies there, I’m fairly confident that it was, at some point, the one standard for IVR applications, endorsed by just about everybody. (I may be wrong, though. The IVR world is so shrouded in mystery and NDAs that it’s difficult to prove anything.) But that’s not so important now, and I agree that there are better choices these days, but I’ll expand on that in part 2.

    I’m not sure how my approach differs from what you call server-centric. Or are you suggesting that the generated VoiceXML would be absolutely minimal, while the real logic is done server-side (in PHP in this case)? So you just keep the prompts, options, submits but remove all ifs and other avoidable constructs? Interesting. That’s pretty much the way I went when I realised that Voice Glue was missing so many features, but didn’t think of going all the way. Possible issues with lag due to a lot of going back and forth between voice browser and server come to mind, though, but it would be fun to try.

    In any case, my original design was all client centric, with a single voicexml 2.1 source file, data tags to talk to the backend (and maybe some PHP to do near-static things like language selection in prompts or for, as you mention, larger constructs). It would have indeed been much cleaner I reckon.

  3. Pingback: The trouble with VoiceXML (part 2) | Riviera Blog

Leave a Reply

Your email address will not be published.