By Jonathan Freeman, Contributing Editor, InfoWorld |

JavaScript tutorial: Add speech recognition to your web app

Harnessing the power of voice commands in a React map explorer app with the annyang JavaScript library

While browsers are marching toward supporting speech recognition and more futuristic capabilities, web application developers are typically constrained to the keyboard and mouse. But what if we could augment keyboard and mouse interactions with other modes of interaction, like voice commands or hand positions?

In this series of posts, we’ll build up a basic map explorer with multimodal interactions. First up are voice commands. However, we’ll first need to lay out the structure of our app before we can incorporate any commands.

Our app, bootstrapped with create-react-app, will be a full-screen map powered by the React components for Leaflet.js. After running create-react-app, yarn add leaflet, and yarn add react-leaflet, we’ll open up our App component and define our Map component:

import React, { Component } from 'react';
import { Map, TileLayer } from 'react-leaflet'
import './App.css';
class App extends Component {
  state = {
    center: [41.878099, -87.648116],
    zoom: 12,
  };
  updateViewport = (viewport) => {
    this.setState({
      center: viewport.center,
      zoom: viewport.zoom,
    });
  };
  render() {
    const {
      center,
      zoom,
    } = this.state;
    return (
      <div className="App">
        <Map
          style={{height: '100%', width: '100%'}}
          center={center}
          zoom={zoom}
          onViewportChange={this.updateViewport}>
          <TileLayer
            url="https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png"
            attribution="&copy; <a href=&quot;http://osm.org/copyright&quot;>OpenStreetMap</a> contributors"
          />
        </Map>
      </div>
    )
  }
}
export default App;

The App component is a stateful component that keeps track of the center and zoom properties, passing them into the Map component. When the user interacts with the maps via the built-in mouse and keyboard interactions, we’re notified to update our state with the new center and zoom levels.

With a full-screen component defined, our app looks like the following:

Out of the box, we get the typical modes of interaction including mouse, keyboard, and touch on devices that support them. With our basic interactions and user interface defined, let’s add voice commands to zoom in and out.

There are many libraries available to perform speech recognition in the browser. There is even a base SpeechDetection API supported by Chrome. We’ll use annyang, a popular and simple JavaScript text detection library. With annyang, you define commands and their handlers in a JavaScript object, like so:

const commands = {
  'in': () => console.log('in command received'),
  'out' : () => console.log('out command received'),
};

Then, all you have to do is pass that object into a method on the annyang object and call start() on that object. A full example looks like this:

import annyang from 'annyang';
const commands = {
  'in': () => console.log('in command received'),
  'out' : () => console.log('out command received'),
};
annyang.addCommands(commands);
annyang.start();

This is super simple, but out of context doesn’t make much sense, so let’s incorporate it into our React component. Within the componentDidMount hook, we’ll add our commands and start listening, but instead of logging to the console, we’ll call two methods that update the zoom level in our state:

  zoomIn = () => {
    this.setState({
      zoom: this.state.zoom + 1
    });
  };
  zoomOut = (...args) => {
    this.setState({
      zoom: this.state.zoom - 1
    });
  };
  componentDidMount() {
    annyang.addCommands({
      'in': this.zoomIn,
      'out': this.zoomOut,
    });
    annyang.start();
  }

When our page refreshes, the browser asks us for permission to use the microphone. If you say yes, you’ll now be able to use the voice commands “in” and “out” to zoom in and out. Want more? The annyang library supports placeholders and regular expressions, too. To support zooming directly to a particular level, we can define a command like so:

    annyang.addCommands({
      /* existing commands */
      'zoom level :level': {regexp: /^zoom level (\d+)/, callback: this.zoomTo},
    });

The :level that is a part of the key is the standard way of defining a single-word placeholder. (If we wanted a multiple world placeholder, we could use *level instead.) By default, the word captured by the placeholder is passed in as a string argument to the handler function. But if we define the handler as an object with regex and callback keys, we can further constrain what the placeholder can be. In this case, we’re limiting the placeholder to only digits. That placeholder will still be passed in as a string, so we’ll need to coerce it to a number when we set the zoom level:

  zoomTo = (zoomLevel) => {
    this.setState({
      zoom: +zoomLevel,
    });
  }

And that’s it! We can now zoom in or out one level at a time, or we can skip directly to a level by saying its number. If you’re playing around with this at home, you’ll notice that it takes a few seconds for annyang to register the command, and sometimes commands don’t get registered. Speech recognition isn’t perfect. If you’re building speech recognition into a production system, you’ll want to incorporate real-time feedback mechanisms for errors or to identify when the library is actively listening.

If you want to play around with the code, you can find it on GitHub. Feel free to reach out on Twitter to share your own multimodal interfaces: @freethejazz.

Next read this:

Jonathan Freeman is a software developer, consultant, and jazz musician living in Chicago. Through consulting, he's enjoyed working in various domains, from finance to healthcare to video games. While he specializes in JavaScript, both browser and server side, he also takes a keen interest in modern data stores (particularly graph databases) and distributed computing platforms.