Voice Processing: Are We Near New Speech Recognition Apps?

By on

Voice Processingby Justin Stoltzfus

Expect Labs CEO Tim Tuttle has a vision for voice. In a talk at the DATAVERSITY® Smart Data 2015 Conference, Tuttle details the recent history of voice processing and how the field has advanced at warp speed.

With experience at MIT and Bell Labs, Tuttle has seen the rise of voice firsthand. Just a few years ago, he says, the kinds of tech put into Siri and Cortana were clumsy at best; even years after Knight Rider explored the very concept of a disembodied, electronic voice carrying on a conversation with a human being, people hadn’t really figured out how to make these voices “intelligent” in a practical way.

“Back then, the problems seemed very much out of reach,” Tuttle says. “(Voice processing) cost an arm and a leg, and the results were lackluster.”

Now, it’s a different story. Noting that voice recognition is now in every major mobile operating system, Tuttle has a number of predictions for voice: a significant one claims that within just a year and a half from now, computers will be “better” at understanding speech than humans, and the 2017 equivalents of IBM’s Watson won’t just be able to mine trivia – they’ll be chatting it up with us.

“Artificial intelligence has cracked the code on voice,” Tuttle says, adding that by 2018, experts expect to see over three billion devices equipped with microphones, and only around 5% of them equipped with a keyboard. Already, Tuttle says, surveys of online interactions are finding that voice accounts for 10% of all search, with Apple reporting over one billion voice queries per week through its Siri platform.

That, says Tuttle, will lead to major changes in how developers and engineers work. Touting a “frictionless voice experience” that will become the holy grail of app development, Tuttle says it’s not just about being “faster than typing,” although empirically, voice is faster. By breaking voice recognition out of the “hands-free” niche, and promoting it as a superior form of tech interaction, Tuttle believes we will collectively embrace voice recognition as the default for most services.

“We see it living on every device, inside every app, and in every home and office,” Tuttle says.

Harder than it Sounds

So what’s stopping companies from jumping on the bandwagon and getting such state-of-the-art, next-generation voice solutions in place? According to Tuttle, there is a number of big challenges that companies have to face when venturing into voice for the first time. One is creating a customized knowledge graph that holds all of those vital pieces of data related to products, brands, services and more, a massive data set that has to be applied to the framework for the voice engine. Beyond that, developers have to create natural language understanding models that can take all of that data in and work out how to apply it by manipulating elements of the provided data sets. As if that’s not enough, there’s also the issue of finding the right answers for the questions that users ask, which, Tuttle says, boils down to an “information retrieval problem.” All of this, ideally, needs to be achieved with extremely low latency, which only adds pressure on development teams.

Low latency is a particular problem because of the agility that has to be achieved. As Tuttle points out, showing comparison tests of individual users doing both voice and text searches, voice is supposed to be quick, and that’s a major part of its appeal; every bottleneck and every stutter shaves extra time off of the mad sprint that’s supposed to happen in the wake of a user event. It might seem unfair to the developers who have to stand on the front lines of these projects, that in addition to mastering the vast troves of meaningful data that constitute question-and-answer applications, you also have to deliver at roughly the speed of light. But in today’s market, that’s the way it is.

Most companies aren’t going to have the staff hours, technical expertise, and general resources to handle all of this, which is why Tuttle sees his company’s MindMeld platform as providing a critical “middleman” experience that will put voice solutions within the reach of the average smaller firm. By providing economies of scale and specialized investments, these third parties would theoretically be able to introduce clients to the wide world of Natural Language Processing, decreasing the price of this golden ticket.

Five Steps to Voice Success

In relating how the MindMeld service works, Tuttle mentions five critical steps for achieving a voice solution. The first, is the creation of the knowledge graph, which Tuttle says can be done partially through data mining, with things like advanced crawling technologies and link-ups to internal databases. The data store, Tuttle points out, is essential: it is the content on which the “operations” of the voice model is going to be working, the real stuff that is getting talked about, and the types of particular information that users are going to be asking about.

Then there’s the job of building an accurate Natural Language Processing model, which Tuttle says is served best by “large-scale Machine Learning systems” that will automatically get better at language processing as they are used. When developers enter a range of example questions into the interface, the technology maps cues to items in the knowledge graph, to eventually get smarter about serving users.

“It makes it possible for a non-expert human annotators to create the necessary data sets…using a simple web-based tool,” Tuttle says, highlighting the idea that such an interface really helps to bridge the gap for someone who is not deep into the field of speech processing.

As for finding the right answer, according to Tuttle, the solution for this step essentially resembles search engines. Technologies need to create lists of millions of possible answers, and score them according to relevance, to pick the best ones every time. This alone requires its own significant processing power, along with algorithmic triage systems that have to be carefully built and maintained.

In addition to all of the above, companies will need to plug their solution into a mobile interface, so that users can simply talk into their smartphone’s microphone and have that data flow into and out of the app that’s using it. For this step, Tuttle recommends using available software development kits for individual platforms.

The last step is building the user interface according to its purpose. Going over the likeliest uses for the new voice solutions that we’ll see in your future, Tuttle talks about smart television and smart home systems, where topic-specific voice apps will let users navigate a virtual world to get the exact information that they need, right from their refrigerators, televisions or toasters.

Voice solutions can help with missing ingredients in a recipe, help you find a particular TV show or movie to watch, or bring more functionality to tiny wearable devices such as fitness trackers.

“These technologies are working remarkably well these days,” Tuttle says. “Expect really great voice to come to apps near you.”

The promise of this new functionality is going to impact many things. It can change the face of the Internet of Things, put pressure on software companies, and create some of those epic product rollout events that people camp out for. Look for more voice to bounce onto the horizon sometime soon.

We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations. We may share your information about your use of our site with third parties in accordance with our Privacy Policy. You can change your cookie settings as described here at any time, but parts of our site may not function correctly without them. By continuing to use our site, you agree that we can save cookies on your device, unless you have disabled cookies.
I Accept