Click to learn more about author Jan Pešán.
Enterprise software use has skyrocketed since the start of the coronavirus pandemic. We’ve learned to replace meetings with Zoom, conduct interviews via webcam, and hold our watercooler chats over Teams, but we’ve also discovered the many ways that our technologies can still improve. As with most black swan events, some sectors faced declining or plateauing growth, while others saw unprecedented advancement due to a rise in need. One of these fields was speech technology, which saw a marked increase among enterprises (68% of respondents recently reported that their company has a voice technology strategy, up 18% from last year). This need means there’s massive potential to evolve in 2021. Here’s where it might go.
Lost in Transcription
Even if two people are speaking the same language, they can still struggle to understand each other. After all, Americans often ask an Irish or Scottish speaker to repeat themselves, a French person and a Quebecois will have occasional misunderstandings, and Mexican Spanish won’t always match a Spaniard’s.
Far less familiar are the linguistic nuances of most speech recognition engines. For most providers, you have to choose an accent-specific model to transcribe your voice or conversation, so what happens when you have multiple accents within one audio file? To anyone who speaks a dialect or accent other than the “standard” version arbitrarily assigned by most providers, you are required to change your voice to suit the engine or are just badly transcribed as a result. But poor dialect comprehension isn’t the only challenge facing speech recognition. Languages with fewer speakers, such as native dialects in smaller countries or distant regions (so called low-resource languages), may have no transcription systems available at all.
As more and more people around the world go online and enter tomorrow’s global digital marketplaces, the need for reliable, fluent, and efficient accent-agnostic and any-context speech recognition technology will become ever more apparent. How, after all, can you serve a market whose language you cannot understand? In 2021 and the years to come, this technology needs to work for everyone – and the first to market in underserved regions like Southeast Asia or Central Africa will have a significant business advantage.
New Opportunities in the New Normal
In a business context, effective automatic speech recognition holds significant promise. A primitive tool isn’t much help if you want to transcribe a meeting or call, since cleaning up the generated text may be too time-consuming and labor-intensive. The latest AI-powered voice recognition solutions, however, are exponentially more effective at generating the correct word output. Aside from their clerical potential, AI-generated transcripts can form the core of a searchable institutional knowledge base by capturing spoken but otherwise unrecorded data. It isn’t just about the word output, however. Businesses are looking for added value, more insights, and increased knowledge to be gained from every captured interaction or conversation.
The artificial intelligence behind automatic speech recognition has been improving for decades; the input that the engine receives is also improving. Every year, computers and phones improve their microphones and sound quality, while high-bandwidth connections mean that there’s less need to compress transmitted sound. In the wake of coronavirus, quality videoconferencing, including good sound quality, will be more of a selling point than ever. Hardware makers have every reason to improve their offerings, speech recognition engines will benefit from better sound quality, and ultimately businesses will generate richer and cleaner data sources.
Intention and Inflection
Even the most people-oriented businesses are currently working to minimize human contact. Sometime in 2021, we hope the plexiglass barriers will come down and the masks will be hung up, but not every business will go back to the old ways of doing things. For reasons of efficiency and liability, some companies will continue to minimize direct human interactions. Take contact centers as an example. Chatbots can often resolve problems more quickly than a human can, provided the bots are able to correctly interpret what the human customers say. While no chatbots are likely to pass the Turing test next year, speech recognition technology will increasingly be concerned with understanding what is meant, not just what is said. This brings together the fields of natural language processing (NLP), natural language understanding (NLU) and speech recognition. For an accurate estimation of speaker intent, AI algorithms need to have not only precise transcription, but also context clues such as timing, repetitions, filler words, and hesitations.
When you or I speak, we convey meaning with more than words. Is that “thanks” an earnest expression of gratitude or a sarcastic jibe? Is the pause between sentences intended to underline a point before beginning a new sentence, or is the speaker waiting for our response? These are shades of meaning that future speech recognition engines will look to detect and interpret in the text-based output.
What will the field of automatic speech recognition technology look like at the end of next year? At the end of the next decade? While predicting the exact time or nature of a technological improvement is just about impossible, it seems clear to us that any-context speech recognition technology will only get better, as it has been for the past decade in the evolutionary part of the field. For the revolutionary part, we will see clear blending of the different AI fields, moving slowly towards a universal artificial intelligence. Maybe by the end of the next decade, callers to helplines will no longer dread the conversations but enjoy them, knowing that their problems will be quickly resolved. Perhaps automatically transcribing meetings will be standard procedure at the majority of businesses. Almost certainly, speech recognition technology will be available to millions or billions of people who currently lack access. Whatever may happen, the future looks bright.