Smart AI Means Smart Data Prep

By on

Every company wants to put artificial intelligence (AI) to work. Its potential seems limitless. Big business benefits at the snap of a finger.

But then reality hits: the value that AI can deliver isn’t easy. Even IBM, a pioneer in the early age of AI (or AI’s rebirth if you want to trace it back to its nascency in the 1940s and ’50s), has had its struggles with its Watson AI platform, the most widely known one revolving around the failure of the technology in the healthcare sector on efforts to improve cancer care.  There seems to be fairly broad consensus on that point:

  • Most organizations fail in some aspects of their AI projects, with a quarter of them reporting up to 50 percent failure rate according to a recent IDC survey.  Lack of skilled staff and unrealistic expectations were identified as the top reasons for failure.
  • Forrester Research has pointed to Data Quality issues as among the biggest AI project challenges, noting that there is generally a lack of understanding about what data is needed for machine-learning models and how to prepare that data.
  • A survey Gartner did late last year shows that AI is now the most-mentioned technology by CIOs, but VP and analyst Andy Rowsell-Jones notes that they may be subject to “irrational exuberance.” In its report AI and ML Development Strategies, Gartner said that the top challenges hindering respondents’ adoption of AI were lack of skills (56 percent), understanding AI use cases (42 percent) and concerns with data scope or quality (34 percent).

In the Wall Street Journal’s Future of Everything Festival, Arvind Krishna, IBM Senior VP, said that about 80 percent of the work with an AI project is collecting and preparing data. Some companies, he said, just aren’t prepared for the cost and work associated with that.

“In the world of IT in general, about 50 percent of projects run either late, over budget or get halted. I’m going to guess that AI is not dramatically different.” 

Rahul Singhal, Chief Product Officer at Innodata, a data extraction, machine learning, and data enrichment vendor, knows the challenges businesses face. Businesses, he said, have underestimated the need of clean annotated data. That is reflected in the fact that the market for data preparation is rising. It was valued at $1.78 billion in 2017 and is expected to reach $6.06 billion by 2023.   

Content Expertise for Data Quality

“It’s a very large market opportunity,” says Singhal. Innodata is one of the vendors in a space that also includes Amazon Turk, Appen, Figure Eight, and Lionbridge. Innodata has been in the business of annotating unstructured content across a variety of domains for 25 years and has subject matter experts (lawyers, pharmacists, etc.) on staff to work on projects in healthcare, pharma, financial services, and B2B publishing domains.

“When you are creating digital products for your customers, you are going through the lifecycle of understanding and annotating the content,” he says. “You need the expertise to succeed in building AI applications.”

Companies don’t necessarily get that expertise when they use data prep providers that use the crowd-sourcing model to do the job, he argues. That model depends on companies’ having their own stringent process flows and quality controls in place to reduce risks from poorly annotated data. “We don’t use the crowd.”

Teaching the Machine

A robust ontology and a lot of training data are required for accurate predictions. “You have to teach the machine and an algorithm to understand the content and context,” Singhal says.

To be able to build and deploy true AI applications, companies need managed service AI applications that are continuously looking at feedback coming from the machine.

“It’s correcting it. It’s giving that retracted feedback loop to the machines and that allows you to then have the machine learning model improvement,” he says. “It will take years for it to be able to automate a lot of these processes and it all starts by having amazing, good-quality annotated ground-truth data.”

There’s no such thing as a one-size fits all “workbench” annotation tool, he says. What a company needs for annotating a SCC (Special Conditions of Contract) legal document is very different than what is needed to annotate an image. As an example, one of Innodata’s customers wanted to annotate a large number of license plates, so Innodata had to work with video images. The company had to build a workbench for taking in 3000 images at the same time. That required that its engineers build a workbench to support high scalability and rapid loading of images. 

Innodata is pursuing the market for annotating complex documents for tasks such as pharma co-vigilance for monitoring the effects of medical drugs after they have been licensed for use. In the financial services space, it supports clients with metadata extraction needs for contracts. For life insurance, it is applying machine learning models to look at healthcare data.  

“We are also doing a lot of work in the reg-tech space,” says Singhal. “We have legal experts looking at different types of regulations, like FINRA and FCC 30, and tagging that content. Those require high-level expertise and frankly a higher quality of ‘ground-truth’ data that could then be applied in production use cases.”

Companies have been pouring money into proof-of-concepts for AI applications, says Singhal. He’s seen projections that the AI product and services market as a whole is estimated to be around $200 billion by 2024. They’re increasingly aware that managed data prep services can be vital to those investments seeing real payoffs, he believes.

Innodata is also partnering with systems integrator Persistent Systems, which builds AI applications, to do the front-end work on aggregating and annotating the ground-truth data that can then be applied in these apps.

“I think that we are at the cusp of something,” Singhal says. “I think you will see more and more organizations are looking for that synergy.”

Image used under license from

Leave a Reply

We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations. We may share your information about your use of our site with third parties in accordance with our Privacy Policy. You can change your cookie settings as described here at any time, but parts of our site may not function correctly without them. By continuing to use our site, you agree that we can save cookies on your device, unless you have disabled cookies.
I Accept