What do big companies have that most emerging businesses don’t have to help them get value from Big Data? Well, to start with, there’s lots of money and a ton of technology resources.
Never fear. At the upcoming Semantic Tech & Business conference in Berlin, Christopher Testa, CTO of startup WhiteBox Inc., plans to give companies with considerably fewer resources than giants like Google and IBM insight into how to use Big Data as a small, lean startup. His guidance will draw from his own past experiences at Google training AdSense; lessons learned studying the development of IBM’s Watson; and his current efforts to apply Big Data principles to create an expert system for amateur radio operator license exams at his own startup, with limited engineering resources. Most recently Testa was head of engineering at Ad.ly, and that will factor into advice about how to run a data center with free and open source solutions, too.
“The problem I am trying to solve is to design, train and use an expert system that is similar caliber to what I was working on when I was a Google employee,” Testa says. “I started to think about mimicking the technology stack and data access I had while at Google, but with limited computing resources and mostly with free and open source software. My question was, how can I do something of the caliber of the Google search engine without spending much money.”
How indeed. A QA expert system has to draw on a lot of data, of course; it also needs a lot of compute resources and benefits from having a very direct problem to solve, he says. From his studies of and discussions with people associated with the IBM Watson project, he learned about the vendor's successful execution model that is a combination of natural language processing (NLP) using Big Data sets and machine-learning techniques. And he pondered how to apply this approach to a new, smaller problem set at less cost – in his case, questions around amateur radio operations licensing. “But ideally you can apply this to any set of questions as long as you have the training set,” he says.
One resource that helped in applying the approach was the open source tool OpenEphyra for QA, which Testa says is a precursor to the code base from which Watson hails. “It has a bunch of sample question data sets. I just dropped in my questions and then started to go through the motions of finding what it gets right and what wrong,” he says. “There are certain subsets that are easy to improve on and those that will be more difficult.”
Testa says he likes to think of his work as creating the poor man’s Watson. But why should emerging and small companies want to pay attention to leveraging Big Data for QA systems? “I think Q&A is just like how every web page now has a search box on it, whether it’s using Google search and its custom site restrict search, or its own search engine for its product. Q&A will be as or more valuable than search engines have been over the coming years,” Testa says. "So, being able to understand and apply this stuff to real world problems is important. People are always trying to find information as quickly and efficiently as possible.”
Registration for the conference is here.