Wavii made waves when it launched its automated, algorithm-driven news aggregation service in April, which has been billed as making Facebook out of Google. But what makes Wavii work? When The Semantic Web Blog found a picture in Wavii.com’s Flickr photostream featuring the word “predicates” on a white board, it was time to discover whether Semantic Web standards had anything to do with its engine.
Turns out, RDF is not at play here. But natural language processing certainly factors in, albeit from the perspective of information extraction and being almost entirely machine-learning based rather than deep-parsing oriented. The service’s technology is influenced by the expert machine-reading NLP work being done at the University of Washington, where Wavii advisor Oren Etzioni is a professor of computer science.
But Wavii CEO and founder Adrian Aoun can credit growing up in a household where his father was a linguist — a student of Noam Chomsky – for originally sparking his interest in how language works. Whenever his dad would have a debate about a language construct with his fellow MIT buddies, he says, “they’d turn to me and say which one sounds better….The irony is that they’re arguing over the rules, but they acknowledge the right answer is whatever humans do.”
Wavii takes its hypothesis from the principle that humans don’t learn language based on the structure of language, or a syntactic view of it, but on a semantic understanding of language – one concept at a time. The tot that finally put together a sentence like, “I need a glass of milk,” got to that point by discovering patterns in the language being spoken all around him, Aoun says. And probably failing a few times in the process, applying a phrase like “I need a” to the word “happy” before being corrected and understanding that “I need a” requires an object to successfully complete the sentence.
And, “if you think about that in computer science terms, the first thing [the child] did was listen to a lot of things. That’s a big data problem. We need to get a lot of sentences into our system and then we try to discern a pattern,” he says. “”That’s a machine-learning problem. Once [the child] discerns patterns, he has a rough sketch, and then the child attempts the pattern.”
Wavii’s approach similarly asserts the pattern, learning a term like “engagement” and looking across the web for all the celebrities and other news-makers who’ve announced they’re tying the knot. But it can make mistakes too, pairing, say, Barack Obama and Mahmoud Ahmadinejad for being “engaged” – but in a heated debate. “Remember how you have to course-correct the kid – we have to do the same thing,” Aoun says. “So the system learns more and more nuances of concepts over time. We call that active learning, which is a machine-learning concept.”
The engine behind Wavii is constantly refining what it knows, and also learning concepts as it crawls the real-time web. As it discovers new things for which it has no context, it can leverage the input of the Wavii team to help — defining, for example, what a patent acquisition is, and how it should be visualized for users. That can include featuring the price and number of patents in a deal.
“Most of the NLP companies – at least the ones we know –are focused on the linguistic approach, but we’re the big data and machine-learning approach, where we teach the computer the concepts of language, one at a time,” says Aoun. “The beauty is we created system that learns a lot like you and I do. So it gets smarter and smarter every day.”
When A Predicate Is Not A Predicate
One reason for not using RDF, says Aoun, is that “it is predicated on having one schema for everything, and that doesn’t make sense for our app.”As he explains further, “RDF is optimized for situations in which you don’t clearly understand ahead of time the queries you will be using at run-time. Since we do, it’s more appropriate for us to store the data in ‘data bags’ optimized for our queries.” At the end of the day, many things CAN be done in RDF, but it doesn’t mean it’s the right choice for every scenario, he says.
The Facebook metaphor goes to work here, as Aoun describes that service’s predefined story types, informed by some level of human knowledge around the aspects about them that matter (such as what friends checked into a place together), and how best to visualize them. “So, Facebook basically says, let’s define a schema for each type of event that can occur. But then beyond that, they say when we visualize it, when we show this to the user, let’s use some intelligence to find what is the best visualization.
We have the equivalent of Facebook. You see that a company buys another company or someone left one job for another. There are thousands and thousands of these [kinds of things], and each one of them has its own schema. So the way we sort our data is using common schema frameworks, so that each thing can say, here is the schema I want to invoke.”
Going back to the romance scenario, moving from engagement to marriage, the service aims to define what Aoun says you might think of as the highlights or plot points around nuptials. For instance, think back to when Kate Middleton married Prince William: “There would be a lot of things we care about when the wedding comes about, and our system would see what it can pull out. Can I get the maid of honor — yes. Or who made the dress…. This comes to how we think of using the word predicate.”
That is, the predicate is the main thing that occurred, in this example, the marriage. The sub-predicate is any of the other details that you aren’t required to have, but would making things nice if you did. Those could be things Wavii picked out of the language, or it could be a photo or video or a chart. Wavii, by the way, doesn’t actually use the word predicate in its external communications.
All this feeds back into how Wavii does what it does, which basically is to conceptually organize the web around the events that occur. “It is the concept of Facebook in that we try to create in essence the same thing, but for the Internet. So we need to detect when events occur and get all the information we need to about the event. That is the semantic challenge,” he says.
The service is ingesting millions of pieces of content per day, processing each in real time at its own pace. So, something that might start life out as a rumor can be updated 5 minutes later as fact when details have been confirmed. “There’s no batch processing,” he says. “Most NLP is very computationally expensive. The reason for that is they’re parsing language,” using the word predicates in the English grammar context, so to speak. “We said what’s really fast is machine learning. So we can do these things incredibly quickly, and run a smaller infrastructure than most start-ups do, and we still are crawling this real-time content and indexing it all day long,” he says.
Aoun won’t be specific about how many concepts Wavii so far knows about, but he will say it’s greater than one and less than one billion. He says the service also isn’t at the point of releasing any satisfaction or other metrics, “but I will tell you that we are very happy with the user response. We do absolutely believe that there is a user need for our product and we are seeing that from our users today.”
Building a Business
Its business model will center around an audience monetization strategy as more traffic comes its way, advertising obviously being a focus. Where Aoun thinks opportunity lies is in marrying context and concept intelligence with user understanding. “The nice thing for us is our understanding of the user is … actually their interest graph, which is valuable, and then we put it in context,” he says. And even if the whole idea of Wavii doesn’t work out, “turns out we have another strategy going,” he says.
That is, unlocking all the data on the web, as Wavii wants to do, so that people can do real semantic search, asking questions and getting answers. That opens the door to enormous power that others will be eager to access, he says. “There are tons of scenarios, finance and business, sports and entertainment. Information is gold.”
Wavii, now that the launch is done, is focusing on feeding more and more data into the system, to deliver more topics and more coverage. “We’re not crawling everything everywhere ,but that is the vision,” he says. “We don’t have every concept, but it is the vision.
And we want to give users experiences around it, too – for instance, if two celebrities are dating, they may want to see who else each person dated. Allowing users to experience this information is really valuable.”