Daniel Tunkelang, Principal Data Scientist at LinkedIn, delivered the final keynote at SemTechBiz in San Francisco this morning, exploring the way in which “semantics emerge when we apply the right analytical techniques to a sufficient quality and quantity of data.”
Daniel began by offering his key takeaways for the presentation;
- Communication trumps knowledge representation.
- Communication is the problem and the solution.
Knowledge representation, and the systems that support it, are possibly over-rated. We get too obsessed, Tunkelang suggested, with building systems that are ‘perfect,’ and in setting out to assemble ‘comprehensive’ sets of data.
On the flip side, Computation is underrated – machines can do a lot to help us cope with incomplete or messy data, especially at scale.
We have a communication problem.
Daniel goes back to the dream of AI, referencing Arthur C Clarke’s HAL 9000 and Star Trek’s android, Data. Both, he suggests, were “constructed by their authors as intelligent computers.” Importantly, they “supported natural language interfaces” to communicate with humans. Their creators, Tunkelang suggested, believed that the computation and the access to knowledge were the hard part – communication was an ‘easy’ after-thought.
And in the 1980s we reach Cyc. Loaded with domain-specific knowledge and more, but “this approach did not and will not get us” anywhere particularly useful.
Moving closer to the present, Freebase. “One of the best practical examples of semantic technologies in the semantic web sense… doing relations across a very large triple store… and making the result available in an open way.” But Freebase has problems, and “they are fundamental in nature.” When you’re dealing with structured data acquired from all over the world, it is difficult to ensure consistency or completeness. “We’re unlikely to achieve perfection, so we shouldn’t make perfection a requirement for success.”
Wolfram Alpha, starting from a proprietary collection of knowledge, is enabling reasoning and analysis over a growing collection of data. Wolfram Alpha is very good when it’s good, but extremely weak when it comes to guiding users toward ‘appropriate’ sources; there is a breakdown in communication, and a failure to manage or guide user expectations.
“Today’s knowledge repositories are incomplete, inconsistent, and inscrutable.”
“They are not sustained by economic incentives.”
Computation is under-rated. IBM’s Deep Blue, for example. A feat of brute-force computation rather than semantics, intelligence or cleverness. “Chess isn’t that hard.”
Also IBM – Watson and its win at Jeopardy. “A completely different ball of wax to playing chess” that is far more freeform and unpredictable than rules-based chess. Although Stephen Wolfram’s blog post from 2011 suggests that internet search engines can also actually do pretty well in identifying Jeopardy answers.
Google’s Alon Halevy, Peter Norvig and Fernando Pereira suggested in 2009 that “more data beats clever algorithms.”
Where can we go from here? “We have a glut of semi-structured data.”
LinkedIn has a lot of semi-structured data from 160 million members, predominantly in the form of
- free-text descriptive profile text;
- marked-up (but typically incomplete and ambiguous) statements of employment, education, promotion etc;
- (also typically incomplete) graph data representing the relationships between people and roles.
Semi-structured search is a killer app. Faceted search UI on LinkedIn, letting the user explore and follow connections, without the need for data to be entered in accordance with a strict classification system or data model.
There is no perfect schema or vocabulary. And even if there were, not everyone would ue it. Knowledge representation only tends to exceed in narrowly scoped areas. Brute force computation can be surprisingly successful.
Machines don’t have to be perfect. Structure doesn’t have to be perfect. We don’t have to be perfect. Communicate with the user. Offer a UI that guides them and helps them explore. Don’t aim for perfection. Offer just enough to help the user move forward.
“More data beats clever algorithms, but better data beats more data.” Computation isn’t the enemy. Make sure ‘better’ data – from SemTech community and others – is available to these machines and then we’ll see something remarkable.
For more from Daniel, listen to April’s episode of the Semantic Link podcast in which he was our guest.