A couple of weeks ago The Semantic Web Blog asked our audience whether we thought Larry Page picking up as Google’s CEO might result in semantic search efforts gaining more traction at Google. The results of that poll, which you can find here, are that 60 percent of respondents think that will be the case (as of this writing).
Maybe so. Last week the search engine giant closed on its acquisition of travel technology provider ITA Software for $700 million. In completing the deal, Google senior VP Jeff Huber posted the comment that a main reason for the buy of the purveyor of airline-data organizing software was that it would make booking travel arrangements more semantic. OK, he didn’t exactly say that but he came close – his actual comment was: “How cool would it be if you could type ‘flights to somewhere sunny for under $500 in May’ into Google and get not just a set of links but also flight times, fares and a link to sites where you can actually buy tickets quickly and easily?”
But the acquisition might unlock more than the doors to travel. Over at the ITA Lab you’ll find projects that include Needlebase and the Needle data aggregation tool for acquiring, integrating, cleansing, analyzing and publishing data on the web. It’s too soon for ITA’s Glenn McDonald, project manager for Needle and designer, Semantic-Web data exploration, to give any details on where what he’s been working on might lead now that Google’s bought the company, but he can say that that work has been about way more than flight data – or search for that matter.
If you search for flights in Matrix, the ITA airfare search engine, you will get help from Needle’s event aggregation database, which handles through its semantic dedupe and data cleansing algorithms niceties like combining nine different listings for the same concert into a single record (even after the data is refreshed from the original source).
So, data curation – whether web-scraped semi-structured content, internal structured data (CSV, XML) that needs to be reconciled, integrated and explored, or directly entered infromation – is the direction Needle’s been moving in. “The project began with a search angle but it’s evolved more into data curation,” says McDonald. “You are a person in a data-intensive field of knowledge, and you are building a data space to support that.” To that end, he says the “philosophy is we should let you into your data, to explore and see it,” says McDonald. “The classic database model where you send requests to IT for a report and they send you a spreadsheet to look at it – it’s hard to do anything with that, it’s hard to tell if you’re missing things, it’s hard to know if it’s right.” While your results may look like a report, the underlying data is accessible with a click.
Everything in Needle is always live queries, and because every piece of data is an equal peer, the system can do anything the data logically allows. “You can look at the data from any perspective, whether the original creator thought about it in that direction or not,” he says. “One person’s list of schools by region is another person’s list of regions with their schools,” for example.
Data extraction is helped by advanced machine-learning techniques. “Machine learning is a big lever for you to do something faster,” he says. Give Needle a handful of examples (fewer for those generated by a behind-the-scenes database, more if templates are the route) about how you want it to navigate through a web site, what data you want out, and how to tag it and arrange the results, and it goes to work. Needle is building its own data sets (heads up: Job openings for semantic database curators and testers to build demo databases and more), and given Google’s acquisition of Freebase and its building of a reference database of entities – “a core data set around which the world could sort of circulate…[it] would be cool to integrate with that,” McDonald says. But that’s just an idea, for now.
Where else the Google acquisition might be an advantage for Needle is its mighty server farm. “There’s a tradeoff with any sort of data analysis system between what degree of expressivieness you have and what degree of scaling you have,” he says. “At the moment we can do a lot with small or medium or big data sets, depending on how you think of them, but I was just at the Big Data Conference, and when they say big they mean really big. Orders of magnitude big.” Currently individual data sets are kept in memory so gargantuan isn’t an option for Needle (and McDonald points out that size isn’t the main dimension it’s concerned with, anyway), but it’s also working on a version that can run off disks to help on the size front.
And speaking of expressiveness, it’s the lack of that in RDF and SPARQL that keeps Needle away from using them (though users can model a particular data set as triples in Needle if they wanted to). “The project is, in my opinion, part of the same topic area as everything labeled semantic web and data, but we don’t use RDF and SPARQL because we don’t think they are expressive enough for what people need,” he says.
While RDF gets credit for pushing the conversation about data representation, and its graph structure is better than a relational database table structure, it still takes apart the logical structure of data and breaks it apart for the machine’s convenience. ”When you’re curating a music poll [which Needle recently did for the Village Voice’s Pazz & Jop Music Critics Poll], you don’t care about assertions but about voters, albums, and artists, and you want to ask questions about patterns in actual logical data – not about triples.”
Another way of saying it, he adds, is that for a lot of data purposes we think triples are too low-level an abstraction. And when it comes to SPARQL, those issues go a bit beyond just expressiveness.
“We use our query language behind the scenes to power every aspect of the UI, and thus it’s critical that it be compositional, so that columns and views and things can all be defined as little snippets that can be strung together. SPARQL isn’t as compositional by nature,” McDonald says. “”We also believe that a really good query language should not just be for humans communicating with the machines, but for humans communicating with each other about data in a structured way. SPARQL is pretty verbose, and the syntax elements tend to dominate the actual content of queries.”