Last week The Semantic Web Blog covered the launch of the SindiceTech Assisted SPARQL Editor as an open source project, noting that SparQLed also is part of SindiceTech’s commercial suite for large enterprises building private linked data clouds. This week, we’ll dive a little deeper into SindiceTech and its progress since the founders of the Sindice web of data search engine turned their attention to focusing on the commercial application of its technology as a real-time semantic warehousing infrastructure, which leverages cloud computing for integrating and normalizing the massive amounts of data the enterprise must deal with.
As SindiceTech founder and CEO Giovanni Tummarello explains, companies actually approached his team to help them make a reality of their visions to use RDF and SPARQL, as the best knowledge representation and querying technologies available, by providing the missing scalability and stability. Sindice.com was evidence that the technology the team had developed could answer these enterprises’ needs; currently there are about 700 million semantically marked-up web pages indexed in the Sindice.com search engine, with a live updated index of some 80 billion triples daily. Its database is over 5 terabytes.
Today, SindiceTech counts among its customers Elsevier, the biggest scientific, technical and medical publisher in the world. Among other users of its technologies are pharmaceutical companies that have agreed that part of the outcome of the work they are doing with SindiceTech become publicly available. Among these are Lundbeck and Eli Lilly and Co., for which SindiceTech built the dataset and demonstrators currently on its Health Care Life Science portal.
Customers, Tummarello says, “told us they have similar amounts [of data as Sindice.com] and can we help. Can some Sindice technology support us to create a living, real-time updated warehouse where we can look at our data, shape and repurpose and mix it at very large-scale. And, with open data more available we want to leverage that, too.”
An example coming from the pharma companies with whom SindiceTech works aims at creating a semantic sandbox where large amounts of data – for example, external bioscience datasets and publications – come together in a common space, with the capability of making connections between, say, 1,000 molecules a pharma company is investigating and scientific articles that point to which ones in that set are promising and which ones are toxic.
In Elsevier’s case, SindiceTech is involved in its Smart Content initiative to leverage the value of linked data extracted from its rich annotated data and complex ontologies, helping to power next-generation information products. “The traditional use case for content providers has been the traditional enterprise search approach, by searching for keywords or related concepts and retrieving a list of sources matching the search phrase,” Tummarello says. “The future is when you can not only search the data, but combine the knowledge semantically at large-scale taken from multiple sources to create new value, display mashups, and conduct experiments with a stream of knowledge.”
Move Toward the Enterprise
Last August the team at Sindice.com decided to extract out of that search engine the pieces of technology that would be reusable in enterprise settings. “One of the things that is true is there is no single data structure or no single way to process data if you want to create something that scales,” Tummarello says. So SindiceTech propose infrastructures that create multiple parallel indexes of different natures, so as to be able to route the requests to the most suitable one.
The SindiceTech Suite includes:
- Its Cloud Spaces middleware is a workflow engine that coordinates three elements: big data processing capabilities (Hadoop all the way to NoSQL datastores), semantic web knowledge representation tools (RDF used as lingua france throughout the whole process), and cloud capabilities. SindiceTech creates semantic sandboxes in the cloud (private or Amazon-powered). These are basically data spaces where data from anywhere and in any format can be merged into a single box and then loaded to parallel indexes (or denormalized) for enterprise-speed, so users can experiment and extract value from the data as they go. “You create different versions and each one is very, very fast for one specific task, and the combination of this collectively is just really fast,” he says. The cloud and parallel index design means that users can make changes or improvements to data and then recalculate its whole path in the cloud, allocating more computers as needed and hot swapping so live services continue while calculations are underway.
- SIREn (Semantic Information Retrieval Engine), the engine behind Sindice.com, is the semantic relational plug-in extension to ApacheSolr, that efficiently indexes and queries RDF, as well as any textual document with an arbitrary amount of metadata fields. “Those who know Solr are extremely excited to know they can use it to index data and their relations to other data, for example, as they might have in a relational database,” says Tummarello.
- The other two components of the enterprise suite are SparQLed, for assisting users with writing SPARQL queries (see the article here about it and its release as open source), and PivotBrowser, which debuted at the Semantic Web Technology and Business conference in San Francisco in June. PivotBrowser “is relational faceted browsing enabled by a combination of our technologies, that is being patented,” Tummarello says. With PivotBrowser, users don’t simply restrict a single type of entity as they navigate collections, but things that are connected to each other. “So SindiceTech is not just indexing, but providing a large-scale way of interacting with data,” he says. ‘PivotBrowser is not just a ‘front-end,’ ” however,… [but] becomes possible just as an application of the large-scale transformation and indexing technologies which are in the Suite.”
At the SindiceTech site are showcases of its Linked Data Clouds technology operating on top of non-trivial data sets: the aforementioned health care/ life sciences portal and the web of data (which is Sindice.com), with publishing on the way. At the HCLS portal, you can see live its Assisted SPARQL query facility and its real-time relational faceted browser for drug discovery. Tummarello says that the HCLS Linked Data cloud data set is over 1 billion triples, and the team is contemplating making this an open, shared community project to further expand the cloud.
“We’re thinking of providing the technology free to anyone who wants to contribute to an open Linked Data cloud on health topics,” he says. “We don’t aim to create the ultimate data cloud on this but we’re happy to partner with someone who wants to do that.”
Tummarello says the company also has seen interest from the government sector, and expects that more and more verticals will have the need “for ingesting a lot of different data whose structure is not known and getting value out of that.”