There are three trends that I observed at SemTech 2011 in San Francisco last week. First was the increased role of native XML databases used in combination with RDF data stores. Second was the many natural-language processing tools and vendors at the conference. And third was the role of semantic annotations and standards directly in web content. I think these trends are related.
One of the keynote presentations at the SemTech 2011 conference was done by the BBC. They presented their core architecture for managing web content as having two main components: a native XML database(MarkLogic) for content and a RDF triple store for "metadata." These tools were at the core of their architecture for their web sites.
Another presentation was done by the Mayo Clinic. They also are using MarkLogic for web content and are also using semantic web technologies. Their diagrams show that there are many ways for these systems to interact.
The presentation by DERI showed that SPARQL queries could be converted to XQuery using the XSPARQL extensions to allow the users to intermingle both XML and RDF queries.
MarkLogic also presented their method of capturing graph structures in XML and running SPARQL benchmarks on these structures. They demonstrated performance on SPARQL benchmarks to be very similar to using plain triple stores.
My presentation also discussed the merits of using NoSQL for metadata management. I showed that the ease of using simple XPath expressions with XQuery templates allows non-programmers to build full CRUDS applications to build and maintain full metadata registries to store enterprise semantic data.
An Emerging Trend: Native XML and RDF together
What is clear about these presentations is that all of them are using native XML database to store content and combining these systems with graph-type queries when the business rules require it. None of the systems were using traditional RDBMS systems to store web content. The process of serializing and de-serializing web content into RDBMS tables seems to have become far more complex than is necessary for the value it brings. Keeping things simple causes huge benefits in terms of content management tools and performance.
Engines and Sidecars
The metaphor that I have been using is to describe modern applications as being focused around native XML databases with very high participation of non-programmers and very high performance queries on both tree-and-graph data. Some queries are best expressed in XQuery and some in SPARQL. But the key is that you pick the right NoSQL solution for the right job. In the long term we expect to see systems that can perform both these queries on similar structures without movement of the underlying data. The image I try to capture is Native XML databases as the "motorcycle" and RDF triple stores as the "sidecar" for specific inferences.
Motorcycles and Sidecars
Native XML Store and RDF Stores
Native XML Stores work best when you can express your data sub-collection as one or more path expressions to the data. If you can express your data collection in a path you get all the benefits of a true NoSQL system – schema free, dynamic and able to handle very-high variability typical of metadata.
So why use RDF and SPAQRL? Because there are some occasions that you simply can not easily express your query in a path expression. XQuery also has grown in the number of rule-engines that can be used. Not only do native XML systems use XML Schemas, Schematron and XProc to query XML, they can also be integrated with easy-to-maintain hierarchical decision trees for rule processing and machine learning. Although XQuery has grown dramatically in its ability to handle complex data and there are now many XQuery extensions modules, there are still occasions that the RDF/SPARQL model is better than those that are part of the traditional XML stack. This is why these hybrid architectures are appearing.
Natural Language and the Rise of Annotations
One of the key changes we are seeing is the increase participation of Natural Language Processing (NLP) tools on the web and in enterprise content. This allows systems to extract key entities (people, places, products etc.) from text and store the reference to these items directly in the text as annotations. The ability to store annotations directly in XML web content and to be able to query these with XQuery makes finding and managing these annotations easy.
High quality NLP tools are still mostly commercial products. NLP tools that annotate your content within a native XML data store are still years away from being an integral part of the open source community. But the quality is increasing and the costs decreasing. NLP tools are increasingly being deployed as cloud-bases services but the need to insert annotations within content is clear.
These developments, combined with the announcements around schema.org by the major search vendors show that there is an increasing set of needs for semantic markup within HTML that can be quickly queried with tools like XQuery. But these queries cannot be done with graphs. Graphs stores are ideal for inference, but graphs stores were never designed to store web content. Building hybrid systems clearly is an architecture for the future.