Paul Wilton was Technical and development lead for semantic publishing at BBC News and Sport Online during the 2010 World Cup. Currently he is the Technical architect at Ontoba. In this interview, a supplement to “Dynamic Semantic Publishing for Beginners”, Paul describes the current landscape for DSP as it applies to news organizations.
Q. Are you seeing a wide disparity in the way that news organizations have approached the creation and use of semantically-linked (or annotated) content?
A. Actually the pattern and often the (general) technical architecture is surprisingly similar. Where things differ are the applications, models used and instance data. This is undoubtedly bleeding edge technology, and typically the impetus to begin investigating the use of linked data, RDF and semantics in the technology stack has come from within the Information Architecture and R&D teams, not from the offices of the CTO/CIO. Maybe this is starting to change now.
Q. Do many news organizations have the resources (staff and/or Content Management Systems) that are able to publish and use semantic data?
A. Not in our experience, but this shouldn’t be a barrier to integrating semantic technologies and publishing linked data.
The key components to adopting semantic publishing – a semantic repository (triple store); appropriate linked data sets; and the ability to semantically annotate your content – can be built alongside an existing Content Management System.
With our clients, we’ve recommended that they follow Agile software development principles and start by building small-scale prototypes. The last thing a CTO wants to hear is that straight-away we need to interfere with their existing publishing system. We advocate building on the periphery of an existing architecture and then integrating iteratively over time.
With most of our clients, we’ve started by selecting a suitable content area to model – one that demonstrates immediate value to that business – and then developed a small-scale prototype on that vertical slice of content. By building lightweight applications that are peripheral add-ons to an existing CMS we have been able to significantly de-risk our projects and deliver them with the minimum amount of interference to existing publishing systems and processes.
This is how the BBC semantic publishing platform was built, as a peripheral enhancement, even through to live production. Tighter integration into the CMS and workflow came later, only once the benefits of the approach had been proven by extending the solution out through live trials.
By designing these peripheral enhancements to the main technical architecture with the goal of delivering an integrated semantic publishing architecture in mind, we have been able to ensure a smooth integration path to a complete solution.
One of the key issues that the publishing organizations we’ve worked with have faced is managing the reference (linked) data – this is probably not something most organizations are prepared for and is not as straightforward as it seems. This hasn’t been a barrier to the progress of our projects, as the instance data has been curated and managed by the project team. But in the longer term, organizations will need to look for data manager/librarians with an understanding of linked data and RDF, as getting data management right is one of the factors in realizing the return on investment. In our experience, we would recommend that an organization look to centralize management of reference data as this will improve quality, removing duplicated effort and – hopefully in some cases – lead to costly, internally managed vocabularies being replaced with open datasets.
Q. When it comes to publishing and using annotated data, do you think some content management systems preferred by news organizations are better than others? Easier? More robust?
A. In our experience, the type of content management system is not as important as one might think. The technical architecture of the semantic publishing system is what’s important – ensuring that it is designed and built with service APIs that meet the overall requirements and use-cases of the platform so that it can easily be integrated with the existing CMS.
The core challenges of building cohesive, loosely coupled, testable services and keeping clean separation between interfaces and internal implementation details are as important here as in any other type of software architecture. To that extent, the choice of data storage solution or authoring toolsets shouldn’t really matter (as long as they effectively meet the needs of the platform). For example, while the optimal experience would be to integrate a semantic annotation user interface seamlessly into the workflow, if a CMS is not flexible enough then a semantic annotation application can always be built as a peripheral enhancement but still tightly integrated within the existing content creation process.
Q. After putting all of this work into consuming and publishing linked data, is this approach paying off for publishers? How? Are there many examples of dynamic publications (or apps, websites, etc.) that highlight what can be done to use/consume linked data/content put forward by news organizations?
A. One of the big benefits of this approach comes from bringing together reference data from across the enterprise into a single canonical source. This has efficiency savings and the potential for focusing data quality activity but by far the biggest benefits will come from new opportunities gained from the ability to rapidly create new views of aggregated assets.
So the emphasis of our work is in linking data within the enterprise. This is often about taking data that already exist in the enterprise, though it might be locked away in spreadsheets or limited in it use to drive a single product. We also work with clients to take advantage of existing sources of linked open data.
It is also important to recognize that, semantic publishing is not just about publishing linked data and RDF as an output from your organization’s publishing stack, but also the use of semantics within the stack to aid publishing of one’s assets and reduce overheads. This is where ontological modeling of your domain to meet your publishing use cases pays off. BBC Sport for example have gained efficiencies in the Journalist workflow by freeing up journalists to do what they do best and author content, and letting the semantic automation choose what content gets onto a page.
Q. Will Schema.org eclipse the need for and use of RDF, especially with the development of rNews? What does it mean for publishers that Schema.org and RDF are linked?
A. Not in our opinion. Schema.org focuses more on the transient markup of documents with data such that clients (machine consumers) can do clever things with those documents in a common way (good). However, it would be hard to put all of this into a semantic repository and build a robust enterprise solution with OWL reasoning from schema.org microdata. RDF is much more appropriate for building enterprise solutions on top of. An organization should engineer their solution based on ontology models that best fit its own use cases and domains, thus allowing robust and effective service APIs to be built and contracted to their ontologies. Internally RDF is best suited for this. Transformation into schema.org markup in documents can occur downstream or post publication.
As part of our role in the media organizations that we help we have been involved in contributing to Schema.org. Our clients are reporting benefits they have seen from differentiating themselves in search results. For example anecdotal evidence from the BBC Food suggests as much as a 40% boost in click through rates.
Q. What barriers exist to enlisting more publishers in the open/linked data movement? How do you think they could be overcome?
A. Don’t try and sell the open and the linked bit of Linked Open Data all at once. Creating a central canonical source of reference data and the benefits this brings to the enterprise is becoming more apparent to the organizations we work with. Taking that and moving on to sell the benefits of publishing open data is a different and probably harder sell. By focusing on the first, achieving the second is trivial when you decide to publish data.
Q. Any thoughts on how and whether publishers should use Freebase?
A. Like DBpedia, Geonames etc. one of the greatest benefits these data hubs have is their role as a common source of identifiers for culturally interesting things. From an enterprise perspective this means providing access to the new sources of data that they join together to improve products. For those publishing data it means reducing the effort for your data to be reused by others.
With the Press Association we have linked Press Association curated entities to Freebase (and DBPedia) URIs whenever possible. However one of the key challenges with all our clients is effectively domain modeling their business processes and content pipeline. The domain model specifies the internal implementation details which drive the business value, and following the architectural principle of separating implementation and interface, we try to keep that decoupled from the semantic interface of public domain identifiers wherever possible. For this reason, we think that Freebase (and other large public domain datasets) should definitely not be overlooked but are often best used indirectly rather than directly.
Q. Ontoba designs custom semantic publishing platforms. What can Ontoba offer news organizations?
A. Correct, we provide technical architecture and development services specializing in semantic technologies. We do not have a product that we roll out, but we architect and build solutions that are the best fit for an organization’s requirements and use-cases. These solutions may in fact integrate a technical product, which individually are one piece of the semantic publishing jigsaw.
We can help in both the upstream and downstream sides of the publishing chain. We are architects and developers experienced in building applications that consume RDF as well as solutions to semantically publish content and linked data.
Q. Does Ontoba help publishers produce annotations and consume semantically-tagged multimedia (videos and photos) as well as text?
A. The platforms we build have been used for annotating a large variety of assets. This has included video, images, news stories, research publications, education materials and statistical data. The real benefits come when you start to draw on a common set of reference data to describe a diverse range of assets.
Thank you, Paul!
To learn more about Dynamic Semantic Publishing, follow this topic on SemanticWeb.com, where there will be coverage of DSP presentations at next week’s Semantic Tech and Business conference in San Francisco.