Semantic technology is scoring more goals in the sports world. The BBC, for example, which created the FIFA World Cup 2010 website that leveraged semantic technology, is at it again as London prepares for the 2012 Summer Olympics. Brazil has gotten into the action, too, with an Internet portal there taking soccer to the semantic web set. At the upcoming SemTech conference in San Francisco, attendees will have an opportunity to hear the latest details about both efforts.
Over at the BBC, for example, the 2012 Olympics site accompanies a completely redesigned BBC Sports site, both based on technology including Fluid Operations’ Information Workbench to support the editorial process for the BBC’s Dynamic Semantic Publishing strategy, from authoring and curation to publishing of ontology and instance data following an editorial workflow. The BBC environment since the World Cup also has been updated to use the MarkLogic document store for managing rapidly changing statistics, navigation and ultimately all content objects, as lead architect Jem Rayfield described it in this blog posting. Today, the triple store that’s been behind the BBC’s past work is extended to cover every team, athlete, venue, discipline, country and so on, Rayfield told The Semantic Web Blog.
Another new piece to the DSP puzzle is natural language processing that uses a feedback loop from the triple store so that, when a new concept is added to that store, it is picked up and processed for instant updates to the audience.
Information Workbench, says Peter Haase, Lead Architect R&D at Fluid Operations, as deployed by the BBC, “integrates and interlinks dynamic and semantically enriched data in a central place. Approved content is then available for automatic publication on the website. The platform seamlessly integrates into already existing editorial processes and automates the creation and delivery of semantically enriched content.”
(One side note: A recent update of Information Workbench includes a new XML provider which allows users to also integrate XML data sources for automatic transformation into RDF, and the release of an optimized user interface that allows users without technical skills to create their own widgets (charts, tables, reports, social media tickers, etc.) using simple step-by-step wizards rather than typing in SPARQL queries.)
As the BBC strives for ever more of a dynamic semantic publishing (DSP) system, “all sports specifics get pushed through in real time and served dynamically,” Rayfield says. “When a journalist writes a story, an athlete is surfaced in suggestions to tag — when someone no one ever heard of before wins a gold medal, he is immediately identified. “
Why the need to continually enhance its DSP architecture? Consider the ratio of rich content it wants to the number of contributors it has – there are tens of thousands of pages behind the Olympics site, but only in the low tens to hundreds of journalists, he says. They can’t manually create the graphic content and services the BBC wants to supply to its audience. “The lessons we learned is content abstraction and NLP help us identify tags and get things correct and quick, and there’s no [negative] impact on the journalist workflow, and so we can create all these pages we couldn’t do before,” he says. Instead, journalists are more empowered to create stories because they can find past relevant content more easily that they can match to work they’re doing on new stories, rather than having it drop out of sight in a static CMS system. For readers, there’s an “increased breadth and navigability across the site to find the content they want — it lasts longer and makes it hundreds of times more discoverable, all with the same journalist headcount.” And that keeps people on the site longer – an opportunity not to be missed when the Olympics are expected to drive a peak of ten million impressions per day.
From Britain to Brazil
At Globo.com, the portal and Internet provider owned by the largest media group in Brazil, Daniel Schwabe, Professor at the Department of Informatics, PUC-Rio has been engaged with Rafael Pena, product owner of the Sports Data System at Globo.com, in the task of semanticizing its soccer coverage.
Globo.com, which uses semantic descriptions of entities to tag news stories, has for some time had a very extensive database of facts about soccer games, with statistics on players, teams, games, and so on.
From that database schema and their study of the soccer domain, Schwabe and Pena worked on a domain model for the sport to enable a clear definition of all the important aspects of a game that should be or could be recorded. Globo.com also includes a formal definition of what a cliché is made of and how it’s defined, as an idea behind the work was to lay the foundation for a system that would help surface to journalists suggestions of data that could help them with their story leads or other important content to include in a piece. It also has paved the way to connecting the dots between gaps in the database, such as that a foul recorded as perpetrated by and received by different parties are actually the same event in a game.
As Schwabe says, “we tried to identify stereotypes – clichés like ‘come from behind’ or ‘rout’ – and then started profiling the stats of these games to identify patterns indicative of each cliché.” Typical of a ‘come from behind game’ is where the initial score is in favor of one team and then comes a reversal, where the difference is above three goals. Once these patterns were validated by journalists, there was an update to the authoring environment so that, when the journalist sits down to write a story about a game, he receives a number of suggestions about the format the game might fall into – a violent game or a rout and so on – and a few data items to support that which, via a widget, can formatted in a way, such as a table, that can be inserted right into the story.
For example, the table might collect statistics around a rout story showing the last number of games within a championship where one team was more than four goals ahead, or that this was the tenth time in five years that such an event occurred.
“It calls the journalist’s attention to something and they can decide whether they want to use that or not,” he says. And often it’s the case that a story fits more than one profile, so the author can choose which data box subscribes to the attention they want to convey. “The goal isn’t to completely automate things but to make it easier for journalists to do their work, so that they don’t miss significant data points.”
The effort was successful during tests during the last Brazilian national championships, Schwabe reports, in the number of stories that wound up featuring the information provided and also in that at least one story even made it to the site because the technology alerted the author to a data point that the game represented a milestone he wasn’t otherwise aware of.
Current plans include leveraging information in its rhetorical structure of clichés, and associated roles, to suggest interesting or significant links that could be included with the story, also following established editorial policies. For instance, it can surface other big single-game scoring players as additional information for that page. “It’s the context to which you refer to an entity which is dependent on the type of story you use to write about it,” he says.
The basic infrastructure, which already has been tested, should be deployed come June in time for the next Brazilian national soccer championships. Some of the more advanced aspects will be tested in the upcoming championships.
Ultimately, Schwabe and Pena hope to measure customer retention on the web site from the use of its linking structure. “That’s what you want, that you are providing interesting enough material that the reader stays on the site by navigating to other interesting information,” Schwabe says. “And we want to confirm anecdotal observations that stories with this additional data are more ‘popular’.”
To hear more about both these initiatives, register for SemTech San Francisco here.