Scratching the Surface: Applying Semantic Technologies at the Tribune Company

By   /  May 29, 2009  /  No Comments

A little over a year ago, the Tribune Company launched its Topic Galleries,, the result of semantic technologies used, in great part, to generate a product.  Other Tribune products are “spinning off” the vocabularies and the underlying logic intelligently surfacing the content in the Galleries.  However, to date, the Topic Galleries provide the most comprehensive presentation of a discovery process intended, in part, to plumb the depths of Tribune content well beyond the limited topic coverag

A little over a year ago, the Tribune Company launched its Topic Galleries,, the result of semantic technologies used, in great part, to generate a product.  Other Tribune products are “spinning off” the vocabularies and the underlying logic intelligently surfacing the content in the Galleries.  However, to date, the Topic Galleries provide the most comprehensive presentation of a discovery process intended, in part, to plumb the depths of Tribune content well beyond the limited topic coverage conveyed in a given headline or in the first few sentences of an article, in order to surface the new, the unknown, and even the unusual in the news.  All in all, for this controlled vocabulary and applied linguistics developer avid word gamer, working with the semantic technology to mine the Tribune’s content has been a great way to spend the last 18, or so, months.

But we’ve just scratched the surface even though the integration of semantics-based systems and the production of the Tribune’s Topic Galleries have given me more anecdotes and insights than I ever thought imaginable during my 10+ years work in semi-automatically applying metadata in order to structure corporate, academic, and government enterprise content. As if I needed a reminder, working with the news covering all knowledge domains on a daily basis, is a constant and humbling reminder of just how big (vast? awesome?) this Semantic and Intelligent Web creation, in which we’re all engaged, is. No matter the length or repetition of development, test, and QA cycles leading up to (drum roll) production, persisting and actually seeing a product come to fruition relies, in great part, on developing a certain Zen approach to accepting the challenges and, frequently, resulting discoveries of the experience.

This is especially true when experience indicates that no amount or combination of standards, best practices, heuristics, models, surveys, or studies ever outputs a perfected, or even nearly perfected result, for all and every situation. But that’s a given, and the essence of what I’m trying to convey here has more to do with what we do with what we find out about our content or processes when contextually analyzing content from an automated starting point. Some time ago now, I was struck by a question posted in a forum that asked, in effect, "But what if you find things in your content that you don’t want to find?" That made me think long and hard and, eventually, I came to the conclusion that maybe it’s all as simple as either a) remaining, if possible, luxuriously and blissfully unaware of what is going on, or b) bracing oneself and diving in cognizant of finding things one couldn’t begin to anticipate no matter how many training sets or frequency-based rules have been developed, of course, at an expenditure of resources. As far as I’m concerned, the preferable option is the latter, though one might learn quickly that the former option is best for the constitutionally delicate or unprepared.

During these days of the "death of the newspaper," general economic tumult, and “perfect storms” of numerous varieties, it is a certain comfort to some of us that forward-thinking media companies, built on news foundations, do, indeed, "walk the walk" and not just "talk the talk" when integrating and developing semantic applications to drive product deployment. During an age of seemingly unprecedented challenge, it is to the Tribune Company’s credit that it has created some thing functional, yet alone innovative and of continuing value to its customers, in an arena that is characterized by that which is experimental, sometimes theoretical, and not, by any means, trivial. Perhaps one of the greatest challenges is in balancing the promise of unprecedented opportunities (depending on how one is able to “see” opportunity) and managing expectations knowing full well that “failure” might indicate either a lack of experience in developing certain applications or managing expectations.

A year’s perspective on coordinating natural language processing, extracting entities, automatically inline tagging, refining countless arguments, all in the pursuit of surfacing semantically associated news content in order to innovate, of course, poses very real risks in “hot” environments requiring action to meet, as much as semi-automatically possible, the needs and expectations of users or consumers broadly defined to include not just people but machines, too. Poring over and manually indexing even a fraction of news content indexing seems to be very much a luxury for few these days and just not practical in the interactive media world of welling content, content, and more content. During economic prosperity, yes, one might debate and pore; but discovering the ‘new’ during troubled times has to occur very quickly if for no other reason than finding out “yesterday” just what the distinguishes ones content from that of competitors. We go about it differently during these times. We don’t have the safety nets that we had just a year ago. We have a heightened awareness of “getting the job done” without the luxurious distractions of “committee-think,” “vaporware solutions,” and latest press-releases leading us down rabbit holes. In other words, we who work on the great experiment of the Intelligent Web now know that more–more faceless people in "call-ins,” more consultant parades through the enterprise, more access to demos–does not equal survival. And as “old school” as it might sound, frequently, it just comes down to the common sense realization that, “time equaling money,” both are in short supply and high demand. So the re-learning begins, again.

As part of an over all intelligent, semantic approach to presenting news to a variety of users–internal and external, human and machine–I do "hand engineer” or “hand curate"1 the majority of the terms that go in to producing the Tribune Company’s Topic Gallery index presented across the various Tribune markets. However, the work that I do, after generating or triggering a variety of other processes, becomes fairly invisible to most resulting, ultimately, in a seemingly intelligent experience had by all. I mention this, too, in order to manage expectations because, with time equaling money, many of us just don’t have time to fantasize about or play at delivering a parthenogenic variation of H.G. Wells’ vision of a “world brain” rising out of little to no resources. “Artificial intelligence” is artifice first and seeming intelligence later.

In the scheme of things, the "disambiguation" of someone like "Chris Brown," notoriously "celebrated" for beating "Rhianna," from all the other "Chris Browns" who either scored touchdowns in high-school football games or starred in direct-to-video movies, in order to present “smart” results to an end-user, is a relatively commonplace and straightforward part of my daily work as compared to declaring and defining an entity like the "Montauk Monster" when the world’s eminent zoologists and cryptologists have yet to "classify" it.

Equally commonplace but not quite as simple are the disambiguation, semantic association, and definition refinement activities triggered when “breaking” news occurs as it did on January 15, 2009. That was the day when an event came to light on the Hudson River and involving a U.S. Airways flight. I remember it vividly because, first and foremost, I work with semantic applications that output fairly directly to Tribune internal and external users and consumers. This has many implications, but, most importantly, it means that I do work very hard at “measuring twice and cutting once” in order to avoid the risk of triggering egregious results, such as potentially indexing hundreds of articles with a term that isn’t quite the right term. There’s really only one way to fix this result: un-tag each content piece using time-consuming, semi-manual processes.

Certainly this happens, its unavoidable, and having gone in to production a year ago, initially and necessarily using a top-down approach to indexing content automatically, I’m fully if not painfully aware of the issues surrounding indexing and un-indexing a vast amount of news content comprised of, naturally, ambiguous terms. That’s natural language for you, and it’s where the Zen of the experience comes in to play.

But in the case of the Hudson River situation, automated indexing took quite a different, even intense turn because, when the news first hit–at least in the part of the Tribune in which I function–the incident was in the process of being described as a "crash." That’s how I got the new "news": a plane crash had occurred in the Hudson River. Simultaneously, because content or information or knowledge management is, after all, ‘work,’ I was receiving, in no uncertain terms, a message summed up as, "Quick! Quick! Get IT into the system!" “IT,” naturally, meaning a variety of things depending on whether you’re a Tribune editor, producer, SEO specialist, or any number of other stakeholders.

Declaring the Hudson River event in the system, I started a certain set of processes identifying, declaring, and defining the germane term(s) necessary for conveying just what the event was followed by a series of uploads, server refreshes, and so on. Thus the process of surfacing the terminology began so that, likewise, the content would surface in numerous views designed for various purposes in order to meet the needs of consumers, advertisers, journalists, social networkers, search engines, and so on.

Luckily, some terms related to the Hudson River situation were already defined such as "U.S. Airways" and "New York"; however, many others were not. Most importantly, and I’m somewhat sorry to say, prior to that fateful day, Chesley B. Sullenberger III, was not defined. After all, he wasn’t exactly considered to be a newsworthy person prior to 1/15/2009. But as the Hudson River ordeal unfolded in the news, I worked in “real time” adding numerous terms and definitions to the system along with the semantic associations relating him to all the other concepts that were suddenly defining or contextualizing his existence in relation to a river, a U.S. state, a plane, a mayor of a major city and so on.

As is so often the case, declaring terms was not necessarily the most challenging part of this particular exercise, but reviewing, testing, and refining their definitions or rules was because, in the end, what was initially making the rounds as being a "crash" turned out to be a “heroic landing effort” managed by Sully. Imagine a degree of embarrassment that would have come from associating “Sully” with a crash when, all along, he was a hero.

Indeed, in the near "real time" world of declaring and defining "news" events, some "thing" may start out as "bad" one moment, and innocently described as such based on immediately available information, then turn in to some thing "good, or, at least, better, within hours or even minutes.

Also, Sullenburger’s heroism pointed out a number of instances in which the content was–brace yourself–just wrong. In early reports, his name was frequently misspelled in various news sources and no one prevailing editorial style indicated his name and variations of it the same way twice. On one occasion, even "Michael Bloomberg" wasn’t indexed in a news service’s content, and, consequently not incorporated in to the larger Tribune semantic picture, because the content referred to "Mayor Bloomberg Bloomberg." So, regardless of all the intelligence I put behind the terms I declared, adding an argument to account for that had not crossed my mind. Stupid me.

All of which reminds me of the famous story of little Virginia O’Hanlon who, in 1897, wrote her famous letter to Francis P. Church, editor of the New York Sun, asking, "Is there a Santa Claus?" to which Church replied, "Yes, Virginia, there is…" For a couple of generations, that answer sufficed and helped establish the Santa figure with a certain innocent veracity.

There is just no need to debate or drive that point home. But, today, it’s just that sort of "declaration" working within the systematic documentation, discovery, surfacing, and online, interactive presentation of the "news" that makes an editorial response, or whatever the content is, that much more thought provoking. Today, particularly in the sphere of semantic technology, we’re not just trying to convince children that "Santa Claus" exists as much as we’re working to convince a whole host of intelligent agents and their algorithms that he exists, too.

Frequently, one will read that the "promise" of controlled vocabularies and complementary intelligent or semi-automated or "smart" work is not living up to its potential. To keep myself sane when I read such comments, I always remember that the "good" or great results of what I do usually go unnoticed and that’s just the way it is. After all, that’s what happens when requirements of various kinds imply, "Be invisible." But knowing that invisibility is a certain kind of perfection, the requirement then becomes one of, “Do Disappoint.” Or “Do be a loser.” Knowing that less-than-desirable results will, indeed, occur and will be noticed either by oneself or others, what approach does one take to carrying on and doing “the good work?” Not doing it or ignoring it is not an option. At the end of the day, the error detection must be considered part of the error correction. Thank you, Alan Turing, for saying it all well before the Internet, as we currently know it, came along. It does relieve some of the pressure…but not all of it.

If you consider yourself a seriously engaged professional in this work, you don’t just get in and "play" or "mess around" with semantic technology unless you have the luxury of being blissfully ignorant of what’s at stake and an ATM card that will spit out wads of money to fix some a problem that a deep breath and a bit of thought might have avoided. Pretending at being Alan Turing or Ranganathan or any number of Web developers, data base administrators, linguists, librarians, and so on, is just "talking the talk" and not "walking the walk." At the least, it trivializes and demoralizes, and at the worst, it costs a lot of money to stop "ripple effects" and recover.

Still, as each day passes, it doesn’t necessarily get easier for many organizations to experience the advantages of structuring its data, parsing metadata element values, using a certain intelligence in indexing, and so on. At a certain point, you DO just have to take a risk and know that you’re going to make mistakes, cause problems, raise red flags, etc. With almost every situation, every environment, every business requirement being an experiment, waiting for or demanding the “invisible” or "98% correct" result or solution doesn’t really get you far. Do something. Actively engage in a process of direct, experiential realization. At some level, in some way, with full awareness, make some semantic associations between concepts. If you’ve read this far, you probably don’t need to be told that, but it never hurts to stress it.

Of course, not everyone has a Tribune Company at its disposal, brimming with exceptional talent across its markets, and working dedicatedly in the semantic technology realm even during a period of economic turndown. But synonyms are readily available to everyone, and, in terms of semantic associations, they work wonders all their own. If synonyms and a tool to manage them is all you’ve got, make the most of them. But don’t bite off more than you can chew. Seriously, I’ve had to apply the metadata equivalent of the Heimlich maneuver too many times in my career, and managing expectations and keeping “in scope”–remember to invoke “less is more” if necessary–is preferable to choking.

Back in the latter half of 2007, I wasn’t sure if the Tribune Company could “walk the walk” of genuine collaboration. Frequently, the walk was dizzying in terms of the coordinated efforts it takes to mirror a fraction of the human intelligence needed to index a vast store of content in order to meet a host of business requirements (it’s a given that it will continue to have its spells of vertigo). Nevertheless, it has left a tracing, and will continue to do so, in the realm of the Semantic Web because it is surfacing its content, and search engines, among other, are finding it, thanks to semantics-based technologies. Now the challenge is staying in the game and creating new ones.

1 Doug Lenat in conversation with Tony Shaw on the topic of Wolfram|Alpha, spring 2009.

You might also like...

Case Study: Polaris Puts Data Analysis in the Service of Defeating Human Trafficking

Read More →