Wikify Your Metadata! Integrating Business Semantics, Metadata Discovery, and Knowledge Management – Part 2

By   /  May 8, 2012  /  2 Comments

by Charles Roe

In Part 1 of the article “Wikify Your Metadata! Integrating Business Semantics, Metadata Discovery, and Knowledge Management” the essential problem facing Harvard Pilgrim Health Care’s (HPHC) issues with migrating from a legacy, monolithic core system to a new componentized system with an Enterprise Data Warehouse were discussed. This article is based on John O. Biderman and Cameron McLean’s Enterprise Data World 2010 Conference presentation of the same name.

One of the essential problems facing the migration and adoption team was how to deal with the semantic challenges of moving from the older legacy system to a fully new one. The business analysts spoke the native system’s vocabulary, but the EDW semantics would be based on the Enterprise Logical Data Models semantics, including entirely new business terms independent of any source applications – and thus for the users there was a sense of impending doom. They needed a metadata system that would include a number of primary driving requirements, including: Structured, non-editable content from data SMEs; collaborative content from business users, advanced search capabilities and a regulated governance structure to keep everything together. After looking at many systems, weighing necessary costs versus short time constraint, they decided to build their own Data Dictionary from the open source, MediaWiki software package. They already had it in-house, it has many add-ons and extensions that provided a rich and robust environment that would allow them to customize and meet all their driving requirements, it could be implemented quickly and would allow the business users to take ownership of the new data definitions.  Part 1 of the article discussed those elements in more detail.

Part 2 of this article will give a brief description of what wiki technology is and how it was utilized by HPHC, certain information architecture solutions they devised during the project and conclude with a discussion of some of their main challenges, successes and future directions with the system after its implementation.

A Brief History of Wiki Technology

The most famous wiki is of course Wikipedia, but the technology of the collective creation of hypertext has been around since the 1960s. When the World Wide Web came into being, the technology progressed at exponential rates and today it is ubiquitous throughout the Internet and within the Intranets of thousands of corporations worldwide. The word wiki was originally coined in 1994 by Ward Cunningham from his memories of the Wiki Wiki Shuttle at Honolulu airport and for the fact that wiki means “quick” in Hawaiian.

Wikitext is essentially a simplified markup language for creating vast, collaborative authoring of hypertext. Take .HTML, which is a good markup language in itself, and strip down all its tags, create a really easy syntax and you get wikitext – it is simple to use. It doesn’t take years of programming knowledge to create wiki pages, just some practice and experience, so in reality even people not technically inclined can create wiki pages with a bit of training. Adding Headings is as easy as using ==, Bullets with *, Hyperlinks with [[]] and many others. Mediawiki, the free, Open Source wiki software package, has innumerable capabilities that the HPHC adoption team found useful for their project, including:

  • Namespaces: They allow the partitioning of wiki pages into different spaces for different uses.
  • Categories: There is simple tagging built-in to assign pages to categories.
  • Advanced Search: Namespaces and categories are searchable; this allowed for a resolution of one of the most important driving requirements for the project.
  • Version History: MediaWiki maintains a version history and has both compare and rollback capabilities.
  • Templates: This allowed the team to simplify layout standardization, add sections similar to the “Wikipedia Disambiguation” statement to their pages and an easier way to build the non-editable structured content necessary.

As the team moved forward they discovered they needed more features than the standard bundle provided, including extensions to templates, the ability to capture more pages into other pages or “metadata about metadata” pages and sections of pages. The Semantic Mediawiki extension provided them with these necessary elements. Semantic Mediawiki is also a free, Open Source project that allows more advanced storing and querying of data within a standard wiki page. Perhaps its greatest element is ‘transclusion’ – the dynamic inclusion of part or all of the text of one hypertext document into another. Transclusion would allow the team to put structured content fully into collaborative pages and vice versa, thus meeting both driving requirements at the same time. The Semantic Mediawiki also includes:

  • Extended markup notation
  • Assigning semantic Properties such as “Member Of,” “Belongs To,” “Has A,” along with synonyms, antonyms and many others to a page
  • Semantic searches can find pages that are members of a property

Graphics Three and Four show examples of pages with structured, collaborative and transcluded content included in them:

Graphic Three

Graphic Four

Semantic Mediawiki also gave them dynamic, self-maintaining pages, so that any changes made within the system would automatically update with any new builds. They also had normalized content, so that “article of record” content could be displayed by reference rather than in local wikitext. And the query mechanism allows for faceted queries. All in all, it gave them what they needed, the driving requirements were met and they could move forward with adding content.

It’s all in the Vocabulary

The new wiki needed information architecture; a shared vocabulary was a must. They consulted with some taxonomy experts and since they already had an existing taxonomy, they leveraged that instead of starting anew with something no one would understand. The ELDM provided a native hierarchy with such elements as Subject Area, Facet, Entity and Attribute, while the EDW’s architecture provided a natural hierarchy with elements like View “Layer”, View and Column. They represented the vertical navigation through the metadata. They also wanted horizontal navigation through the metadata as well. Years before, they had defined a number of analytic topics that represented elements like best practices, health care data warehousing and others. They had taken those analytic opportunities and mapped them to ELDM facets; those are mapped to entities and attributes which go back to the physical data in the EDW.  Thus, through that navigation pathway they could associate analytic topics with the data that supports that analytic topic, and therefore seed business taxonomies into the wiki to allow for horizontal navigation.

Such a system allowed them to create an information model where all the navigation pathways and hierarchies were clearly defined, where they could show what content pages were getting generated as structured content and what would get found through queries. They set up a page naming convention and determined the properties for each node, such as namespace, category and property tags. In terms of the wiki structure, each EDW “View Layer” was loaded into its own wiki namespace, which supported filtered searching for users with different authentication levels. The EDW namespaces were protected, each wiki page had a specific ID with different admin rights set and a Business Annotations section could be added to any protected page and thus be edited by any user while keeping non-editable structured content safe. All the business data elements in the EDW are mapped to their counterparts in the ELDM and the Data Dictionary was the first time they’d ever had the physical-to-logical relationships systematically exposed to business users.

The page generation process was developed. It included four necessary components:

  • A solid data model, ELDM and data repository. These included materialized views of the CMC, a flattened hierarchy for simpler queries, the ability get frozen snapshots of the metadata and allowed for the generation of changed pages only.
  • A Java program to read the CMC data. It retrieves the content, recurses through the data lineage, creates a tree and allows users to find the source end point for a given target.
  • Velocity template language used to generate the wiki text itself.
  • Selenium robots and Mediawiki APIs to post the data into the actual wiki pages.

Graphic Five shows this process:

Graphic Five

Conclusion – Challenges and Directions

Once the system was in place and users began to generate pages, the team ran into a participation challenge – they needed some kind of critical mass of data within the Data Dictionary to generate buzz, demonstrate its value and elicit more user contributions. People did start talking about it, but at the time Knowledge Management and Web 2.0 technologies were still relatively primitive ideas within the organization. So they started adding more and more content, interest was piqued and many people started seeing the capabilities inherent in the Data Dictionary. It took time, but eventually they reached the needed critical mass to get more user contribution. As of the 2010, at the time of their Enterprise Data World presentation, there were 300+ unique users participating within the system, 18,000+ generated pages, the Data Stewardship Committees had taken the responsibility for quality data definitions and the entire system was implemented at about 30-50% the cost of a commercial package. The Data Dictionary was a closer fit to their driving requirements than any commercial package as well. An Actuary using the system wrote that: “The Data Dictionary provides clear and consistent EDW business definitions, search capabilities and a platform for user collaboration. The best part? The Data Dictionary is based on technology many users have already experienced with Wikipedia.” While a Financial Analyst said of the system that “I’m quite excited about the ease of collaboration provided by the Data Dictionary. When analysts learn something interesting from their analysis they can now post their findings in the Data Dictionary for others to see. The Data Dictionary will dramatically decrease learning time for data analysts transitioning to the EDW.”

All in all the system is a success for HPHC. But like any new system there are always improvements on the horizon. They want to build in wiki forms for user contributions, use less wiki markups and more templates to allow for easier bulk revisions of page layouts. They need more taxonomy that was better detailed; such elements as “Analytic Topics” were good for demonstrations, but did not turn out to be all that useful. They want to develop more business vocabulary and strategies for relating them to metadata support and create a conceptual search. Other extended scope projects include SOA documentation and integration into enterprise Knowledge Management.

Wikify may not be an actual word in an official dictionary, but the creation of the Data Dictionary by HPHC has demonstrated how creativity and a lot of hard work can bring a project with highly specific driving requirements to full fruition and fulfill the needs of all the stakeholders involved in the process.

You might also like...

Data Science in 90 Seconds: K-Means Clustering

Read More →