Datasets Addition Promising Extension For Schema.Org

A call for comments is out for a proposal for a 'Datasets' addition to, via the W3C’s Web Schemas task force group that is used by the project to collaborate with the wider community.

The proposal extending for describing datasets and data catalogs introduces three new types, with associated properties, as follows:

Writing at the blog, Dan Brickley calls it a “small but useful vocabulary,” with particular relevance to open government and public sector data.

He also references this week’s post at by Chris Musialek, the chief software architect for the site. Musialek writes that, following a review of the draft proposal, “we are comfortable with the current state of things,” and that any work left to do seems very resolveable.

“We've been watching the datasets schema space for a while now, as is very interested in adding support for our listing of over 450,000 datasets. We think this will help the major search engines create better relevance rankings of Federal government data, where many searches begin,” Musialek says. And he notes later in the post that, “We're really excited to see this schema move in the direction of official addition to We really hope to see it be included in a release soon.”

The Tetherless World Constellation at Rensselaer Polytechnic Institute – where Professor James A. Hendler is now the head of the Department of Computer Science – has a demo available that contains automatically-generated dataset descriptions based on TWC's International Dataset Search and which uses the extension for datasets and data catalogs. A few weeks back, at the Semantic Technology & Business Conference in San Francisco, Hendler told The Semantic Web Blog in an interview that, while a vocabulary for describing datasets and data catalogue was not yet part of, efforts were underway to make that happen.

In that interview Hendler also disclosed that the number of open government data sets on the web has hit the million mark. In his blog posting, Brickley says the proposal is exciting because of the “huge number of datasets that have been made  public in recent years. While each dataset may ultimately be expressed in detailed, domain-specific form (e.g. using specific scientific or statistical schemas), the Datasets proposal focuses on the high level common characteristics that are shared across thousands of otherwise diverse datasets.”

The proposal includes a table mapping Datasets extension types and properties (including supporting vocabulary) to and from their approximate equivalents in Data Catalog Vocabulary (DCAT), Asset Description Metadata Schema (ADMS), and VoID. The next steps for the proposal are to get feedback from publishers of applicable datasets that the extension would be useful to them and is a good fit to available metadata.