The Legal Information Institute at Cornell University Law School is about making law accessible and understandable, for free. It’s been engaged in that mission since the early ’90s, and semantic web technology today plays a role in furthering that goal.
The organization this month published a new electronic edition of the Code of Federal Regulations (CFR), which contains a bevy of rules across 50 titles that impact nearly all areas of American business. Work underway at LII, dubbed the Linked Legal Data project, seeks to apply Linked Open Data principles enhances access to the CFR, with capabilities such as being able to search its Title 21 Food and Drugs database using brand names for drugs (such as Tylenol), and receiving the generic name for the drug (acetaminophen) as a suggested term. “You cannot look for regulatory information on Tylenol in the CFR because Tylenol will never be there,” says Dr. Núria Casellas, who is a visiting scholar at the LII spearheading work on the project. “That is a brand name. What you actually want to look for are components, such as acetaminophen.”
While the general citizenry might find reasons to leverage the fruits of this effort, businesses that must comply with these requirements are a more likely target – not just the lawyers and paralegals, but those responsible for tasks, for example, such as storing and caring for products their company exports or imports, including understanding the safety regulations that apply to it. The Tylenol-acetaminophen example, she says, is very interesting because it showcases how using the wrong word or the incorrect approach can hamper a company from being able to find the relevant regulatory or safety information it needs to take into consideration.
In fact, the main use case for the Linked Data project’s origins has been around product and industrial information, with a focus on Title 21. “We would like to give the definitions, obligations and vocabularies, and product information to enhance search and retrieval, and also visualization of the information,” says Casellas. So, if a company produces certain materials, for instance, it will be possible to retrieve all the information from the CFR that relates to that.
Making Connections To Other Resources
But there’s also intent to connect with external resources. Casellas speaks, for example, of linking materials from the Drug Bank open data drug and drug target database, which has been transformed into RDF and made available as a SPARQL endpoint, to Title 21 in the CFR, and vice verse. “Our purpose is not only to improve searching our site or the visualization of results or giving more information to the public, but to open this data to outside sources that would like to reuse it,” she says. “That’s fulfilling the idea of free access.”
Another reference to the Tylenol issue helps point out the usefulness of these kinds of connections, because the Drug Bank “can suggest all this information as it has the information regarding the pharmacological components, and all these brand names that are FDA-approved,” she says. “So that is one of the areas that to us is very relevant.”
Title 21, Casellas says, is very long, and quite difficult given the structure, so getting good results here is a good indicator that there will be success correctly parsing the rest of the titles. The team is developing a SKOS-based thesauri derived from the terms used in the CFR, and extracting definitions and obligations. “We would like to present the user with the definition of the term, maybe because they’re in a section talking about something that is defined elsewhere, and we want to contextualize that so they know they are using the term in the same way,” she says. When it comes to obligations – for instance, what rules a manufacturer of a certain product is required to abide by – “if we could offer some of those answers , just show some of these obligations, that would be an advance because at the moment, you just have to read the whole section and try to discover for yourself.”
The language of definitions and obligations share some challenges, such as the fact that information extraction must be done from long and complex sentences (this is the law we are talking about!), and parsers have trouble with the wording and organization of words. “Even with implementations of plain language in the legal domain this is a hard area of research for NLP [natural language processing],” she says. “Also. the structure of the XML we work with is not as constrained as it ought to be, so it’s also complicated.”
The work includes reusing product codes from sources such as the North American Industry Classification System (NAICS) that usually help with tax purposes for buyers’ and sellers’ financial systems. “We thought incorporating those structures could be very helpful, because usually the users know of these codes or they use them in their daily work,” Casellas says. To that end, it has done some RDF converting of the product codes and now students involved in the project are exploiting the structure of RDF to find sections in CFR that make reference to these codes, she says.
By the end of this month she is hopeful of having results that will move the project from the first testing phase around retrieving those definitions and obligations, where there is clarity and completeness, to the next step. “The second phase is then to deal with the hard cases.”
In the announcement of the new electronic edition of the CFR, it’s noted that other near-term enhancements will include searches by United Nations product code, the identification and linking of relevant agency guidance information for each Part and Section, and a wide variety of Linked Data offerings.