The non-obvious ‘second use’ of Excel is that of a universal data recycling bin for Organisational ‘reports’ and ‘data dumps’. Examples exist everywhere; the main reason is that it is just so convenient to do – everyone is an Excel expert when it comes to tidying up, sorting and generally asking questions that can be answered from within any one single spreadsheet. But asking questions that can only be answered by connecting many spreadsheets together is harder to achieve technically and yet the payoff can be large.
• A list of 30,000 Gene names in one spreadsheet has minimal value.
• List the expression activity of those Genes in a second spreadsheet alongside Drugs that may have influenced that activity in a third spreadsheet and valuable new information may be extracted.
People use multiple spreadsheets as closet databases – but spreadsheets do not naturally lend themselves to the kinds of sophisticated questioning that an Oracle Database supports.
The Opportunity and the Problem
The EPA's 2008 report contains stunning insights into the relationships between the quality of Air, Water and Land and the state of Health of the US.
“At the heart of the report are “indicators” that track important trends in environmental quality and human health”, all of which ‘sits on top of’ 327 Excel spreadsheets.
“Tell me about a higher than normal incidence of birth defects in this region where this noxious gas emission was high” – means comparing data across these multiple data sources.
These valuable answers are hard to achieve because of the technical problem of how to ‘join’ just the data sources needed to bring specific rows back that have only the columns needed.
Consider the alternatives facing a researcher with 1,000 questions for this EPA data:
1. Manually cutting and pasting in Excel.
a. Using a familiar analogy; start with the first spreadsheet, sort to find ‘employee name and discover the employee id’. Then go to the second spreadsheet and sort on employee id to discover the job salary, then find the department …
b. No one will ever complete the multiple cut and paste operations for 1000 questions so in practice valuable insights remain undiscovered.
2. Programming the 1000 questions.
a. Leaving aside the presence or absence of budget for this task, the deliverable will typically be 1000 SQL reports.
i. The essence of ‘discovery’ is ‘I am not sure of the second question I want to ask until I see the answers from my first question’.
ii. People want to be in charge of asking their own ad hoc questions and not be constrained by 1000 canned SQL reports.
b. I.T. have tried repeatedly to solve this problem with ‘SQL Join Wizards’ but you will be hard pressed to find many real business people that use these on a daily basis.
i. In fairness, for well-defined domains such as Data Warehouses with a few ‘tables’ there are good tools such as Business Objects.
ii. The EPA has 327 disparate ‘tables’ - a different technical challenge because there are infinitely many ‘paths’ connecting them.
3. Using the Semantic Web to solve the problem.
a. This approach can completely solve the problem in its entirety.
b. The Semantic Web provides one of those rare jumps in productivity that will completely change the effectiveness of information discovery.
Newcomers to the Semantic Web often struggle through complex acronyms in an attempt to find the benefit – “What can I do with the Semantic Web today and why should I care”?
The seldom explained answer is that the Semantic Web solves the problems associated with approaches 1 and 2 above – no further need for 1000 expensive inflexible SQL statements!
Looking at a diagram of all of the world’s airports and the network of flight-path ‘arrows’ connecting them it is obvious that (A) ‘Everything is Connected’ and perhaps less obvious that (B) ‘Knowledge is encoded in the Connections’.
Simply put, you shouldn’t have to tell the Pilot which runway to use (she gets that from ‘B’) or which flight-path to take from London to SF via Washington (she gets that from ‘A’).
Semantic Discovery System
http://www.SemanticDiscoverySystem.com – SDS – is a product developed over 8 years in collaboration with GSK and the Universities of Manchester and Pennsylvania to solve problems such as those faced by the EPA using the new power of the Semantic Web.
Following an invitation from Brand Niemann of the EPA, we used SDS to provide a graphical interface to the 327 spreadsheets to allow point and click composition of any number of questions.
‘Pointing’ SDS at the spreadsheets generated the Semantic Model below automatically
B Donnelly Figure 1
Initially we see terms such as ‘X2_13_Panel_A’ based on the actual Excel filenames and worksheet labels – we need to change these to fewer, meaningful new names.
There may be 5 airports at each of New York and London but it is easier for a high level graph to use an ‘Airport’ abstraction and then drill down through the name of either New York or London to Heathrow, JFK etc.
Using this approach we abstracted the 327 spreadsheets to a much smaller number of sensibly named concepts. SDS then drew arrows between the concepts based on the simple expedient of a user-defined spreadsheet that describes what should be joined to what.
Now we can ask and answer any question by pointing and clicking on any attribute in any concept, then doing the same in another concept a few arrows away.
B Donnelly Figure 2
We believe this SDS approach has wide applicability:
- We are working with Brand Niemann to integrate circa 1500 spreadsheets of the Statistical Abstract of the US
- We will abstract the ‘1500’ down to circa 20 concepts (BUDGET, REGION, INDUSTRY …).
- We will ‘connect the dots’ and link this with other Semantic models to open the door to cross domain queries.
- We are working http://www.cordobasolutions.co.uk in the business domains of Project, Programme and Change Management.
- Programmes worldwide are ‘run on’ Microsoft Project with Excel files logging Issues, Risks and Status.
- Management wants answers that span all plans and spreadsheets and have to resort to time consuming cutting and pasting.
- We are working with Drug R&D companies to provide ‘Semantic Query of Histological Images’ and integration of large Oracle databases.
We believe that there is understandable confusion in the world today about what the Semantic Web is and what immediate benefits it can deliver.
Whilst great benefits will arise from reasoning and vocabulary optimisations, we think there are a host of instantly useful productivity benefits to be had using the approach described in this paper.
We would like to thank Brand Niemann of the EPA for his help and support in compiling this use case.
Note for Technical Readers - more detail at http://www.SemanticDiscoverySystem.com
- The Semantic Web provides a new language called SPARQL that avoids the time-expensive part of SQL programming – writing the joins.
- Given a graph it is simple to create a SPARQL query which will retrieve a sub-graph.
- For those experienced with RDBMS that are coming to the Semantic Web for the first time, SPARQL looks a lot like SQL - without the joins but with the addition of pattern matching.
- When we use SDS to ‘sit on top’ of large Oracle systems, we automatically generate the complete OWL Graph in a similar way to the EPA screenshots but in this case we transform the SPARQL to native Oracle SQL.