6. Approach to internal data management
The linked open data approach described above is a good solution to many of the problems of data discovery and data sharing in an open web environment. Many of the same issues are encountered inside any large organisation and so it is interesting to consider whether open data technology is also applicable to internal data management and data exchange.
Many tasks of researchers and analysts within the government require accessing and combining data from multiple sources. This requires finding what is available, understanding how it was produced, what limitations and constraints the data might have, then processing the data into the required form.
Linked data is essentially a data integration technology, well suited to distributed and diverse collections of data, such as those created and used within the Scottish Government. It is becoming more popular for enterprise data integration and business intelligence applications.
In this chapter we consider the implications of applying the linked data approach to internal data management.
The Analytical Leadership Group of the Scottish Government recently considered a paper on 'Strategic Data Management' setting out the main requirements for management and exchange of data within the analytical services divisions.
This document identifies the main objectives of a new solution to strategic data management as:
1. Provide visibility of analytical datasets held across SG to all analysts
2. Enable access to data from the widest possible range of analytical tools
3. Ensure the security and integrity of data
4. Provide metadata on the datasets we hold and the data within them
5. Empower analysts to manage their data effectively within a clear framework
6. Automate as much as possible
We will consider how a linked data approach would relate to each of these main objectives.
1. The issues around discovery of open data are very relevant to this issue. Web technologies in general (corporate intranets for example) are highly applicable to 'inside the firewall' applications where there is a medium to large number of users, particularly when they are distributed across several locations.
Our discussions with government analysts indicated that personal contacts - 'knowing who to ask' - is often the first port of call when trying to find information from outside their own domain. This will always be a useful component of data exchange, but a more systematic approach can help people to discover data more quickly and more reliably. Picking up the phone to the appropriate person may still be necessary or useful for detailed questions, but a better approach to dataset cataloguing and dataset metadata will reduce the workload on individuals and will allow analysts to find data that they may otherwise have missed.
Creating a simple standardised set of dataset metadata and tools to support the creation and maintenance of that metadata (automated where possible) allows the creation of browsing and search tools that will improve data discovery within Analytical Services.
2. Commonly used tools include SAS, Excel, SPSS, SQL Server. There has been a significant investment in the use of these tools, in terms of staff expertise and development of customised workflows and other software. Any new solution to data management must allow these tools to continue to be used, and to be flexible enough to allow the introduction of new analysis tools in future.
If data is to be held as linked data in the underlying data management system, this requires the structure and meaning of the data to be made explicit. This requires effort, but generally only needs to be done once for each dataset or group of related datasets and provides benefits in repeated more effective uses of that data.
There are two aspects to consider: how data in linked data form is created and updated; and how linked data is accessed via existing tools.
The majority of the data analysis tools in use are based on essentially tabular data. Statistical linked data consists mainly of n-dimensional data cubes, supported by reference data in list or tree structures. By choosing 2 dimensions at a time, data cube data can be converted easily into a table or series of tables. Some format conversion work will be required to extract data in a form that can be read into tools like Excel and SAS but this is relatively straightforward.
The task of creating and editing linked data from tabular data tools is more difficult and is the subject of active research and development, for example in the EU funded OpenCube project. It requires a mapping of data from a rows and columns format to the triple-based representation of linked data, involving consideration of the 'Linked Data Cookbook' steps described in Section 5.1.2 of this report. A series of data mappings will need to be developed and maintained, and used to support the integration of data analysis tools with the underlying data management store. This is the most technically complex aspect of using a linked data approach to the data management system, but is also the aspect that brings significant benefits as it enables a rich integration of data from different sources.
3. Data held in the data management system will be subject to access restrictions as some of it will include personal data or other sensitive material. Therefore, not only will the system as a whole need to be secure from unauthorised access, it must be possible to manage access to datasets within the system. Using currently available linked data technology, there are a number of possible approaches to this issue. The challenge here is that one of main strengths of linked data, the ability to use SPARQL to query across a number of interlinked datasets, is a feature that must be carefully controlled if access to specific parts of the data are to be controlled.
There are 3 main levels of data organisation in a triple store: the triple, the named graph and the database (sometimes referred to as the 'dataset' in SPARQL documentation, but this is a different use of the word 'dataset' to the rest of this report, so we will avoid that term). A typical triple store platform can host many databases simultaneously. In most triple store implementations, a SPARQL query can access any data in a single database. Some stores may have extensions that allow finer grained access control for SPARQL queries, but this is not standard, so it would restrict technology options to rely on that. Access to individual resources or individual named graphs ('buckets' of data within a database) can be controlled in a user interface layer for general data presentation and data browsing, but is hard to manage reliably if SPARQL query access is offered.
A reasonable approach may be to manage access at the database level, which allows a number of reliable security technology options. This means that data should be grouped into databases based on consistent access requirements. This is a similar concept to controlling access to SAS data at the 'library' level.
4. Providing consistent metadata on datasets and their contents is an important requirement for making the data easy to find and for an analyst to decide if it is suitable for their purpose. This objective has much in common with objective 1 in terms of the technical requirements for meeting it.
5. To empower analysts to manage data will require providing good management tools for reviewing datasets in a database, controlling who has access to it, creating and maintaining metadata, managing versioning and so on. Due attention will need to be paid to a user friendly interface to support these functions.
6. The linked data approach incorporates open standards for interfaces and APIs. Consistent APIs and consistent standards for data representation means that a high degree of automation is possible. In addition to the standard facilities offered by linked data systems, some form of workflow management will be required. There are many existing options for doing this and the choice of technology for workflow management would depend on the complexity of the processes to be automated and the frequency with which they are modified.
Using linked data internal data management will significantly reduce the additional effort required for open data publishing as the hard work of analyzing and representing explicitly the data structure and meaning will already have been done.
Use of linked data for internal data management, with access control and versioning, is less well established than use of linked data for open data publishing. To learn in more detail about the opportunities and possible pitfalls, we suggest carrying out and reviewing a small scale experimental implementation of the approach.
Email: Sara Grainger