Annex B Identifiers
Although organisations are only advised to aim for 3 star data format as their initial goal in open data publishing, using identifiers in the data can make it easier to move towards 5 star data. In order to see what is involved, it's helpful to first look more closely at the kind of information that is typically expressed in tabular data. As an example, consider the number of births in Glasgow City during a given year. In a spreadsheet, we might see something like this:
|Year||Area||Number of births|
This row of data can be read as a statement: "In 2014, the number of births in Glasgow was 7,311". In order to make a clear statement, we need to be able to refer to things (or more generally, to resources) in a non-ambiguous way. But the term 'Glasgow' is ambiguous. We could be using it to refer to Glasgow City (pop. about 600,000) or Greater Glasgow (pop. about 1.2 million). Of course, a careful data publisher could try to consistently distinguish between these two terms, but it is hard to eliminate the risk of confusion.
In the context of open data, terms for referring unambiguously to things (both concrete and abstract) are called unique identifiers. The most common approach uses Uniform Resource Identifiers ( URIs for short). These look just like familiar web addresses, but are intended primarily for naming resources rather than as web pages to visit in a browser. An example URI is http://statistics.gov.scot/id/statistical-geography/S12000046, which in fact is a unique identifier for Glasgow City. Although the standard framework of 5 star data goes beyond the limitations of tabular data, it is perfectly possible to use URIs in tabular data. For example, we could produce a data record like this:
|Year||Area URI||Area||Number of births|
This preserves the informal label for Glasgow while also including a URI
URIs are useful as a formal mechanism to clarify what a data record is talking about. They are useful in another, perhaps more important way, in that they give a precise way for two different datasets to give information about a common set of resources. For example, dataset A could contain a wide range of administrative data about Glasgow City, while dataset B contains information about health resources, including maternity wards and antenatal classes. As long as the two datasets use the same identifier for Glasgow City, it becomes much easier to automatically combine them so as to derive a much richer and more complete set of statements.
URIs are designed to be processed by machine rather than by humans. Although a URI does not have to correspond to a page that can be viewed in a web browser, it is best practice to have a way of getting from a URI to a human-readable web page, so that any questions about usage and interpretation can be clarified. This step is usually accomplished by something called Content Negotiation, which redirects a web browser from a URI to an associated page; for example, if you point your browser to the URI
you will automatically get redirected to
These addresses look exactly the same except that 'id' (for identifier) in the first has been replaced by 'doc' (for document) in the second. The second address takes you to a web page with lots of useful information about the resource.
Unfortunately, there is another level of complexity in using URIs. A URI is unambiguous because it names only one resource. However, there might be a second URI (or indeed many) which also names that resource. As an example, http://dbpedia.org/resource/Glasgow is the URI for Glasgow City provided by DBpedia (and http://dbpedia.org/page/Glasgow is the associated human-readable page). This multiplication of synonymous URIs can be dealt with by adding a piece of data to your datasets which says that two URIs refer to the same thing.
In summary, including URIs in tabular data is a useful step in increasing the interoperability between multiple datasets and allowing more useful inferences to be drawn from the data. Identifying the right URIs to use, however, is not always straightforward. Although best practice is to re-use existing identifiers, it requires a mixture of experience, effort and luck to search these out. As the open data ecosystem in Scotland evolves, it should become increasingly possible to establish standards for public sector URIs.
Email: Stuart Law, Stuart.Law@gov.scot
Kyle Malcolm, Kyle.Malcolm@gov.scot