Public sector personal data sharing: framework and principles

This report considers frameworks and practices of providing access to personal data by public sector organisations to private organisations.

3. Data Sharing Pathways

Pathway 1: Data Sharing Agreements around the World

This framework for data sharing, where public interest/public benefit is identified and a data sharing agreement is drawn up, is common practice in UK, Europe, Australia, Canada, USA and elsewhere; it comprises what we call the first pathway for data sharing and the most common. However, currently, this pathway is predominantly used for sharing public sector personal data with trusted research centres for academic analysis. For example, Statistics Denmark provides access to anonymized personal data to researchers accessing the data through Danish research environments[5]. Where this data is to be linked to other datasets not held by Statistics Denmark, approval must be sought from The Danish Data Protection Agency (Danish Data Protection Agency, n.d.), the national independent supervisory authority in Denmark who operate a similar role to the Information Commissioner's Office (ICO) in the UK.

In USA, a federal statistics research data centre is used to facilitate the sharing of restricted or non-public data that is governed under the Confidential Information Protection and Statistical Efficiency Act (CIPSEA) (U.S. Office of the Chief Technology Officer 2018). Individuals must travel to the centres to access the data securely, with any outputs from analysis being reviewed before release to protect against misinterpretation or inappropriate disclosures. As in the UK, records are kept of data access and data disclosures. There are currently 31 Federal Statistical Research Data Centre's located across the USA, operating in partnership with more than 50 research organisations, such as universities, government agencies and not for profit research institutions[6]. In Australia, another example of these research centres is the Secured Unified Research Environment (SURE) (SaxInstitute, n.d.). SURE is a national online workspace used by over 500 researchers and 25 government and health data custodians that facilitates the analysis and sharing of health and human service data.

In the UK, an equivalent environment is operated by the Office for National Statistics Secure Research Services (SRS), which provides access to anonymized unpublished data to accredited researchers. SRS operates in accordance with the five safes framework of safe people, safe projects, safe settings, safe outputs, safe data (UK Office for National Statistics, n.d.a). While many datasets are available via remote access on the SRS, some data may only be accessible in an approved safe setting. While the data provided to researchers is anonymized, accredited processors may carry out data linkage before data is then de-identified and shared (UK Office for National Statistics, n.d.b).

A similar platform operating in Wales is the Secure Anonymised Information Linkage (SAIL) Databank. SAIL enables the robust storage of anonymized, person-based records for the improvement of health and wellbeing and associated services (Sail Data Bank, n.d.). It is accessible only for the purposes of research, providing a secure environment for the analysis of population data.

In Scotland, access to health data is managed through the NHS Scotland Data Safe Haven (NHS Scotland, n.d.). Safe Havens are secure environments for the use of electronic NHS data for research. The data is stored in a de-identified form which can be linked to other datasets, including non-health data, by trained staff. The data remains under the control of the NHS, ensuring compliance with NHS policy and legislation. Safe Havens are required to work to a set of seven principles that govern their operation across Scotland (Scottish Government 2015a). Importantly, one of these principles stipulates that personal data cannot be sold by the Safe Haven nor transferred to a commercial organisation, reflecting concerns over public trust.

At a regional level, DataLoch is a data service that allows academic researchers and health service managers to access health and social care data from the South-East of Scotland (DataLoch, n.d.). Potential users of the data must complete an application process along with relevant accredited training (as used in the Safe Havens). Each application is assessed by NHS employees to ensure that only the minimum amount of data needed to answer the specified research question is requested and that requests are in the public interest. This assessment stage is vital as the data stored within the DataLoch is unconsented, patient level data.

In each of the above examples, personal data is first processed to be in an anonymized form that is then moved to the safe research environment where the researcher can access the data for their analysis. An alternative to this is seen in the OpenSAFELY platform. OpenSAFELY is an open-source platform facilitating the analysis of electronic health records in England (OpenSAFELY, n.d.). Currently, the platform is only used to support research that is related to the COVID-19 pandemic, with all activity on the platform publicly logged, including analysis code. Proposals to use the data are reviewed by both NHS England and the OpenSAFELY group to assess the public benefit of the proposals and ensure that the proposals are from accredited analysts. Through collaborations with the organisations that supply and maintain electronic health record systems in England (OpenSAFELY-TPP and OpenSAFELY-EMIS), data can be analysed 'in situ'. This means that researchers first develop their analysis on randomly generated data, which shares the same characteristics as the raw patient data. This first step serves to reduce unnecessary disclosure of individual data. Once the analysis code has been developed and tested, it is then sent to the 'live' data environment, with researchers only seeing the tables and graphs that are the results of their analysis and not the original patient records. This means that patient data remains in the original administrative database where it was first gathered, instead of requiring additional processing to create an anonymized or pseudonymized dataset that is then moved to a secure data store (as is needed for other safe environments).

In sum, across the UK, Europe and other countries around the world, there is a common pathway for sharing personal data which involves the identification of public interest/public benefit and the subsequent drawing up of a data sharing agreement. This pathway is currently predominantly used to facilitate the sharing of personal data from public sector to accredited research organisations (mostly hosted in collaboration with universities). While the overall process is broadly the same, the technical approaches used to manage the data sharing process vary. In the first case, as seen in the DataLoch, and Safe Havens, the personal data is further processed and then stored in a separate safe environment in which researchers can run their analysis. In the second, as seen in the OpenSAFELY platform, data remains in the original database held by the public organisation, with researchers only seeing the results of their analysis returned.

Examples of data sharing agreements with private companies

While at present many of the data centres only allow accredited academic researchers to access and use the data, there is scope for the same pathways to be used to facilitate private sector access to personal data held within the private sector. For example, DataLoch are currently in the process of developing governance to allow researchers from third and private sector organisations to access data extracts from August 2022 (DataLoch, n.d.). This would follow a similar process to that currently adopted for academic researchers with an assessment being made over the legitimacy and benefits to patients or that it is in the public interest.

Another example is the early access release of education data to accredited education analytics suppliers by the Department for Education. Here data from the National Pupil Database is shared with six different accredited suppliers, which includes charitable companies and private companies, to allow them to provide data services for schools and local authorities (UK Department of Education 2022).

At a larger scale, in 2016 DeepMind Health formed a partnership with Moorfields Eye Hospital to access a data set of one million eye scans and related health data, including clinical diagnosis, treatment, model of the eye scanning machine, and patient age (Moorfields Eye Hospital, n.d.). The project aimed to investigate how machine learning could be used to analyze patient eye scans to help improve early detection of eye disease (DeepMind 2016). Early results from the study were published in 2018 (Fauw 2018), demonstrating that the AI technology could match the accuracy of the clinical ophthalmologist after being trained on 14,884 eye scans, and that it could be applied to images from a range of eye scanners. It important to note that the project involved the use of de-personalised data and so consent from individuals was not necessary.

Outside of health data, the National Data Analytics Solution (NDAS) project provides an example of crime and justice data sharing (The Alan Turing Institute 2017). This project is led by the West Midlands Police on behalf of the Home Office to develop a new scalable data analytics capability that would be owned by the UK law enforcement agencies. The project aims to apply the resulting analytics capability to explore issues of modern slavery, including prevention, detection, prosecution, and the safeguarding of victims. In addition, data analytics developed may be applied to other crime and policing issues such as serious violence, organized crime, firearms, domestic abuse and demand and resourcing. To complete this, West Midlands Police are working with Accenture who are the data processor for the project. Personal data is shared with the NDAS in accordance with the Law Enforcement purpose of the Data Protection Act. However, while private sector are involved in the data processing, it is unclear the extent to which data is shared with the company Accenture.

At present, there are very few examples of personal data held in the public sector being shared with the private sector. Where this has been done, or is being proposed, the pathway to enable it is the same as that currently used for giving access to researchers. That is to say, a public interest condition is first identified and then a data sharing agreement is drawn up between the parties involved. As with researcher access, often these proposals will be approved by a nominated panel who assess the application prior to data being shared. This helps to ensure that the five safes as described by ONS (safe people, safe projects, safe settings, safe outputs, safe data are met (UK Office for National Statistics, n.d.a).

Pathway 2: Extra Legislation Surrounding Data Sharing

Given the legal requirements outlined in GDPR and the DPA 2018, extra legislation is often needed to further facilitate or restrict personal data sharing. For instance, the Serious Crime Act 2007 permits the disclosure of personal data and sensitive data by public authorities to specified anti-fraud organisations for the prevention of fraud (UK Office for National Statistics 2015). Nonetheless, the data sharing must still be in line with DPA, meaning that considerations must be made over the fairness, transparency, accuracy and security of the data sharing. Furthermore, many of the existing data shares that operate under this legislation are governed by data sharing agreements, thus the underlying framework for the data sharing remains broadly similar that described in the previous section.

In contrast, the Commissioners for Revenue and Customs Act 2005 limits data sharing by HMRC to others (UK Public General Acts 2005). Under this act, data can only be shared outside of HMRC for a restricted set of reasons, such as the fulfilment of HMRC's functions, legal compliance, public interest or where the individual has given their consent (UK HM Revenue & Customs, n.d.). This sharing is further amended by the Digital Economy Act 2017, which permits under certain conditions, the sharing of non-identifying information (UK Department for Digital, Culture, Media & Sport 2016). Along with facilitating the sharing of limited HMRC data, the Digital Economy Act also sets out conditions for sharing public data for research purposes (UK Public General Acts 2017). However, this still maintains the requirement for compliance with the DPA.

Outside of the UK, in Finland the Act on the Secondary Use of Health and Social Data (Finland Government 2019) provides a separate legal framework for the reuse of health and social data. This act allows the reuse of data for:

  • scientific research
  • statistics
  • development and innovation activities
  • steering and supervision of authorities
  • planning and reporting duties of authorities
  • teaching
  • knowledge management

The act has also amended the pathway for the sharing of personal data that is held by public sector (Finland Ministry of Social Affairs and Health, n.d.). As described in the previous section, in the current UK framework, data sharing agreements must be made with each data controller. This creates an administrative burden across multiple organisations. In the new Finnish framework, a separate permit authority will be set up – Findata – that will enable a centralized system for the issuing of the data requests and permits. This will allow those who wish to use data from several different bodies or those who want to use data from Finish health and social care records to make one application. For those seeking to access datasets, a separate data utilisation plan will also be required. Findata will also provide and manage a secure environment through which the data can be accessed (Deloitte 2020). Where the data requested is individual level data, the data will be anonymized or pseudonymized. According to our interview with Findata, external organisations apply to Findata, who would contact the relevant data controllers for the datasets; the data controllers securely send the filtered data to the Findata, which does the linking and combining of multiple datasets, then provides a secure link to the applicant. The value of the service is rooted in the curating, linking and combining of the data, not the raw data itself, as the data does not leave Findata.

This Finnish framework is still in the relatively early stages with permits so far only issued for the secondary use of healthcare data (Zajc 2021). However, it demonstrates an alternative second pathway for the sharing of personal data whereby extra legislation is adopted to enable the sharing of personal data held in the public sector. It is worth noting that in Finland, according to our interview with Findata, citizens must opt-out of their data being used rather than opt-in, so that by default citizen data can be used in this manner.

Pathway 3: Artificial Intelligence (AI) and Data Sharing

The application of AI is creating new demands for larger-scale data sharing, and as such is creating demand for an alternative third pathway to data sharing. As stated in Norway's AI strategy document 'access to high-quality datasets is essential for exploiting the potential of AI' (Norwegian Ministry of Local Government and Modernisation 2020, p.6). However, currently there are very few mechanisms to allow that, given the constraints outlined in earlier sections. The UK AI strategy summarises this challenge:

"Some of the most valuable data – in terms of its potential for enabling innovation, improving services of realising public sector savings – cannot be made open because it contains nationally critical, personal or commercially sensitive information. This includes data which could be used to identify individuals. Organisations looking to access or share data can often face a range of barriers, from trust and cultural concerns to practical and legal obstacles. To address these issues, we are working with industry to pioneer mechanisms for data sharing such as Data Trusts." (UK Department for Business, Energy & Industrial Strategy 2019, n.p.).

Despite the interest in this area much of the legislation governing the use and application of AI is still in the early stages (European Commission 2022). Where AI technology has been used for the analysis of big data in healthcare, data has either been accessed through an individual data sharing agreement (as in DeepMind's access to eye scan data), or it is based on obtaining consent from the individual whose data is being accessed. For example, the UK BioBank[7] is a large-scale database containing genetic and health information from 500,000 consenting UK participants. Data is anonymized enabling researchers from around the world to access the information.

Another example of big data access can be found in the Sentinel system, which is led by the US Food and Drug Administration (FDA) (Sentinel, n.d.). Like the OpenSAFELY system in the UK, the Sentinel system enables the analysis of data that remains in situ with queries being sent to organisations which can opt in to return query results. Thus data partners keep control over their data. As in BioBank, any directly identifiable patient data is not shared.

Finally, the Harmony Alliance operate a public-private partnership for the analysis of data for blood cancer research (Harmony Alliance, n.d.). This data is gathered from pharmaceuticals, biobanks, hospitals, interventional and non-interventional trails. From interviews with Harmony alliance, they described their approach as a 'de-facto anonymisation process' where any data provider who submits to their data lake must exclude all personal data identifiers (names, addresses, IDs, etc). The data is passed to a third party for processing and standardising, then is harmonised and introduced onto the platform. Both public and private organisations submit data to the data sharing platform.

While many current AI projects that use personal data do so using consent-based frameworks, new draft legislation from the EU may provide scope for an alternative model. The proposed AI Act includes details for the development of AI regulatory sandboxes, which would provide:

a controlled environment that facilitates the development, testing and validation of innovative AI systems for a limited time before their placement on the market or putting into service pursuant to a specific plan. This shall take place under the direct supervision and guidance by the competent authorities with a view to ensuring compliance with the requirements of this Regulation and, where relevant, other Union and Member States legislation supervised within the sandbox (European Commission 2021, p. 69, Article 53: 1).

The proposed act provides an article explicitly detailing terms for the re-use of personal data in the sandbox. Article 54 states that:

"the innovative AI systems shall be developed for safeguarding substantial public interest in one or more of the following areas:

(i) The prevention, investigation, detection or prosecution of criminal offences or the execution of criminal penalties, including the safeguarding against and the prevention of threats to public security, under the control and responsibility of the competent authorities. The processing shall be based on Member State or Union law;

(ii) Public safety and public health, including disease prevention, control and treatment;

(iii) A high level of protection and improvement of the quality of the environment" (Ibid. pg 70, Article 54: 1)

The article continues by specifying terms for use, including processes for storage, processing logs, data retention and erasure. It is emphasized that the re-use of personal data should only be used where other data types (anonymized, synthetic, or other non-personal data) would not be suitable. At the time of writing, the AI Act is still under discussion by European member states, however it nonetheless represents potential for a separate pathway for the re-use of personal data in AI applications.



Back to top