An Introduction to Statistical Disclosure Control

Introduction

HIC are a health data ecosystem, on behalf of Data Controllers (often NHS healthboards), we provide access to sensitive data in our Trusted Research Environment (TRE), when approvals are in place. We operate within the Five Safes Framework which allow us to maintain trust and transparency in secure data management. We apply proportional risk mitigation in our processes and for TRE Users this can be a learning curve, as you must undergo HIC processes for Safe Outputs. This guide will provide a brief introduction to Statistical Disclosure Control, which we will refer to as ‘Disclosure Control’ to keep a broad definition for varying research methodologies.

Background to the Five Safes Framework

The Five Safes Framework are a set of principles which provide a practical and comprehensive approach to handling sensitive data in a responsible and ethical manner, while still enabling research. Each principle is a different (but related) dimension that contributes to the overall safe use of data. Since it’s introduction by Felix Ritchie and team in 2008, it has been adopted widely in the UK (ONS, NHS England, UK Stats Authority etc.), and internationally (Europe, New Zealand, Australia, Brazil, Singapore). It has been written in to various legislation (e.g. Digital Economy Act 2017) and provides a high level approach to secure data management.

 

Each principle is a different (but related) dimension that contributes to the overall safe use of sensitive data. We can think of the Five Safes as an equalizer, considering how each principle will be balanced individually and as a whole.

  1. Safe People can access the TRE after training, approvals, and agreements are in place. Users are bound by contracts and professional obligations of confidentiality, ensuring that they keep the data secure and private.

  2. Safe Projects are reviewed for potential patient and public benefit, this ensures that data is used for legitimate and beneficial purposes.

  3. Safe Settings (our TRE) is used to access the data on secure technology systems. This provides a secure and controlled space where the data is accessed and analysed.

  4. Safe Data is provided as pseudonymised personal data to protect privacy. In addition to pseudonymisation, data minimisation, and impact assessments are also used to mitigate risk. This protects individual's privacy when researchers access the data extract.

  5. Safe Outputs are ensured as only summary data can be taken out of the TRE, after disclosure control to ensure no individual level data is released. Results are reviewed by HIC to ensure the information is truly anonymous and there is no risk of re-identification of people from findings, further protecting privacy.

Diagram of each of the Five Safes (people, projects, settings, data, outputs) with a bar next to it. The bar is in traffic light colour coding to demonstrate the risk sensitivities. The idea being that all Five Safes can be balanced individually and all together.
The Five Safes represented as an equalizer; each principle will be balanced as a whole

Balancing the Five Safes

Understanding the Five Safes as a balance is key. Principles are not targets set individually, but considered in how we balance them against each other. In the image above, Safe People is moderate risk as we expect approved individuals to handle the sensitive information responsibly. Safe Projects should ensure that data are used only for approved purposes, but it should be noted that HIC do not make this decision, appropriate information governance pathways and approvals vary between projects. Both Safe Settings and Safe Data are considered highly safe as the use of HIC data (pseudonymised and curated by experts) and our TRE (highly restricted environment) ensures confidential data is not openly available. And lastly, Safe Outputs are moderately safe as the information taken out has undergone HIC processes.

The Five Safes help us balance the need for data access and analysis, with the imperative to protect individual privacy and confidentiality.


 Disclosure Control

Disclosure control is essential for safeguarding the confidentiality and privacy of sensitive data. It involves minimizing the risk of disclosing sensitive information about people contained within data extracts. Although we provide pseudonymised data, it could be easy to potentially identify individual people with the right context or knowledge. For example, requesting release about a patient with an unusual lightening-shaped scar, of white ethnicity, and from a school aged sample in a small geographical region. Having a number of variables for information gathering, makes it easier to re-identify the likelihood of Harry Potter making it into a data science example.

We operate on a rules-based triage approach to disclosure control, and escalations are made as necessary for more complex output requests. That is, our TRE Support Team start with a simple approach for checking any disclosure risks, and cases are escalated to other technical teams for further scrutiny as required. We will work with you to understand why output can/cannot be released.


HIC Considerations

When finalising your output, know that HIC will check for:

  • Small counts/ frequencies: a cell minimum threshold of 5 will be expected; solutions include-

    • suppressing (or redacting) small numbers to <5

    • rounding values

    • redesigning outputs so that cells are combined, for example reducing SIMD quintile to SIMD decile.

  • Single figures: these are individual level data that relates specifically to one person. We will check for this in tables and reports, but also in graphs for example, we will check the risk of any outliers or single points being identifiable.

  • Artificial Intelligence (AI)/ Machine Learning (ML) methdologies have a difference process and we continue to develop the critieria for disclosure of these project models, including-

    • AI/ML triage form

    • Model Attack Report

    • Model Risk Assessment questionnaire

    • Other information governance paperwork such as the Data Protection Impact Assessment (DPIA), any contractual agreements (where an external 3rd party is involved), and the ethics and transparency of the TRE user making the request.

 

In general, good disclosure control is consistent with good statistics: many observations, no influential outliers, well-behaved distributions etc.


Other resources

 

This article used the developing UK TRE glossary, which includes DARE UK Drive Projects (2023) and the DataMind glossary at the HDR-UK Hub, which are openly available here, and the Statistical Disclosure Handbook

 Related articles