An Introduction to Statistical Disclosure Control

Introduction

HIC are a health data ecosystem, on behalf of Data Controllers (often NHS healthboards), we provide access to sensitive data in our Trusted Research Environment (TRE), when approvals are in place. We operate within the Five Safes Framework which allow us to maintain trust and transparency in secure data management. We apply proportional risk mitigation in our processes and for TRE Users this can be a learning curve, as you must undergo HIC processes for Safe Outputs. This guide will provide a brief introduction to Statistical Disclosure Control, which we will refer to as ‘Disclosure Control’ to keep a broad definition for varying research methodologies.

Background to the Five Safes Framework

The Five Safes Framework are a set of principles which provide a practical and comprehensive approach to handling sensitive data in a responsible and ethical manner, while still enabling research. Each principle is a different (but related) dimension that contributes to the overall safe use of data. Since it’s introduction by Felix Ritchie and team in 2008, it has been adopted widely in the UK (ONS, NHS England, UK Stats Authority etc.), and internationally (Europe, New Zealand, Australia, Brazil, Singapore). It has been written in to various legislation (e.g. Digital Economy Act 2017) and provides a high level approach to secure data management.

Each principle is a different (but related) dimension that contributes to the overall safe use of sensitive data. We can think of the Five Safes as an equalizer, considering how each principle will be balanced individually and as a whole.

Safe People can access the TRE after training, approvals, and agreements are in place. Users are bound by legal, ethical and security guidelines to handle data responsibly.
Safe Projects are reviewed for potential patient and public benefit (by appropriate Information Governance pathways, not HIC staff). Contractual agreements may be required where an external 3rd party is involved.
Safe Settings (our TRE) is used to access the data on secure technology systems. This provides a secure and controlled space where the data is accessed and analysed.
Safe Data is provided as pseudonymised personal data to protect privacy. We also apply data minimisation techniques and conduct data impact assessments to mitigate risk. These safeguarding measures help protect individuals’ privacy when researchers access the data extract.
Safe Outputs ensures only summary data can leave the TRE, and this is after disclosure control to prevent any release of individual-level data. All results are reviewed by HIC staff to confirm the data is truly anonymous, with no risk of re-identifying individuals from findings. This process also applies to outputs from AI/ML models.

Diagram of each of the Five Safes (people, projects, settings, data, outputs) with a bar next to it. The bar is in traffic light colour coding to demonstrate the risk sensitivities. The idea being that all Five Safes can be balanced individually and all together.

The Five Safes represented as an equalizer; each principle will be balanced as a whole

Balancing the Five Safes

Understanding the Five Safes as a balance is key. Principles are not targets set individually, but considered in how we balance them against each other. In the image above, Safe People is moderate risk as we expect approved individuals to handle the sensitive information responsibly. Safe Projects should ensure that data are used only for approved purposes, but it should be noted that HIC do not make this decision, appropriate information governance pathways and approvals vary between projects. Both Safe Settings and Safe Data are considered highly safe as the use of HIC data (pseudonymised and curated by experts) and our TRE (highly restricted environment) ensures confidential data is not openly available. And lastly, Safe Outputs are moderately safe as the information taken out has undergone HIC processes.

The Five Safes help us balance the need for data access and analysis, with the imperative to protect individual privacy and confidentiality.

Disclosure Control

Disclosure control is essential for safeguarding the confidentiality and privacy of sensitive data. It involves minimizing the risk of disclosing sensitive information about people contained within data extracts. Although we provide pseudonymised data, it could be easy to potentially identify individual people with the right context or knowledge. For example, requesting release about a patient with an unusual lightening-shaped scar, of white ethnicity, and from a school aged sample in a small geographical region. Having a number of variables for information gathering, makes it easier to re-identify the likelihood of Harry Potter making it into a data science example.

We operate on a rules-based triage approach to disclosure control, and escalations are made as necessary for more complex output requests. That is, our TRE Support Team start with a simple approach for checking any disclosure risks, and cases are escalated to other technical teams for further scrutiny as required. We will work with you to understand why output can/cannot be released.

HIC Considerations

When finalising your output, know that HIC will check for:

Small counts/ frequencies: a cell minimum threshold of 5 will be expected; solutions include-
- suppressing (or redacting) small numbers to <5
- rounding values
- redesigning outputs so that cells are combined, for example reducing SIMD quintile to SIMD decile.
Single figures: these are individual level data that relates specifically to one person. We will check for this in tables and reports, but also in graphs for example, we will check the risk of any outliers or single points being identifiable.
Artificial Intelligence (AI)/ Machine Learning (ML) methdologies have a difference process and we continue to develop the critieria for disclosure of these project models, including-
- AI/ML triage form
- Model Attack Report
- Model Risk Assessment questionnaire
- Other information governance paperwork such as the Data Protection Impact Assessment (DPIA), any contractual agreements (where an external 3rd party is involved), and the ethics and transparency of the TRE user making the request.

In general, good disclosure control is consistent with good statistics: many observations, no influential outliers, well-behaved distributions etc.

Other resources

The UK Data Service Handbook on Statistical Disclosure Control for outputs is publicly available - SDC Handbook
Statistical Disclosure Control: A Practice Guide https://readthedocs.org/projects/sdcpractice/downloads/pdf/latest/
The Office for National Statistics https://www.ons.gov.uk/file?uri=/aboutus/whatwedo/statistics/requestingstatistics/secureresearchservice/gettingyourresearchoutputsapproved/srsoutputcheckingguidance.pdf
The Office for National Statistics Safe Researcher Training course delivered by Scottish Centre for Administrative Data Research has an extensive section on SDC and how to create safe outputs
UK TRE Community Glossary UK-TRE-glossary.xlsx
Further reading around the Five Safes can be found at Research Data Scotland What is the Five Safes framework? | Research Data Scotland

This article used the developing UK TRE glossary, which includes DARE UK Drive Projects (2023) and the DataMind glossary at the HDR-UK Hub, which are openly available here, and the Statistical Disclosure Handbook

HIC Knowledge Base