Cohort
Instructions
What is a cohort
The RDMP defines a cohort as a list of identifiers which uniquely identify subjects and can be used to do dataset linkage, anonymisation and extraction. This chapter relates to how you arrive at your final project cohort identifier list and how RDMP supports this activity.
Cohort identification is one of the most complicated parts of meeting projects’ extraction needs. It is also very sensitive and has considerable risk potential (for example incorrectly contacting patients about conditions they don’t actually have).
Cohort Lifecycle
Cohorts are central to the extraction functionality provided by RDMP. This means that while Cohorts can be considered as a piece of functionality in it’s own right it is integrated tightly with the following satellite functionality.
Each Cohort starts life either as an ‘External Identifier List’ or a ‘Cohort Identification Configuration’. An ‘External Identifier List’ is the easy use case when your researcher knows exactly what private identifiers they want to extract data for (e.g. from doing a patient case note review). More often however you have a set of ‘Identification Criteria’ (e.g. ‘all patients with Type 2 diabetes currently living in Tayside on drug X). Identification Criteria are built in RDMP in a ‘Cohort Identification Configuration’ (See Cohort Identification Criteria), this is done by dragging and dropping curated Filters and Datasets and arranging them in SET operation containers (e.g. inclusion / exclusion criteria). Each part of this configuration can be tested individually (See Figure 41) to ensure the configuration accurately reflects the researchers requirements. Once you have finalised your ‘Cohort Identification Configuration’ it is executed on the live data repository and the resulting identifiers are treated exactly like an ‘External Identifier List’. At this point the Cohort Identification Configuration can be optionally frozen so that it cannot be edited/used again (although it can be cloned). The primary reason you might want to freeze the configuration is to preserve reproducibility because although the final identifier list is snapshotted and versioned you might need to go back and debug your configuration for errors/refinements and you don’t want people to have modified it since the project extract was done.
Once you have an ‘Identifier List’ (either via a Cohort Identification Configuration or directly from an External source) you can import the private identifiers into the Cohort Database, this contains all the private and project specific release identifiers for the project patients that you will extract data for in your Project Extraction. The reason we store the identifiers as a static immutable list is so that you can always reproduce a Project Extract exactly even years later. If you identify a problem with a cohort list, the prefered method for refinement post data extraction is commit a new version of the cohort (V2, V3 etc).
The RDMP is designed to allow maximum flexibility on how you allocate your release identifiers and how you treat your cohorts (See Functionality – Release Identifier Allocation) . The Cohort Sources RDMP offers out of the box generate GUIDs but you can modify them or point RMDP directly at your own custom database. This means that you can manually delete identifier lists / assign extraction identifiers yourself if you choose (for example if there is a governance problem around holding an incorrect identifier list).
Cohort Identification Criteria
Background
RDMP supports cohort identification by simplifying dividing complex identification requirements into small self contained testable sets. This helps reduce the ambiguity and supports transparency. Consider a simple cohort query as might be requested by a researcher:
“I want all patients who have been prescribed Diazepam1 for the first time after 20002 and who are still alive today3”
We begin by identifying each set (see in red). These sets are combined to produce the final distinct patient list. Criteria 3 (‘still alive today’) will be based on the demography dataset while the other 2 will be based on the prescribing dataset.
Each set is built by selecting filters from the Data Catalogue. Let us assume there are some useful filters already set up in the Catalogue:
Dataset | Available Filter | Implementation |
Prescribing | Drug name = @X | name like '%DIAZEPAM%' |
Prescribing | Prescribed before @date | prescribed_date < '2000-01-01' |
Demography | Patient is Dead | date_of_death is not null |
In this simple case we can use a single filter per set but in more complicated cases you might need to combine filters (e.g. people who died in Tayside).
The use of set theory gives us many advantages.
· Testability. It is easier to test that your SET of dead patient identifiers is correct than to try and test the entire configuration as a whole.
· Reusability. A well tested SET can be reused later on in different cohort generation tasks
· Performance. Relational database engines are built on set theory so are super fast