/
Configuring the load

Configuring the load

All loads in RDMP are composed of 1 or more ProcessTask. Each ProcessTask operates on a specific stage of the load pipeline and does a single activity (de-duplicate records, load files from CSV etc).

There are 5 categories of ProcessTask

Category

Purpose

Category

Purpose

Executable

Runs a specified .exe file

SQLFile

Runs a specified block of SQL on it's associated load stage database (e.g. RAW)

DataProvider

Moves files around e.g. downloading from an FTP server

Attacher

Loads records into the RAW database e.g. from CSV or remote table

MutilateDataTable

Modifies table data in RAW/STAGING to resolve specific issues (primary key collisions etc)

The simplest load is one that simply has a CSV attacher. Add the 'AnySeparatorFileAttacher' to the 'Mounting' stage of the load:

Command Line: This can be done from the CLI using:

./rdmp CreateNewClassBasedProcessTask LoadMetadata:*Biochemistry Mounting AnySeparatorFileAttacher

Set the following properties:

Property

Value

Property

Value

Separator

,

FilePattern

*.csv

TableName

Biochemistry

Command Line: This can be done from the CLI using:

./rdmp SetArgument "ProcessTask:AnySeparatorFileAttacher" Separator "," ./rdmp SetArgument "ProcessTask:AnySeparatorFileAttacher" FilePattern "*.csv" ./rdmp SetArgument "ProcessTask:AnySeparatorFileAttacher" TableName "Biochemistry"

Running the load

Place the Biochemistry.csv file generated by BadMedicine into the ForLoading directory of your load (e.g. C:\temp\biochem\Data\ForLoading).

Right click the [LoadMetadata] and select 'Check and Execute'

Command Line: This can be done from the CLI using:

./rdmp.exe dle -l "LoadMetadata:*Biochemistry" --command check

If the checks propose changes such as creating the archive trigger you should accept them. If you are unsure if it worked you can run the checks again.

When running from CLI check fixes are automatically accepted.

Run the data load engine by pressing the green go button.

Command Line: This can be done from the CLI using:

./rdmp.exe dle -l "LoadMetadata:*Biochemistry" --command run

If your load fails and you are able to fix the problem then you must run the checks again before trying to execute the load again (this also applies to CLI use).

Failed loads are deliberately left in the state in which they crash so that the RAW/STAGING databases can be inspected to assist in diagnosis.

Resizing a column

One common issue you may encounter is when new data is wider than the existing schema. If you encounter this you can resize the column in RDMP. You can use Ctrl+F to quickly find a column by name and jump to it in the tree collection.

Command Line: This can be done from the CLI using:

./rdmp.exe AlterColumnType ColumnInfo:*QuantityUnit* "varchar(20)"

Archiving

After a successful load all files in 'ForLoading' are zipped up and put into 'ForArchiving'. The ID of the load is the file name of the zip. This allows you to track an entry in the load audit logs to the file that was loaded at the time.

Data Load Engine Logging

All messages created during the load are stored in the hierarchical logging database along with counts of the number of

Command Line: This can be done from the CLI using:

./rdmp.exe ViewLogs LoadMetadata:*Biochemistry

Cohort Builder

Cohorts are lists of identifiers which uniquely identify subjects and can be used to for dataset linkage, anonymisation and extraction.

Cohort identification is often the most complicated parts of meeting projects' extraction needs. It is a sensitive task with considerable risk potential (for example incorrectly contacting patients about conditions they don't actually have).

There are 2 stages to creating a cohort before it can be used to generate an extraction.

Stage

Purpose

Stage

Purpose

Identify Cohort

Build a query or locate a file that contains unique person identifiers

Commit Cohort

Create a permenant record of the results of the query / contents of the file and allocate any anonymous mappings

Identify Cohort

Each Cohort starts life either as a list of identifiers in a file or a CohortIdentificationConfiguration (Query Builder).

If identifiers are in a file then they can be directly committed into the Saved Cohorts store. If additional restrictions must be placed on the identifier list (e.g. removing Opt outs, people with specific conditions) then the identifiers should be imported as a new Catalogue and a CohortIdentificationConfiguration should be built using it. If an identifier list is imported as a Catalogue you can mark as 'Project Specific' which will tie it to a single Project and ensure it is not mistakenly used in other projects/extracts.

If you need to assemble a query to identify the cohort then you will need to create a CohortIdentificationConfiguration in the Cohort Builder.

Command Line: This can be done from the CLI using:

./rdmp CreateNewCohortIdentificationConfiguration "My First Cohort"

Add the Biochemistry dataset to the 'Inclusion Container'.

Command Line: This can be done from the CLI using:

./rdmp AddCatalogueToCohortIdentificationSetContainer CohortAggregateContainer:6 Catalogue:Biochemistry chi

The above query uses the ID of the container (6). We cannot use the name (unless running in a script) because there are likely to be multiple containers that share that name

You can disable the red 'problem' indicators by disabling them in user settings. This is worth doing as they will prevent the cohort from being committed.

Command Line: This can be done from the CLI using:

./rdmp SetUserSetting StrictValidationForCohortBuilderContainers false

Add a filter to the dataset. Either an existing one (if you have any Catalogue level Filters) or a new empty one (if empty provide some WHERE sql).

Command Line: This can be done from the CLI using:

./rdmp CreateNewFilter "AggregateConfiguration:People in Biochemistry" null "Creatinine" "TestCode like '%CRE%'"

Execute the cohort. You can access various view options from the right click context menu:

Command Line: This can be done from the CLI using:

./rdmp ViewData "AggregateConfiguration:People in Biochemistry" TOP_100 ./rdmp ViewData "AggregateConfiguration:People in Biochemistry" All

Cohort Containers

New CohortIdentificationConfigurations are created with 3 containers:

  • A Root Container

    • Inclusion Criteria

    • Exclusion Criteria

The first container is an EXCEPT while the other two are UNION. For more information on how these work see the Cohort Builder FAQ or Cohort Creation.

Commit Cohort

Cohort Committing is the process of 'finalising' a list of identifiers. This is important for reproducibility and anonymisation.

To commit a cohort you will need

Saved Cohorts

Identifier lists are shown in the 'Saved Cohorts' collection accessible from the top toolbar of RDMP

Command Line: This can be done from the CLI using:

./rdmp describe ExtractableCohort

Assuming you created example datasets when installing RDMP you should see the ExternalCohortTable RDMP_ExampleData. If you open this store you can see that it stores lists where the 'private identifier' is chi and the anonymous identifier is ReleaseId.

Command Line: This can be done from the CLI using:

./rdmp describe ExternalCohortTable:RDMP_ExampleData

Each identifier format that you use for linkage requires its own ExternalCohortTable. For example if you perform cohort building / linkage with both NHSNumber and SocialSecurityNumber then you would need 2 ExternalCohortTable.

You can create a new ExternalCohortTable with the wizard

Command Line: This can be done from the CLI using:

./rdmp CreateNewCohortStore "DatabaseType:MicrosoftSQLServer:Name:RDMP_ExampleCohorts2:Server=(localdb)\MSSQLLocalDB;Integrated Security=true" false chi "varchar(10)"

Cohort Pipeline

To put the identifiers into the ExternalCohortTable you will need to run a 'Cohort Creation' Pipeline.

A list of all Pipelines can be seen in the in the 'Tables (Advanced)' collection:

Command Line: This can be done from the CLI using:

./rdmp ls Pipeline

The easiest place to commit a cohort from is the 'Project' collection New button:

Choose your CohortIdentificationConfiguration and the 'RDMP_ExampleData' destination.

When prompted to choose a Project pick the existing example Project 'Lung Cancer Project'. Select 'New Cohort' and give it a name.

Preview and then Run the Pipeline. Ensure that you have the correct Pipeline for the task (e.g. if uploading a file, use the 'From CSV File').

If you get an error and are using a CohortIdentificationConfiguration, open it in Cohort Builder and ensure that it runs correctly there and that there are no lines highlighted red. You may need to disable StrictValidationForCohortBuilderContainers (see Identify Cohorts).

Command Line: This can be done from the CLI using:

./rdmp CreateNewCohortByExecutingACohortIdentificationConfiguration "CohortIdentificationConfiguration:my new cic" ExternalCohortTable:RDMP_ExampleData "My test cohort" null "Pipeline:CREATE COHORT*Cohort Identification Configuration"

After the pipeline completes you should see your new cohort under 'Saved Cohorts'.

Release Identifier Allocation

If you view the data in an ExtractableCohort you can see two columns. The first will be whatever column in your dataset you use for linkage (e.g. chi) and the second will be the anonymous representation (e.g. 'ReleaseId').

Command Line: This can be done from the CLI using:

./rdmp ViewData ExtractableCohort:1 ALL

You can tailor how these identifiers are allocated in several ways. For example

Do Not Allocate Release Identifiers

If your data is already anonymised or is not sensitive then you may not want seperate release identifiers at all. Alternatively you may have a seperate process for allocating identifiers (e.g. a find/replace script).

In this case you can disable release identifier allocation by editing the ExternalCohortTable such that the 'Private Identifier Field' is the same as the 'Release Identifier Field':

You will also need to update your User Settings to allow for this:

Finally you will need to update the schema of the Cohort table to allow nulls. The code will vary but this is Sql Server code for the default example ExternalCohortTable created on install:

ALTER TABLE RDMP_ExampleData..Cohort ALTER COLUMN ReleaseId varchar(300)

Command Line: This can be done from the CLI using:

./rdmp set ExternalCohortTable:RDMP_ExampleData ReleaseIdentifierField "[RDMP_ExampleData]..[Cohort].[chi]" ./rdmp SetUserSetting R004 Success

Note that UserSettings are not shared between CLI and Windows application

Allocate Release Identifiers Manually

You can leave identifiers blank in the [ExternalDatabaseServer] table by editing the Pipelines and changing the ReleaseIdentifierAllocation property on the destination component to NullAllocateReleaseIdentifiers.

As before you will need to ensure that the database allows nulls in the release identifier column (See above).

Allocate Release Identifiers With Plugin

RDMP supports both [Plugin] destination components and allocation methods. If you simply want to change what identifiers are allocated then define a new class implementing IAllocateReleaseIdentifiers.

If you have something more complicated in mind you can create a new class implementing IPluginCohortDestination.

In both cases you will need to package your code as a plugin, see Plugin Writing.

Pipelines

Pipelines provide flexibility when moving data from place to place (e.g. when creating a cohort or uploading a file as a new table.

A Pipeline consists of a source component generating data (e.g. reading a file) and a series of components operating on the data then a destination component (e.g. writing final data into a database).

In some situations the source and/or destination may be fixed by the use case. When this happens the Pipeline cannot specify it's own.

Pipelines are usually configured once and then reused for the task from then on. You may find that you define multiple Pipelines for the same task (e.g. File Import) that are configured slightly differently (e.g. one for comma separated and another for tab separated).

Each PipelineComponent has its own arguments that can be configured. For example the BULK INSERT Pipeline source has settings like Separator and ThrowOnEmptyFiles.

When you edit a Pipeline you are able to select and delete components as well as drag new components into the middle (or source/destination) to perform specific activities. For example if you need to replace values in the pipeline with a mapping you can use ColumnSwapper.

Defining your own custom components is fun and easy and covered in PluginWriting.md.

Extraction

Data export in RDMP allows you to link a cohort to one or more dataset(s). This will extract only those records in a dataset that belong to people in your cohort. This linkage is done through the user interface by creating a Project and one or more Catalogue. A Project can have one or more ExtractionConfiguration (e.g. cases and controls).

In order to be included in an ExtractionConfiguration a Catalogue must be marked extractable and have a column marked IsExtractionIdentifier. When added you can define Filters to further constrain what records are extracted (e.g. prescriptions for a specific drug group only).

RDMP preserves all ExtractionConfiguration seperate from Catalogue definitions. This ensures reproducibility of extracts even at a far later date (e.g. years).

ExtractionConfiguration can be frozen after release to prevent accidental changes. They can even be cloned for later reuse (e.g. in a data refresh after 6 months).

Configuring an Extraction

Once you have created a cohort and Project you are ready to create a new ExtractionConfiguration. This can be done from the right click context menu on any Project or the 'Extraction Configurations' node.

Provide a name for the ExtractionConfiguration then pick your cohort and initial Catalogues.

Command Line: This can be done from the CLI using:

./rdmp CreateNewExtractionConfigurationForProject "Project:Lung Cancer Project" "My test config" ./rdmp ChooseCohort "ExtractionConfiguration:My test config" "ExtractableCohort:MyCohortVersion1" ./rdmp AddDatasetsToConfiguration ExtractableDataset:Biochemistry "ExtractionConfiguration:My test config" ./rdmp AddDatasetsToConfiguration ExtractableDataset:Prescribing "ExtractionConfiguration:My test config"

Now check and execute the extraction:

Command Line: This can be done from the CLI using:

./rdmp extract -e "ExtractionConfiguration:My test config" -p "Pipeline:DATA EXPORT*CSV" --command check

./rdmp extract -e "ExtractionConfiguration:My test config" -p "Pipeline:DATA EXPORT*CSV" --command run

 

Check then execute an ExtractionConfiguration. Note that when specifying the Pipeline name we use the wild card '*'. In this case it is to avoid the ':' which cannot appear in name pattern on CLI

Verification

You can confirm that an extraction was succesful by opening the extraction location on disk or by running the Extraction SQL.

Command Line: This can be done from the CLI using:

./rdmp ViewExtractionSql "ExtractionConfiguration:My test config" Catalogue:Prescribing

Note that this will view both the SQL query and all the data returned

Adding filters

You can add or import Filters to reduce the scope of the extract. For example if a study only requires specific types of records (e.g. Diabetic Drug Prescriptions). The cohort ensures that only records for specific people are extracted. Filters further reduce that according to WHERE SQL in them.

Command Line: This can be done from the CLI using:

./rdmp CreateNewFilter "SelectedDatasets:22" null "Diabetic Drugs" "LEFT ( [FormattedBnfCode] , 3) = '6.1'"

Note that this will view both the SQL query and all the data returned

Scaleability and Adaptability

Since data repositories can contain hundreds of millions of records and cohorts can be equally large, RDMP data export is designed for scalability. Datasets are extraced in parallel and results are streamed in batches (to reduce RAM overhead or out of memory errors).

As seen above the CLI can be used to execute ExtractionConfigurations. This allows for a workflow where data analysts build and check the configuration but the running is then done by a powerful automation server overnight.

Extractions are executed through a Pipeline (See Pipelines chapter). This allows for extraction to database or any other destination for which a component exists. Plugin components can be written and added to extraction pipelines e.g. to perform further identifier substitutions (see Plugin Writing).

Release

An ExtractionConfiguration can have many Catalogues. These can be extracted at different times. Some may aquire modifications over time (adjusting filters etc). Some may fail to extract (e.g. bad SQL in filter) or have extracted files lost (e.g. disk backup restore).

The Release process of RDMP confirms that all expected extraction artifacts are in place and that the extraction audit matches the current live configuration of the ExtractionConfiguration (no changes have been made but not run).

Releases use the Pipeline system which allows for customization. For example a [Plugin] could enable releases to an SFTP server etc.

You can start a release of one or more ExtractionConfigurations via the Project context menu.

Command Line: This can be done from the CLI using:

./rdmp release -c "ExtractionConfiguration?Project:Lung Cancer Project" -p "Pipeline:RELEASE*To Directory" --command check --skipreleased ./rdmp release -c "ExtractionConfiguration?Project:Lung Cancer Project" -p "Pipeline:RELEASE*To Directory" --command run --skipreleased

The question mark after ExtractionConfiguration means match all where the Project matches the pattern ('Lung Cancer Project')

Related content

Creating a Data Load Configuration
Creating a Data Load Configuration
More like this
Create a Catalogue from file
Create a Catalogue from file
More like this
Breaking RDMP into its component parts
Breaking RDMP into its component parts
More like this
Create a Catalogue from table
Create a Catalogue from table
More like this