Introduction to SQL

Structured Query Language (SQL) is a language for interacting with relational databases, by means of joins, conditions, groups, and more. This allows manipulation and querying of large datasets in a fast and efficient manner. At HIC, we typically provide data releases in the form of an SQL database, but have found that a common way of interacting with this is to load all of the data into RAM and process it in the user’s language of choice. During the June 2024 HUG, a comparison was given between these two approaches, querying and linking via SQL vs. querying and linking via R. While the demonstration was only working with 100,000 rows per relation, the R script still took nearly 20 minutes, while the SQL, which produced the same result, completed in less than a second.

The demonstration showed how fast and efficient SQL can be, but didn’t go into any details on the querying language itself. This post aims to be a ‘getting started guide’, and will link to more advanced topics as they come up later. We’ll also focus on the kind of data HIC provide, and try to include examples which are relevant to as many TRE users as possible.

How to Play Along

Whether your TRE workspace is Windows or Ubuntu there are software tools that you can use for SQL. If you have a Windows workspace, we recommend using Microsoft SQL Management Studio - this should be installed by default on all Windows TRE workspaces. For Ubuntu workspaces, Azure Data Studio should be installed.

If you can’t find SQL Management Studio or Azure Data Studio on your workspace, please contact HICSupport@dundee.ac.uk and we will install it for you.

 

We have a guide on how to connect to the TRE SQL Server, which covers both of these tools: How do I connect to my study's SQL database? Once you’ve logged in, you’ll be able to run some queries by selecting your database and clicking “New Query” from the menu. This will open an editor which you can type your query into. Towards the top, there will be a button to execute the query. A results pane (or error message, if the query has an issue) will be displayed below the editor.

Whilst we demonstrated this with synthetic data generated by HIC Data, focusing on a Prescribing dataset, you will have to edit table and variable names specific to your project dataset.

Returning Results with SELECT

The SELECT query will return the values, columns, or aggregates you request. To start, let’s return 100 entries from the Prescribing table.

SELECT TOP(100) * FROM Prescribing

After clicking Execute, the first 100 results will be returned from the Prescribing table. Note that while keywords are not case sensitive, the names of tables and columns are. You can try changing Prescribing to prescribing which will produce the error “Invalid object name prescribing”, but changing SELECT to select would work fine.

Returning all results (e.g removing TOP(100) from the query) is usually possible, but is often unlikely to facilitate your understanding and layout of the schema - instead, it’ll just take longer (sometimes ages!) to load the results. If you do this and it’s taking ages, click on the editor window and then click on the icon, which will stop the query.

We can also reduce the number of columns we want to return, since in most cases we won’t actually need everything. To do this, we simply replace the * with the names of the columns we want, separated by commas. For example, to get the prescribed date and quantity, we can change the above query to this:

SELECT TOP(100) prescribed_date, quantity FROM Prescribing

If you are loading many results into RAM, this is often very sensible, since it’ll reduce the amount of memory required to cache the results. If you’re working with the data in something like R or Python, selecting exactly what you need will reduce the likelihood of small bugs in your code leading to erroneous results later on.

Linking Tables with JOIN

JOIN allows you to find a related row in another table, for each row the current table - it joins the columns together based on a provided condition. There are quite a few types of JOIN, but learning just a couple of them will probably work for you 90% of the time. In this example, we’ll stick with the example above and select the first 100 results of prescribed date and quantity, but we’ll link it to the Demography table and return the patient’s sex.

SELECT TOP(100) prescribed_date, quantity, sex FROM Prescribing LEFT JOIN Demography_Current ON (Prescribing.PROCHI = Demography_Current.PROCHI)

The condition specified in the brackets can be expanded if necessary. Some tables might use several columns as their primary key (this is called a composite key).

A “LEFT” Join? What would a “RIGHT” join do?

If you think of the main table we’re querying (Prescribing) as the table on the left, then the table on the right would be Demography_Current. This means that for every row in Prescribing, we’ll find a relevant row in Demography_Current based on the value of the PROCHI column. You can replace LEFT JOIN with a RIGHT JOIN, for this example there probably won’t be any changes.

The difference is that now every row in the right table (Demography_Current) is being linked to every row in the left table (Prescribing). Whether this makes a difference or not will depend on whether it is a one-to-one mapping, a many-to-one mapping or a one-to-many mapping. For example, there might be multiple rows for the same PROCHI in the demography table, which would make it appear as though there are more rows in Prescribing than there really are.

We’ll talk about some other kind of joins later on in this post.

Table Aliases

Working with SQL can feel a bit verbose sometimes - typing “Demography_Current” with the correct case can get tiresome. Instead, SQL supports something called table aliasing. This allows the table to be referenced by a shorter name. The example above can be rewritten to make it more compact (and sometimes, more readable!).

In this example, we’ve given the Prescribing table the alias of P and the Demography table the alias of D. This allows us to refer to those tables as P and D in the JOIN condition.

Filtering Results

SQL supports quite advanced conditional statements, allowing unnecessary results to be filtered out very quickly. The conditions can be combined, on the same or across multiple tables. For example, with the query above, we could apply it only to females.

We could also exclude results where the quantity is between a range, for example:

SQL conditions work on pretty much any datatype, be it numbers, strings and dates. For partial string matching, a condition called LIKE is available. For example, we could run the same query but only for people whose postcode starts with “DD”

In this example, the percent symbol (%) is used as a wildcard for any string. An underscore (_) can also be used to match any single character. If running LIKE queries is helpful, you might want to see more information about it from the Microsoft documentation.

Producing Aggregates

SQL supports a number of aggregate functions which allow you to quickly summarise information from a table - counting rows, obtaining the min, max, standard deviation, or average of a column, or value derived from several columns. Here are a few examples.

Find the smallest prescription issued from Prescribing:

Find the average age from Demography_Current:

If you just want to know how many rows in a table there are in total, you can use COUNT

Producing Grouped Aggregates

The result of an aggregate can be grouped by another condition or value. For example, we might want to retrieve a count of prescriptions by age (probably not useful, but it’s an example ). For this, we will need to make use of a JOIN to link the Prescribing table to the Demography_Current table. We’ll also need to use an aggregate function (COUNT in this case), and a GROUP BY clause. We’ll also add in table aliasing, since we covered those earlier too.

The query may take some time to run (about 20-30 seconds), but it would take much (much) longer in something like R.

More Advanced JOINs

We’ve covered LEFT and RIGHT joins - these will return all rows from either the left or the right table, and try to join them to the other, based on some condition. This can result in some null values, since not every row will have a matching row in the other table, but still be returned. To avoid these null values from appearing, it’s possible to use an INNER join. For example,

This will only return results from Prescribing if there is a corresponding Demography_Current record for that PROCHI. The opposite of this is a FULL join, which will include rows from either table, even if there isn’t a matching row in Prescribing (or the other way around). A SELF join allows a table to be joined to itself which is useful under certain circumstances (A “Person” could be the parent of another “Person”, for example). A CARTESIAN join is an expensive operation which joins every row in one table to every row in the other table - be careful doing this, it’s likely you’re meaning to do something else.

Aside from the different types of join operation, there is also more advanced usage. For example, you might want to join multiple tables. If you’re doing this, be mindful of what is going on when you are doing a LEFT or a RIGHT join - it’s very likely you’ll end up with lots of partial duplicates, unless all the involved relations are one-to-one.

Indexing

One of the most important features of SQL databases, like Microsoft SQL Server, is the ability to add indexes to columns. This builds a cache of values which enable quicker retrieval from the table by value. Some columns will be indexed by default, this is always the case for the primary key (e.g. PROCHI in the Prescribing table). Additional indexes can be added by HIC Data team if your queries are running slower than expected. We’re unable to allow everyone to add indexes as databases are provisioned in the TRE read-only. This avoids accidental modification or deletion of data. We offer a read/write database suffixed with “_Working” which you can use for temporary tables or duplicating the data to add indexes, if necessary.