1 Introduction

This is a short vignette to show some tips and tricks for efficiently pulling data from the database of Metawards runs, using Ensemble Zero for demonstrative purposes. I for no moment profess to be an expert at SQL or relational databases so any further additions or corrections to this vignette are welcome.

2 Loading and inspection

Let us first load the packages that I’m going to use. dplyr is part of the tidyverse set of packages that enables a clean interface to data management. dbplyr translates dplyr notation into SQL, allowing us to communicate to the database. RSQLite which embeds the SQL database engine in R. Finally, DBIhosts a bunch of tools to communicate between R and our database.

library(dplyr)
library(dbplyr)
library(RSQLite)
library(DBI)

First we must establish a connection to the database.

con <- DBI::dbConnect(
  RSQLite::SQLite(), 
  "../data/uberStages.db"
)

Once we establish this connection we can have a look at what tables are in the database. Note, if you download the ensemble zero uberStages.db you will only have compact and not ward_to_trust (this is a table that I’ve added locally).

src_dbi(con)

## src:  sqlite 3.29.0 [/Users/lachlan/Documents/PostDoc/20200710_covid_plots/data/uberStages.db]
## tbls: compact, ward_to_trust

Then we link to the table that we’re interested in. As the code stands at the moment all the good stuff lies in the compact table. When we print the connection, metawards, it looks more or less like a regular tibble, only with a database connection added and ?? number of rows. When we see the ??, dbplyr is being smart/lazy and intentionally only considering the top few rows of data to show us what it looks like. This is nice for databases as they are generally of a size where they’ll max out our RAM and then no one has a good time.

metawards <- tbl(con, "compact")
print(metawards)

## # Source:   table<compact> [?? x 7]
## # Database: sqlite 3.29.0
## #   [/Users/lachlan/Documents/PostDoc/20200710_covid_plots/data/uberStages.db]
##    ward  week  Hprev Cprev Deaths output  replicate
##    <chr> <chr> <dbl> <dbl>  <int> <chr>       <int>
##  1 1     11     0        0      0 Ens0000         1
##  2 2     11     0        0      0 Ens0000         1
##  3 3     11     1        0      1 Ens0000         1
##  4 4     11     0        0      0 Ens0000         1
##  5 5     11     0        0      0 Ens0000         1
##  6 6     11     0        0      0 Ens0000         1
##  7 7     11     0        0      0 Ens0000         1
##  8 8     11     1.86     0      1 Ens0000         1
##  9 9     11     0        0      0 Ens0000         1
## 10 10    11     0        0      0 Ens0000         1
## # ... with more rows

From here we can utilise the magic of dbplyr to subset and pull in parts of the database as a tibble to work on. All we have to add is the collect call to explicitly tell dbplyr to go get our data. Do not run this before reading the next section.

data_tmp <- metawards %>%
  filter(output == "Ens002q", replicate == 4) %>%
  collect()

3 Indexing

Indexing is a method used to more quickly retrieve subsets of data from the database. Indexing a database is akin to creating a glossary for different column values so when we want to filter and pull data (for example, ward == 19) it can look up all rows that match the criterion and pull them, as opposed to considering all of the rows in the database. Unfortunately indexing also requires it’s fair share of memory and can make database compression less efficient. As it stands the databases hosted online will be un-indexed, and so you will have to do this on your local copy.

This can be done inside of R with the following style command

dbSendStatement(con, "CREATE INDEX *index_name* ON *tbl* (*column_name*);")

For instance, say you were particularly interested in pulling different model outputs. You could create an index on the output column with the command

dbSendStatement(con, "CREATE INDEX output_idx ON compact (output);")

What if you were interested in pulling unique combinations of columns, say, outputs and replicates. Creating two separate indexes will only help for the first filtering criteria as indexes are binary search trees, and so once you’ve filtered on the first criteria then the lookup information won’t help for the second criteria (thanks to TJ for pointing this out to me). We can, however, create an index on multiple columns where data are filtered by multiple criteria. For the example with outputs and replicates, we can write

dbSendStatement(con, "CREATE INDEX run_idx ON compact (output, replicate);")

Now, you’ve created some indexes, run your analyses, and your stuck with this monolithic database. How do we get rid of the indexes to conserve memory? Easy. Just use the IF EXISTS and DROP INDEX drop index commands,

dbSendStatement(con, "DROP INDEX IF EXISTS *idx_name* ON *tbl*;")

Can’t remember what indexes you’ve created? Try

dbGetQuery(con, "PRAGMA index_list('compact')")

Note that here we are using the command dbGetQuery as we’re wanting something back from the database rather than just sending commands to it.

As a final note, just be prepared that creating indexes can take some time. My local machine requires about 10 minutes to create a single column index. Despite this, the savings from indexing are massive. I highly recommend that you use them.

4 A short example

I’ve included a short example from some code I’ve been using for plots (plot vignette coming soon). The goal here is to pull some data, and map it to the trust level. We must go through an intermediary stage of first mapping it to ward names; although, there has been discussion of including trust names in the next iteration of the database so this step may soon be redundant.

First let us read in our ward and trust data

library(readr)

ward_lookup <- read_csv("../data/ward_lookup.csv") %>%
  dplyr::select(WD11CD, WD11NM, LAD11CD, LAD11NM, ward = FID)

to_trust <- read_csv("../data/WD11ToAcuteTrustIncWalesHB.csv") %>%
  as_tibble()

Now let us filter a single model output from the database. Note that wards and weeks are stored as a character vector, and so I convert them at the end to numeric. To demonstrate how fast indexing is, I’ve also called bench::mark around this call.

bench::mark({
  data_tmp <- metawards %>%
    filter(output == "Ens002q", replicate == 4) %>%
    collect() %>%
    mutate(
      ward = as.numeric(ward),
      week = as.numeric(week)
    )
})[,-1]

## # A tibble: 1 x 5
##        min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1    548ms    548ms      1.82    19.2MB        0

Half a second to filter a single model run from over half a trillion rows of data: not bad! Let’s look at our model, ward, and trust data.

print(data_tmp)

## # A tibble: 137,408 x 7
##     ward  week Hprev Cprev Deaths output  replicate
##    <dbl> <dbl> <dbl> <dbl>  <int> <chr>       <int>
##  1     1    11 0.714 1.29       0 Ens002q         4
##  2     2    11 0     0          0 Ens002q         4
##  3     3    11 1.43  1.14       1 Ens002q         4
##  4     4    11 1     0.857      1 Ens002q         4
##  5     5    11 0.286 1.57       1 Ens002q         4
##  6     6    11 4     0.857      1 Ens002q         4
##  7     7    11 5.71  1.14       3 Ens002q         4
##  8     8    11 5.43  5.43       1 Ens002q         4
##  9     9    11 3.57  2.14       0 Ens002q         4
## 10    10    11 2.14  2.71       2 Ens002q         4
## # ... with 137,398 more rows

print(ward_lookup)

## # A tibble: 8,588 x 5
##    WD11CD    WD11NM               LAD11CD   LAD11NM  ward
##    <chr>     <chr>                <chr>     <chr>   <dbl>
##  1 E05002337 Central              E06000039 Slough   2001
##  2 E05002338 Chalvey              E06000039 Slough   2002
##  3 E05002339 Cippenham Green      E06000039 Slough   2003
##  4 E05002340 Cippenham Meadows    E06000039 Slough   2004
##  5 E05002341 Colnbrook with Poyle E06000039 Slough   2005
##  6 E05002342 Farnham              E06000039 Slough   2006
##  7 E05002343 Foxborough           E06000039 Slough   2007
##  8 E05002344 Haymill              E06000039 Slough   2008
##  9 E05002345 Kedermister          E06000039 Slough   2009
## 10 E05002346 Langley St Mary's    E06000039 Slough   2010
## # ... with 8,578 more rows

print(to_trust)

## # A tibble: 8,532 x 4
##    WD11CD    WD11NM         trustId trustName                                   
##    <chr>     <chr>          <chr>   <chr>                                       
##  1 E05000658 Harper Green   RMC     Bolton NHS Foundation Trust                 
##  2 E05003149 Hawcoat        RTX     University Hospitals Of Morecambe Bay NHS F~
##  3 E05006853 Yeovil Central RA4     Yeovil District Hospital NHS Foundation Tru~
##  4 E05000861 Wigan Central  RRF     Wrightington, Wigan and Leigh NHS Foundatio~
##  5 E05001536 Hardwick       RVW     North Tees and Hartlepool NHS Foundation Tr~
##  6 E05006341 Woodlands      RCB     York Teaching Hospital NHS Foundation Trust 
##  7 E05007696 Central        RYR     Western Sussex Hospitals NHS Foundation Tru~
##  8 E05000326 Brunel         RAS     The Hillingdon Hospitals NHS Foundation Tru~
##  9 E05000018 Farringdon Wi~ R1H     Barts Health NHS Trust                      
## 10 E05001992 Horfield       RVJ     North Bristol NHS Trust                     
## # ... with 8,522 more rows

To get from the ward identifier in our model output to our trust ID we’ll have to first join to the ward_lookup tibble to find the ward code, and then to the to_trust tibble to find the trust ID. This can be done fairly simply in a dplyr pipe command. There a many different join commands that can be used. Personally, I prefer left_join as should any model location be unidentified with a ward it will return NA values and notify us that something is maybe wrong somewhere. Other commands, such as inner_join will not do this and quietly throw out the unidentified data.

data_tmp %>%
  left_join(ward_lookup, by = "ward") %>%
  left_join(to_trust, by = "WD11CD") %>%
  dplyr::select(
    ward, week, Hprev, Cprev, Deaths, output, trustId
  )

## # A tibble: 137,408 x 7
##     ward  week Hprev Cprev Deaths output  trustId
##    <dbl> <dbl> <dbl> <dbl>  <int> <chr>   <chr>  
##  1     1    11 0.714 1.29       0 Ens002q RAL    
##  2     2    11 0     0          0 Ens002q R1H    
##  3     3    11 1.43  1.14       1 Ens002q RYJ    
##  4     4    11 1     0.857      1 Ens002q RAP    
##  5     5    11 0.286 1.57       1 Ens002q RAL    
##  6     6    11 4     0.857      1 Ens002q RAL    
##  7     7    11 5.71  1.14       3 Ens002q RWG    
##  8     8    11 5.43  5.43       1 Ens002q RAL    
##  9     9    11 3.57  2.14       0 Ens002q RWH    
## 10    10    11 2.14  2.71       2 Ens002q RAL    
## # ... with 137,398 more rows

Database Interaction in R

Lachlan Astfalck

2020-07-21

1 Introduction

2 Loading and inspection

3 Indexing

4 A short example