Requires MetaWards development branch currently
Please download files for running the model from here.
The basic model structure is:
Each of these classes will be split into \(J\) age-classes, with interactions between the classes described in detail below.
The model structure above allows for different progression pathways. We also split the population up into different age demographics, to allow for variable mixing between the age-groups. We define a customised mover
function that is used to move individuals through different pathways, and mixer
functions that scale the force-of-infection (FOI) terms between the different age demographics (explained later). There are currently five pathways, which will be described below, but can be summarised as:
We are calibrating to hospitalised deaths, and so to simplify the model and aid calibration, we assume that the \(H\) class is non-infectious, but split the \(I\) class into two stages, \(I_1\) and \(I_2\). Therefore, on average individuals are infectious for longer if they do not go to hospital. Some evidence suggests that individuals can remain symptomatic for much longer than they remain infectious, but from a transmission perspective, once individuals cease to be infectious they are equivalent to “recovered” individuals. As such, the \(R_I\) class corresponds to recovered and non-infectious symptomatic individuals, and \(T_P + T_{I_1} + T_{I_2}\) to the average infectious period (not average symptomatic period). The incubation period is thus \(T_E + T_P\), and the time between symptom onset and admission to hospital is \(T_{I_1}\). We also make the simplifying assumption that asymptomatic infections last the same time as symptomatic infections.
We need to specify initial proportions of individuals in each age-class, which are provided in the data/populationByAge/Pop by CTRY.csv
file from Rob.
Note that the total number of individuals in the table below is different to the number of individuals in the commuter data that is used in MetaWards, since the commuter data come from the 2011 census, whereas the age-distributions above come from more recent population counts. Hence we use the more recent data to figure out the proportions of individuals in each age-category, which we pass to the MetaWards model.
However, we also need to specify the proportions of the workers and players that are in each age category. In this case we make the assumption that all individuals \(<18\) years old or \(\geq 70\) years old are players only, and furthermore we assume that school-age children don’t move from their home LADs—essentially assuming that most children go to school in their home LADs. (Note the play movements of the \(<5\) year olds reflects the movements of the parents/carers). To implement this in MetaWards we can set the proportions of workers/players in each age-class as described below, and then set an age-specific parameter static_play_at_home
for each age-class that we wish to restrict movements for. This latter part is done in the user input file here.
For the worker/player proportions, consider that if \(p_j\) is the proportion of individuals in age-class \(j\) in the population, then the proportion of workers in age-class \(j\) is: \[ p^W_j = \left\{\begin{array}{ll} 0 & \mbox{if $a_j < 18$},\\ \frac{p_j}{\sum_{\{k : 18 \leq a_k < 70\}} p_k} & \mbox{if $j$ s.t. $18 \leq a_j < 70$},\\ 0 & \mbox{if $a_j \geq 70$}. \end{array}\right. \] To generate the proportions of players in each age-class, we need to use the fact that the sum of workers and players in each age-class must equal the number in the population. Hence we require that \[\begin{align*} p^P_jN^P + p^W_j N^W &= p_j \left(N^P + N^W\right)\\ \Rightarrow p^P_j &= \frac{p_j \left(N^P + N^W\right) - p^W_j N^W}{N^P}, \end{align*}\] where \(N^W = 22,469,005\) and \(N^P = 33,613,072\) are the total number of workers and players in the population.
Age | Count | Prop. Pop. | Prop. Workers | Prop. Players |
---|---|---|---|---|
<5 | 3,346,727 | 0.060 | 0.000 | 0.1000 |
5-17 | 8,607,891 | 0.154 | 0.000 | 0.2570 |
18-29 | 8,615,247 | 0.154 | 0.235 | 0.0999 |
30-39 | 7,505,080 | 0.134 | 0.205 | 0.0865 |
40-49 | 7,189,826 | 0.128 | 0.195 | 0.0832 |
50-59 | 7,488,780 | 0.134 | 0.205 | 0.0865 |
60-69 | 5,866,967 | 0.105 | 0.160 | 0.0682 |
70+ | 7,356,660 | 0.131 | 0.000 | 0.2190 |
The demographics in MetaWards is set up using the demographics.json
file.
{"demographics" : ["age1", "age2", "age3", "age4", "age5", "age6", "age7", "age8"],
"work_ratios" : [0, 0, 0.235, 0.205, 0.195, 0.205, 0.16, 0],
"play_ratios" : [0.100108, 0.256943, 0.099855, 0.086539, 0.083213, 0.086539, 0.068235, 0.218568],
"diseases" : ["ncov", "ncov", "ncov", "ncov", "ncov", "ncov", "ncov", "ncov"]
}
The stages can be set with the MetaWards disease file, called ncov.json
here.
{ "name" : "SARS-Cov-2 (genpop)",
"stage" : ["E", "P", "I1", "I2", "RI", "DI", "A", "RA", "H", "RH", "DH"],
"beta" : [0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0],
"progress" : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
"too_ill_to_move" : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0],
"contrib_foi" : [0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]
}
We will discuss these choices in more detail in the subsequent sections. Note that the progress
parameters are all set to zero here, since all movements will be controlled via the custom mover
function, and the beta
parameters are all set to one for infectious classes, but different values for different infectious classes will be passed in when the model is run, and the mixing between the age-classes will be controlled by a custom mixer
function.
Various functions require additional user inputs. These are stored in the inputs/user_inputs.txt
file. This contains information on the dates of lockdown and other parameters that are necessary to govern various custom functions as described in more detail in various sections of this vignette. For example, the input .nage
controls the number of age-classes in the model as described above.
# Normal status
0] = True
.can_work[
# Lockdown 1
1] = False
.can_work[
# Lockdown 2
2] = True
.can_work[
# Lockdown dates
= 2020
.lockdown_date_1_year = 3
.lockdown_date_1_month = 21
.lockdown_date_1_day = 2020
.lockdown_date_2_year = 5
.lockdown_date_2_month = 13
.lockdown_date_2_day
# contact matrix
= "inputs/POLYMOD_matrix.csv"
.contact_matrix1_filename = "inputs/coMix_matrix.csv"
.contact_matrix2_filename
# number of age classes
= 8
.nage
# names of input files for seeding
= "inputs/age_seeds.csv"
.age_seed_filename = "inputs/seeds.csv"
.ini_states_filename
# turn off movements in school-age children
= 1.0 age2:static_play_at_home
We seed infections proportionately to the age-structure of the population (described above). These are stored in a file called inputs/age_seeds.csv
, containing the probabilities below:
1,0.06
2,0.154
3,0.154
4,0.134
5,0.128
6,0.134
7,0.105
8,0.131
Here the first column is the age demographic and the second is the probability of seeding into that age-demographic. Since we do not know where initial importations of infections happened, or where these infections happened, we need some way to inform this. We want to avoid treating the number/timing of initial importations as additional unknown parameters in the model, since this vastly increases the volume of parameter space to search. Instead we have developed a data driven prior for the states of the system at a given point in time, which depends on some data and also the parameters of the model.
The general approach is to take deaths observed in each Lower Tier Local Authority (LTLA) region on each day up to some fixed time point early on in the epidemic. Here we choose the 13th March 2020. We then fit a simplified version of the MetaWards model to each LTLA independently, where we assume a single population of individuals in each LTLA, and we remove age-structure from the model (taking a weighted average of the transition parameters across all the age-classes as inputs).
In addition to the transmission process, we also introduce a fixed parameter \(p_{\text{ini}}\), which corresponds to the probability of an introduction of infection on a given day, that is independent of the transmission process. For a fixed set of parameters, we fit the model from 1st January 2020–13th March 2020 using an individual forwards-filtering backwards sampling (iFFBS) algorithm1. Here the unknowns are the states of the system and the model is calibrated to the observed number of deaths in each LTLA. This provide a posterior distribution for the states of the system at each time point between 1st January 2020–13th March 2020, that is conditional on data. We take random samples from this distribution to provide initial states for our MetaWards model runs. We use truncated sampling to ensure that there is at least one individual that is either infectious, or can progress to infectiousness, to ensure that there is a non-zero probability of an outbreak in a given LTLA region. By separating out the data used to generate the seeds, and the data used to calibrate the model, we try to avoid too many of the pitfalls of using the data twice.
We then take multinomial samples to decide how individuals are seeded into each age-class. Details and instructions can be found in data/seedDeaths/README.md
. This creates a file inputs/seedInfo.rds
, which contains the data necessary to fit the iFFBS model for a given set of parameters. The file R_tools/simulateSeedStates.R
will fit the iFFBS models according to each set of design points and generate a file called inputs/seeds.csv
that contains initial states for use in MetaWards. The iFFBS code is contained in the R_tool/iFFBS.cpp
file.
Note: we have made a few extensions to the standard iFFBS algorithm to make it more efficient for large populations with low numbers of infections.
Note: the user inputs file must contain the relative path names to the age and seed files—see here.
To implement this probabilistic seeding, we need a custom iterator
function, which can be viewed below.
The contents of the iterator.py
file is below. This contains code to initiate seeds and deal with lockdown (see here).
from metawards.utils import Console
from metawards.iterators import advance_infprob
from metawards.iterators import advance_play
from metawards.iterators import advance_fixed
from metawards.iterators import advance_foi
from metawards.iterators import advance_recovery
from metawards.iterators import advance_foi_work_to_play
from metawards.iterators import advance_work_to_play
from metawards.movers import MoveGenerator, go_ward
from datetime import datetime
import numpy as np
import sys
import re
# use caching to store matrix read in from filename
= {}
seed_file_cache
# read in age lookup
def read_age_file(filename):
global seed_file_cache
if filename not in seed_file_cache:
with open(filename, "r") as FILE:
= [[num for num in line.split(',')] for line in FILE]
age_probs "Age-level seeding probabilities:", variables = [age_probs])
Console.debug(
# convert to correct format
= np.array(age_probs)
age_probs = age_probs[:, 0].astype(int)
age_probs_ind = age_probs[:, 1].astype(float)
age_probs
# save in cache
= "STORED"
seed_file_cache[filename] "age_probs_ind"] = age_probs_ind
seed_file_cache["age_probs"] = age_probs
seed_file_cache[
return seed_file_cache["age_probs_ind"], seed_file_cache["age_probs"]
# functions to read seeding probabilities from file and/or return from cache
def read_states_file(filename):
global seed_file_cache
if filename not in seed_file_cache:
with open(filename, "r") as FILE:
= [[num for num in line.split(',')] for line in FILE]
ini_states "Initial states:", variables = [ini_states])
Console.debug(
# convert to correct format
= np.array(ini_states)
ini_states = ini_states[:, 0]
ini_states_output = ini_states[:, 1].astype(int)
ini_states_LAD = ini_states[:, 2:].astype(int)
ini_states
# save in cache
= "STORED"
seed_file_cache[filename] "ini_states_output"] = ini_states_output
seed_file_cache["ini_states_LAD"] = ini_states_LAD
seed_file_cache["ini_states"] = ini_states
seed_file_cache[
return seed_file_cache["ini_states_output"], seed_file_cache["ini_states_LAD"], seed_file_cache["ini_states"]
# determine the lock-down status based on the population and current network
def get_lock_down_vars(network, population):
= population.date
date = network.params.user_params
params
= int(params["lockdown_date_1_year"])
y1 = int(params["lockdown_date_1_month"])
m1 = int(params["lockdown_date_1_day"])
d1
= int(params["lockdown_date_2_year"])
y2 = int(params["lockdown_date_2_month"])
m2 = int(params["lockdown_date_2_day"])
d2
# Lock down dates
= datetime(y1, m1, d1).date()
lock_1 = datetime(y2, m2, d2).date()
lock_2
= 0
state = 1.0
rate if date >= lock_1:
+= 1
state = params["lock_1_restrict"]
rate if date >= lock_2:
+= 1
state = (1.0 - ((1.0 - params["lock_1_restrict"]) * params["lock_2_release"]))
rate
= params["can_work"][state]
can_work return state, rate, can_work
# advance in a lock-down state weekday
def advance_lockdown_week(network, population, **kwargs):
= get_lock_down_vars(network = network, population = population)
state, rate, can_work
= rate, network = network, population = population, **kwargs)
advance_infprob(scale_rate = network, population = population, **kwargs)
advance_play(network if can_work:
= network, population = population, **kwargs)
advance_fixed(network
# advance in a lockdown state weekend
def advance_lockdown_weekend(network, population, **kwargs):
= get_lock_down_vars(network = network, population = population)
state, rate, can_work
## update dyn_play_at_home
= []
dyn_play_at_home for i in range(0, len(network.subnets)):
dyn_play_at_home.append(network.subnets[i].params.dyn_play_at_home)= network.subnets[i].params.user_params["p_home_weekend"]
network.subnets[i].params.dyn_play_at_home
= rate, network = network, population = population, **kwargs)
advance_infprob(scale_rate = network, population = population, **kwargs)
advance_work_to_play(network
## reset dyn_play_at_home
for i in range(0, len(network.subnets)):
= dyn_play_at_home[i]
network.subnets[i].params.dyn_play_at_home
# advance FOI weekend
def advance_foi_work_to_play_weekend(network, population, **kwargs):
## update dyn_play_at_home
= []
dyn_play_at_home for i in range(0, len(network.subnets)):
dyn_play_at_home.append(network.subnets[i].params.dyn_play_at_home)= network.subnets[i].params.user_params["p_home_weekend"]
network.subnets[i].params.dyn_play_at_home
= network, population = population, **kwargs)
advance_foi_work_to_play(network
## reset dyn_play_at_home
for i in range(0, len(network.subnets)):
= dyn_play_at_home[i]
network.subnets[i].params.dyn_play_at_home
# set custom advance function
def advance_initial_seeds(output_dir, network, population, infections, profiler, rngs, **kwargs):
# extract user parameters
= network.params
params
# extract files name for initial seeding probabilities
= params.user_params["age_seed_filename"]
age_seed_filename = params.user_params["ini_states_filename"]
ini_states_filename
# start profiler
= profiler.start("additional_seeds")
p
# set up lookups or read from cache
= read_age_file(age_seed_filename)
age_probs_ind, age_probs = read_states_file(ini_states_filename)
ini_states_output, ini_states_LAD, ini_states
# get output identifier for this run of MetaWards
= re.search(r"Ens([0-9a-z]*)", output_dir.output_dir())
output = output.group(0)
output
# set up vector of state names
= ["E", "P", "I1", "I2", "RI", "DI", "A", "RA", "H", "RH", "DH"]
state_names = np.array(state_names)
state_names = range(len(state_names))
state_nums = np.array(state_nums)
state_nums
# extract states for this run
= [i == output for i in ini_states_output]
filter_states if(sum(filter_states) == 0):
raise Exception(f"Cannot find {output} seeding information.")
# filter seeding information relating to this run
= ini_states_LAD[filter_states]
ini_states_LAD = ini_states[filter_states, :]
ini_states
# loop over LADs
for k in range(len(ini_states_LAD)):
# check for any seeds
= [i == ini_states_LAD[k] for i in ini_states_LAD]
filter_LAD = ini_states[filter_LAD, :][0]
ini_states_curr
# loop over states
for j in range(len(ini_states_curr)):
# extract number of affected individuals
= ini_states_curr.sum()
nind
if nind > 0:
# select seeds in age-classes at random according to initial probabilities
= np.random.multinomial(nind, age_probs)
age_seeds
# run over each age demographic
for demographic in range(len(age_seeds)):
# check if any seeding done in demographic
while age_seeds[demographic] > 0:
# select states for given seeds
= ini_states_curr / ini_states_curr.sum()
state_probs
# select states
= np.random.multinomial(1, state_probs)
states = [i == 1 for i in states]
filter_state = state_names[filter_state]
state = state_nums[filter_state]
staten
# remove selected state from states
-= 1
ini_states_curr[staten]
# generate move
= MoveGenerator(from_demographic = f'age{demographic + 1}',
move = f'age{demographic + 1}',
to_demographic = ini_states_LAD[k],
from_ward = ini_states_LAD[k],
to_ward = "S",
from_stage = state[0],
to_stage = 1)
number
= move, output_dir = output_dir, network = network,
go_ward(generator = population, infections = infections,
population = profiler, rngs = rngs, **kwargs)
profiler
# update counter
-= 1
age_seeds[demographic]
# end profiler
p.stop()
# custom iterator function
def iterate_lockdown(stage: str, population, **kwargs):
# is this a weekday or a weekend?
if population.date is None:
# have to guess from the day - assume day zero was a Monday
= population.day % 7
day # 0-4 is a weekday, 5 and 6 are weekend
= (day >= 5)
is_weekend else:
= population.date.weekday()
day # 0-4 is a weekday, 5 and 6 are weekend
= (day >= 5)
is_weekend
# need custom advance functions to scale rates after lockdown
# in period before lockdown this treats workers as players at
# weekends
if population.day == 1:
if stage == "setup":
return [advance_initial_seeds]
else:
return []
else:
if stage == "foi":
if is_weekend:
return [advance_foi_work_to_play_weekend, advance_recovery]
else:
return [advance_foi, advance_recovery]
elif stage == "infect":
if is_weekend:
return [advance_lockdown_weekend]
else:
return [advance_lockdown_week]
else:
# we don't do anything at the "analyse" or "finalise" stages
return []
For generic states \(X\) and \(Y\) say, the progress of an individual in age-class \(j\) (\(j = 1, \dots, J\)) from state \(X\) to state \(Y\) in a given day is governed by probability \(q^j_{XY}\), where:
We also require inputs for \(R_0\), a parameter \(\nu_A\) which scales the force-of-infection between asymptomatics and the general population, and further scaling parameters that we use to model various lockdown measures (described below). We also need a parameter \(p_{\text{home/weekend}}\) which scales the number of movements at weekends.
From the model structure above, in a single population, the force-of-infection, \(\lambda^j\), acting on an individual in age-class \(j\) is: \[\begin{align*} \lambda^j &= \sum_{k=1}^J \frac{\beta^{kj}}{N^k}\left(P^k + I_1^k + I_2^k + \nu_A A^k\right) \end{align*}\] In practice we model the transmission parameters as: \[ \beta^{kj} = \nu c^{kj}, \] where \(0 < \nu < 1\) is a probability of infection per contact, and \(c^{kj}\) is the population contact rate that an individual in class \(j\) has with individuals in class \(k\). We use the population contact matrix \(C = \{c^{kj}; k = 1, \dots, J; j = 1, \dots, J\}\) from the POLYMOD survey in the early stages of the outbreak, and then the CoMix survey after the first lockdown.
A challenge is that we parameterise our model in terms of \(R_0\), not \(\nu\), and thus we need a way to derive \(\nu\) from \(R_0\) for a given run. To this end, \(R_0\) can be derived as the maximum eigenvalue of the next-generation matrix (NGM). Firstly, let \(F_i(x)\) be the rate of new infections into infected class \(i\), and \(V_i(x)\) be the rate of transitions from infected class \(i\). Then define matrices: \[ \mathbf{F} = \left[\frac{\partial F_i\left(x_0\right)}{\partial x_j}\right] \quad \mbox{and} \quad \mathbf{V} = \left[\frac{\partial V_i\left(x_0\right)}{\partial x_j}\right] \quad \mbox{for}~1 \leq i, j, \leq J, \] where \(x_0\) corresponds to the states at the disease-free equilibrium. Then, the NGM, \(\mathbf{K}\), is: \[ \mathbf{K} = \mathbf{FV}^{-1}, \] and \(R_0\) is the maximum eigenvalue of \(\mathbf{K}\).
From the model structure above, in a single population with age-structure, the infected classes are \(E^j\), \(A^j\), \(P^j\), \(I_1^j\) and \(I_2^j\) for age-classes \(j = 1, \dots, J\). In a deterministic model these would have rates-of-change of: \[\begin{align*} \frac{dE^j}{dt} &= S^j\left[\sum_{k=1}^J \frac{\beta^{kj}}{N^k}\left(P^k + I_1^k + I_2^k + \nu_A A^k\right) \right] - \gamma_E E^j\\ \frac{dA^j}{dt} &= \left(1 - p_{EP}^j\right)\gamma_E E^j - \gamma_A A^j\\ \frac{dP^j}{dt} &= p_{EP}^j\gamma_E E^j - \gamma_P P^j\\ \frac{dI_1^j}{dt} &= \gamma_P P^j - \gamma_{I_1} I_1^j\\ \frac{dI_2^j}{dt} &= \left(1 - p_{I_1H}^j - p_{I_1D}^j\right)\gamma_{I_1} I_1^j - \gamma_{I_2} I_2^j\\ \end{align*}\] Here \(\beta^{kj}\) is the transmission rate from class \(k\) to class \(j\) and \(N_k\) is the total number of individuals in class \(k\). The parameter \(\nu_A\) scales the force-of-infection from the \(A\) class relative to the other infectious classes.
The non-zero components needed to derive the NGM are: \[\begin{align*} \frac{\partial F_{E^i}\left(x_0\right)}{\partial E^j} &= 0 &\qquad \forall i, j,\\ \frac{\partial F_{E^i}\left(x_0\right)}{\partial A^j} &= \frac{\beta^{ji}\nu_A S^i}{N^j} &\qquad \forall i, j\\ \frac{\partial F_{E^i}\left(x_0\right)}{\partial P^j} &= \frac{\beta^{ji}S^i}{N^j} &\qquad \forall i, j\\ \frac{\partial F_{E^i}\left(x_0\right)}{\partial I_1^j} &= \frac{\beta^{ji}S^i}{N^j} &\qquad \forall i, j\\ \frac{\partial F_{E^i}\left(x_0\right)}{\partial I_2^j} &= \frac{\beta^{ji}S^i}{N^j} &\qquad \forall i, j\\ \end{align*}\] and \[\begin{align*} \frac{\partial V_{E^i}\left(x_0\right)}{\partial E^j} &= \left\{\begin{array}{cc} \gamma_E & \mathrm{for}~i = j,\\ 0 & \mathrm{otherwise,} \end{array}\right. &\\ \frac{\partial V_{A^i}\left(x_0\right)}{\partial E^j} &= \left\{\begin{array}{cc} -\left(1 - p_{EP}^i\right)\gamma_E & \mathrm{for}~i = j,\\ 0 & \mathrm{otherwise,} \end{array}\right. &\\ \frac{\partial V_{A^i}\left(x_0\right)}{\partial A^j} &= \left\{\begin{array}{cc} \gamma_A & \mathrm{for}~i = j,\\ 0 & \mathrm{otherwise,} \end{array}\right. &\\ \frac{\partial V_{P^i}\left(x_0\right)}{\partial E^j} &= \left\{\begin{array}{cc} -p_{EP}^i\gamma_E & \mathrm{for}~i = j,\\ 0 & \mathrm{otherwise,} \end{array}\right. &\\ \frac{\partial V_{P^i}\left(x_0\right)}{\partial P^j} &= \left\{\begin{array}{cc} \gamma_P & \mathrm{for}~i = j,\\ 0 & \mathrm{otherwise,} \end{array}\right. &\\ \frac{\partial V_{I_1^i}\left(x_0\right)}{\partial P^j} &= \left\{\begin{array}{cc} -\gamma_P & \mathrm{for}~i = j,\\ 0 & \mathrm{otherwise,} \end{array}\right. &\\ \frac{\partial V_{I_1^i}\left(x_0\right)}{\partial I_1^j} &= \left\{\begin{array}{cc} \gamma_{I_1} & \mathrm{for}~i = j,\\ 0 & \mathrm{otherwise,} \end{array}\right. &\\ \frac{\partial V_{I_2^i}\left(x_0\right)}{\partial I_1^j} &= \left\{\begin{array}{cc} -\left(1 - p_{I_1H}^i - p_{I_1D}^i\right)\gamma_{I_1} & \mathrm{for}~i = j,\\ 0 & \mathrm{otherwise,} \end{array}\right. &\\ \frac{\partial V_{I_2^i}\left(x_0\right)}{\partial I_2^j} &= \left\{\begin{array}{cc} \gamma_{I_2} & \mathrm{for}~i = j,\\ 0 & \mathrm{otherwise.} \end{array}\right. &\\ \end{align*}\] Since the non-zero components of \(\mathbf{F}\) all contain \(\beta^{ji}\), where \[ \beta^{ji} = \nu c^{ji}, \] we can thus write \[ \mathbf{K} = \nu\mathbf{G}\mathbf{V}^{-1} \] where \(\mathbf{G}\) is equivalent to replacing \(\beta^{ji}\) by \(c^{ji}\) in \(\mathbf{F}\). As such: \[\begin{align*} R_0 &= \nu~\mathrm{eig}_M\left(\mathbf{G}\mathbf{V}^{-1}\right)\\ \Rightarrow \nu &= \frac{R_0}{\mathrm{eig}_M\left(\mathbf{G}\mathbf{V}^{-1}\right)} \end{align*}\] where \(\mathrm{eig}_M(\mathbf{K})\) denotes the maximum eigenvalue of \(\mathbf{K}\).
When using a mixture of contact matrices, we parameterise \(\nu\) using the NGM evaluated at the initial contact structure, which represents the contact matrix expected in a completely susceptible population at the start of the outbreak.
We have derived plausible parameter bounds from data and the literature. We note that the analyses below are not intended to find accurate estimates for these parameters, but to find plausible ranges for these parameters.
\[\begin{align} \mbox{$R_0$}&: (2, 4.5) \qquad \mbox{from Leon and Rob}\\ T_E&: (0.1, 2)\\ T_P&: (1.2, 3)\\ T_{I_1}&: (2.8, 4.5)\\ T_{I_2}&: (0.0001, 0.5)\\ \alpha_{EP}, \alpha_{I_1H}, \alpha_{I_1D}, \alpha_{HD}, \eta&: \mbox{drawn from distribution as described below}\\ \alpha_{T_H}, \eta_{T_H}&: \mbox{drawn from distribution as described below}\\ \nu_A&: (0, 1)\\ p_{\text{home/weekend}}&: (0, 1) \end{align}\]
For the mean latent period, \(T_E\), we set the upper bound such that the latent period distribution has an upper tail probability that is consistent with the generation interval distribution from Challen et al. (2020). This provides a conservative upper bound, since the generation time is equivalent to \(T_E + T_P\) in our model. We choose a lower bound such that it is possible to have a short, but non-zero latent period. The figure below shows the latent period distributions at the lower and upper bounds of the parameter ranges for \(T_E\).
To generate ranges for the length of the pre-symptomatic period, \(T_P\), we note that the incubation period is equivalent to \(T_E + T_P\) in our model, and hence through simulation we generated incubation period distributions at the lower and upper bounds of \(T_E\), and chose bounds for \(T_P\) such that at the lower bounds of \(T_E\) and \(T_P\) the incubation period distribution looks like the lower-mean incubation period distribution in Challen et al. (2020), and at the upper bounds of \(T_E\) and \(T_P\) the incubation period distribution looks like the higher-mean incubation period distribution in Challen et al. (2020). The figure below shows the incubation period distributions at the lower and upper bounds of the parameter ranges for \(T_E\) and \(T_P\), against the reference distributions in Challen et al. (2020).
To generate ranges for the length of the infectious period, \(T_{I_1}\), we note that this is equivalent to the time-from-onset-to-admission in our model, and hence through simulation chose bounds for \(T_{I_1}\) such that the distribution of \(T_1\) was approximately consistent with the time-from-onset-to-admission in Challen et al. (2020), and also that the time-from-infection-to-admission, \(T_E + T_P + T_{I_1}\), was consistent with the time-from-infection-to-admission in Challen et al. (2020), at both the lower and upper bounds for \(T_E\) and \(T_P\). The figure below shows these distributions at the lower and upper bounds of the parameter ranges, against the reference distributions in Challen et al. (2020).
For the remainder of infectivity time, \(T_{I_2}\), we noted that Cevik et al. (2020) found no live virus beyond day 9 of symptom onset. As such, we used simulation to find upper and lower bounds for \(T_{I_2}\). The upper bound of \(T_{I_2}\) was chosen so that when \(T_{I_1}\) was at its lower bound, the probability that the infectivity time post onset (\(T_{I_1} + T_{I_2}\)) was > 9 days was \(\approx 0.05\). Conversely, if \(T_{I_1}\) is set to its upper bound, then even for very low values of \(T_{I_2}\) we were unable to get upper tail probabilities for \(T_{I_1} + T_{I_2}\) of less than \(\approx 0.13\), hence we set a lower bound for \(T_{I_2}\) of 0.0001. At the upper bound for \(T_{I_1}\) and \(T_{I_2}\) these tail probabilities were \(\approx 0.15\).
For the hospital stay times we used data from the CHESS study which is not publicly available. Below we plot the distribution of stay times by age, along with the best fitting exponential model in each case.
If we plot the log(mean hospital stay time) from the fitted exponential models against the midpoints of each age-category, then we get an approximate increasing linear relationship with age. To capture estimation uncertainties more robustly, we built a hierarchical Bayesian model (fitted using MCMC implemented in NIMBLE) to estimate both the mean hospital stay times in each age category, and then simultaneously the slope and intercept from Gaussian linear regression model fitted to the log(mean hospital stay time) from the exponential models. Full details can be found in the data/hospitalStays/
folder.
As a comparison, below we show the log(mean hospital stay time) estimates from the non-Bayesian models (fitted independently to each age-distribution) against the fitted line from the hierarchical Bayesian model, with 99% credible intervals.
The intercept and slope of this line provide the parameters we use to determine the age-structured hospital stay times, and we’ve decided to draw from the posterior distribution using a space-filling design for these parameters. We do this by generating large numbers of samples from the posterior, and then sub-sampling these using a maximin design to get appropriate coverage. To avoid having to run the MCMC each time we want more design points, we approximate the posterior distribution using a truncated finite Gaussian mixture model, fitted using the mclust
package in R. The posterior distributions from the MCMC and the approximating FMM are shown below.
For the probabilities of transitioning along different pathways by age, we used two sources of information to generate plausible ranges to use as initial inputs into the calibration. We use model estimates of the infection fatality risk and the probability of hospitalisation given infection from Verity et al. (2020), and also some data from the CDC.
The estimates of the infection fatality risk and the probability of hospitalisation given infection from the CDC are obtained in the following way.
These estimates are provided in the data/pathways/CasesHospDeathInHospForUS_Update.xlsx
file.
If we plot the log(probability of hospitalisation given infection), and the log(infection fatality risk) against the midpoints of each age-category, then we get an approximate increasing linear relationship with age in all cases, and after discussions we have decided to use log-linear relationships between the probabilities of transitions along certain pathways, and age, with the same slope parameter in each case, constrained such that there is always an increasing risk of more severe outcomes with age. We allow the intercepts to vary between the different transition pathways. Hence we have five parameters to calibrate: the intercepts \(\alpha_{EP}\), \(\alpha_{I_1D}\), \(\alpha_{I_1H}\), \(\alpha_{HD}\) and the common slope \(\eta\).
To derive plausible ranges for these parameters, we used an ad-hoc MCMC approach based on simulated data sets. Firstly, we took the infection fatality risks and probabilities of hospitalisation estimates from Verity et al. (2020), and for each age-class we generated a pseudo-data set by randomly sampling from a binomial distribution with 100 trials and a probability of success given by the relevant estimate. This gave us age-specific counts for hospitalisations (\(y_{iH}\)) and death given infection (\(y_{iD}\)) given \(n_i = 100\) “trials” for age-classes \(i\). We then fitted a model to this pseudo-data set, targeting a posterior of the form: \[\begin{align} & \pi\left(\alpha_{EP}, \alpha_{I_1D}, \alpha_{I_1H}, \alpha_{HD}, \eta \mid \mathbf{y}, \mathbf{n}\right) \propto\\ & \qquad \pi\left(\mathbf{y} \mid \alpha_{EP}, \alpha_{I_1D}, \alpha_{I_1H}, \alpha_{HD}, \eta, \mathbf{n}\right) \pi\left(\alpha_{EP}, \alpha_{I_1D}, \alpha_{I_1H}, \alpha_{HD}, \eta\right). \end{align}\] The likelihood component, \(\pi\left(\mathbf{y} \mid \alpha_{EP}, \alpha_{I_1D}, \alpha_{I_1H}, \alpha_{HD}, \eta, \mathbf{n}\right)\), depends on a set of latent variables, \(\mathbf{x}\), and can be written as: \[ \pi\left(\mathbf{y} \mid \alpha_{EP}, \alpha_{I_1D}, \alpha_{I_1H}, \alpha_{HD}, \eta, \mathbf{n}\right) = \int_{\mathcal{X}}\pi\left(\mathbf{y} \mid \mathbf{x}, \mathbf{n}\right)\pi\left(\mathbf{x} \mid \alpha_{EP}, \alpha_{I_1D}, \alpha_{I_1H}, \alpha_{HD}, \eta\right) d \mathcal{X}, \] where \(\mathcal{X}\) is the (multidimensional) parameter space for the latent states. Here the latent states correspond to the individuals transitioning down the different pathways in the model, and \(\pi\left(\mathbf{y} \mid \mathbf{x}, \mathbf{n}\right)\) is a binomial probability mass function based on probabilities derived from \(\mathbf{x}\) (here we assume that given \(\mathbf{x}\), the data \(y_{iH}\) and \(y_{iD}\) are independent). To approximate the integral we used Monte Carlo simulation such that \[ \hat{\pi}\left(\mathbf{y} \mid \alpha_{EP}, \alpha_{I_1D}, \alpha_{I_1H}, \alpha_{HD}, \eta, \mathbf{n}\right) = \frac{1}{M}\sum_{j=1}^M \pi\left(\mathbf{y} \mid \mathbf{x}_j, \mathbf{n}\right), \] where (with a slight abuse of notation) \(x_j \sim \pi\left(\mathbf{x} \mid \alpha_{EP}, \alpha_{I_1D}, \alpha_{I_1H}, \alpha_{HD}, \eta\right)\) gives the number of individuals from 100,000 who end up running through each pathway, and for age-class \(i\) and count \(y_i\): \(\pi\left(y_i \mid x_i, n\right) \sim \mbox{Bin}\left(100, x_i / 100,000\right)\) with \(x_i\) the relevant simulated count. We used this estimator in place of the true likelihood in a Metropolis-Hastings algorithm to get an approximate “posterior”. We found \(M = 1\) simulation was sufficient, but we had to run long chains to get good mixing.
We repeated this approach for 10 pseudo-data sets derived from the Verity et al. (2020) estimates, and also 10 pseudo-data sets derived from the CDC data. The union of these posterior samples across all 20 data sets provided our plausible ranges. Full details can be found in the data/pathways/
folder.
As a comparison, below we show the probability of hospitalisation given infection estimates from both the Verity and CDC data against the 99% prediction intervals generated from our union posterior samples.
As a comparison, below we show the probability of death given infection estimates from both the Verity and CDC data against the 99% prediction intervals generated from our union posterior samples.
For our input designs we draw from the union posterior distributions using a space-filling maximin sampling design over a large set of posterior samples. To avoid having to run the MCMC each time we want more design points, we approximate the posterior distribution using a truncated finite Gaussian mixture model, fitted using the mclust
package in R. The posterior distributions from the MCMC and the approximating FMM are shown below. In practice we require checks that \(\eta > 0\) and that for any combination of parameters the probabilities of transition in all age-classes are bounded by 1.
MetaWards has a specific structure in terms of how it progresses movements between the stages. To get the correct splits between the pathways specified above we do all non-infection movements through a custom mover
function specified below. Note that in the ncov.json
file specified above, we set all progress
parameters to be 0. Thus, all transition probabilities other than new infections are driven by user-defined parameters that are passed to the mover
function, which we call move_pathways.py
here.
The mover
function applies movements in order, and so it is important to get the order correct. In particular we need to reverse the order of the stage movements (e.g. do movements out of the \(H\) demographic before movements out of the \(I_1\) demographic). This is to ensure that individuals that move from \(I_1 \to H\) say, can’t then immediately move out of \(H\). The file move_pathways.py
contains the code below.
Additional note: The functions in the
mover
file operate in turn. Therefore the movement probabilities below must be altered between each function, in order to get the correct proportions moved. For example, consider that we have \(n\) individuals in the \(I_1\) class and we want to move a proportion \(p_{I_1}p_{I_1H}\) from \(I_1 \to H\), a proportion \(p_{I_1}p_{I_1D}\) from \(I_1 \to D_I\), and a proportion \(p_{I_1}\left(1 - p_{I_1H} - p_{I_1D}\right)\) from \(I_1 \to I_2\), where \(p_{I_1} = 1 - e^{-\gamma_{I_1}}\).In this case the first
mover
function takes a random binomial sample from the \(n\) individuals with probability \(p_{I_1}p_{I_1H}\) as requested, resulting in \(n_{I_1H}\) moves. However, the secondmover
function now operates on the \(n - n_{I_1H}\) individuals, so we need to adjust the sampling probabilities to adjust for this. Hence the secondmover
function needs to sample from the \(n - n_{I_1H}\) individuals with probability \(\frac{p_{I_1}p_{I_1D}}{1 - p_{I_1}p_{I_1H}}\) in order to generate the correct proportions of moves that we would expect, resulting in \(n_{I_1D}\) moves. Similarly, the thirdmover
function now operates on the \(n - n_{I_1H} - n_{I_1D}\) remaining individuals, and thus we would need to adjust the sampling probability to \(\frac{p_{I_1}\left(1 - p_{I_1H} - p_{I_1D}\right)}{1 - p_{I_1}\left(p_{I_1H} + p_{I_1D}\right)}\). The remaining individuals remain in \(I_1\).
from metawards.movers import go_stage
from metawards.utils import Console
def move_pathways(network, **kwargs):
# extract user defined parameters
= network.params
params
## extract number of age-classes
= params.user_params["nage"]
nage
## moves out of E class
= []
pE = []
pEP for j in range(nage):
f'pE_{j + 1}'])
pE.append(params.user_params[f'pEP_{j + 1}'])
pEP.append(params.user_params[
## moves out of A class
= []
pA for j in range(nage):
f'pA_{j + 1}'])
pA.append(params.user_params[
## moves out of P class
= []
pP for j in range(nage):
f'pP_{j + 1}'])
pP.append(params.user_params[
## moves out of I1 class
= []
pI1 = []
pI1H = []
pI1D for j in range(nage):
f'pI1_{j + 1}'])
pI1.append(params.user_params[f'pI1H_{j + 1}'])
pI1H.append(params.user_params[f'pI1D_{j + 1}'])
pI1D.append(params.user_params[
## moves out of I2 class
= []
pI2 for j in range(nage):
f'pI2_{j + 1}'])
pI2.append(params.user_params[
## moves out of H class
= []
pH = []
pHD for j in range(nage):
f'pH_{j + 1}'])
pH.append(params.user_params[f'pHD_{j + 1}'])
pHD.append(params.user_params[
= []
func
## moves in reverse order through the stages to
## ensure correct mapping
## NOTE: looping over the age ranges don't work unless you use a
## default parameter k = j in the lambda - solution derived from:
## https://stackoverflow.com/questions/10452770/python-lambdas-binding-to-local-values
#######################################################
######### H MOVES #########
#######################################################
## move H genpop to DH genpop
## (denominator adjustment is due to operating on remainder
## as described in the vignette, also includes correction
## in case of rounding error)
= []
tpHD for j in range(nage):
* pHD[j])
tpHD.append(pH[j] = 1.0 if tpHD[j] > 1.0 else tpHD[j]
tpHD[j] = 0.0 if tpHD[j] < 0.0 else tpHD[j]
tpHD[j] lambda k = j, **kwargs: go_stage(go_from=f'age{k + 1}',
func.append(=f'age{k + 1}',
go_to="H",
from_stage="DH",
to_stage=tpHD[k],
fraction**kwargs))
## move H genpop to R genpop
## (denominator adjustment is due to operating on remainder
## as described in the vignette, also includes correction
## in case of rounding error)
= []
tpHR for j in range(nage):
* (1.0 - pHD[j]) / (1.0 - tpHD[j]))
tpHR.append(pH[j] = 0.0 if tpHD[j] == 1.0 else tpHR[j]
tpHR[j] = 1.0 if tpHR[j] > 1.0 else tpHR[j]
tpHR[j] = 0.0 if tpHR[j] < 0.0 else tpHR[j]
tpHR[j] lambda k = j, **kwargs: go_stage(go_from=f'age{k + 1}',
func.append(=f'age{k + 1}',
go_to="H",
from_stage="RH",
to_stage=tpHR[k],
fraction**kwargs))
#######################################################
######### I2 MOVES #########
#######################################################
## move I2 genpop to RI genpop
= []
tpI2R for j in range(nage):
tpI2R.append(pI2[j])= 1.0 if tpI2R[j] > 1.0 else tpI2R[j]
tpI2R[j] = 0.0 if tpI2R[j] < 0.0 else tpI2R[j]
tpI2R[j] lambda k = j, **kwargs: go_stage(go_from=f'age{k + 1}',
func.append(=f'age{k + 1}',
go_to="I2",
from_stage="RI",
to_stage=tpI2R[k],
fraction**kwargs))
#######################################################
######### I1 MOVES #########
#######################################################
## move I1 genpop to H genpop
= []
tpI1H for j in range(nage):
* pI1H[j])
tpI1H.append(pI1[j] = 1.0 if tpI1H[j] > 1.0 else tpI1H[j]
tpI1H[j] = 0.0 if tpI1H[j] < 0.0 else tpI1H[j]
tpI1H[j] lambda k = j, **kwargs: go_stage(go_from=f'age{k + 1}',
func.append(=f'age{k + 1}',
go_to="I1",
from_stage="H",
to_stage=tpI1H[k],
fraction**kwargs))
## move I1 genpop to DI genpop
## (denominator adjustment is due to operating on remainder
## as described in the vignette, also includes correction
## in case of rounding error)
= []
tpI1D for j in range(nage):
* pI1D[j] / (1.0 - tpI1H[j]))
tpI1D.append(pI1[j] = 0.0 if tpI1H[j] == 1.0 else tpI1D[j]
tpI1D[j] = 1.0 if tpI1D[j] > 1.0 else tpI1D[j]
tpI1D[j] = 0.0 if tpI1D[j] < 0.0 else tpI1D[j]
tpI1D[j] lambda k = j, **kwargs: go_stage(go_from=f'age{k + 1}',
func.append(=f'age{k + 1}',
go_to="I1",
from_stage="DI",
to_stage=tpI1D[k],
fraction**kwargs))
## move I1 genpop to I2 genpop
## (denominator adjustment is due to operating on remainder
## as described in the vignette, also includes correction
## in case of rounding error)
= []
tpI1I2 for j in range(nage):
* (1.0 - pI1H[j] - pI1D[j]) / (1.0 - pI1[j] * (pI1H[j] + pI1D[j])))
tpI1I2.append(pI1[j] = 0.0 if (pI1[j] * (pI1H[j] + pI1D[j])) == 1.0 else tpI1I2[j]
tpI1I2[j] = 1.0 if tpI1I2[j] > 1.0 else tpI1I2[j]
tpI1I2[j] = 0.0 if tpI1I2[j] < 0.0 else tpI1I2[j]
tpI1I2[j] lambda k = j, **kwargs: go_stage(go_from=f'age{k + 1}',
func.append(=f'age{k + 1}',
go_to="I1",
from_stage="I2",
to_stage=tpI1I2[k],
fraction**kwargs))
#######################################################
######### P MOVES #########
#######################################################
## move P genpop to I1 genpop
= []
tpPI1 for j in range(nage):
tpPI1.append(pP[j])lambda k = j, **kwargs: go_stage(go_from=f'age{k + 1}',
func.append(=f'age{k + 1}',
go_to="P",
from_stage="I1",
to_stage=tpPI1[k],
fraction**kwargs))
#######################################################
######### A MOVES #########
#######################################################
## move A genpop to R genpop
= []
tpAR for j in range(nage):
tpAR.append(pA[j])lambda k = j, **kwargs: go_stage(go_from=f'age{k + 1}',
func.append(=f'age{k + 1}',
go_to="A",
from_stage="RA",
to_stage=tpAR[k],
fraction**kwargs))
#######################################################
######### E MOVES #########
#######################################################
## move E genpop to A genpop
= []
tpEA for j in range(nage):
* (1.0 - pEP[j]))
tpEA.append(pE[j] = 1.0 if tpEA[j] > 1.0 else tpEA[j]
tpEA[j] = 0.0 if tpEA[j] < 0.0 else tpEA[j]
tpEA[j] lambda k = j, **kwargs: go_stage(go_from=f'age{k + 1}',
func.append(=f'age{k + 1}',
go_to="E",
from_stage="A",
to_stage=tpEA[k],
fraction**kwargs))
## move E genpop to P genpop
## (denominator adjustment is due to operating on remainder
## as described in the vignette, also includes correction
## in case of rounding error)
= []
tpEP for j in range(nage):
* pEP[j] / (1.0 - tpEA[j]))
tpEP.append(pE[j] = 0.0 if tpEA[j] == 1.0 else tpEP[j]
tpEP[j] = 1.0 if tpEP[j] > 1.0 else tpEP[j]
tpEP[j] = 0.0 if tpEP[j] < 0.0 else tpEP[j]
tpEP[j] lambda k = j, **kwargs: go_stage(go_from=f'age{k + 1}',
func.append(=f'age{k + 1}',
go_to="E",
from_stage="P",
to_stage=tpEP[k],
fraction**kwargs))
return func
We will use an interaction matrix to input the contact matrices that scale the force-of-infection (FOI) from each age-class to the other age-classes (see here). When using interaction matrices in this way, the MetaWards beta
parameters correspond to the probability of transmission given an infected contact, \(\nu\), and for asymptomatics we scale beta
by \(0 < \nu_A < 1\). Full details are given here.
The mix_pathways.py
file contains the custom mixer code. Note that we pass in the relative paths to the contact matrix filenames through the user inputs (see here), and we start with the POLYMOD contact matrix and then switch to the CoMix contact matrix after the first lockdown. The merge_matrix_multi_population
merge function is required to get the correct frequency-dependent mixing (see MetaWards tutorial section here).
from metawards.mixers import merge_matrix_multi_population
from metawards.utils import Console
from datetime import datetime
# use caching to store matrix read in from filename
= {}
file_cache
# function to read matrix from file and/or return from cache
def read_file(filename, nage):
global file_cache
if filename not in file_cache:
with open(filename, "r") as FILE:
= [[num for num in line.split(',')] for line in FILE]
contact_matrix # set up interaction matrix
= [[0.0 for i in range(nage)] for j in range(nage)]
matrix for i in range(nage):
for j in range(nage):
= float(contact_matrix[i][j])
matrix[i][j] "Interaction matrix", variables = [matrix])
Console.debug(= matrix
file_cache[filename] return file_cache[filename]
# determine the lock-down status based on the population and current network
def get_lock_down_status(network, population):
= population.date
date = network.params.user_params
params
= int(params["lockdown_date_1_year"])
y1 = int(params["lockdown_date_1_month"])
m1 = int(params["lockdown_date_1_day"])
d1
= int(params["lockdown_date_2_year"])
y2 = int(params["lockdown_date_2_month"])
m2 = int(params["lockdown_date_2_day"])
d2
# Lock down dates
= datetime(y1, m1, d1).date()
lock_1 = datetime(y2, m2, d2).date()
lock_2
= 0
state if date >= lock_1:
+= 1
state if date >= lock_2:
+= 1
state return state
# mixing matrix
def mix_pathways(network, population, **kwargs):
# extract user parameters
= network.params
params
# get lockdown status
= get_lock_down_status(network = network, population = population)
state
# extract contact matrix and scaling information
= params.user_params["nage"]
nage if state == 0:
= params.user_params["contact_matrix1_filename"]
contact_matrix_filename else:
= params.user_params["contact_matrix2_filename"]
contact_matrix_filename
# set up interaction matrix in cache or read from cache
= read_file(contact_matrix_filename, nage)
matrix
= matrix
network.demographics.interaction_matrix
return [merge_matrix_multi_population]
Lockdown scales the FOI terms for different time periods, which each represent a different stage of interventions. Furthermore, for full lockdown we switch off all worker movements. When partial relaxing of lockdown is introduced, we allow worker movements to begin again, but scale the FOI (albeit less stringently than during full lockdown). We store this information in a custom iterator
function with the corresponding parameters in inputs/user_inputs.txt
.
See here for full iterator function.
We also have a custom extractor
function, that saves the outputs as a compressed SQL database for each age-class, called age*.db.bz2
(where *
varies by age-class).
Each database for a given run contains a single table called compact
. For efficiency we:
This also means that LADs and/or age-classes that are not infected do not return any results.
This trick hugely reduces the size of the output data, but means that we have to do some post-processing in order to extract quantities of interest. Since the data are stored as an SQL database, we can either query the database directly, or use some of the tools in e.g. R (see below) to interface with it and to reconstruct time-series counts if required.
So, to clarify, the stages*.db
databases each contain a table called compact
with entries:
day
, LAD
Einc
, Pinc
, I1inc
, I2inc
, RIinc
, DIinc
Ainc
, RAinc
Hinc
, RHinc
, DHinc
We save this custom extractor
function in the file extractor.py
.
from metawards.extractors import extract_default
from metawards import Networks, OutputFiles, Population, Workspace
from metawards.utils import Console
from copy import deepcopy
= {}
_zero_crossings
def create_tables(network: Networks):
## return a function that creates the tables
## for the specified number of disease classes
## Primary key needs to be composite here
## set specific column names
= ["Einc", "Pinc", "I1inc", "I2inc", "RIinc", "DIinc", "Ainc", "RAinc", "Hinc", "RHinc", "DHinc"]
col_names = [f"{i} int" for i in col_names]
col_names = ','.join(col_names)
col_str
## set specific column names
= ["E", "P", "I1", "I2", "RI", "DI", "A", "RA", "H", "RH", "DH"]
col_names_ini = [f"{i} int" for i in col_names_ini]
col_names_ini = ','.join(col_names_ini)
col_str_ini
def initialise(conn):
= conn.cursor()
c f"create table compact(day int not null, LAD int not null, {col_str}, "
c.execute(f"primary key (day, LAD))")
f"create table compact_ini(day int not null, LAD int not null, {col_str_ini}, "
c.execute(f"primary key (day, LAD))")
conn.commit()
return initialise
def output_db(population: Population, network: Networks,
**kwargs):
workspace: Workspace, output_dir: OutputFiles, print(f"Calling output_db for a {network.__class__} object")
Console.
## extract number of age-classes
= network.params.user_params["nage"]
nage
## setup marker for previous day
for i, subnet in enumerate(network.subnets):
## if first day, then create a copy of the ward data
## this should produce incidences of zero on day 1
if not hasattr(workspace.subspaces[i], "output_previous"):
= deepcopy(workspace.subspaces[i].ward_inf_tot)
workspace.subspaces[i].output_previous
if population.day == 1:
## open connection to databases
= []
conn = []
c for i in range(nage):
f"age{i + 1}.db", initialise = create_tables(network)))
conn.append(output_dir.open_db(
c.append(conn[i].cursor())
## get data
= workspace.subspaces[i].ward_inf_tot
ward_inf_tot
## extract initial counts
= ward_inf_tot[0]
E = ward_inf_tot[1]
P = ward_inf_tot[2]
I1 = ward_inf_tot[3]
I2 = ward_inf_tot[4]
RI = ward_inf_tot[5]
DI = ward_inf_tot[6]
A = ward_inf_tot[7]
RA = ward_inf_tot[8]
H = ward_inf_tot[9]
RH = ward_inf_tot[10]
DH
## loop over wards and write to file
= range(0, workspace.subspaces[0].nnodes + 1)
wards = [population.day] * len(wards)
day
## set column names
= ["day", "LAD", "E", "P", "I1", "I2", "RI", "DI", "A", "RA", "H", "RH", "DH"]
col_names = ','.join(col_names)
col_str
## write to file
for day, ward, E, P, I1, I2, RI, DI, A, RA, H, RH, DH in\
zip(day, wards, E, P, I1, I2, RI, DI, A, RA,\
H, RH, DH):if ward not in _zero_crossings:
= False
_zero_crossings[ward]
## set up list of changes
= [day, ward, E, P, I1, I2, RI, DI, A, RA, H, RH, DH]
val = ",".join([str(v) for v in val])
keeps_str = f"insert into compact_ini ({col_str}) values ({keeps_str}) "
qstring
## try to fudge a marker for first infections
if any([ v > 0 for v in val[2:] ]) and _zero_crossings[ward] is False and ward != 0:
= True
_zero_crossings[ward] print(f"Got first infection in LAD {ward}")
Console.
## check for any changes in ward
if _zero_crossings[ward] is True and any([ v > 0 for v in val[2:] ]):
c[i].execute(qstring)
conn[i].commit()
## save today's data so that it can be used tomorrow
for i, subnet in enumerate(network.subnets):
= deepcopy(workspace.subspaces[i].ward_inf_tot)
workspace.subspaces[i].output_previous else:
## open connection to databases
= []
conn = []
c for i in range(nage):
f"age{i + 1}.db", initialise = create_tables(network)))
conn.append(output_dir.open_db(
c.append(conn[i].cursor())
## NEED TO DO FOLLOWING CALCULATIONS IN ORDER
## loop over age classes
for j in range(nage):
## get data
= workspace.subspaces[j].output_previous
ward_inf_previous = workspace.subspaces[j].ward_inf_tot
ward_inf_tot
## ASYMPTOMATICS
## calculate incidence
= []
RAprime for old, new in zip(ward_inf_previous[7], ward_inf_tot[7]):
- old)
RAprime.append(new = []
Aprime for old, new, Rinc in zip(ward_inf_previous[6], ward_inf_tot[6], RAprime):
- old + Rinc)
Aprime.append(new
## HOSPITAL
## calculate incidence
= []
RHprime for old, new in zip(ward_inf_previous[9], ward_inf_tot[9]):
- old)
RHprime.append(new = []
DHprime for old, new in zip(ward_inf_previous[10], ward_inf_tot[10]):
- old)
DHprime.append(new = []
Hprime for old, new, Rinc, Dinc in zip(ward_inf_previous[8], ward_inf_tot[8], RHprime, DHprime):
- old + Rinc + Dinc)
Hprime.append(new
## GENPOP
## calculate incidence
= []
RIprime for old, new in zip(ward_inf_previous[4], ward_inf_tot[4]):
- old)
RIprime.append(new = []
DIprime for old, new in zip(ward_inf_previous[5], ward_inf_tot[5]):
- old)
DIprime.append(new = []
I2prime for old, new, Rinc in zip(ward_inf_previous[3], ward_inf_tot[3], RIprime):
- old + Rinc)
I2prime.append(new = []
I1prime for old, new, I2inc, Dinc, Hinc in zip(ward_inf_previous[2], ward_inf_tot[2], I2prime, DIprime, Hprime):
- old + I2inc + Dinc + Hinc)
I1prime.append(new = []
Pprime for old, new, I1inc in zip(ward_inf_previous[1], ward_inf_tot[1], I1prime):
- old + I1inc)
Pprime.append(new = []
Eprime for old, new, Pinc, Ainc in zip(ward_inf_previous[0], ward_inf_tot[0], Pprime, Aprime):
- old + Pinc + Ainc)
Eprime.append(new
## loop over wards and write to file
= range(0, workspace.subspaces[0].nnodes + 1)
wards = [population.day] * len(wards)
day
## set column names
= ["day", "LAD", "Einc", "Pinc", "I1inc", "I2inc", "RIinc", "DIinc", "Ainc", "RAinc", "Hinc", "RHinc", "DHinc"]
col_names = ','.join(col_names)
col_str
## write to file
for day, ward, Einc, Pinc, I1inc, I2inc, RIinc, DIinc, Ainc, RAinc, Hinc, RHinc, DHinc in\
zip(day, wards, Eprime, Pprime, I1prime, I2prime, RIprime, DIprime, Aprime, RAprime,\
Hprime, RHprime, DHprime):if ward not in _zero_crossings:
= False
_zero_crossings[ward]
## try to fudge a marker for first infections
if Einc != 0 and _zero_crossings[ward] is False and ward != 0:
= True
_zero_crossings[ward] print(f"Got first infection in LAD {ward}")
Console.
## set up list of changes
= [day, ward, Einc, Pinc, I1inc, I2inc, RIinc, DIinc, Ainc, RAinc, Hinc, RHinc, DHinc]
val = ",".join([str(v) for v in val])
keeps_str = f"insert into compact ({col_str}) values ({keeps_str}) "
qstring
## check for any changes in ward
if _zero_crossings[ward] is True and any([ v > 0 for v in val[2:] ]):
c[j].execute(qstring)
conn[j].commit()
## save today's data so that it can be used tomorrow
for i, subnet in enumerate(network.subnets):
= deepcopy(workspace.subspaces[i].ward_inf_tot)
workspace.subspaces[i].output_previous
def extract_db(**kwargs):
= []
funcs
funcs.append(output_db)return funcs
To run designs, we need to generate a disease.csv
file containing different parameters to use for different runs. For consistency, we will define three spaces:
The input and design spaces are fairly trivial to convert between, but some more work has to be done to convert between the input space and the disease space. Thus we have parameter ranges as described above.
In R we can set up the input parameter ranges as follows:
## set up parameter ranges for uniform ranges
<- data.frame(
parRanges parameter = c("R0", "TE", "TP", "TI1", "TI2",
"nuA", "lock_1_restrict", "lock_2_release", "p_home_weekend"),
lower = c(2, 0.1, 1.2, 2.8, 0.0001, 0, 0, 0, 0),
upper = c(4.5, 2, 3, 4.5, 0.5, 1, 1, 1, 1),
stringsAsFactors = FALSE
)
Firstly we want a function to convert between the design and input spaces. A short R function called convertDesignToInput()
which does this is given below and can be found in the R_tools/dataTools.R
script file. This requires a design
data frame with columns denoting each input parameter in parRanges
and rows corresponding to design points. There should be an additional column called output
that defines a unique identifier for each design point, and a column called repeats
that contains the number of repeats for each design point. The convertDesignToInput()
function also requires the parRanges
data frame (defined above). We use the scale
argument to define whether the design is on the \((0, 1)\) (scale = "zero_one"
) or \((-1, 1)\) (scale = "negone_one"
) space.
## function to convert from design to input space
<- function(design, parRanges, scale = c("zero_one", "negone_one")) {
convertDesignToInput
require(tidyverse)
## convert from design space to input space
<- mutate(design, ind = 1:n()) %>%
input gather(parameter, value, -ind) %>%
left_join(parRanges, by = "parameter")
if(scale[1] == "negone_one") {
<- mutate(input, value = value / 2 + 1)
input
}<- mutate(input, value = value * (upper - lower) + lower) %>%
input ::select(ind, parameter, value) %>%
dplyrspread(parameter, value) %>%
arrange(ind) %>%
::select(-ind)
dplyr
## return inputs
input }
Next we have to generate design points for the age-dependent parameters using the process described above. To do this we have two generated finite mixture models that are stored in the files inputs/pathways.rds
and inputs/hospStays.rds
, and we use these to generate sets of design points, which we sub-sample using a maximin space-filling design. This approach is implemented in the FMMmaximin()
function shown below. This takes an mclust
model object containing the finite mixture model, a required number of design points (nsamp
), and a number of points to pass to the maximin sampler (nseed
). It is also useful to pass a limitFn
argument in some cases to FMMmaximin()
, which provides a function that assesses whether a set of inputs will convert to valid probabilities for MetaWards. We do not need this for hospital stay lengths, since these are rates that by design will always be \(>0\). However, for the pathways relationships, we have a log-linear relationship between probabilities of transition and age, and for some inputs this may result in probabilities that are \(> 1\). Passing a limitFn
argument allows us to remove these inputs from the point cloud that the maximin design is taken over, to ensure valid inputs.
## function to generate maximin samples given an arbitrary FMM
<- function(model, nsamp, limits, limitFn = NULL, nseed = 10000, ...) {
FMMmaximin
## check inputs and dependencies
require(mclust)
require(fields)
stopifnot(!missing(model) & !missing(nsamp) & !missing(limits))
stopifnot(class(model)[1] == "densityMclust")
stopifnot(is.numeric(nsamp) & round(nsamp) == nsamp & nsamp > 1)
stopifnot(is.numeric(nseed) & round(nseed) == nseed & nseed > 1 & nsamp < nseed)
stopifnot(is.matrix(limits))
stopifnot(ncol(limits) == 2 & nrow(limits) == nrow(model$parameters$mean))
stopifnot(all(limits[, 1] < limits[, 2]))
if(!is.null(limitFn)) {
stopifnot(class(limitFn) == "function")
stopifnot(formalArgs(limitFn)[1] == "x")
cat("When using 'limitFn' ensure that function takes matrix as first argument 'x'
and returns vector of logicals for inclusion of length 'nrow(x)'\n")
}
## produce large number of samples from model and ensure they are
## consistent with limits
<- matrix(NA, 1, 1)
sims while(nrow(sims) < nseed) {
## sample from FMM
<- sim(model$modelName, model$parameters, nseed)[, -1]
sims1
## check against limits
for(i in 1:nrow(limits)) {
<- sims1[sims1[, i] >= limits[i, 1], ]
sims1 <- sims1[sims1[, i] <= limits[i, 2], ]
sims1
}if(!is.null(limitFn)) {
<- sims1[limitFn(sims1, ...), ]
sims1
}if(nrow(sims1) > 0) {
if(nrow(sims) == 1) {
<- sims1
sims else {
} <- rbind(sims, sims1)
sims
}
}
}<- sims[1:nseed, ]
sims
## rescale
<- apply(sims, 2, function(x) (x - min(x)) / diff(range(x)))
simsnorm
## sample initial choice at random
<- sample.int(nrow(sims), 1)
simind <- c(1:nrow(sims))[-simind]
simrem
## choose other points using maximin
while(length(simind) < nsamp) {
<- rdist(simsnorm[simind, , drop = FALSE], simsnorm[-simind, , drop = FALSE])
dists <- apply(dists, 2, min)
dists <- which(dists == max(dists))
dists <- c(simind, simrem[dists[1]])
simind <- simrem[-dists[1]]
simrem
}
## return design
sims[simind, ] }
Note that we do not have to run the convertDesignToInput()
function on these design points, since they are already in the input space.
Once we have generated design points in this way, we need to transform from the input space to the disease space for MetaWards. A convertInputToDisease()
R function is given below and can be found in the R_tools/dataTools.R
script file. This requires an input
data frame, with columns denoting each input parameter (R0
, TE
, TP
, TI1
, TI2
, alphaEP
, alphaI1H
, alphaI1D
, alphaHD
, eta
, alphaTH
, etaTH
, nuA
, lock_1_restrict
, lock_2_release
) and rows corresponding to each input point, a number of repeats
and a column of unique identifiers (output
).
## function to convert from input to disease space
<- function(input, C, N, S0, ages) {
convertInputToDisease
require(tidyverse)
require(magrittr)
stopifnot(all(c("R0", "nuA", "TE", "TP", "TI1", "TI2", "alphaEP",
"alphaI1H", "alphaI1D", "alphaHD", "eta",
"alphaTH", "etaTH", "lock_1_restrict",
"lock_2_release", "p_home_weekend", "repeats", "output") %in% colnames(input)))
## check unique ID
stopifnot(length(unique(input$output)) == length(input$output))
## scaling for asymptomatics
<- select(input, nuA, output)
disease
## progressions out of the E class
for(j in 1:length(ages)) {
<- mutate(disease, ".pE_{j}" := 1 - exp(-1 / input$TE))
disease
}<- mutate(disease,
disease temp = map2(input$alphaEP, input$eta, function(alpha, eta, ages) {
exp(alpha + eta * ages) %>%
matrix(nrow = 1) %>%
set_colnames(paste0(".pEP_", 1:length(ages))) %>%
as_tibble()
ages = ages)) %>%
}, unnest(cols = temp)
## progressions out of the A class
for(j in 1:length(ages)) {
<- mutate(disease, ".pA_{j}" := 1 - exp(-1 / (input$TP + input$TI1 + input$TI2)))
disease
}
## progressions out of the P class
for(j in 1:length(ages)) {
<- mutate(disease, ".pP_{j}" := 1 - exp(-1 / input$TP))
disease
}
## progressions out of the I1 class
for(j in 1:length(ages)) {
<- mutate(disease, ".pI1_{j}" := 1 - exp(-1 / input$TI1))
disease
}<- mutate(disease,
disease temp = map2(input$alphaI1H, input$eta, function(alpha, eta, ages) {
exp(alpha + eta * ages) %>%
matrix(nrow = 1) %>%
set_colnames(paste0(".pI1H_", 1:length(ages))) %>%
as_tibble()
ages = ages)) %>%
}, unnest(cols = temp)
<- mutate(disease,
disease temp = map2(input$alphaI1D, input$eta, function(alpha, eta, ages) {
exp(alpha + eta * ages) %>%
matrix(nrow = 1) %>%
set_colnames(paste0(".pI1D_", 1:length(ages))) %>%
as_tibble()
ages = ages)) %>%
}, unnest(cols = temp)
## check for multinomial validity
<- select(disease, starts_with(".pI1")) %>%
temp select(-starts_with(".pI1_")) %>%
mutate(ind = 1:n()) %>%
gather(par, prob, -ind) %>%
separate(par, c("par", "age"), sep = "_") %>%
group_by(age, ind) %>%
summarise(prob = sum(prob)) %>%
pluck("prob")
if(any(temp < 0) | any(temp > 1)) {
stop("Some multinomial pI1* probs invalid")
}
## progressions out of the I2 class
for(j in 1:length(ages)) {
<- mutate(disease, ".pI2_{j}" := 1 - exp(-1 / input$TI2))
disease
}
## progressions out of the H class
<- mutate(disease,
disease temp = map2(input$alphaTH, input$etaTH, function(alpha, eta, ages) {
exp(alpha + eta * ages) %>%
1 - exp(-1 / .)} %>%
{matrix(nrow = 1) %>%
set_colnames(paste0(".pH_", 1:length(ages))) %>%
as_tibble()
ages = ages)) %>%
}, unnest(cols = temp)
<- mutate(disease,
disease temp = map2(input$alphaHD, input$eta, function(alpha, eta, ages) {
exp(alpha + eta * ages) %>%
matrix(nrow = 1) %>%
set_colnames(paste0(".pHD_", 1:length(ages))) %>%
as_tibble()
ages = ages)) %>%
}, unnest(cols = temp)
## lockdown scalings
$`.lock_1_restrict` <- input$lock_1_restrict
disease$`.lock_2_release` <- input$lock_2_release
disease
## add weekend movement scaling
$`.p_home_weekend` <- input$p_home_weekend
disease
## remove any invalid inputs
stopifnot(any(disease$nuA >= 0 | disease$nuA <= 1))
stopifnot(
select(disease, starts_with(".p")) %>%
summarise_all(~{all(. >= 0 & . <= 1)}) %>%
apply(1, all)
)
## set up data for calculating transmission parameter
<- select(disease, output, starts_with(".pEP"), starts_with(".pI1H"), starts_with(".pI1D")) %>%
temp gather(prob, value, -output) %>%
separate(prob, c("prob", "age"), sep = "_") %>%
spread(age, value) %>%
group_by(prob, output) %>%
nest() %>%
ungroup() %>%
spread(prob, data) %>%
mutate_at(vars(-output), ~purrr::map(., as.data.frame)) %>%
mutate_at(vars(-output), ~purrr::map(., unlist))
colnames(temp) <- gsub("\\.", "", colnames(temp))
<- select(input, output, R0, nuA, TE, TP, TI1, TI2) %>%
temp inner_join(temp, by = "output")
<- mutate(temp, nu = pmap_dbl(as.list(temp)[-1], function(R0, nuA, TE, TP, TI1, TI2, pEP, pI1H, pI1D, C, N, S0) {
temp ## transformations
<- rep(1 / TE, length(N))
gammaE <- rep(1 / TP, length(N))
gammaP <- rep(1 / TI1, length(N))
gammaI1 <- rep(1 / TI2, length(N))
gammaI2 <- rep(1 / (TP + TI1 + TI2), length(N))
gammaA
## calculate nu from NGM
NGM(R0 = R0, nu = NA, C, S0, N, nuA, gammaE, pEP, gammaA, gammaP, gammaI1,
$nu
pI1H, pI1D, gammaI2)C = C, N = N, S0 = S0))
}, <- inner_join(disease, select(temp, nu, output), by = "output")
disease
## checks on nu
stopifnot(all(disease$nu > 0 & disease$nu < 1))
## finalise data set
<- mutate(disease, nuA = nuA * nu) %>%
disease mutate(`beta[1]` = nu) %>%
mutate(`beta[2]` = nu) %>%
rename(`beta[3]` = nu) %>%
rename(`beta[6]` = nuA) %>%
inner_join(select(input, repeats, output), by = "output")
stopifnot(all(disease$`beta[6]` < disease$`beta[1]`))
print(paste0(nrow(input) - nrow(disease), " invalid inputs removed"))
print(paste0(nrow(disease), " samples remaining"))
## return disease file
disease }
Also in R_tools/dataTools.R
is a function ensembleIDGen()
that creates unique IDs for each design point. So an example of a quick LHS design for five design points and five replicates is given by the convertDesign.R
script file:
## load libraries
library(lhs)
library(mclust)
library(tidyverse)
## source dataTools
source("R_tools/dataTools.R")
## set up parameter ranges for uniform ranges
<- data.frame(
parRanges parameter = c("R0", "TE", "TP", "TI1", "TI2",
"nuA", "lock_1_restrict", "lock_2_release", "p_home_weekend"),
lower = c(2, 0.1, 1.2, 2.8, 0.0001, 0, 0, 0, 0),
upper = c(4.5, 2, 3, 4.5, 0.5, 1, 1, 1, 1),
stringsAsFactors = FALSE
)
## read in contact matrix to use for NGM
<- as.matrix(read.csv("inputs/POLYMOD_matrix.csv", header = FALSE))
C
## generate LHS design
<- 100
ndesign <- randomLHS(ndesign, nrow(parRanges))
design colnames(design) <- parRanges$parameter
<- as_tibble(design)
design
## convert to input space
<- convertDesignToInput(design, parRanges, "zero_one")
inputs
## generate space-filling design for other parameters
## load FMM objects
<- readRDS("inputs/hospStays.rds")
hospStays <- readRDS("inputs/pathways.rds")
pathways
## generate design points for hospital stay lengths
<- FMMmaximin(
hospStaysInput
hospStays,
ndesign, matrix(c(-Inf, Inf, 0, Inf), ncol = 2, byrow = TRUE)
%>%
) as_tibble() %>%
rename(alphaTH = x1, etaTH = x2)
## generate design points for other transition probabilities
## this function checks validity of inputs
<- function(x, ages) {
pathwaysLimitFn <- apply(x, 1, function(x, ages) {
singleProbs <- x[5]
eta <- x[-5]
alphas <- sapply(alphas, function(a, eta, ages) {
y <- exp(a + eta * ages)
y all(y >= 0 & y <= 1)
eta = eta, ages = ages)
}, all(y)
ages = ages)
}, <- apply(x[, -c(1, 3)], 1, function(x, ages) {
multiProbs <- x[1]
alphaI1D <- x[2]
alphaI1H <- x[3]
eta <- exp(alphaI1D + eta * ages)
pI1D <- exp(alphaI1H + eta * ages)
pI1H <- pI1D + pI1H
p all(p >= 0 & p <= 1)
ages = ages)
}, & singleProbs
multiProbs
}
## produces design points subject to constraints
<- FMMmaximin(
pathwaysInput
pathways,
ndesign,matrix(c(rep(c(-20, 0), times = 4), 0, 1), ncol = 2, byrow = TRUE),
pathwaysLimitFn,ages = c(2.5, 11, 23.5, 34.5, 44.5, 55.5, 65.5, 75.5)
%>%
) as_tibble() %>%
rename(alphaEP = x1, alphaI1D = x2, alphaHD = x3, alphaI1H = x4, eta = x5)
## bind to design
<- cbind(inputs, hospStaysInput, pathwaysInput)
inputs
## add unique hash identifier
## (at the moment don't use "a0" type ensembleID, because MetaWards
## parses to dates)
$output <- ensembleIDGen(ensembleID = "Ens0", nrow(inputs))
inputs$repeats <- 1
inputs
## solution to round numbers preserving sum
## adapted from:
## https://stackoverflow.com/questions/32544646/round-vector-of-numerics-to-integer-while-preserving-their-sum
<- function(x) {
smart_round <- floor(x)
y <- tail(order(x - y), round(sum(x)) - sum(y))
indices <- y[indices] + 1
y[indices]
y
}
## set up number of initial individuals in each age-class
<- smart_round(read_csv("inputs/age_seeds.csv", col_names = FALSE)$X2 * 56082077)
N <- N - smart_round(read_csv("inputs/age_seeds.csv", col_names = FALSE)$X2 * 100)
S0 <- c(2.5, 11, 23.5, 34.5, 44.5, 55.5, 65.5, 75.5)
ages
## convert input to disease
<- convertInputToDisease(inputs, C, N, S0, ages)
disease stopifnot(nrow(disease) == ndesign)
# ## plot inputs
# library(GGally)
# p <- select(inputs, -output, -repeats) %>%
# ggpairs(upper = "blank")
# ggsave("design.pdf", p, width = 10, height = 10)
## reorder samples
<- arrange(inputs, output)
inputs <- arrange(disease, output)
disease
## write text file for MetaWards
write.table(disease, "inputs/disease.dat", row.names = FALSE, sep = " ", quote = FALSE)
## save inputs for data for post-simulation runs
saveRDS(inputs, "inputs/inputs.rds")
saveRDS(parRanges, "inputs/parRanges.rds")
This produces a file inputs/disease.dat
that can be passed to MetaWards to run the model. The bash
script below provides the command line instructions needed to the model using these inputs for the start of the epidemic to the first lockdown. If you don’t run Linux, then the file should give you an idea of how to run the model on your own system.
#!/bin/bash
## set up outputs folder
mkdir -p raw_outputs
## run design code
R CMD BATCH --no-restore --slave --no-save convertDesign.R
# ## run seeding code
# cd data/seedDeaths
# R CMD BATCH --no-restore --slave --no-save '--args 2020-02-01 2020-03-11' seedDeaths.R
# cd ../..
## run seeding code
R CMD BATCH --no-restore --slave --no-save '--args 2020-02-01 2020-03-06' R_tools/simulateSeedStates.R
## set path to MetaWardsData repository
export METAWARDSDATA=$HOME/Documents/covid/MetaWardsData
## run model
metawards -d model_code/ncov.json -m 2019LADData\
-D model_code/demographics.json\
--mixer model_code/mix_pathways\
--mover model_code/move_pathways\
--input inputs/disease.dat\
--iterator model_code/iterator\
-u inputs/user_inputs.txt -o raw_outputs --force-overwrite-output \
--extractor model_code/extractor\
--start-date 2020/03/06 --theme simple --nsteps 15
As described above each model run produces a series of files called age1.db.bz2
, age2.db.bz2
etc., which are compressed databases containing the outputs for each age-class. In practice, it is time-consuming to unzip and extract all of these outputs for all design points, and extract and collate the various quantities we need for our emulation and history matching process. To this end we have been kindly offered a processing and data storage solution on JASMIN.
The simulation outputs from our model are stored at the public repository:
https://gws-access.jasmin.ac.uk/public/covid19/
These are stored in the format:
wave
+-- inputs
+-- raw_outputs
| +-- EnsID
| | +-- age*.db
| | +-- weeksums_*.csv
| +-- EnsIDx002
| | +-- age*.db
| | +-- weeksums_*.csv
| +-- EnsIDx003
| | +-- age*.db
| | +-- weeksums_*.csv
| +-- ...
The age*.db
files are the unzipped raw outputs for that design point (described above). The weeksums_*.csv
files are weekly summaries of mean hospital prevalence (Hprev
column) and total deaths (Deaths
column) on a weekly basis for each LAD and week within each age-class. For a given ensemble ID (Ens0000
say), replicates are appended with the replicate number e.g. Ens0000x001
, Ens0000x002
and so on.
If you’ve run the model yourselves, or want to use the data differently, then as an example of how to manipulate the raw outputs, migrate to an output folder containing e.g. age3.db.bz2
(or download one from the repo above). If this is a zipped file, then you will have to unzip it. In Linux this can be done on the command line e.g.
bzip2 -dkf age3.db.bz2
You will notice that the unzipped age3.db
file is larger than the compressed version. As such, you might need to remove age3.db
at the end if you have limited hard drive space (to this end, the bzip -dkf
flag used above ensures that the original compressed file is not deleted when uncompressed).
The database contains a single table called compact
. To store the outputs efficiently, we have introduced various tricks, described above. To clarify, each age*.db
database contains a table called compact
with entries:
day
, LAD
Einc
, Pinc
, I1inc
, I2inc
, RIinc
, DIinc
Ainc
, RAinc
Hinc
, RHinc
, DHinc
If you’re happy with SQL, you can query these directly with e.g. SQLite. If you are an R user, then the dplyr
package (or more specifically the dbplyr
package) provides some useful R tools for querying SQL databases using tidyverse
-type notation. More details about these tools can be found here.
If required, the script R_tools/dataTools.R
, that provides a function called reconstruct()
that is written using Rcpp
and can be used to reconstruct the time-series counts from the incidence data.
## write Rcpp function to reconstruct counts from incidence
## first element of Einc etc. must be initial states and the following
## must be the incidence at a given time
library(Rcpp)
cppFunction('IntegerMatrix reconstruct(IntegerVector Einc, IntegerVector Pinc,
IntegerVector I1inc, IntegerVector I2inc, IntegerVector RIinc, IntegerVector DIinc,
IntegerVector Ainc, IntegerVector RAinc,
IntegerVector Hinc, IntegerVector RHinc, IntegerVector DHinc) {
// extract sizes
int n = Einc.size();
// set up output matrix
IntegerMatrix output(n, 17);
// set up initial counts
output(0, 0) = 0;
output(0, 1) = Einc[0];
output(0, 2) = 0;
output(0, 3) = Pinc[0];
output(0, 4) = 0;
output(0, 5) = I1inc[0];
output(0, 6) = 0;
output(0, 7) = I2inc[0];
output(0, 8) = RIinc[0];
output(0, 9) = DIinc[0];
output(0, 10) = 0;
output(0, 11) = Ainc[0];
output(0, 12) = RAinc[0];
output(0, 13) = 0;
output(0, 14) = Hinc[0];
output(0, 15) = RHinc[0];
output(0, 16) = DHinc[0];
int E = Einc[0];
int P = Pinc[0];
int I1 = I1inc[0];
int I2 = I2inc[0];
int RI = RIinc[0];
int DI = DIinc[0];
int A = Ainc[0];
int RA = RAinc[0];
int H = Hinc[0];
int RH = RHinc[0];
int DH = DHinc[0];
// reconstruct counts
for(int i = 1; i < n; i++) {
E += Einc[i] - Pinc[i] - Ainc[i];
output(i, 0) = Einc[i];
output(i, 1) = E;
P += Pinc[i] - I1inc[i];
output(i, 2) = Pinc[i];
output(i, 3) = P;
I1 += I1inc[i] - I2inc[i] - Hinc[i] - DIinc[i];
output(i, 4) = I1inc[i];
output(i, 5) = I1;
I2 += I2inc[i] - RIinc[i];
output(i, 6) = I2inc[i];
output(i, 7) = I2;
RI += RIinc[i];
output(i, 8) = RI;
DI += DIinc[i];
output(i, 9) = DI;
A += Ainc[i] - RAinc[i];
output(i, 10) = Ainc[i];
output(i, 11) = A;
RA += RAinc[i];
output(i, 12) = RA;
H += Hinc[i] - RHinc[i] - DHinc[i];
output(i, 13) = Hinc[i];
output(i, 14) = H;
RH += RHinc[i];
output(i, 15) = RH;
DH += DHinc[i];
output(i, 16) = DH;
}
// return counts
return(output);
}')
As a quick example, imagine that we want to extract the cumulative hospital cases on say day 20 from R for the third age-class. (Note that we do not need the time-series counts for this, only the incidence data.) Here we will need to extract the new hospital cases from day 1–20, and then sum them up for each LAD. Therefore we need to extract day
, LAD
and Hinc
from the compact
table.
## load library
## (you might also need to install the 'RSQLite'
## and `dbplyr` packages which 'dplyr' calls)
library(dplyr)
## establish connection to database
<- DBI::dbConnect(RSQLite::SQLite(), "age3.db")
con
## connect to the 'compact' table
<- tbl(con, "compact")
compact
## examine
compact
## # Source: table<compact> [?? x 13]
## # Database: sqlite 3.33.0
## # [/home/tj/Documents/covid/uq4covid.github.io/vignettes/data/MetaWards/vignette/age3.db]
## day LAD Einc Pinc I1inc I2inc RIinc DIinc Ainc RAinc Hinc RHinc DHinc
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 2 10 1 0 0 0 0 0 0 0 0 0 0
## 2 2 12 1 0 0 0 0 0 1 1 0 0 0
## 3 2 18 0 0 0 0 0 0 0 1 0 0 0
## 4 2 23 0 0 0 0 0 0 1 0 0 0 0
## 5 2 27 1 0 0 0 0 0 0 0 0 0 0
## 6 2 28 1 0 0 0 0 0 0 0 0 0 0
## 7 2 29 0 0 0 0 0 0 0 1 0 0 0
## 8 2 35 0 0 0 0 0 0 1 0 0 0 0
## 9 2 38 1 0 0 0 0 0 0 0 0 0 0
## 10 2 39 1 0 0 0 0 0 0 0 0 0 0
## # … with more rows
By default, the package only pulls enough data from the database to produce a summary on the screen (notice that it prints the dimensions as ?? x 13
). If the database is small enough, then the collect()
function can be used to import tables directly into R as a tibble
. Alternatively, the dbplyr
package can convert tidyverse
-style commands into SQL and run these directly within the database. For large databases this is likely to be much more efficient, in terms of speed and memory usage. You can then collect()
the results of the query into R.
Note: if you need to reconstruct the time-series counts from the incidence data, then you will need to
collect()
the necessary information to R first, and then run thereconstruct()
function as shown below.
Querying databases
As an example of this latter approach, we will set up a query that sums the new hospital cases over the first 20 days in each LAD. Remember: for each LAD the database only contains days when some event happens. For cumulative incidence counts this is fine, since the missing data will be zero in each case, so we just need to filter first to remove all time points \(> 20\). To set up the query:
## Hinc contains the new cases, so sum these
## over each LAD for days 1--100
<- filter(compact, day <= 20) %>%
hosp_db select(LAD, Hinc) %>%
group_by(LAD) %>%
summarise(Hcum = sum(Hinc))
This hasn’t run any commands yet, rather hosp_db
contains a parsed SQL query that can be run through the database connection. If you like, you can view the generated SQL query using:
## view query
show_query(hosp_db)
## Warning: Missing values are always removed in SQL.
## Use `SUM(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## <SQL>
## SELECT `LAD`, SUM(`Hinc`) AS `Hcum`
## FROM (SELECT `LAD`, `Hinc`
## FROM `compact`
## WHERE (`day` <= 20.0))
## GROUP BY `LAD`
Now let’s run the query and return the results to R by passing hosp_db
to the collect()
function:
## run query and pull to R
<- collect(hosp_db)
hosp
## print to screen
hosp
## # A tibble: 336 × 2
## LAD Hcum
## <int> <int>
## 1 1 0
## 2 2 0
## 3 3 0
## 4 4 0
## 5 5 0
## 6 6 0
## 7 7 0
## 8 8 0
## 9 9 0
## 10 10 0
## # … with 326 more rows
Now you can play with hosp
as much as you like. Note that hosp
here only contains information about LADs that have some infections, and only from the time since initial infection. Hence for calibration you might need to expand to fill in the missing LADs. Fortunately, R (especially tidyverse
) has lots of tools for doing this, and an example is given in the next section.
Collecting and reconstructing counts in R
Imagine now that we wish to extract the number of asymptomatic infections on each of the first 20 days in each LAD. Remember: for each LAD the database only contains days when some event happens, and only contains incidence counts, so we will have collect the data into R, and then run the reconstruct()
function provided above in each LAD.
Note: the
reconstruct()
function requires that the data are in time order, but do not require time points to be expanded (since no events happen on days that are not present in the data set). For multiple LADs the reconstruct function has to run on each LAD separately, and hence the code below first orders the data, and then nests it by LAD, before running the reconstruct function on each LAD separately. It then removes the nesting.
To set up the query:
## extract information for days 0-20 and collect to R
<- filter(compact, day <= 20) %>%
output collect()
## now sort by day, group by LAD and run reconstruct()
<- arrange(output, day) %>%
output group_by(LAD) %>%
nest() %>%
mutate(data = map(data, ~{
<- reconstruct(
output $Einc, .$Pinc, .$I1inc, .$I2inc, .$RIinc,
.$DIinc, .$Ainc, .$RAinc,
.$Hinc, .$RHinc, .$DHinc
.%>%
) ::set_colnames(c(
magrittr"Einc", "E", "Pinc", "P", "I1inc", "I1", "I2inc", "I2",
"RI", "DI",
"Ainc", "A", "RA",
"Hinc", "H", "RH", "DH"
%>%
)) as_tibble()
$day <- .$day
output
output%>%
})) unnest(cols = data) %>%
ungroup() %>%
select(day, LAD, everything())
## examine data frame (reordered to put A columns first for later comparison)
select(output, day, LAD, Ainc, A, RA, everything()) %>%
filter(LAD == 317) %>%
print(n = Inf)
## # A tibble: 10 × 19
## day LAD Ainc A RA Einc E Pinc P I1inc I1 I2inc I2
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 6 317 0 0 0 0 1 0 0 0 0 0 0
## 2 7 317 0 0 0 8 9 0 0 0 0 0 0
## 3 8 317 7 7 0 6 8 0 0 0 0 0 0
## 4 9 317 6 11 2 6 8 0 0 0 0 0 0
## 5 10 317 6 15 4 15 17 0 0 0 0 0 0
## 6 11 317 13 27 5 19 22 1 1 0 0 0 0
## 7 12 317 19 41 10 28 30 1 1 1 1 0 0
## 8 13 317 29 65 15 44 45 0 1 0 1 0 0
## 9 14 317 33 89 24 72 82 2 3 0 0 1 1
## 10 15 317 64 136 41 14 26 6 8 1 1 0 0
## # … with 6 more variables: RI <int>, DI <int>, Hinc <int>, H <int>, RH <int>,
## # DH <int>
At this point we have a set of time-series counts, which we will now expand to fill in the missing LAD / day combinations. (This can be done with complete()
, but we’ve found it’s generally faster by joining to a lookup table as below.)
## create lookup to expand LAD / time combinations
<- expand.grid(LAD = 1:339, day = 0:20) %>%
lookup as_tibble()
## joint to lookup, and fill in missing information correctly
<- select(output, day, LAD, A) %>%
asymp full_join(lookup, by = c("day", "LAD")) %>%
arrange(LAD, day) %>%
mutate(A = ifelse(day == 0 & is.na(A), 0, A)) %>%
group_by(LAD) %>%
fill(A) %>%
ungroup()
## examine output
filter(asymp, LAD == 317) %>%
print(n = Inf)
## # A tibble: 21 × 3
## day LAD A
## <int> <int> <dbl>
## 1 0 317 0
## 2 1 317 0
## 3 2 317 0
## 4 3 317 0
## 5 4 317 0
## 6 5 317 0
## 7 6 317 0
## 8 7 317 0
## 9 8 317 7
## 10 9 317 11
## 11 10 317 15
## 12 11 317 27
## 13 12 317 41
## 14 13 317 65
## 15 14 317 89
## 16 15 317 136
## 17 16 317 136
## 18 17 317 136
## 19 18 317 136
## 20 19 317 136
## 21 20 317 136
Once you do not need access to the SQLite database any more, you should disconnect:
## disconnect from database
::dbDisconnect(con) DBI
You can very easily wrap these ideas into an R function that can scroll through the design IDs, extract relevant outputs and bind to the inputs.