Overview

This vignette provides all of the analysis for case study 2 in the accompanying package and paper. Briefly, the aim is to infer antibody kinetics and historical attack rates using cross-sectional haemagglutination inhibition titre data on a panel of recent and historical A/H3N2 strains. All of the functions used here are well documented and have many tunable arguments, and we therefore encourage users to refer to the helps files.

This vignette demonstrates only how to reproduce the MCMC chains, simulate data, assess model fits and assess chain convergence. Code to reproduce figures from the main text in the accompanying paper can be found in the inst/extdata/scripts folder of the package.

Setup

Installation and requirements

serosolver may be installed from github using the devtools package. There are a number of additional packages that we need for this analysis.

# Required to run serosolver
devtools::install_github("seroanalytics/serosolver")
library(serosolver)
library(plyr)
library(data.table)

## Required for this analysis
library(reshape2)
library(foreach)
library(doParallel)
library(bayesplot)
library(coda)
library(ggplot2)
library(viridis)

# set up cluster
set.seed(1234)
cl <- makeCluster(5)
registerDoParallel(cl)

## Note that this vignette was generated on a Windows machine,
## and the setup for parallelisation is different on a Linux machine
## for Linux machine:
#  library(doMC)
#  library(doRNG)
#  registerDoMC(cores=5)

Assumptions

In this analysis, all serological samples were taken in 2009 and therefore all time variables are relative to this year. We are interested in inferring infections and attack rates at an annual resolution, and therefore set resolution to 1. Our primary outcome of interest is to infer unbiased historical attack rates, and we therefore use the version of the code with a Beta prior on per-time attack rates, prior_version=2. Furthermore, we have found that in the situation where the number of possible infections to infer is large but the amount of data is relatively sparse, identifiability is poor when using reference priors (eg. uniform Beta or Jeffrey’s prior). We instead opted to use a weakly informative prior for annual attack rates with mode = 0.15 but with high variance, corresponding to prior observations of annual influenza attack rates. We set these parameters at the start of the analysis.

filename <- "case_study_2"
resolution <- 1 ## eg. this would be set to 12 for monthly resolution
sample_year <- 2009

serosolver::describe_priors()
#> Which version to use in serosolver? The following text describes the proposal step for updating infection histories.
#> Version 1: Beta prior on per time attack rates. Explicit FOI on each epoch using probability of infection term. Proposal performs N `flip` proposals at random locations in an individual's infection history, switching 1->0 or 0->1. Otherwise, swaps the contents of two random locations
#> Version 2: Beta prior on per time attack rates. Gibbs sampling of infection histories as in Indian Buffet Process papers, integrating out each probability of infection term.
#> Version 3: Beta prior on probability of infection for an individual, assuming independence between individuals. Samples from a beta binomial with alpha and beta specified by the par_tab input. Proposes nInfs moves at a time for add/remove, or when swapping, swaps locations up to moveSize time steps away
#> Version 4: Beta prior on probability of any infection. Gibbs sampling of infection histories using total number of infections across all times and all individuals as the prior
prior_version <- 2

Preparing the data

The data used in this analysis are haemagglutination inhibition (HI) titres against a number of A/H3N2 that have circulated since 1968. The raw data are in a wide format, providing the highest two-fold dilution of serum at which haemagglutination is inhibited. The first step of the analysis is therefore to clean the titre data and convert the data frame to the long format, as described in the quickstart vignette.

## Read in data
raw_dat_path <- system.file("extdata", "Fluscape_HI_data.csv", package = "serosolver")
raw_dat <- read.csv(file = raw_dat_path, stringsAsFactors = FALSE)
print(head(raw_dat))
#>   Age HI.H3N2.1968 HI.H3N2.1975 HI.H3N2.1979 HI.H3N2.1989 HI.H3N2.1995
#> 1  75           80           40           40           80          160
#> 2  35           20           80          160           40           80
#> 3  71           80           40           20           20           40
#> 4  65           80           40           40           20           40
#> 5  64          160           80           40           10           40
#> 6  33           40           20          160           80           80
#>   HI.H3N2.2002 HI.H3N2.2003 HI.H3N2.2005 HI.H3N2.2008
#> 1          160           40           80           40
#> 2           20           10            0            0
#> 3           80           20           10            0
#> 4           20            0            0            0
#> 5           40            0           20           20
#> 6          160           40           40           20

## Add indexing column for each individual
raw_dat$individual <- 1:nrow(raw_dat)

## Convert data to long format
melted_dat <- reshape2::melt(raw_dat, id.vars=c("individual","Age"),stringsAsFactors=FALSE)

## Modify column names to meet serosolver's expectations
colnames(melted_dat) <- c("individual","DOB","virus","titre")
melted_dat$virus <- as.character(melted_dat$virus)

## Extract circulation years for each virus code, which will be used 
## by serosolver as the circulation time
melted_dat$virus <- as.numeric(sapply(melted_dat$virus, function(x) strsplit(x,split = "HI.H3N2.")[[1]][2]))

## Clean and log transform the data
melted_dat <- melted_dat[complete.cases(melted_dat),]
melted_dat[melted_dat$titre == 0,"titre"] <- 5
melted_dat$titre <- log2(melted_dat$titre/5)

## Convert ages to DOB
melted_dat$DOB <- sample_year - melted_dat$DOB

## All samples taken at the same time
melted_dat$samples <- sample_year

## Add column for titre repeats, enumerating for each measurement for the same virus/sample/individual
melted_dat <- plyr::ddply(melted_dat,.(individual,virus,samples),function(x) cbind(x,"run"=1:nrow(x),"group"=1))

## Rename to data expected by serosolver
titre_dat <- melted_dat
print(head(titre_dat))
#>   individual  DOB virus titre samples run group
#> 1          1 1934  1968     4    2009   1     1
#> 2          1 1934  1975     3    2009   1     1
#> 3          1 1934  1979     3    2009   1     1
#> 4          1 1934  1989     4    2009   1     1
#> 5          1 1934  1995     5    2009   1     1
#> 6          1 1934  2002     5    2009   1     1

Given that this analysis uses titres from multiple, antigenically related viruses, it is necessary to define an antigenic map describing the antigenic distance between all of the viruses here. We use coordinates based on the antigenic map created by Fonville et al. Generating the antigenic map involves fitting a smoothing spline through provided coordinates to give a representative virus for each time point (in this case, each year) that an individual could be infected. This process also inputs antigenic coordinates for time points that we do not have a measured virus.

## Read in raw coordinates
antigenic_coords_path <- system.file("extdata", "fonville_map_approx.csv", package = "serosolver")
antigenic_coords <- read.csv(file = antigenic_coords_path, stringsAsFactors=FALSE)
print(head(antigenic_coords))
#>   Strain    X    Y
#> 1   HK68  1.8  2.4
#> 2   EN72  2.7  4.9
#> 3   VI75  7.6  6.3
#> 4   TX77  7.9  8.8
#> 5   BK79  9.6 11.0
#> 6   SI87 15.0  6.8

## Convert to form expected by serosolver
antigenic_map <- generate_antigenic_map(antigenic_coords, resolution)
print(head(antigenic_map))
#>       x_coord   y_coord inf_times
#> 1 -0.09718111 0.5021363      1968
#> 2  0.80502804 1.6816917      1969
#> 3  1.70723718 2.8612472      1970
#> 4  2.60944633 3.9976902      1971
#> 5  3.51165548 4.8709093      1972
#> 6  4.41386463 5.5057039      1973

## More flexible version of the above function
virus_key <- c(
    "HK68" = 1968, "EN72" = 1972, "VI75" = 1975, "TX77" = 1977, 
    "BK79" = 1979, "SI87" = 1987, "BE89" = 1989, "BJ89" = 1989,
    "BE92" = 1992, "WU95" = 1995, "SY97" = 1997, "FU02" = 2002, 
    "CA04" = 2004, "WI05" = 2005, "PE06" = 2006
  )
antigenic_coords$Strain <- virus_key[antigenic_coords$Strain]
antigenic_map <- generate_antigenic_map_flexible(antigenic_coords)

## Restrict entries to years of interest. Entries in antigenic_map determine
## the times that individual can be infected ie. the dimensions of the infection
## history matrix.
antigenic_map <- antigenic_map[antigenic_map$inf_times >= 1968 & antigenic_map$inf_times <= sample_year,]
strain_isolation_times <- unique(antigenic_map$inf_times)

NOTE: generate_antigenic_map expects the provided file fonville_map_approx.csv. Users should refer to generate_antigenic_map_flexible for more generic antigenic map generation.

Finally, we must specify the par_tab data frame, which controls which parameters are included in the model, which are fixed, and their uniform prior ranges. Given that we are integrating out the probability of infection terms under prior version 2, we must remove these parameters from par_tab. Furthermore, given that we are interested in long-term dynamics with relatively sparse data, we remove parameters relating to the short-term antibody kinetics phase to avoid identifiability issues. We set alpha and beta of the beta prior to give a mode of 0.15 assuming that our prior belief has the equivalent weighting to 4 observed individuals.

par_tab_path <- system.file("extdata", "par_tab_base.csv", package = "serosolver")
par_tab <- read.csv(file = par_tab_path, stringsAsFactors=FALSE)

## Set parameters for Beta prior on infection histories
beta_pars <- find_beta_prior_mode(0.15,4)
par_tab[par_tab$names == "alpha","values"] <- beta_pars$alpha
par_tab[par_tab$names == "beta","values"] <- beta_pars$beta
## Maximum recordable log titre in these data is 8
par_tab[par_tab$names == "MAX_TITRE","values"] <- 8

## Remove phi parameters, as these are integrated out under prior version 2
par_tab <- par_tab[par_tab$names != "phi",]

## Fix all short term parameters to 0
par_tab[par_tab$names %in% c("mu_short","sigma2","wane"),"fixed"] <- 1 # mu_short, waning and sigma2 are fixed
par_tab[par_tab$names %in% c("mu_short","sigma2","wane"),"values"] <- 0 # set these values to 0

Summary

Running the MCMC

We are now ready to fit our model. We will fit multiple chains in parallel, though the below analysis could easily be replicated by running chains sequentially. Starting conditions for the MCMC chain must be generated that return a finite likelihood. The user may modify many of the MCMC control parameters, though the defaults are fine for most purposes. We have made some minor tweaks in this case study to improve convergence on infection history estimates. Step sizes for parameters in par_tab are tuned automatically, and some automated tuning of the infection history proposals takes place for prior version 3. However, for other attack rate priors, it is necessary for the user to do some manual tuning of a) the number of individuals sampled at each step proposal_inf_hist_indiv_prop; b) the number of time points sampled at each step proposal_inf_hist_time_prop; c) the frequency of individual infection history swapping steps (ie. for an individual, choose two time points and swap their contents)proposal_inf_hist_indiv_swap_ratio; d) proportion of infection history sampling steps which should be the alternative swapping step, where the contents of infection histories at two time points are swapped proposal_inf_hist_group_swap_ratio; e) proportion of infection histories to swap with each alternative swapping step proposal_inf_hist_group_swap_prop. For example, in this case study, we attack rates are likely to be highly correlated in adjacent years (as we have limited data to distinguish between infections in years close in time), and we therefore increase the frequency of the alternative infection history swapping step with proposal_inf_hist_group_swap_prop.

Changing the number of iterations and the length of the adaptive period are often desirable. More crucially, the amount of chain thinning should be specified to ensure that users are not saving a large number of MCMC iterations (as this will rapidly fill disk space!). Thinning should be set such that at least 1000 iterations are saved (ie. iterations/thin and thin_inf_hist). Users are encouraged to pay extra attention to thin_inf_hist, which dictates the thinning of the infection history chain, and can generate a very large file if left unchecked.

## Distinct filename for each chain
no_chains <- 5
filenames <- paste0(filename, "_",1:no_chains)
chain_path <- sub("par_tab_base.csv","",par_tab_path)
chain_path_real <- paste0(chain_path, "cs2_real/")
chain_path_sim <- paste0(chain_path, "cs2_sim/")

## Create the posterior solving function that will be used in the MCMC framework 
model_func <- create_posterior_func(par_tab=par_tab,
                                titre_dat=titre_dat,
                                antigenic_map=antigenic_map,
                                version=prior_version) # function in posteriors.R
#> Creating posterior solving function...
#>

## Generate results in parallel
res <- foreach(x = filenames, .packages = c('serosolver','data.table','plyr')) %dopar% {
  ## Not all random starting conditions return finite likelihood, so for each chain generate random
  ## conditions until we get one with a finite likelihood
  start_prob <- -Inf
  while(!is.finite(start_prob)){
    ## Generating starting antibody kinetics parameters
    start_tab <- generate_start_tab(par_tab)
    
    ## Generate starting infection history
    start_inf <- setup_infection_histories_titre(titre_dat, strain_isolation_times, 
                                                 space=3,titre_cutoff=4)
    start_prob <- sum(model_func(start_tab$values, start_inf)[[1]])
  }
  
  
  res <- serosolver(par_tab = start_tab, 
                  titre_dat = titre_dat,
                  antigenic_map = antigenic_map,
                  start_inf_hist = start_inf, 
                  mcmc_pars = c("iterations"=500000,"adaptive_iterations"=100000,"thin"=1000,
                                "thin_inf_hist"=1000,"save_block"=1000,
                                "proposal_inf_hist_time_prop"=1, "proposal_inf_hist_indiv_prop"=1,
                                "proposal_inf_hist_group_swap_ratio"=0.8, "proposal_inf_hist_group_swap_prop"=1),
                  filename = paste0(chain_path_real,x), 
                  CREATE_POSTERIOR_FUNC = create_posterior_func, 
                  version = prior_version)
}

Post-run analyses

Once the MCMC chains are run, serosolver provides a number of simple functions to generate standard outputs and MCMC diagnostics. The saved MCMC chains are compatible with the coda and bayesplot packages, and users are encouraged to use these. First, read in the MCMC chains. The below function distinguishes between posterior samples for the infection history matrix and for the process parameters. The function searches for all files with the filenames generated by serosolver in the specified directory, and returns data structures with these concatenated and seperated in a list.

## Read in the MCMC chains
## Note that `thin` here is in addition to any thinning done during the fitting
all_chains <- load_mcmc_chains(location=chain_path_real,thin=1,burnin=100000,
                               par_tab=par_tab,unfixed=FALSE,convert_mcmc=TRUE)
#> Chains detected:     5Highest MCMC sample interations: 
#> Chains detected: 
#> X:/Program Files/R/R-3.6.2/library/serosolver/extdata/cs2_real//case_study_2_1_infection_histories.csv
#> X:/Program Files/R/R-3.6.2/library/serosolver/extdata/cs2_real//case_study_2_2_infection_histories.csv
#> X:/Program Files/R/R-3.6.2/library/serosolver/extdata/cs2_real//case_study_2_3_infection_histories.csv
#> X:/Program Files/R/R-3.6.2/library/serosolver/extdata/cs2_real//case_study_2_4_infection_histories.csv
#> X:/Program Files/R/R-3.6.2/library/serosolver/extdata/cs2_real//case_study_2_5_infection_histories.csv
#> [[1]]
#> [1] 643851
#> 
#> [[2]]
#> [1] 650759
#> 
#> [[3]]
#> [1] 640438
#> 
#> [[4]]
#> [1] 645224
#> 
#> [[5]]
#> [1] 648570

## Alternative, load the included MCMC chains rather than re-running
## data(cs2_chains_real)
## all_chains <- cs2_chains_real

print(summary(all_chains))
#>                   Length Class      Mode   
#> theta_chain       65130  mcmc       numeric
#> inf_chain             5  data.table list   
#> theta_list_chains     5  -none-     list   
#> inf_list_chains       5  -none-     list

Chains should then be checked for the usual MCMC diagnostics: $\hat{R}$ and effective sample size. First, looking at the antibody kinetics process parameters:

## Get the MCMC chains as a list
list_chains <- all_chains$theta_list_chains
## Look at diagnostics for the free parameters
list_chains1 <- lapply(list_chains, function(x) x[,c("mu","sigma1","error",
                                                     "tau","total_infections",
                                                     "lnlike","prior_prob")])

## Gelman-Rubin diagnostics to assess between-chain convergence for each parameter
print(gelman.diag(as.mcmc.list(list_chains1)))
#> Potential scale reduction factors:
#> 
#>                  Point est. Upper C.I.
#> mu                     1.02       1.03
#> sigma1                 1.01       1.03
#> error                  1.00       1.00
#> tau                    1.01       1.02
#> total_infections       1.02       1.03
#> lnlike                 1.01       1.01
#> prior_prob             1.01       1.01
#> 
#> Multivariate psrf
#> 
#> 1.02
gelman.plot(as.mcmc.list(list_chains1))


## Effective sample size for each parameter
print(effectiveSize(as.mcmc.list(list_chains1)))
#>               mu           sigma1            error              tau 
#>         481.6208         727.3621        2435.7809        1171.5251 
#> total_infections           lnlike       prior_prob 
#>         497.1942         590.3697         607.8823

## Posterior estimates for each parameter
print(summary(as.mcmc.list(list_chains1)))
#> 
#> Iterations = 1:501
#> Thinning interval = 1 
#> Number of chains = 5 
#> Sample size per chain = 501 
#> 
#> 1. Empirical mean and standard deviation for each variable,
#>    plus standard error of the mean:
#> 
#>                        Mean        SD  Naive SE Time-series SE
#> mu                   2.2193 1.475e-01 0.0029480      0.0073858
#> sigma1               0.1046 4.409e-03 0.0000881      0.0001667
#> error                1.1624 3.751e-02 0.0007494      0.0007630
#> tau                  0.0306 5.330e-03 0.0001065      0.0001663
#> total_infections  1288.9589 9.359e+01 1.8699544      4.5226816
#> lnlike           -3990.8712 1.532e+02 3.0611018      6.8715326
#> prior_prob       -2081.5167 1.449e+02 2.8955201      6.4436385
#> 
#> 2. Quantiles for each variable:
#> 
#>                        2.5%        25%        50%        75%      97.5%
#> mu                1.947e+00  2.119e+00  2.221e+00  2.314e+00  2.509e+00
#> sigma1            9.579e-02  1.015e-01  1.048e-01  1.076e-01  1.130e-01
#> error             1.090e+00  1.138e+00  1.162e+00  1.186e+00  1.238e+00
#> tau               2.043e-02  2.692e-02  3.053e-02  3.411e-02  4.144e-02
#> total_infections  1.117e+03  1.226e+03  1.288e+03  1.350e+03  1.476e+03
#> lnlike           -4.291e+03 -4.090e+03 -3.993e+03 -3.887e+03 -3.696e+03
#> prior_prob       -2.364e+03 -2.178e+03 -2.084e+03 -1.982e+03 -1.801e+03


## Plot the MCMC trace using the `bayesplot` package
color_scheme_set("viridis")
p_theta_trace <- mcmc_trace(list_chains1)
print(p_theta_trace)

and at the infection histories:

## Extract infection history chain
inf_chain <- all_chains$inf_chain

## Look at inferred attack rates
p_ar <- plot_attack_rates(inf_chain, titre_dat, strain_isolation_times, pad_chain=TRUE,
                          plot_den = TRUE,prior_pars=list(prior_version=prior_version, 
                                            alpha=par_tab[par_tab$names=="alpha","values"],
                                            beta=par_tab[par_tab$names=="beta","values"])) 
print(p_ar)
#> Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
#> 'x' values

#> Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
#> 'x' values


## Calculate convergence diagnostics and summary statistics on infection histories
## Important to scale all infection estimates by number alive from titre_dat
n_alive <- get_n_alive_group(titre_dat, strain_isolation_times,melt=TRUE)

## This function generates a number of MCMC outputs
ps_infhist <- plot_posteriors_infhist(inf_chain=inf_chain, 
                                      years=strain_isolation_times, 
                                      n_alive=n_alive)
#> Padding inf chain...
#> Done
#> Calculating by time summaries...
#> Done
#> Calculating by individual summaries...
#> Done


## Posterior mean, median, 95% credible intervals and effective sample size
## on per time attack rates
print(head(ps_infhist[["estimates"]]$by_year))
#>       j group      mean     median lower_quantile upper_quantile effective_size
#> 1: 1968     1 0.9358901 0.95238095     0.82142857      1.0000000       597.5393
#> 2: 1969     1 0.2590169 0.06976744     0.00000000      0.9651163       458.7168
#> 3: 1970     1 0.6503526 0.81111111     0.02222222      0.9777778       629.5773
#> 4: 1971     1 0.2597718 0.18478261     0.01086957      0.8478261       971.8555
#> 5: 1972     1 0.1865070 0.14736842     0.01052632      0.5894737      1465.3556
#> 6: 1973     1 0.1419553 0.12371134     0.01030928      0.4020619      1811.9475
#>    gelman_point gelman_upper
#> 1:     1.121062     1.173689
#> 2:     1.013099     1.035579
#> 3:     1.007056     1.021404
#> 4:     1.008536     1.021709
#> 5:     1.004837     1.009290
#> 6:     1.006230     1.015901

## Posterior mean, median, 95% credible intervals and effective sample size
## on per individual total number of infections
print(head(ps_infhist[["estimates"]]$by_indiv))
#>    i      mean median lower_quantile upper_quantile effective_size
#> 1: 1 12.360479     12              9             16       1169.470
#> 2: 2  7.790419      8              6             10       1391.668
#> 3: 3  9.282635      9              7             12       1454.059
#> 4: 4  8.488623      8              6             11       1880.095
#> 5: 5  9.657086     10              7             13       1078.471
#> 6: 6  9.235130      9              7             12       1415.614

## Check convergence of infection history summary statistics
## MCMC trace plots of attack rates
## Each subplot shows one year
print(ps_infhist[["by_time_trace"]][[1]])


## MCMC trace plots of total number of infections per individual
## Each subplot shows one individual
print(ps_infhist[["by_indiv_trace"]][[1]])


## Distribution of total number of infections
print(ps_infhist[["indiv_infections"]])


## Check for agreement between inferred cumulative infection histories 
## for some individuals
p_indiv_inf_hists <- generate_cumulative_inf_plots(inf_chain,indivs=1:9,pad_chain=FALSE,
                                                  strain_isolation_times = strain_isolation_times,
                                                  number_col=3)
## Each subplot shows one individual
print(p_indiv_inf_hists[[1]])


## Posterior probability that infections occured at given times per individual
## Each subplot shows one individual
print(p_indiv_inf_hists[[2]])

Mixing can sometimes be very poor for per-time attack rates when adjacent times are highly correlated. This is often the case when the amount of data relatively poor. A cruder time resolution (eg. per two years) may be advisable, and mixing may benefit from increasing the proposal_inf_hist_group_swap_ratio and proposal_inf_hist_indiv_prop parameters in the mcmc_pars list in serosolver. proposal_inf_hist_indiv_prop determines how frequently the MCMC sampler uses a proposal step that swaps the a proportion proposal_inf_hist_group_swap_ratio of individual’s infection states between two time points.

Users may also easily check the inferred antibody landscapes at the time each sample was taken. Black dots show observations, shaded regions and black line show 95%, 50% credible intervals and posterior median.

## get_titre_predictions expects only a single MCMC chain, so
## subset for only one chain
chain <- as.data.frame(all_chains$theta_chain)
chain1 <- chain[chain$chain_no == 1,]
inf_chain1 <- inf_chain[inf_chain$chain_no == 1,]

titre_preds <- get_titre_predictions(chain = chain1, 
                                     infection_histories = inf_chain1, 
                                     titre_dat = titre_dat, 
                                     individuals = unique(titre_dat$individual),
                                     antigenic_map = antigenic_map, 
                                     par_tab = par_tab,expand_titredat=FALSE)
#> Creating model solving function...
#> 
to_use <- titre_preds$predictions
print(head(to_use))
#>   individual  DOB virus titre samples run group    lower lower_50   median
#> 1          1 1934  1968     4    2009   1     1 3.041685 3.788346 4.280591
#> 2          1 1934  1975     3    2009   1     1 1.902959 2.595233 3.184981
#> 3          1 1934  1979     3    2009   1     1 2.139347 3.222580 3.664448
#> 4          1 1934  1989     4    2009   1     1 2.213891 3.295432 3.932731
#> 5          1 1934  1995     5    2009   1     1 3.638016 4.492440 5.190986
#> 6          1 1934  2002     5    2009   1     1 4.241020 5.049454 5.483356
#>   upper_50    upper      max
#> 1 4.687300 5.598159 3.830762
#> 2 3.745539 4.557163 2.357291
#> 3 4.231587 5.067237 3.223135
#> 4 4.626296 5.620150 3.446563
#> 5 5.635953 6.581105 4.795906
#> 6 5.834227 6.569361 4.615596

## Using ggplot
## Shaded regions show 95% and 50% credible intervals, 
## line shows posterior median.
## Each suplot shows one individual
titre_pred_p <- ggplot(to_use[to_use$individual %in% 1:9,])+
  geom_ribbon(aes(x=virus,ymin=lower, ymax=upper),fill="gray90")+
  geom_ribbon(aes(x=virus,ymin=lower_50, ymax=upper_50),fill="gray70")+
  geom_line(aes(x=virus, y=median))+
  geom_point(aes(x=virus, y=titre))+
  coord_cartesian(ylim=c(0,8))+
  ylab("log titre") +
  xlab("Time of virus circulation") +
  theme_classic() +
  facet_wrap(~individual)
titre_pred_p

Further analyses

Figures in the main text can be readily generated from the MCMC output from above. The source code to generate these figures has been hidden, but can be found in the original .Rmd file for this vignette.

First, we are interested in calculating the number of infections experienced by individuals over time as a function of their age. We see that individuals are infected less frequently as they become older.

Given the sparsity of data here, the default attack rate plot is difficult to interpret. Below is an alternative visualisation of the attack rate, with the 95% and 50% credible intervals shown in red, the posterior median shown in black and the posterior maximum likelihood estimate shown as a dashed green line.

## Find samples that were in both theta and inf hist chains
chain <- as.data.frame(all_chains$theta_chain)
intersect_samps <- intersect(unique(inf_chain$samp_no), unique(chain$samp_no))
chain <- chain[chain$samp_no %in% intersect_samps,]

## Find the parameter values that gave the highest posterior probability
which_mle <- chain[which.max(chain$lnlike),c("samp_no","chain_no")]

## Take subset of chain for computational speed, as do not need all samples
samps <- unique(inf_chain[,c("samp_no","chain_no")])
n_samps <- sample(1:nrow(samps), 100)
samps <- samps[n_samps,]
samps <- rbind(samps, which_mle) ## Plus MLE estimate
## Append the MLE estimate, note that this is max(samp_no)

## Create new index variables for simplicity
samps$samp_no1 <- 1:nrow(samps)
samps$chain_no1 <- 1

## Inner join to return only our subset of samples
## Reformat samp_no and chain_no identifiers so that code
## sees samples as coming from one chain
inf_chain <- merge(inf_chain, samps, by=c("samp_no","chain_no"))
inf_chain <- inf_chain[,c("samp_no1","chain_no1","i","j","x")]
colnames(inf_chain)[1:2] <- c("samp_no","chain_no")
inf_chain <- pad_inf_chain(inf_chain)
## Rename columns to be more informative

## Column names expected by code below
colnames(inf_chain) <- c("samp_no","chain_no","individual","year","infected","group")

## Data on which strains belong to which cluster
cluster_path <- system.file("extdata", "fonville_clusters.csv", package = "serosolver")
clusters <- read.csv(file = cluster_path, stringsAsFactors=FALSE)
clusters <- clusters[clusters$year <= sample_year,]

## j=1 corresponds to the year 1968
inf_chain$year <- inf_chain$year + 1967

## Merge cluster data and infection history data
inf_chain <- merge(inf_chain, clusters[,c("year","cluster1")],by="year")

## Calculate ages and age groups of all individuals
titre_dat$age <- max(strain_isolation_times) - titre_dat$DOB
titre_dat$age_group <- cut(titre_dat$age,breaks=c(0,20,100),include.lowest=TRUE)
ages <- unique(titre_dat[,c("individual","age_group","DOB","age")])

## Merge infection histories with individual data
inf_chain<- merge(inf_chain, data.table(ages), by=c("individual"))

## Alive status for each individual for each time,
## only interested in individuals that were alive 
## when a virus circulated
inf_chain$alive <- inf_chain$DOB <= inf_chain$year
inf_chain <- inf_chain[inf_chain$alive,]

Finally, inferring individual infection histories allows us to investigate age-specific patterns of incidence. Here, we show the proportion of individuals that were infected at least once within a single antigenic cluster, finding that clusters that circulate for longer tend to infect a far higher proportion of the population. Furthermore, we see that a far higher proportion of the younger age group is infected in more recent years.

Simulation recovery

We finish the vignette by presenting a simulation-recovery experiment to test the ability of the framework to recover known infection histories and antibody kinetics parameters using simulated data that matches the real dataset.

Extract MLE parameters from fits

We simulate infection histories and antibody titre data based on the “real” parameters inferred from fitting the model above. First, we extract the maximum posterior probability antibody kinetics parameters and attack rates.

## Read in MCMC chains from fitting
all_chains <- load_mcmc_chains(location=chain_path_real,thin=1,burnin=100000,
                               par_tab=par_tab,unfixed=FALSE,convert_mcmc=FALSE)
#> Chains detected:     5Highest MCMC sample interations: 
#> Chains detected: 
#> X:/Program Files/R/R-3.6.2/library/serosolver/extdata/cs2_real//case_study_2_1_infection_histories.csv
#> X:/Program Files/R/R-3.6.2/library/serosolver/extdata/cs2_real//case_study_2_2_infection_histories.csv
#> X:/Program Files/R/R-3.6.2/library/serosolver/extdata/cs2_real//case_study_2_3_infection_histories.csv
#> X:/Program Files/R/R-3.6.2/library/serosolver/extdata/cs2_real//case_study_2_4_infection_histories.csv
#> X:/Program Files/R/R-3.6.2/library/serosolver/extdata/cs2_real//case_study_2_5_infection_histories.csv
#> [[1]]
#> [1] 643851
#> 
#> [[2]]
#> [1] 650759
#> 
#> [[3]]
#> [1] 640438
#> 
#> [[4]]
#> [1] 645224
#> 
#> [[5]]
#> [1] 648570

## Alternative, load the included MCMC chains rather than re-running
## data(cs2_chains_real_b)
## all_chains <- cs2_chains_real_b

## Find samples that were in both theta and inf hist chains
chain <- all_chains$theta_chain
inf_chain <- all_chains$inf_chain
intersect_samps <- intersect(unique(inf_chain$samp_no), unique(chain$samp_no))
chain <- chain[chain$samp_no %in% intersect_samps,]

## Find the parameter values that gave the highest posterior probability
which_mle <- chain[which.max(chain$lnlike),c("samp_no","chain_no")]
mle_theta_pars <- chain[chain$samp_no == which_mle$samp_no & chain$chain_no == which_mle$chain_no,]

## Store total infections to compare later
mle_total_infs <- mle_theta_pars[,"total_infections"]
mle_theta_pars <- mle_theta_pars[,par_tab$names]
mle_inf_hist <- inf_chain[inf_chain$samp_no == which_mle$samp_no & inf_chain$chain_no == which_mle$chain_no,]

## Generate full infection history matrix using provided function
mle_inf_hist <- expand_summary_inf_chain(mle_inf_hist[,c("samp_no","j","i","x")])
## Find number of infections per year from this infection history
no_infs <- colSums(mle_inf_hist[,3:ncol(mle_inf_hist)])

## If missing time points in simulated attack rates
if(length(no_infs) < length(strain_isolation_times)){
  diff_lengths <- length(strain_isolation_times) - length(no_infs)
  no_infs <- c(no_infs, rep(0, diff_lengths))
}

## Find attack rate per year
n_alive <- get_n_alive(titre_dat, strain_isolation_times)
attack_rates <- no_infs/n_alive

Functions are provided to simulate antibody titre data under a given serosurvey design. The antibody kinetics parameters and attack rates estimated above are used to simulate titres from the model. The simulate_data function is well documented, and users should refer to the help file to customise the simulated serosurvey design.

set.seed(1234)

sim_par_tab <- par_tab
sim_par_tab$values <- as.numeric(mle_theta_pars)
sim_par_tab[sim_par_tab$names %in% c("alpha","beta"),"values"] <- c(1/3,1/3)

age_min <- 2009 - max(titre_dat$DOB)
age_max <- 2009 - min(titre_dat$DOB)
n_indiv <- length(unique(titre_dat$individual))
dat <- simulate_data(par_tab=sim_par_tab,
                     n_indiv=n_indiv, 
                     buckets=resolution, 
                     strain_isolation_times=strain_isolation_times,
                     sampling_times=2009, 
                     nsamps=1, 
                     antigenic_map=antigenic_map, 
                     age_min=age_min,
                     age_max=age_max,
                     attack_rates=attack_rates,
                     repeats=1)
#> Simulating data


## Inspect simulated antibody titre data and infection histories
sim_titre_dat <- dat[["data"]]
sim_infection_histories <- dat[["infection_histories"]]

## Store total infections to compare later
actual_total_infections <- sum(sim_infection_histories)

plot_data(sim_titre_dat, sim_infection_histories, 
          strain_isolation_times,n_indivs = 5)


## Use titres only against same viruses tested in real data
viruses <- unique(titre_dat$virus)
sim_titre_dat <- sim_titre_dat[sim_titre_dat$virus %in% viruses, ]
sim_ages <- dat[["ages"]]
sim_titre_dat <- merge(sim_titre_dat, sim_ages)
sim_ar <- dat[["attack_rates"]]

Simulation fitting

Once these simulated data have been generated, the work flow becomes exactly the same as with the real data above.

filename <- "case_study_2_sim"

## Distinct filename for each chain
no_chains <- 5
filenames <- paste0(filename, "_",1:no_chains)

## Create the posterior solving function that will be used in the MCMC framework 
model_func <- create_posterior_func(par_tab=sim_par_tab,
                                titre_dat=sim_titre_dat,
                                antigenic_map=antigenic_map,
                                version=prior_version) # function in posteriors.R
#> Creating posterior solving function...
#>

## Generate results in parallel
res <- foreach(x = filenames, .packages = c('serosolver','data.table','plyr')) %dopar% {
  ## Not all random starting conditions return finite likelihood, so for each chain generate random
  ## conditions until we get one with a finite likelihood
  start_prob <- -Inf
  while(!is.finite(start_prob)){
    ## Generate starting values for theta
    start_tab <- generate_start_tab(par_tab)
    ## Generate starting infection history
    start_inf <- setup_infection_histories_titre(sim_titre_dat, strain_isolation_times, 
                                                 space=3,titre_cutoff=4)
    start_prob <- sum(model_func(start_tab$values, start_inf)[[1]])
  }
  
  res <- serosolver(par_tab = start_tab, 
                  titre_dat = sim_titre_dat,
                  antigenic_map = antigenic_map,
                  start_inf_hist = start_inf, 
                  mcmc_pars = c("iterations"=500000,"adaptive_iterations"=100000,"thin"=1000,
                                "thin_inf_hist"=1000,"save_block"=1000,
                                "proposal_inf_hist_group_swap_ratio"=0.8, "proposal_inf_hist_indiv_prop"=1),
                  filename = paste0(chain_path_sim,x), 
                  CREATE_POSTERIOR_FUNC = create_posterior_func, 
                  version = prior_version)
}

Simulation analysis

MCMC chains should be checked for convergence under the usual diagnostics. We also compare the inferred posterior distributions to the known true parameter values. We see that convergence and between-chain agreement is good and that the model recovers reasonably unbiased estimates for some parameters. However, under this sampling strategy the model slightly underestimates the amount of long term antibody boosting elicited by a single infection and overestimates the total number of infections. This is driven by the contribution of the attack rate prior relative to the contribution of the likelihood (the data). Increasing the number of measured titres (for example, measure titres against 40 viruses rather than 9) or using a more informative attack rate prior would help reduce this bias.

## Read in the MCMC chains
## Note that `thin` here is in addition to any thinning done during the fitting
sim_all_chains <- load_mcmc_chains(location=chain_path_sim,thin=1,burnin=100000,
                               par_tab=par_tab,unfixed=FALSE,convert_mcmc=TRUE)
#> Chains detected:     5Highest MCMC sample interations: 
#> Chains detected: 
#> X:/Program Files/R/R-3.6.2/library/serosolver/extdata/cs2_sim//case_study_2_sim_1_infection_histories.csv
#> X:/Program Files/R/R-3.6.2/library/serosolver/extdata/cs2_sim//case_study_2_sim_2_infection_histories.csv
#> X:/Program Files/R/R-3.6.2/library/serosolver/extdata/cs2_sim//case_study_2_sim_3_infection_histories.csv
#> X:/Program Files/R/R-3.6.2/library/serosolver/extdata/cs2_sim//case_study_2_sim_4_infection_histories.csv
#> X:/Program Files/R/R-3.6.2/library/serosolver/extdata/cs2_sim//case_study_2_sim_5_infection_histories.csv
#> [[1]]
#> [1] 641001
#> 
#> [[2]]
#> [1] 645313
#> 
#> [[3]]
#> [1] 647117
#> 
#> [[4]]
#> [1] 639237
#> 
#> [[5]]
#> [1] 640481

## Alternative, load the included MCMC chains rather than re-running
## data(cs2_chains_sim)
## sim_all_chains <- cs2_chains_sim

theta_chain <- sim_all_chains$theta_chain
## Get the MCMC chains as a list
list_chains <- sim_all_chains$theta_list_chains
## Look at diagnostics for the free parameters
list_chains1 <- lapply(list_chains, function(x) x[,c("mu","sigma1","error",
                                                     "tau","total_infections",
                                                     "lnlike","prior_prob")])

## Gelman-Rubin diagnostics and effective sample size
print(gelman.diag(as.mcmc.list(list_chains1)))
#> Potential scale reduction factors:
#> 
#>                  Point est. Upper C.I.
#> mu                     1.01       1.02
#> sigma1                 1.00       1.00
#> error                  1.00       1.00
#> tau                    1.00       1.00
#> total_infections       1.02       1.06
#> lnlike                 1.01       1.04
#> prior_prob             1.01       1.04
#> 
#> Multivariate psrf
#> 
#> 1.03
print(effectiveSize(as.mcmc.list(list_chains1)))
#>               mu           sigma1            error              tau 
#>         815.2872        1413.2702        2555.2807        1102.1962 
#> total_infections           lnlike       prior_prob 
#>         658.8086         844.4075         779.8205

melted_theta_chain <- reshape2::melt(as.data.frame(theta_chain), id.vars=c("samp_no","chain_no"))
estimated_pars <- c(sim_par_tab[sim_par_tab$fixed == 0,"names"],"total_infections")
melted_theta_chain <- melted_theta_chain[melted_theta_chain$variable %in% estimated_pars,]
colnames(melted_theta_chain)[3] <- "names"

add_row <- data.frame("total_infections",actual_total_infections,0,0.1,0,10000,0,0,1)
colnames(add_row) <- colnames(sim_par_tab)
sim_par_tab1 <- rbind(sim_par_tab, add_row)

ggplot(melted_theta_chain) + 
  geom_density(aes(x=value,fill=as.factor(chain_no)),alpha=0.5) +
  geom_vline(data=sim_par_tab1[sim_par_tab1$fixed == 0,],aes(xintercept=values),linetype="dashed") +
  facet_wrap(~names,scales="free") + 
  theme_classic() +
  theme(legend.position="bottom")

Recovery of known attack rates is also reasonably accurate, though the constraint of the posterior distibution is quite low for many years where identifiability is poor. Again, more titre data or more individuals would improve inferential power. One particularly reassuring plot is the comparison of known individual cumulative infection histories (the cumulative sum of infections over time for an individual) against the estimated posterior distribution of cumulative infection histories. We see that the 95% credible intervals capture the true cumulative infection histories in almost all cases.

## Extract infection history chain
inf_chain <- sim_all_chains$inf_chain

## Look at inferred attack rates
p_ar <- plot_attack_rates(inf_chain, sim_titre_dat, strain_isolation_times, pad_chain=FALSE,
                            plot_den = TRUE,prior_pars=list(prior_version=prior_version, 
                                                          alpha=par_tab[par_tab$names=="alpha","values"],
                                                          beta=par_tab[par_tab$names=="beta","values"]))  +
  geom_point(data=sim_ar,aes(x=year,y=AR),col="purple")
print(p_ar)


## Calculate convergence diagnostics and summary statistics on infection histories
## Important to scale all infection estimates by number alive from titre_dat
sim_n_alive <- get_n_alive_group(sim_titre_dat, strain_isolation_times,melt=TRUE)

## This function generates a number of MCMC outputs
ps_infhist <- plot_posteriors_infhist(inf_chain=inf_chain, 
                                      years=strain_isolation_times, 
                                      n_alive=sim_n_alive)
#> Padding inf chain...
#> Done
#> Calculating by time summaries...
#> Done
#> Calculating by individual summaries...
#> Done

## Check convergence of infection history summary statistics
## MCMC trace plots of attack rates
print(ps_infhist[["by_time_trace"]][[1]])

## MCMC trace plots of total number of infections per individual
print(ps_infhist[["by_indiv_trace"]][[1]])


## Check for agreement between inferred cumulative infection histories 
## for some individuals
p_indiv_inf_hists <- generate_cumulative_inf_plots(inf_chain,indivs=1:9,pad_chain=FALSE,
                                                   real_inf_hist=sim_infection_histories,
                                                  strain_isolation_times = strain_isolation_times,
                                                  number_col=3)
## Each subplot shows one individual
print(p_indiv_inf_hists[[1]])

## Posterior probability that infections occured at given times per individual
## Each subplot shows one individual
print(p_indiv_inf_hists[[2]])

Paper case study 2: serosolver