# Daily longitudinal sampling of SARS-CoV-2 infection reveals substantial heterogeneity in infectiousness – Nature Microbiology

This study was approved by the Western Institutional Review Board, and all participants provided informed consent.

### Participants

All on-campus students and employees of the University of Illinois at Urbana-Champaign are required to submit saliva for RT–qPCR testing every 2–4 days as part of the SHIELD campus surveillance testing programme. Individuals testing positive were instructed to isolate and were eligible to enrol in this study for a period of 24 h following receipt of their positive test result. Close contacts of individuals who test positive (particularly those co-housed with them) are instructed to quarantine and were eligible to enrol for up to 5 days after their last known exposure to an infected individual. All participants were also required to have received a negative saliva RT–qPCR result 7 days before enrolment.

Individuals were recruited via either a link shared in an automated text message providing isolation information sent within 30 min of a positive test result, a call from a study recruiter or a link shared by an enroled study participant or included in information provided to all quarantining close contacts. In addition, signs were used at each testing location and a website was available to inform the community about the study.

Participants were required to be at least 18 years of age, have a valid university ID, speak English, have Internet access and live within 8 miles of the university campus. After enrolment and consent, participants completed an initial survey to collect information on demographics and health history and were provided with sample collection supplies. Participants who tested positive before enrolment or during quarantine were followed for up to 14 days. Quarantining participants who continued to test negative by saliva RT–qPCR were followed for up to 7 days after their last exposure. All participants’ data and survey responses were collected in the Eureka digital study platform. All study participants were asked whether they had previously tested positive for SARS-CoV-2 or been vaccinated against SARS-CoV-2. All participants included in this cohort reported no previous SARS-CoV-2 infection and were unvaccinated at the time of enrolment.

### Sample collection

Each day, participants were remotely observed by trained study staff, who collected the following samples.

1. (1)

Saliva (2 ml), into a 50-ml conical tube

2. (2)

One nasal swab from a single nostril using a foam-tipped swab that was placed within a dry collection tube

3. (3)

One nasal swab from the other nostril using a flocked swab that was subsequently placed in a collection vial containing 3 ml of viral transport medium (VTM). Swab and VTM manufacturer were not changed throughout the study.

The order of nostrils (left versus right) used for the two different swabs was randomized. For nasal swabs, participants were instructed to insert the soft tip of the swab at least 1 cm into the indicated nostril until they encountered mild resistance, rotate the swab around the nostril five times and leave it in place for 10–15 s. After daily sample collection, participants completed a symptom survey. A courier collected all participant samples within 1 h of sampling using a no-contact pickup protocol designed to minimize courier exposure to infected participants.

### Saliva RT–qPCR

After collection, saliva samples were stored at room temperature and RT–qPCR was run within 12 h of initial collection in a Clinical Laboratory Improvement Amendments (CLIA)-certified diagnostic laboratory. The protocol for the covidSHIELD direct saliva-to-RT–qPCR assay used has been detailed previously24. In brief, saliva samples were heated at 95 °C for 30 min followed by the addition of 2× Tris/Borate/EDTA buffer (TBE) at a 1:1 ratio (final concentration 1× TBE) and Tween-20 to a final concentration of 0.5%. Samples were assayed using the Thermo Taqpath COVID-19 assay.

### Antigen testing

Foam-tipped nasal swabs were placed in collection tubes, transported in cold packs and stored at 4 °C overnight based on guidance from the manufacturer. The morning after collection, swabs were run through the Sofia SARS antigen FIA on Sofia devices according to the manufacturer’s protocol.

### Nasal swab RT–qPCR

Collection tubes containing VTM and flocked nasal swabs were stored at −80 °C after collection and were subsequently shipped to Johns Hopkins University for RT–qPCR and virus culture testing. After thawing, VTM was aliquoted for RT–qPCR and infectivity assays. One millilitre of VTM from the nasal swab was assayed on the Abbott Alinity, according to the manufacturer’s instructions, in a College of American Pathologist and CLIA-certified laboratory.

### Virus culture from nasal swabs

Vero-TMPRSS2 cells were grown in complete medium (CM) consisting of DMEM with 10% foetal bovine serum (Gibco), 1 mM glutamine (Invitrogen), 1 mM sodium pyruvate (Invitrogen), 100 U ml–1 penicillin (Invitrogen) and 100 μg ml–1 streptomycin (Invitrogen)57. Viral infectivity was assessed on Vero-TMPRSS2 cells as previously described using infection medium (identical to CM except that FBS is reduced to 2.5%)26. When a cytopathic effect was visible in >50% of cells in a given well, the supernatant was harvested. The presence of SARS-CoV-2 was confirmed through RT–qPCR, as described previously, by extracting RNA from the cell culture supernatant using the Qiagen viral RNA isolation kit and performing RT–qPCR using N1 and N2 SARS-CoV-2-specific primers and probes, in addition to primers and probes for the human RNaseP gene with the CDC research-use-only 2019-Novel Coronavirus (2019-nCoV) Real-time RT–PCR primer and probes sequences, and utilizing synthetic RNA target sequences to establish a standard curve58.

### Viral genome sequencing and analysis

Viral RNA was extracted from 140 µl of heat-inactivated (30 min at 95 °C, as part of the protocol detailed in ref. 24) saliva samples using the QIAamp viral RNA mini kit (Qiagen); 100 ng of viral RNA was used to generate complementary DNA using the SuperScript IV first strand synthesis kit (Invitrogen). Viral cDNA was then used to generate sequencing libraries utilizing the Swift SNAP Amplicon SARS CoV2 kit with additional coverage panel and unique dual indexing (Swift Biosciences), which were sequenced on an Illumina Novaseq SP lane. Data were run through the nf-core/viralrecon workflow (https://nf-co.re/viralrecon/1.1.0) using the Wuhan-Hu-1 reference genome (NCBI accession NC_045512.2). Swift v.2 primer sequences were trimmed before variant analysis from iVar v.1.3.1 (https://doi.org/10.1186/s13059-018-1618-7), retaining all calls with a minimum allele frequency of 0.01 and higher. Viral lineages were called using the Pangolin tool (https://github.com/cov-lineages/pangolin) v.2.4.2, pango v.1.2.6 and the 5/19/21 version of the pangoLEARN model based on the nomenclature system described in ref. 59.

### Statistics and reproducibility

Details of statistical analysis methods are given below. No statistical method was used to predetermine sample size. For some analyses, a small number of individuals were excluded for reasons detailed above, where relevant. Experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment.

### Statistical analyses

The difference in the distribution of a parameter of interest between the non-B.1.1.7 and B.1.1.7 infection groups was assessed using univariate analysis, and P values calculated using the Wilcoxon rank-sum test. Comparison of infectious virus shedding between the two groups was performed using multivariate analysis with age as an additional variate. Levels of infectious viral shedding, after adjusting for age, were predicted by assuming an age of 28 years—that, is the median age of the cohort (Fig. 4c).

### Generation of figures

All figures, except for Fig. 2a, were generated using RStudio. Figure 2a was generated using Microsoft Powerpoint.

### Overview of model construction and parameter estimation

The goal of quantitative analyses is to use mathematical models to characterize viral shedding dynamics based on both viral genome loads (as measured by RT–qPCR) and the presence or absence of infectious virus (as measured by viral culture assay). Analysing the model results, we quantify individual-level heterogeneity in both viral genome shedding dynamics and individual infectiousness. See Extended Data Fig. 6 for an overview of the analysis workflow.

First, we performed experiments to derive the calibration curves for transformation of Ct/CN values from RT–qPCR to viral genome loads (Viral genome load calibration from Ct/CN values). Note that, due to the nature of RT–qPCR assays and sampling noise, viral genome loads derived using calibration curves represent a proxy for the actual quantities. Nonetheless, this approach is the best available to derive viral genome loads for the purpose of viral dynamic modelling, and is widely used in understanding SARS-CoV-2 dynamics21,60.

Second, we constructed viral dynamic models and fit these to viral genome loads (Viral dynamics models). We estimated key parameters governing infection processes in the nasal- and the saliva-associated compartments, such as viral exponential growth rate before peak viral genome load and viral clearance rate. This allows us to characterize individual-level heterogeneity in infection kinetics.

Third, we constructed mathematical models to describe how the amount of infectious virus shed relates to changes in viral genome load, as measured by RT–qPCR (Modelling infectiousness of an individual). We fit the models to viral culture assay data. Using the best model and predicted viral genome load kinetics from the viral dynamics model, we predicted the extent of infectious virus shedding—that is the infectiousness, for each individual—and thus quantified the individual-level heterogeneity in infectiousness.

### Viral genome load calibration from Ct/CN values

#### Viral genome load calibration: nasal samples

To calculate viral genome loads from CN values reported for nasal samples, we performed calibration curve experiments to empirically define the relationship between CN values obtained from the RT–qPCR assay used on nasal swab samples, and absolute viral genome loads within samples, as quantified by ddPCR. We quantified viral genome loads for 62 nasal samples with CN values ranging between 17 and 38. For each sample, absolute copy numbers of viral genomes were measured using two different N-gene-specific primer sets (N1 and N2). To account for technical noise between samples, we also determined the concentration of the host RNAse P (RP) transcript as a control (Supplementary Table 10). We then normalized copy numbers of N1 and N2 targets by dividing by their corresponding RP target numbers, then multiplied the mean of RP concentration across all samples. Note that the unit of these measurements is per millilitre: this is because nasal swab samples were each collected in 3 ml of VTM.

Plotting the logarithm of normalized viral genome loads against the associated CN values shows a clear linear relationship, justifying the use of linear regression below. Linear regression lines with similar coefficients were used as calibration curves in other studies21,60. We also note that the noise in genome viral loads is high when CN values are high (for example, >33), probably a reflection of increased noise when the signal is low26. However, this high level of variation at high CN values will not impact on the conclusion of our study, because the range of viral loads relevant to transmission is much higher (>106 copies ml–1; Fig. 3d).

We then performed linear regression on measured CN values and log10 viral genome loads (Extended Data Fig. 9). This led to the following formula for the relationship between CN values and viral genome load:

$$\log _{10}V = 11.35 – 0.25{\mathrm{CN}}$$

where V and CN denote the viral genome load and CN value, respectively. Note that, because of the high number of data points measured, the level of uncertainty in the regression line is minimal (Extended Data Fig. 9).

#### Viral genome load calibration: saliva samples

Unlike for nasal samples, we were unable to measure the calibration curve using saliva samples taken from participants. To quantify the efficiency of the RT–qPCR assay used on saliva samples, we used data from calibration experiments in which saliva samples obtained from healthy donors were spiked with SARS-CoV-2 genomic RNA. More specifically, 0.9 ml of saliva from a healthy donor was spiked with 0.1 ml of 1.8 × 108, 5.4 × 105 or 6.0 × 104 RNA copies ml–1. For samples spiked with 1.8 × 108 RNA copies ml–1, tenfold serial dilutions were performed to a final concentration of 1.8 × 104 RNA copies ml–1. A total of 24 samples were collected and Ct values of the N gene then measured (Supplementary Table 11).

As above, we plotted the logarithm of viral loads against Ct values (Extended Data Fig. 10). The plot shows a clear linear relationship, justifying the use of linear regression below. We then performed linear regression on measured CN values and log10 viral genome loads (Extended Data Fig. 10). This led to the following formula for the relationship between CN values and viral genome load:

$$\log _{10}V = 14.24 – 0.28{\mathrm{Ct}}$$

where V and Ct denote viral genome load and Ct value, respectively. In regard to the nasal calibration curve, the level of uncertainties in the regression line is minimal (Extended Data Fig. 10).

Note that a major difference between samples spiked with viral genomes and those taken from infected individuals is that the latter are likely to be noisier because of variation in the sample collection process. However, the two approaches should not differ substantially in assessing the efficiency of the RT–PCR protocol. The impact of noise in the nasal sample can be minimized by taking a large number of samples over a wide range of CN values, as we did for the nasal samples. Therefore, the calibration curves derived above represent an accurate translation of Ct/CN values to viral load.

### Viral dynamics models

We constructed viral dynamics models to describe the dynamic changes in viral genome load. The viral genome load patterns in nasal and saliva samples are distinct from each other in many individuals, suggesting compartmentalization of infection dynamics in these two sample sites. Therefore, we use the models below to describe data collected from these two compartments separately. See Fig. 2a and Extended Data Fig. 4 for schematics of these models.

#### The target-cell-limited model

We first constructed a within-host model based on the target-cell-limited (TCL) model used for other respiratory viruses such as influenza61 and, more recently, SARS-CoV-2 (refs. 27,29,62). We keep track of the total numbers of target cells (T), cells in the eclipse phase of infection (E)—that is, infected cells not yet producing virus, productively infected cells (I) and viruses (V). The ordinary differential equations are:

$$\begin{array}{*{20}{l}} {\frac{{{\mathrm{d}}T}}{{{\mathrm{d}}t}}} \hfill & = \hfill & { – \beta VT} \hfill \\ {\frac{{{\mathrm{d}}E}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {\beta VT – kE} \hfill \\ {\frac{{{\mathrm{d}}I}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {kE – \delta I} \hfill \\ {\frac{{{\mathrm{d}}V}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {\pi I – cV} \hfill \end{array}$$

(1)

In this model, target cells are infected by virus with rate constant β, cells in the eclipse phase become productively infected cells at per-capita rate k and productively infected cells die at per-capita rate δ. We use V to describe viruses measured in nasal or saliva samples, representing a proportion of the total virus in the compartment under consideration. Therefore, rate π is the product of viral production rate per infected cell and the proportion of virus that is sampled (see Ke et al.27 for a detailed derivation). Viruses are cleared at per-capita rate c.

#### Refractory cell model

We extend the TCL model by including an early innate response—that is the type-I/III interferon response, where interferons are secreted from infected cells and bind to receptors on uninfected target cells, stimulating an antiviral response that renders them refractory to viral infection. Note that this is the best model to describe the viral genome load dynamics as measured by RT–qPCR from nasal samples.

We keep track of interferon (F) and cells refractory to infection (R), in addition to other quantities in the TCL model. The full ordinary differential equations (ODEs) for target cells, refractory cells and interferon are

$$\begin{array}{*{20}{l}} {\frac{{{\mathrm{d}}T}}{{{\mathrm{d}}t}}} \hfill & = \hfill & { – \beta VT – \phi FT + \rho R} \hfill \\ {\frac{{{\mathrm{d}}R}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {\phi FT – \rho R} \hfill \\ {\frac{{{\mathrm{d}}E}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {\beta VT – kE} \hfill \\ {\frac{{{\mathrm{d}}I}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {kE – \delta I} \hfill \\ {\frac{{{\mathrm{d}}V}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {\pi I – cV} \hfill \\ {\frac{{{\mathrm{d}}F}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {sI – \mu F} \hfill \end{array}$$

(2)

In this model, the impact of the innate immune response is to convert target cells into refractory cells at rate ϕFT where ϕ is a rate constant. Refractory cells can become target cells again at rate ρ. Interferon is produced and cleared at rates s and μ, respectively.

For simplicity, and due to a lack of empirical data on interferon responses in our study, we simplify the model by making the quasi-steady-state assumption that the interferon dynamics are much faster than the dynamics of infected cells and assume that $$\frac{{{\mathrm{d}}F}}{{{\mathrm{d}}t}} = 0$$. Thus $$sI = \mu F$$ or $$F = \frac{s}{\mu }I$$.

Let $${\Phi} = \phi \frac{s}{\mu }$$, so that the ODEs for the innate immunity model become:

$$\begin{array}{*{20}{l}} {\frac{{{\mathrm{d}}T}}{{{\mathrm{d}}t}}} \hfill & = \hfill & { – \beta VT – {\Phi}IT + \rho R} \hfill \\ {\frac{{{\mathrm{d}}R}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {{\Phi}IT – \rho R} \hfill \\ {\frac{{{\mathrm{d}}E}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {\beta VT – kE} \hfill \\ {\frac{{{\mathrm{d}}I}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {kE – \delta I} \hfill \\ {\frac{{{\mathrm{d}}V}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {\pi I – cV} \hfill \end{array}$$

(3)

#### Viral production reduction model

In addition to making target cells refractory to infection, the impact of interferons may include reducing virus production from infected cells. We include this action of interferons in the viral production reduction model. As above, we make the quasi-steady-state assumption that interferon dynamics are much faster than those of infected cells and assume that F is proportional to I. The ODEs for the model are:

$$\begin{array}{*{20}{l}} {\frac{{{\mathrm{d}}T}}{{{\mathrm{d}}t}}} \hfill & = \hfill & { – \beta VT} \hfill \\ {\frac{{{\mathrm{d}}E}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {\beta VT – kE} \hfill \\ {\frac{{{\mathrm{d}}I}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {kE – \delta I} \hfill \\ {\frac{{{\mathrm{d}}V}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {\frac{\pi }{{1 + \gamma I}}I – cV} \hfill \end{array}$$

(4)

where γ is a constant representing the effect of interferon in reducing viral production.

#### Immune effector cell model

Over the course of infection, immune effector cells are activated and recruited to kill infected cells. These immune effector cells include innate immune cells such as macrophages and natural killer cells, as well as cells developed during the adaptive immune response such as cytotoxic T lymphocytes and antibody-secreting B cells. To consider the impact of these immune effector cells, we develop a model—the effector cell model—based on a previous model for influenza infection28. In this model, we assume that the death rate of infected cells is δ1 at the beginning of the infection. This may reflect the cytotoxic effects of viral infection. After time t1, the death rate of infected cells increases by δ2, where δ2 models the killing of infected cells by immune effector cells. The ODEs for the model are:

$$\begin{array}{*{20}{l}} {\frac{{{\mathrm{d}}T}}{{{\mathrm{d}}t}}} \hfill & = \hfill & { – \beta VT} \hfill \\ {\frac{{{\mathrm{d}}E}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {\beta VT – kE} \hfill \\ {\frac{{{\mathrm{d}}I}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {kE – \delta (t)I} \hfill \\ {\frac{{{\mathrm{d}}V}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {\pi I – cV} \hfill \\ {\delta \left( t \right)} \hfill & = \hfill & {\left\{ {\begin{array}{*{20}{l}} {\delta _1} \hfill & {t < t_1} \hfill \\ {\delta _1 + \delta _2} \hfill & {t \ge t_1} \hfill \end{array}} \right.} \hfill \end{array}$$

(5)

Note that this is the best model to describe the viral genome load dynamics as measured by RT–qPCR from saliva samples.

#### Combined model

In the full model, we combine the refractory cell model and immune effector cell model to consider both the immediate interferon response and immune effector response. The ODEs for the model are:

$$\begin{array}{*{20}{l}} {\frac{{{\mathrm{d}}T}}{{{\mathrm{d}}t}}} \hfill & = \hfill & { – \beta VT – {\Phi}IT + \rho R} \hfill \\ {\frac{{{\mathrm{d}}R}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {{\Phi}IT – \rho R} \hfill \\ {\frac{{{\mathrm{d}}E}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {\beta VT – kE} \hfill \\ {\frac{{{\mathrm{d}}I}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {kE – \delta (t)I} \hfill \\ {\frac{{{\mathrm{d}}V}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {\pi I – cV} \hfill \\ {\delta \left( t \right)} \hfill & = \hfill & {\left\{ {\begin{array}{*{20}{l}} {\delta _1} \hfill & {t < t_1} \hfill \\ {\delta _1 + \delta _2} \hfill & {t \ge t_1} \hfill \end{array}} \right.} \hfill \end{array}$$

(6)

### Total target cell numbers

We calculate the total numbers of target cells in the nasal and saliva compartments by multiplying the total number of epithelial cells in these two compartments by the fraction of epithelial cells expected to be targets for SARS-CoV-2 infection.

For the total number of epithelial cells in the nasal compartment, we use the estimate from Baccam et al.61, 4 × 108 cells. This is calculated from the estimate that the surface area of the nasal turbinates is 160 cm2 (ref. 63) and the surface area per epithelial cell is 2 × 10−11 to 4 × 10−11 m2 per cell (ref. 61). For the saliva compartment, the total surface area of the mouth was estimated to be 214.7 cm2 (ref. 64). Therefore, we estimate that the total number of epithelial cells in the mouth is approximately 4 × 108 × 214.7/160 = 5.4 × 108.

Hou et al. estimated that the fraction of cells expressing angiotensin-converting enzyme 2—that is, the receptor for SARS-CoV-2 entry—on the cell surface is approximately 20% in the upper respiratory tract65. Therefore, in our model, the initial numbers of target cells in the nasal and saliva compartments are calculated as 4 × 108 × 20% = 8 × 107 and 5.4 × 108 × 20% = 1.08 × 108, respectively.

Note that these estimates are approximations using available best estimates in the literature. For a standard viral dynamics model, the number of initial target cells and virus production rate are unidentifiable and only their product is identifiable66. Thus, if the actual number of target cells differs from that estimated here, an increase in the initial number of target cells will lead to a corresponding decrease in the estimate of virus production rate, and vice versa.

### Initial number of infected cells

We assume that one cell in the compartment of interest is infected at the start of infection, E0 = one cell, consistent with refs. 27,67. The small number of infected cells is also consistent with a recent work which estimated from sequencing data that the transmission bottleneck is small for SARS-CoV-2 and that there are probably between one and three infected cells at the initiation of infection68,69,70. Note that, in an earlier work, we showed that changes in the number of initially infected cells of between one and five in the model do not substaintially change the inference results27.

#### Initial viral growth rate, r

For all models above, the initial growth of the viral population before peak viral genome load is dominated by viral infection. This means that the immune responses considered in our models act to change the viral growth trajectory substantially only at later time points71. Thus, we derive an approximation to the initial viral growth rate using the TCL model only (equation (1)). This approximation also represents a good approximation for other models.

We first make two simplifying assumptions commonly used in analysis of the initial dynamics of viral dynamic models72,73. First, because at the initial stage of infection the number of infected cells is orders of magnitude lower than the number of target cells, we assume that the number of target cells is at a constant level, T0. Second, the dynamics of viruses are much faster than those of infected cells. For example, the rate of viral clearance is in the time scale of minutes and hours whereas the death of productively infected cells is in days. Therefore, we make the quasi-steady-state assumption, $$\frac{{{\mathrm{d}}V}}{{{\mathrm{d}}t}} \approx 0$$, such that the concentrations of viruses are always in proportion to the concentration of productively infected cells—that is, $$\pi I \approx cV$$. This gives $$V \approx \frac{\pi }{c}I$$.

With these two assumptions, equation (1) becomes a system of linear ODEs with two variables, E and I:

$$\begin{array}{*{20}{l}} {\frac{{{\mathrm{d}}E}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {\beta \frac{\pi }{c}IT_0 – kE} \hfill \\ {\frac{{{\mathrm{d}}I}}{{{\mathrm{d}}t}}} \hfill & = \hfill & {kE – \delta I} \hfill \end{array}$$

(7)

The Jacobian matrix, J, for this system of ODEs is:

$$J = \left[ {\begin{array}{*{20}{c}} { – k} & {\beta \frac{\pi }{c}T_0} \\ k & { – \delta } \end{array}} \right]$$

The initial growth rate, r, is the leading eigenvalue of the Jacobian matrix of the ODE system. We calculate the eigenvalues, λ, for the Jacobian matrix above from $$\left| {J – \lambda I} \right| = 0$$, where I is the identity matrix, and get:

$$\lambda = \frac{1}{2}\left[ { – \left( {k + \delta } \right) \pm \sqrt {\left( {k + \delta } \right)^2 + 4k\delta \left( {R_0 – 1} \right)} } \right]$$, where $$R_0 = \frac{{\beta \pi }}{{\delta c}}T_0$$.

Then, the leading eigenvalue—that is, the initial growth rate r— is:

$$r = \frac{1}{2}\left[ { – \left( {k + \delta } \right) + \sqrt {\left( {k + \delta } \right)^2 + 4k\delta \left( {R_0 – 1} \right)} } \right].$$

(8)

### Fitting viral dynamic models to viral genome load data

We took a non-linear mixed-effect modelling approach to fit the viral dynamic models to viral genome load data from all individuals simultaneously. All estimations were performed using Monolix (Monolix Suite 2019R2, Lixoft: https://lixoft.com/products/monolix/). We allowed random effects on the fitted parameters (unless specified otherwise). All population parameters, except for the starting time of simulation, t0, are positive and therefore we assume that they follow log-normal distributions. For t0 we assume a normal distribution because t0 can be positive or negative.

The parameters β and π in the viral dynamic models strongly correlate with each other when the models are fitted to viral genome load data66. We tested three choices in handling this correlation in fitting all five viral dynamic models: (1) a correlation is assumed between parameter β and π in Monolix; (2) parameter β has a fixed effect only (that is, its value is set to be the same across all individuals); and (3) parameter π has a fixed effect only.

To test whether the age of the individuals and/or the infecting viral genotype (categorized as either non-B.1.1.7 or B.1.1.7) explains the heterogenous patterns in viral genome load trajectories across the cohort, we tested whether they covary with any of the fitted parameters in the model by setting the two variables as a continuous and a categorical covariate, respectively, in Monolix.

The assumptions on parameters β and π and the choice of parameters that covariate with age or viral strain of infection led to a large number of model choices for fitting. Therefore, we took the following strategy to ensure that we identified the best model and parameter combinations to describe the data.

• First, we tested the three assumptions about parameters β and π in the five viral dynamic models without any covariate and selected the best assumption for further analysis based on their corrected Akaike information criterion (AICc) scores.

• Second, using the best assumption, we tested the model by including the age of the individuals as a continuous covariate of all fitted parameter values with a random effect first. We then took an iterative approach to test whether the covariate should be removed from any of the parameters in the model using Pearson’s correlation test in Monolix. The parameter(s) that has a non-significant P value (P > 0.05) or with the lowest P value is removed from next round of parameter fitting. We iterated the process until all parameters were removed.

• The best model variant with the lowest AICc score was then selected for analysis on whether parameter estimates differed in individuals infected by different viral strains. As before, we took an iterative approach. We first set the viral strain—that is, non-B.1.1.7 or B.1.1.7—as a categorical covariate of all fitted parameter values with a random effect in the model. We then tested whether the covariate should be removed from any of the parameters in the model using the analysis of variance in Monolix. The parameter(s) that has a non-significant P value (P > 0.05) or with the lowest P value is removed from the next round of parameter fitting. We iterated the process until all parameters were removed.

• Finally, the model variant with the lowest AICc score was selected as the best model.

#### Prediction of viral genome load trajectories for non-B.1.1.7 and B.1.1.7 strains

We randomly sampled 5,000 sets of parameter combinations from the distribution specified by the best-fit population parameters (Supplementary Table 4). For the effector cell model for the saliva compartment, β and π are strongly correlated. We thus applied formulations such that correlations between the two parameter values are preserved in the random sampling in accordance with the estimated correlation coefficient. We simulated the best-fit model using the 5,000 sets of parameter combinations for each of the strain. The median and the fifth and 95th quantilse of viral genome loads at each time points are reported.

### Modelling infectiousness of an individual

We model how infectiousness depends on the viral genome load in an individual, similarly to the framework proposed in Ke et al.27. Specifically, we first use the viral culture data collected in this study to infer how the level of infectious virus shed relates to viral genome loads as measured by RT–qPCR. From this model, we predict how the level of infectious virus shedding changes over time in each individual and how the overall infectiousness of the infection varies among participants.

#### Relationship between viral genome load and infectious viruses

We first consider three alternative models describing how the amount of infectious virus in a sample is related to viral genome load (derived from the CN values): the ‘linear’ model, ‘power-law’ model and ‘saturation’ model. In these models, due to the nature of stochasticity in sampling, we assume the number of infectious viruses that was in the sample for cell culture experiment to be a random variable, Y, that follows a Poisson distribution, with Vinf representing the expected number of infectious viruses—that is, $$V_{{\mathrm{inf}}} = E(Y)$$.

1. (1)

The linear model:

We assume that Vinf, is proportional to the viral genome load, V, in the sample:

$$V_{{\mathrm{inf }}} = E(Y) = AV$$

(9)

where A is a constant.

2. (2)

The power-law model:

We assume that Vinf is related to the viral genome load, V, by a power function:

$$V_{{\mathrm{inf}}} = E(Y) = BV\,^h$$

(10)

where B and h are constants.

3. (3)

The saturation model:

We assume that Vinf is related to the viral genome load, V, by a Hill function:

$$V_{{\mathrm{inf}}} = E(Y) = V_m\frac{{V^h}}{{V^h + K_m^h}}$$

(11)

where Vm and Km are constants and h is the Hill coefficient.

#### Probability of cell culture being positive

If each infectious virus has a probability $${\it{\varrho }}$$ to establish infection such that the cell culture becomes positive, the number of viruses that successfully establish an infection in cell culture is Poisson distributed with parameter $$\lambda = E\left( Y \right){\it{\varrho }} = V_{{\mathrm{inf}}}{\it{\varrho }}$$. Thus, the probability of one or more viruses successfully infecting the culture so that it tests positive is

$$p_{\mathrm{positive}} = 1 – \exp \left( { – \lambda } \right) = 1 – {{{\mathrm{exp}}}}( – V_{{\mathrm{inf}}}{\it{\varrho }})$$

(12)

Substituting the expressions of Vinf from the three models above, we get the following expressions for ppositive from the three models (note that we use the subscripts ‘1’, ’2’ and ‘3’ to denote the three models for Vinf):

$$p_{{\mathrm{positive}},1} = 1 – \exp \left( { – V_{\mathrm{inf}}{\it{\varrho }}} \right) = 1 – \exp \left( { – DV} \right)$$

(13)

where $$D = A{\it{\varrho }}$$.

$$p_{{\mathrm{positive}},2} = 1 – \exp \left( { – V_{{\mathrm{inf}}}{\it{\varrho }}} \right) = 1 – \exp \left( { – GV\,^h} \right)$$

(14)

where $$G = B{\it{\varrho }}$$.

$$p_{{\mathrm{positive}},3} = 1 – \exp \left( { – V_{{\mathrm{inf}}}{\it{\varrho }}} \right) = 1 – \exp \left( { – J\frac{{V^h}}{{V^h + K_m^h}}} \right)$$

(15)

where $$J = V_m{\it{\varrho }}$$.

Note that, from the expressions above, it becomes clear that we will not be able to estimate parameters A, B and Vm in the three models because they appear as products with the unknown parameter $${\it{\varrho }}$$ in the equations. This means that the viral culture data do not allow us to estimate the absolute number of infectious viruses in a sample or provide a viral genome load; instead, we are able to estimate a quantity that is a constant proportion of the actual number of infectious viruses over time and across individuals. Therefore, we report estimations of infectious viruses in arbitrary units. These estimates represent a relative measure of infectiousness. Two estimates measured at different time points and/or from different individuals can be compared using this method.

#### Model fitting using a population effect modelling approach

For each sample, viral genome load and cell culture positivity were measured. Using these data, we estimate parameter values in the three models by minimizing the negative log-likelihood of the data.

More specifically, the likelihood of the mth observation being positive or negative in cell culture is calculated as:

$$p_{i,m} = \left\{ {\begin{array}{*{20}{l}} {p_{{\mathrm{positive}},i}(V_m),} \hfill & {{\mathrm{if}}\,{\mathrm{the}}\,{\it{k}}{\mathrm{th}}\,{\mathrm{observation}}\,{\mathrm{is}}\,{\mathrm{positive}}} \hfill \\ {1 – p_{{\mathrm{positive}},i}\left( {V_m} \right),} \hfill & {{\mathrm{if}}\,{\mathrm{the}}\,{\it{k}}{\mathrm{th}}\,{\mathrm{observation}}\,{\mathrm{is}}\,{\mathrm{negative}}} \hfill \end{array}} \right.$$

(16)

where Vm is the viral genome load of the mth observation.

Because we have the paired nasal RT–qPCR and viral culture data for each individual, we fit the three mathematical models using a nonlinear mixed-effect modelling approach. Again, all estimations were performed using Monolix. We allowed random effects on the fitted parameters (unless specified otherwise). All population parameters with a random effect are assumed to follow log-normal distributions.

To find the best model explaining the data, we tested models with different combinations of parameters either with or without a random effect (Supplementary Table 7). The model with the lowest AIC score was selected as the best model.

Note that, for each of the three models, we tested a model variation where all parameters in the models have fixed effects only—that is, a single set of parameters is used to explain viral culture data from every individual. In this case, there is no heterogeneity in parameter values across individuals. The resulting AIC scores are significantly worse than the best-fit model assuming random effects on parameters (Supplementary Table 7). This indicates that there is a substantial level of individual heterogeneity in the relationship between infectious virus shedding and viral genome loads (as shown in Fig. 3d).

#### Calculation of CIs of the cell culture positivity curve (Fig. 3c)

Similar to the procedures performed for prediction of CIs of viral genome load trajectories, we randomly sampled 5,000 sets of parameter combinations from the distribution specified by the best-fit population parameters of the best model—that is, the saturation model assuming that Km has only a fixed effect (Supplementary Table 8). More specifically, we sampled parameters from a log-normal distribution for J and h, with their means and standard deviations at the best-fit values. Using the parameter combinations, we generated curves of probability of cell culture positivity at CN values ranging between 10 and 40. The median and the fifth and 95th quantiles of viral genome loads at each CN values are reported.

### Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

#Daily #longitudinal #sampling #SARSCoV2 #infection #reveals #substantial #heterogeneity #infectiousness #Nature #Microbiology