Estimating Vaccine Efficacy from Household Data Using Surrogate Outcome and a Validation Sample

Household data are frequently used in estimating vaccine efficacy because it provides information about every individual’s exposure to vaccinated and unvaccinated infected household members. This information is essential for reliable estimation of vaccine efficacy for infectiousness (V EI), in addition to estimating vaccine efficacy for susceptibility (V ES). However, accurate infection outcome data is not always available on each person due to high cost or lack of feasible methods to collect this information. Lack of reliable data on true infection status may result in biased or inefficient estimates of vaccine efficacy. In this paper, a semiparametric method that uses surrogate outcome data and a validation sample is introduced for estimation of V ES and V EI from a sample of households. The surrogate outcome data is usually based on illness symptoms. We report the results of simulations conducted to examine the performance of the estimates, compare the proposed semiparametric method with maximum likelihood methods that either use the validation data only or use the surrogate data only and address study design issues. The new method shows improved precision as compared to a method based on the validation sample only and smaller bias as compared to a method using surrogate outcome data only. In addition, the use of household data is shown to greatly improve the attenuation in the estimate of V ES due to misclassification of the outcome, as compared to the use of a random sample of unrelated individuals.


Introduction
Estimation of vaccine efficacy has traditionally focused on the vaccine-induced reduction in susceptibility to infection, or vaccine efficacy for susceptibility (V E S ).However a vaccine, such as a prophylactic HIV vaccine, may also lower the infectiousness of a vaccinated person who became infected (Longini et al., 1996).The relative reduction in infectiousness due to a vaccine is the vaccine efficacy for infectiousness or V E I .Both V E S and V E I are measures of the true biological effects of a vaccine.
In general, V E is expressed as 1-RR, where RR is a measure of relative risk in vaccinated individuals compared to unvaccinated individuals, under the assumption of equal exposure to the infectious agent.Different levels of information are required to estimate V E S depending on what parameterization is used (Halloran et al., 1997).Haber et al. (1991) defined V E S in terms of the transmission probability to a susceptible individual who makes a contact with an infectious person.V E S is defined as one minus the ratio of the transmission probabilities to a vaccinated and an unvaccinated susceptible person when both are exposed to the same source of infection.V E I measures the effect of a vaccine on infectiousness of a vaccinated infected person.It is defined as one minus the ratio of the transmission probabilities from a vaccinated and an unvaccinated infected individual when they make contacts with a susceptible person (Koopman and Little, 1995).Estimation of V E I is challenging because it requires information on exposure to infection, and gathering this type of information is often expensive, difficult or even impossible.Therefore, V E I cannot be estimated from a sample of unrelated individuals.Data based on a sample of households provide information on everyone's exposure to both vaccinated and unvaccinated infected individuals.The information on infections contracted from vaccinated persons who became infected is essential for reliable estimation of V E I .Davis and Haber (2001) developed a maximum likelihood method for the estimation of V E S and V E I from household data.
The problem of estimating V E S and V E I is further complicated by the fact that reliable infection outcome data is often expensive or difficult to collect from each individual in a vaccine study.For example, in an influenza vaccine study, a culture or a quick test of a sputum or a blood sample would be required to confirm infection (Halloran and Longini, 2001).Confirming all individuals in the study by cultures or samples can be very expensive and time consuming.Often, a closely related outcome may be used as a surrogate for the infection outcome.For example, an illness outcome defined as 'any respiratory illness,' can be used as a surrogate for the infection outcome in an influenza vaccine study.
The use of surrogate outcome variables is common in medical research, especially in clinical settings (Prentice, 1989;Wittes et al., 1989;Fleming et al., 1994).In identifying 'valid' surrogates, Prentice (1989) suggested the criteria that a test of the null hypothesis using a surrogate w provides valid inference regarding the true outcome x.He also provided general guidelines for choosing variables to satisfy this definition of surrogacy.According to his definition, a key property of a potential surrogate is that P (x|w, m) = P (x|w) almost surely, where m is a covariates or a treatment indicator.This implies that the effect of treatment on the true outcome should act solely through the surrogate w.This is the foundation for making inference about the true outcome based solely on the surrogate.However, this assumption may not be satisfied in many applications.For example, in the case of an infectious disease it is possible that the vaccine affects the probability that an ill person is indeed infected.To relax this assumption, Pepe (1992) proposed a semi-parametric method that uses a validation sample to relate the true and surrogate outcomes.She showed that this semiparametric method allows direct inference regarding the association between the true outcome and the covariates.Golm et al. (1998Golm et al. ( , 1999) ) explored the use of semiparametric methods with validation samples for exposure-to-infection information to estimate V E I in trials of human immunodeficiency virus vaccines.Their methods assume that exposureto-infection, which is a covariate, may be mismeasured while the outcome (infection) is always correctly assessed.Halloran and Longini (2001) illustrated the use of validation sets to correct the attenuated estimate of V E S for mismeasured outcome data.They used an example of influenza vaccine efficacy and effectiveness trials under the assumption that the group of influenza-like cases includes true and misspecified influenza infection cases.Halloran and Longini multiplied an estimated probability (which is assumed constant over time) of an influenza-like case being true influenza infection in each vaccination stratum (i.e., vaccinated or nonvaccinated) when estimating V E S alone from final attack rates.Currently, there is no method available for estimating V E S and V E I from data with mismeasured outcome information.
The purpose of this work is to develop and evaluate a semiparametric method for simultaneous estimation of V E S and V E I from household data when the true infection status is observed on everybody in a validation sample of households and a surrogate illness outcome is observed on every study participant.We extend the method of Pepe (1992) to the case where the units of analysis are households of various sizes, the true outcome is the array of the (correlated) infection statuses of all household members, the surrogate outcome is the corresponding array of illness statuses and the treatment indicator is the corresponding array of vaccination statuses.As we mentioned earlier, household data is used because it contains information on the vaccination and infection or illness status of each household member.In other words, it provides information about every individual's exposure to vaccinated and unvaccinated infected or ill household members, which is necessary for reliable estimation of the vaccine effect on infectiousness.One should note that for a study participant in a household where the true infection statuses may be misclassified, we have incomplete information on both the outcome variable (her/his own infection status) and the exposure variables (the infection statuses of all other household members).

Study design
We consider an outbreak of an infectious disease which is transmitted from person to person in a closed community.Once a susceptible person becomes infected, she or he is infectious to others for a relatively short time and then becomes immune at least until the outbreak is over.The community consists of many small transmission units, which will be referred to as households.(Sexual partnerships can be viewed as households of size two).For simplicity, we assume that everybody, except for a small number of initial infectives, is susceptible at the beginning of the study.(Individuals who are initially immune can be excluded from the study without loss of any relevant information).A susceptible person can become infected from an infectious household member or from 'the community', i.e. from an infectious person in another household.Prior to the outbreak, individuals may be vaccinated with a 'leaky' vaccine, i.e., a vaccine that reduces their susceptibility by lowering their probability of becoming infected.The vaccine may also reduce an individual's infectiousness by lowering her/his probability of infecting others in the case she/he becomes infected (a vaccine breakthrough).The main purpose of the study is to evaluate the vaccine's effects on the susceptibility and infectiousness of a vaccinee as compared to an unvaccinated person.
For the purpose of the study, we assume that two samples of households are selected from the community.In the first sample, which will be referred to as the validation sample, both the true infection outcome and a related surrogate outcome, which is usually based on illness symptoms, are available for all the members of every household.We denote the number of households in the validation sample by N v .In the second sample, which will be called the surrogate sample, only the surrogate illness outcome is known for everyone.There are N − N v households in the surrogate sample, where N is the total number of households in the study.
Consider a household with s = s 0 + s 1 initial susceptibles, where s 0 and s 1 are the number of unvaccinated and vaccinated susceptible household members, respectively.Let m i denote the vaccination status of person i, with m i = 1 for vaccinated and m i = 0 for unvaccinated.The array m = (m 1 , ..., m s ) denotes the vaccination statuses of all the susceptible household members.Let x i be the infection status of person i at the end of the outbreak, with x i = 1 for infected and x i = 0 for uninfected.Finally, let w i be the surrogate outcome (i.e., illness) of person i with w i = 1 for ill and w i = 0 for not ill.For households in the validation sample, the true infection outcome array x = (x 1 , ..., x s ) and the surrogate outcome w = (w 1 , ..., w s ) are known.For households in the surrogate sample, only the surrogate outcome array w is known.Table 1 describes the data structure.
Table 1: Sample data structure of a household study with a validation sample of N v households and a surrogate sample of N − N v households; s is the size (number of initial susceptibles) of the household; m is the array of vaccination statuses of all household members; x is the true infection outcome; w is the surrogate illness outcome.
is an indicator with 1 representing a household in the validation sample and 0 representing a household in the surrogate sample.

Household
s m x w

Calculation of P (x|m)
To write an expression for the probability P (x|m) of infection outcome x in a household with vaccination pattern m, we first need to define the transmission probabilities and the effects of the vaccine.Let β denote the probability that an unvaccinated susceptible becomes infected from the community during the course of the epidemic, and let γ denote the probability that the same person is infected from an unvaccinated household member while the latter is infectious.The vaccine efficacy for susceptibility, V E S , is the relative reduction due to vaccination in the transmission probability to a vaccinated susceptible.The vaccine efficacy for infectiousness, V E I , is the relative reduction due to vaccination in the transmission probability from a vaccinated infectious person.Define θ = 1−V E S and ϕ = 1 − V E I .Then the transmission probability from the community to a vaccinated susceptible is β • θ.The transmission probability from an infected person to a susceptible household member is γ • θ when the susceptible person is vaccinated and the infected is unvaccinated; it is γ • ϕ when the susceptible is unvaccinated and the infected is vaccinated; and it is γ • θ • ϕ when both are vaccinated.
For the infection outcome x, let j 0 and j 1 be the number of infected persons among the unvaccinated and vaccinated household members, respectively.Then j = j 0 + j 1 = x i .Let J denote the subset of the j household members who became infected.Then for j = 0, 1, 2, ..., s−1 (i.e., not everybody in the household became infected): (2.1) The first term in (2.1) denotes the probability that everybody in subset J became infected if there were no other members in the household.The second and third terms are the probabilities that all non-infected unvaccinated and vaccinated household members, respectively, escaped infection from the community.The next two terms are the probabilities that all non-infected unvaccinated and vaccinated household members escaped infection from the j 0 unvaccinated infected members.The last two terms are the corresponding escape probabilities from the j 1 vaccinated infected members.For a proof of (2.1) see Longini et al. (1988).
The probability that everybody in the household became infected, i.e., P (x = 1|m), is obtained as one minus the sum of all the expressions (2.1) over j = 0, 1, 2, ..., s − 1.Thus, a recursive computation is involved in calculating the probabilities of the infection outcomes, x.For a household of size s, one needs to first calculate the probabilities of all possible outcomes for all the households of sizes s = 1, 2, ..., s − 1.
If the true infection outcome is available for all the study participants, then the likelihood function is obtained as the product of all the terms P (x|m) over all the households in the study.Maximization of the likelihood will then provide estimates of the parameters β, γ, θ and ϕ (Davis and Haber, 2001).

The semiparametric method
We propose a semiparametric method to estimate θ and ϕ (i.e., V E S and V E I ) using the surrogate and validation samples.The validation sample is used to relate the true and the surrogate outcomes (x and w) and thus to reduce the bias of the parameter estimates.The surrogate sample is used to improve the efficiency of the estimates.A semiparametric method is used to avoid specification or misspecification of the relationship between the true outcome and the surrogate outcome while still making valid inference on the parameters of interest (Pepe, 1992).A semiparametric method that places no structure on the conditional probability function P (w|x, m) is desirable since the relationship between the true outcome x and the surrogate outcome w is not of primary interest.
Given that no structure is specified for P (w|x, m), we assume that P (w|x, m) is independent of Θ, where Θ = (β, γ, θ, ϕ).In other words, the parameters related to transmission and vaccine effects do not affect that probability that an infected person develops illness symptoms.On the other hand, we allow the probability of illness given infection to depend on the actual vaccination status.Then, an empirical estimator of P (w|x, m) is found using the validation sample: P (w|x, m) = P (w, x, m)/ P (x, m), where I [.] is the indicator function, V denotes the validation sample and N v is the number of households in the validation sample.
Then the estimated likelihood function is: (2.2)

Properties of the maximum estimated likelihood estimates
Under regularity conditions, the maximum estimated likelihood estimates Θ satisfies the score equation ∂ log L(Θ)/∂Θ = 0 and is consistent (Pepe, 1992).
If derivatives are available, the Newton-Raphson iteration scheme can be used to find Θ.The estimates of V E S and V E I are obtained as 1 − θ and 1 − φ, repectively.The properties of Θ (details of the proof can be found in Pepe, 1992) are: a.If the validation sample fraction N v /N has a nonzero limit ρ v then n 1 2 ( Θ − Θ) converges in distribution to a mean zero normal random variable with variance where b.The estimate Ĵ (Θ) = n −1 ∂ 2 log L(Θ)/∂Θ 2 is consistent for J (Θ), and

Simulation Results
We conducted a simulation study to investigate the empirical bias and precision of the estimates of θ and ϕ, and to compare the performance of the parameter estimates with different validation sample sizes and misclassification probabilities.Four estimation methods were used.(1) The full data method, i.e., the ML method that one would use if the true infection outcome could be measured on every study participant.(2) The validation method that uses only the true outcomes in the validation sample.(3) the surrogate method that uses the surrogate outcomes from all the N households.(4) The semiparametric method that uses the true and the surrogate data from the validation sample and the surrogate data from the surrogate sample.One expects the first method to produce the most accurate and precise estimates as it uses the true infection outcome for all the households in the study.Obviously, this method cannot be used when the the true outcome is only observed on a subset of households, but we included it in the simulation study for comparisons with the other methods.The second method completely ignores the surrogate outcomes.The third method ignores the true outcomes in the validation sample; this method was included in the simulation study as it is based on the data that would be available if it was impossible to obtain the true outcome on any study participant.The fourth method uses all the available data, hence it is expected to produce estimates that are more accurate than in method 3 and more precise than in method 2.
The input parameters for the simulations are δ, θ, ϕ, ε 0 , and ε 1 .δ is the daily transmission probability from an unvaccinated infected person to an unvaccinated susceptible household member.ε 0 and ε 1 are the daily transmission probabilities from the community to an unvaccinated and a vaccinated person, respectively.Note that the simulation program uses the transmission probabilities in one day, and hence they differ from β and γ defined in Section 2.2.The probability of an unvaccinated person becoming infected in one day is 1 Here x 0 , and x 1 are the numbers of infected unvaccinated and vaccinated persons in the household, respectively, on the previous day.The probability of a vaccinated person becoming infected in one day is 1 . In all the simulations, the length of infectious period was set to one day.Prior to the beginning of the 'outbreak', each individual was 'vaccinated' with a probability of 0.5, independently of all other individuals.Based on the results from our earlier paper (Davis and Haber, 2001), this random vaccination design produces the most precise parameter estimates.For each scenario, we generated 200 simulations and reported the mean parameter estimate and the mean estimated standard error over the 200 simulations.
The true infection outcome was obtained for each study participants in each simulation.We now describe the generation of the surrogate outcomes.For a given individual of vaccination status m, define P (w|x, m) as the probability of surrogate outcome w given infection outcome x.Four probabilities were used to generate the surrogate outcome given one's infectious outcome and vaccination status: P 1 = P (w = 1|x = 1, m = 0), P 2 = P (w = 1|x = 1, m = 1), P 3 = P (w = 1|x = 0, m = 0), and P 4 = P (w = 1|x = 0, m = 1).To choose the values for these four probabilities, we first followed the assumption made by Halloran and Longini (2001) for an hypothetical influenza vaccine study.They assumed that every infected person becomes ill, and that an uninfected person may also develop illness symptoms.This implies that P 1 = P 2 = 1, P 3 > 0, and P 4 > 0. We then varied the values of P 3 and P 4 to explore the effect of the probability that an uninfected person becomes ill on the properties of the estimated parameters.Later we relaxed the assumption P 1 = P 2 = 1 and chose values less than 1.0 for these probabilities.
Fortran programs were used to generate the data and obtain the parameter estimates along with their standard errors.Since the likelihood is very complicated and there is no closed form for the derivatives, we followed conventional ways of obtaining the standard errors from Fortran IMSL routines.The subroutine DB2ONF was used in maximizing the likelihood using a quasi-Newton method and a finite-difference gradient.The Hessian matrix is obtained from this subroutine and then the routine DLINRG was used to compute the information matrix.

Reduced model -estimating V E S when V E I = 0
A reduced version of our model for estimating vaccine efficacy can be obtained by assuming that the vaccine affects only susceptibility, i.e., V E I = 0 (ϕ = 1).We explored the performance of θ under different scenarios.One would expect the bias of the methods that use surrogate outcomes to depend on the misclassification probabilities P 3 and P 4. P 3 and P 4 are the probabilities that an unvaccinated and a vaccinated person, respectively, develop illness symptoms when they are infected.

The case P 3 = P 4
Table 2 presents the mean of θ and of its standard error for various input parameter values for household sizes 3 and 4 when P 3 = P 4. The semiparametric method is more robust than the surrogate method and more precise than the validation method.This is more evident for larger values of θ and larger values of P 3 = P 4. We also see that for larger household sizes all four methods perform better (smaller bias and smaller standard error).In order to reduce the impact of the simulation-induced variability, we chose the same seed in all the simulation using a fixed value of θ.Therefore, the results for the full data and the validation methods, which do not depend on P 3 and P 4, are the same for the same value of θ.

Full
Validation Surrogate Semiparametric

Unequal P 3 and P 4
We now consider situations when P 3 and P 4 are not equal.One can expect the performance of the methods that use surrogate data to depend on both the magnitude and the ratio of the misclassification probabilities.The ratio is important because if the misclassification probabilities for vaccinated and unvaccinated persons are very different then the ratio of the frequencies of ill persons between vaccinees and nonvaccinees will be a biased estimate of the vaccine effect.To investigate the effect of the ratio on the performance of the estimates we conducted simulations with a fixed value P 3 + P 4 while varying the ratio P 3/P 4. Table 3 presents the results of these simulation for P 3 + P 4=0.2, 0.4, and P 3/P 4 = 1, 2, 4. We can see that the semiparametric method is quite robust even when P 3/P 4 = 4.It is intersting to note that as P 3/P 4 increases, the standard error for the semiparametric method decreases while the standard error from the surrogate method increases.

Different sampling fractions for the validation sample
Let P v denote the fraction of the validation sample size out of the total number of households in the study.Table 4 lists the simulation results for various sampling fractions with P 3 = P 4 = 0.4 and a total of 500 households of size 4 in the study.We can see that the sampling fraction of the validation sample does not have a significant effect on the bias of the semiparametric estimate of θ.For the standard errors, a slight decrease is observed with increasing the sampling fraction of the validation sample.Thus, it seems that the semiparametric method works as well for a smaller sampling fraction (P v = 0.2) as for a larger sampling fraction.

The full model -estimating V E S and V E I
In this section we drop the assumption V E I = 0 and compare the performance of the four methods with respect to the simultaneous estimation of θ and ϕ.Table 5 presents the results for the case P 3 = P 4 = 0.2 when 100 out of a total of 500 households of size 4 are included in the validation sample.The estimates of θ produced by the semiparametric method have small bias and standard error.
The semiparametric estimates of ϕ have a positive bias, but this bias is usually smaller than the (negative) bias of the surrogate method.On the other hand, the standard errors of the semiparametric estimates are only slightly larger than those produced by the surrogate method.Hence, the use of the true outcomes from the validation sample improves the estimation of ϕ.
So far we have always assumed that P 1 = P 2 = 1, i.e., every person who is infected indeed develops the illness symptoms.We now consider situations where some of the infected persons remained symptom-free (silent infections).Here we report the results for the case P 1 = P 2 = 0.9, P 3 = P 4 = 0.2, θ = ϕ = 0.4, and all the remaining quantities are set to the same values as in Table 5.For the estimation of θ, the surrogate method produced a severely biased estimate of 0.69 while the the bias from each of the other three methods was very small.For the estimation of ϕ, the estimates produced by the full, validation, surrogate and semiparametric methods were 0.39, 0.37, 0.19 and 0.53, respectively.Thus, the last two methods produces biased estimates.Of these three methods, the semiparametric estimate has the smallest standard error (0.13), compared to 0.18 for the surrogate method and 0.27 for the validation method.

Discussion
Estimation of V E S and V E I is often complicated by lack of reliable information on exposure to infection and on the true infection outcome.This paper proposes a semiparametric method that uses data from two sample of households: (i) a surrogate sample, where only a surrogate outcome variable (such as illness symptoms) is observed, and (ii) a validation sample where both the true infection outcome and the surrogate outcome are observed.In estimating V E S when V E I = 0, this semiparametric method performs better than maximum likelihood methods that use the surrogate outcome data only or the true outcome data only.The semiparametric estimates have smaller standard errors than those based on the validation data only and smaller biases than those based on the surrogate data only.This suggests that the proposed method gains efficiency by including the surrogate data and corrects the misclassification bias associated with the surrogate data by including the true outcome data from the validation sample.In estimating V E S and V E I simultaneously, the semiparametric method estimates V E S with very small bias and standard error, but it tends to underestimate V E I , even though this underestimation is not severe when the true V E I is small.The bias in estimating V E I is always larger than in estimating V E S , even when the true outcome is observed for every study participant (Davis and Haber, 2001).While we fixed the household size in each set of simulations, the estimation methods can be used when households of different sizes are included in the study.
Several studies found estimates of vaccine efficacy (V E S ) to be severely attenuated when surrogate illness outcomes are used instead of the true infection outcomes (Belshe et al., 1998(Belshe et al., , 2000;;Nichol et al., 1999;Longini et al., 2000).In this work we found that the use of household data from a study consisting of a surrogate and a validation sample reduces the bias resulting from the inaccuracy of the surrogate data.For example, Halloran and Longini (2001) used data from a random sample of unrelated individuals and obtained an estimated V E S of 0.25 when the true V E S was 0.89.Using the semiparametric method and the study design described in this paper we found that the bias in the estimate of V E S was usually less than 0.1.In addition, using household data allows simultaneous estimation of both V E S and V E I while data on unrelated individuals are not suitable for the estimation of V E I (Davis and Haber, 2001).
Our simulation study shows that the semiparametric method is quite robust even when the number of households in the validation sample size is quite small (e.g., 20 percent) compared to the total number of households included in the study.We also found that the performance of the semiparametric estimates remains quite stable when the misclassification probabilities for vaccinated and unvaccinated persons are very different.
The semiparametric method proposed in this study extends the method of Pepe (1992) to the case where both the true and the surrogate outcomes are arrays of infection or illness statuses of individuals in the same household.Our simulations show that despite the multivariate nature of the outcome variable, the semiparametric method is very robust when one is interested in estimating V E S regardless of the value of V E I .The bias in the estimation of V E I is not more severe than the bias associated with estimating V E I when the true infection outcome is known for each individual.
Future studies can look into better ways to correct the bias in estimating V E I with household data and may add a component in the semiparametric method to correct this underestimation.It is also desirable to explore methods to find the optimal sampling fraction for the semiparametric method proposed for the household data.Finally, one may try to extend this method to cases where the true infection outcome and the surrogate illness outcome are observed for some of the household members while only the illness outcome is observed for other members of the same household.Data of this type was collected in an influenza vaccine trial described in Hurwitz et al (2000).
P 3 and P 4 are the probabilities that an unvaccinated and a vaccinated person, respectively, develop illness symptoms when they are infected.
3 and P 4 are the probabilities that an unvaccinated and a vaccinated person, respectively, develop illness symptoms when they are infected. P