A Statistical Method of Detecting Bioremediation

Hydrocarbon contaminated soils result from pipeline ruptures, petroleum manufacture spills, as well as storage and transportation accidents (Bossert and Bartha (1984)). The cost of removal of the contaminated solids followed by incineration or by disposal in a landfill is prohibitive. Bioremediation the use of microorganism populations to eliminate hydrocarbon contaminations from the environment is the most acceptable technology for hydrocarbon cleanup (Bossert and Bartha (1984)). It can be argued that a decrease of the oil concentration in soil is not due to biodegradation but due to sorption. If this were the case, since mass transfer of sorption is a gradual process, a slow decrease in the oil recovery rate may be observed after a spill. However, a rapid or sudden decrease in the oil concentration during the incubation should exclude sorption as the primary mechanism contributing to the observed hydrocarbon loss. A Bayesian procedure is given to detect a change of the linear relationship between the oil concentration (the dependent variable) and the time in days since the addition of the oil (the independent variable). The advantage of this procedure is that it does not need to assume that the variance of the error before the change is equal to that after the change. The implementation of this procedure is straightforward.


Introduction
In Ma (1999), an investigation of the potential enhanced rate, the potential enhanced extent, or both of bioremediation of 20W motor oil contaminated soil by inoculation of microorganisms was undertaken.The inoculant of the soil was with a commercially available product containing microbial isolates with bioremediation abilities.The product package also included a nutrient product for feeding the microbes, which was added to all treatments.Also examined was the effect of different soil types (sandy loam with low organic matter, denoted as sandy loam, sandy loam with high organic matter, denoted silt loam, and loam with high clay content, denoted loam) on bioremediation.
Ma noted that it could be argued that the observed decrease of the oil concentration in all soil types was not due to biodegradation but due to its sorption effects.Ma also noted that the mass transfer of sorption is a gradual process, and should result in an unchanged degradation rate.Rapid decrease of the oil concentration during the incubation period would exclude sorption as the primary mechanism contributing to the observed hydrocarbon loss, and therefore providing evidence of effective bioremediation.
When oil concentration is modeled via linear regression lines, such a rapid change would be indicated by a change in the regression line at some time during the study period.A statistical test that such a change exists would provide an objective method of eliminating sorption as the primary mechanism contributing to the observed hydrocarbon loss, and supporting the existence of bioremediation.In addition, estimating the time when such a change takes place is then meaningful and potentially useful.For example, the knowledge of the change point under a given set of conditions could determine the choice of inoculant or affect the strategy of dealing with a particular hydrocarbon spill.
Statistical methods have been developed to detect change points in regression.For example, one may use likelihood procedures (see, for example, Worsley (1983) and Srivastava and Worsley (1986)) and Bayesian methods (see, for example, Broemeling and Chin Choy (1981), Moen and Broemeling (1984), and Guttman and Srivastava (1987)).This paper focuses on Bayesian methods.In a Bayesian analysis one needs to give prior probability distributions to both change points and the parameters.The resulting posterior probabilities, based on the data, are then used to make neces-sary inferences.In particular, the posterior distribution of the change point can be employed to locate the "actual time" at which a sequence of observations undergoes sudden changes, that is, the time where the posterior probability assumes its maximum will be used to estimate the actual time of the change.Such a methodology has found many interesting applications in practice.For example, Smith (1975) developed Bayesian tests for structural stability, a topic of interest to economists.Moen and Broemeling (1984) developed a procedure for testing whether or not a change has occurred in the regression matrix of a multivariate linear model.The resulting test is based on the marginal posterior distribution of the change point.A numerical example using a bivariate regression model was used to illustrate the test procedure.Guttman and Srivastava (1987) provided a Bayesian method of finding the change point for the general multivariate linear model in which it is suspected that a change occurs from one linear model to another where the different models have some common parameters.They also discussed a certain change-point problem that involves a switch at some time from one growth-curve model to another.They illustrated the general results through the example of locating the time at which the effects of labor inputs on gross domestic product may undergo.
In this article, we developed a Bayesian method for detecting the time where the regression lines differ.One advantage of the procedure is that it does not assume that the variance of the error before the change of regression is equal to that after the change.This is practical since in many biological situations, the variance of concentrations changes with the magnitude of the concentrations.
The rest of the paper is organized as follows.Section 2 describes the experiment and dataset.Section 3 introduces the statistical models.Section 4 presents the analysis of bioremediation.Section 5 concludes our study.The derivation of formulas is given in an Appendix.

Experiment
Three types of uncontaminated soils (sandy loam, silt loam and loam) were collected from different locations of the University of Wisconsin-Green Bay.This soil was air-dried and screened by using U.S. Standard Sieve Series No. 10 (2.00 mm).20W motor oil was added to simulate a hydrocarbon spill with a concentration of 3000 mg oil per kilogram of dry soil.Each type of soil was then divided into three plots.These plots received one of three treatments.For treatment one, the control, no additional material was added.For treatment two, only the nutrients were added.This treatment was to provide information on whether the nutrients could stimulate the degradation of the oil by indigenous microorganisms.And for treatment three, both the nutrients and the inoculum were added.This treatment was to provide information on whether the inoculum could enhance the rate of oil degradation.These treated plots of the soil were then divided in half to provide a duplication of each treatment on each type of soil.Soil moisture was adjusted twice a week to maintain between a 40 and 50 percent of the soil content by weight at saturation.The soil was also mixed thoroughly twice per week to homogenize the soil and oil contaminant distribution and to enhance aeration.The results of both duplications of each of the three treatments on each of the three soil types are given in Table 1.

Models
The simplest framework of detecting change point in regression may be stated as follows.Consider a sequence of n pairs of observations (x i , y i ), i = 1, 2, . . ., n, where y i is the value of the response variable (dependent variable) in the ith trial and x i is the known value of the independent variable in the ith trial.Suppose the following model: where 3 ≤ t ≤ n − 3, 1i are independent N(0, σ 2 1 ), 2i are independent N(0, σ 2 2 ), and 1i and 2i are independent.The parameters β ij , σ 2 i , and t are all unknown.The above model indicates a switch in regression of y on x at "time" t.The main task is to estimate t.The range 3 ≤ t ≤ n − 3 is necessary in order to allow for the estimation of the two regression lines as well as the estimates of the associated error terms.
For completeness of the theory, we shall consider the following general Source: Ma (1999) case: where p + 1 ≤ t ≤ n − p − 1, x i is a known p vector of independent variables, and both β 1 and β 2 are p vectors of unknown parameters.This model shows that the linear relationship between the dependent and independent variables changes at "time" t.The range p + 1 ≤ t ≤ n − p − 1 is necessary in order to allow for the estimation of the two regression lines as well as the estimates of the associated error terms.
In the following, we shall impose some prior information on β 1 , β 2 , σ 2 1 , σ 2 2 , and t, and derive the posterior probability mass function of the change point t.This posterior distribution will be used to estimate t in (3.2).
Set the matrices X 1 and X 2 and vectors y 1 , y 2 , and y as follows.
Assume that X i is of full rank, i.e., the rank of X i is p, the number of columns of X i , for i = 1, 2. Then the square matrix X i X i is of full rank so that the inverse of X i X i is defined in the usual way.Let Given that no other information is available, what prior might we put on the parameters β 1 , β 2 , σ 2 1 , σ 2 2 , and t ?Intuitively, "noninformative" priors should be given in the absence of "enough" information.Let us consider first the time t.Clearly, we should assume that each of the time values p+1, • • •, n − p − 1 is equally likely.This fact is indicated by the uniform prior p(t) ∝ constant.For the vector β 1 , we may also assume the uniform prior, that is, uniform over the entire p-dimensional space.This is expressed by p(β 1 ) ∝ constant.Similarly, we have p(β 2 ) ∝ constant.To find a prior for σ 2 1 , we consider ln σ 2 1 .Since ln σ 2 1 can take any value between −∞ and ∞, we may assume the uniform prior on ln σ 2 1 , that is, uniform over the entire one-dimensional space.Thus we have p(ln σ 2 1 ) ∝ constant, so that p(σ 2 1 ) ∝ 1/σ 2 1 .Same reasoning yields p(σ 2 2 ) ∝ 1/σ 2 2 .Note that the noninformative priors for β 1 , β 2 , ln σ 2 1 , and ln σ 2 2 should be understood in terms of Jeffreys' improper probability distribution functions.Now we assume that all the above priors are independent.Then we have the following (uniform) prior in β 1 , β 2 , ln σ 2 1 , ln σ 2 2 , and t: It can be shown that given the prior in (3.3) and the data y, the posterior probability that a change point occurs at time t is (see the Appendix) (3.4) where the constant K is such that (3.5) Now one may estimate t by the mode of the distribution in (3.4), that is by that value of t at which p(t|y) has its maximum.
In this paper we use only noninformative priors (i.e., where no prior information is explicitly imposed).Noninformative priors are used when information about parameters is completely unknown or when proper priors such as conjugate priors do not apply.When more information on the parameters is known so that a reliable proper prior can be employed, the resulting estimate of the change point can be expected to be more accurate.For a vigorous discussion on the choice of priors, see Box and Tiao (1992).

Analysis
For simplicity, in this paper we will consider in detail the case of loam with high clay content receiving treatment 3, where both nutrients and inoculum were added.For this purpose we isolate the results of both duplications of treatment 3 on the loam from Table 1 to form Table 2.
Upon examination of the data in Table 2, it can be noted that the oil concentration measurements in Dup 2 are consistently larger than the oil concentration measurements in Dup 1.We suspect that there is some block effect associated with Dup 2. Specifically, the set of conditions such as raw material source, raw material purity, and room temperature that resulted in the oil concentrations observed in Dup 1 are not exactly the same as the set of conditions that resulted in the oil concentrations observed in Dup 2. Such differences may result in the fact that all responses in Dup 2 will be τ units lower (or higher) than in Dup 1, that is, ỹ = y + τ , where ỹ is a Dup 2 observation and y is an observation produced under the conditions for Dup 1.To get valid data for our change point analysis from Dup 2, we need to account for this effect τ .A straightforward way of doing this is as follows.The average from Dup 1 is A 1 = (2160+2180+• • •+870)/10 = 1501, and the average from Dup 2 is A 2 = (2470 + 2320 + • • • + 1190)/10 = 1817.The difference A 2 − A 1 = 316, denoted τ , will be used to estimate τ .Now subtract τ from each observation in Dup 2. See Table 3 for the modified dataset.This new dataset will be used to replace the old one (Table 2) for our analysis.In the above we discussed a simple way to remove the block effect from Dup 2. For more information on block effects, see Montgomery (2001).The data in Table 3 suggest that the relation between the oil concentration and the time changes after 11 days, since the observations during the first 11 days look larger than those for the following days.We now use the procedure described in Section 3 to estimate the time when such a change took place.
Let y denote the (modified) oil concentration, and let x = (1, x 2 ), where x 2 denotes the time in days.In order to apply the model (3.2) to the two duplications it is necessary to enumerate the observations.We do so alternating between duplicate 1 and duplicate 2, giving the sequence: (2160, 2154, 2180, 2004, • • •, 870, 874).The posterior probabilities that a particular observation is where a change point occurs are derived by (3.4) and (3.5).They are given in Table 4. 39 46 Dup 1 -0.001 0.000 0.005 0.000 0.000 0.000 0.000 0.002 -Dup 2 -0.000 0.732 0.001 0.000 0.000 0.001 0.257 -- The maximum probability p(t|y) = 0.732 is seen to be when t = 6, indicating a change in response after 11 days.A graph of the experimental points and the two resulting regression lines are given in Figure 1.
Note that in the above we used the sequence (2160, 2154, 2180, 2004, • • •, 870, 874) as our data y, that is, we arranged the data as if we always observed Dup 1 first in a given day.The fact is that the order of displaying Dup 1 and Dup 2 observations in a given day will not affect our final estimate of the change point.More specifically, if we display the data in the order of time (days) but without caring about the order of arranging Dup 1 and Dup 2 observations in a given day, then we have 2 10 = 1024 ways to write down our y, and each y will lead to the same conclusion that a change in response occurs after 11 days.Clearly, different y's may yield different posterior distributions p(t|y).For example, let us use the sequence (2154, 2160, 2004, 2180, • • •, 874, 870) as our y.Then using (3.4) and (3.5), we obtain Table 5 listing the posterior probabilities that a change point occurs.Note that Table 5 is not identical with Table 4.However, since the maximum posterior probability 0.731 occurs when t = 6, again we see that a change in response takes place after 11 days.
In the analysis of this data in the original bioremediation study Ma (1999)   39 46 Dup 1 --0.0000.731 0.001 0.000 0.000 0.001 0.256 ----Dup 2 --0.001 0.001 0.006 0.000 0.000 0.000 0.000 0.002 --noted change points only in the loam with high clay content under all treatments by visual inspection.By applying our methodology to treatment 1 and treatment 2 in loam, we found the change point occurs during day 11.Therefore, in all three treatments, the change takes place during day 11.This further corroborates the conclusions in Ma (1999) that ". . . the rapid decrease after day 11 of the incubation with loam with high clay content should exclude sorption as the primary mechanism contributing to the observed hydrocarbon loss.The microbial activity can change exponentially which may indicate that the rapid decrease of hydrocarbon in loam soil was contributed by biodegradation."

Conclusion
We have derived a practical Bayesian method for the estimation of a change point in a linear regression that does not rely on the hypothesis that the same variance exists both before and after the change point.Through our analysis of bioremediation data, we have illustrated the methods use and have shown it to be an effective means in estimating the time of a change point in this situation.As the literature contains many examples where the application of this method of finding the time of a change point would be useful for other fields, including economics and biology, we feel that this method can become a useful tool in a variety of undertakings.

Figure 1 :
Figure 1: Oil concentration in loam as a function of days of bioremediation.The scatter represents the experimental data.The lines represent the two fitted linear models.

Table 2 :
20W oil concentration in loam with treatment 3

Table 3 :
20W oil concentration in loam with treatment 3 (Modified)

Table 4 :
Posterior probabilities that a change point occurs