Factor Effects Testing for Mixture Distributions with Application to the Study of Emergence of Pontomyia Oceana

In this work, testing of factor effects to the observed data from finite mixture distributions are discussed. Likelihood ratio tests are used to test whether factors of interest have significant effects to the mixture distribution model. To carry out the likelihood ratio tests, different methods about the computation algorithm for the maximum likelihood estimation (MLE) of the parameters in the mixture models are studied. These methods are applied to the data obtained from a laboratory study on emergence of Pontomyia oceana, where the effects of factors, such as sex and temperature, to the distribution of the dates that Pontomyia oceana emerged are investigated. From the results obtained, in some cases, three-component logistic distributions are fitted to the data with two peaks very close to each other. This is somewhat surprising as merely from the histogram, it is not easy to see and usually not expected to say there are two very close peaks. From the practical point of view, as the laboratory conditions excluded the possible effects related to semi-lunar tidal fluctuations that may have a dominating influence in nature. Thus the laboratory results helps to identify all the possible factors that have minor effects. Based on the results of this study, the difference between males and females, nevertheless, suggests that sex hormone may be involved in affecting the emergence dates. The suggestion of a third peak is unexpected from our point of view and it implies that there are factors we never suspected. It is worth noting that through rigorous statistical analysis presented here, it helps to provide an objective estimation on the distribution of the emergence dates as well as the corresponding proportions and the peak synchronous emergence dates in each period under different factor effects. We only start to speculate its possible adaptive meaning after the differences have been established as a true phenomenon. From this study, it reveals some additional biological phenomena worthy of more investigations.


Introduction
Pontomyia oceana(P.oceana in short) is a kind of marine midges at ocean front.One of the characteristics of this kind of midges is that their life histories consist of a relatively long period of benthic larval development(30-45 days) and followed by a very short adult stage(2 hours).In nature, a semi-lunar rhythm of the midge emergence was observed.The midges emerged around new moons and full moons.It was also noticed that the midges might be concentrated in the first or the last few days of the window suitable for emergence depending on season.We had no idea what environmental factors might be affecting this trait, and we had no idea if an endogenous rhythm was involved.The temperature treatments in this study were an effort to test if an endogenous rhythm was involved.An important characteristic of an endogenous rhythm is temperature-compensation, i.e., the midges are expected to catch up under low temperatures.On the other hand, the difference between males and females in emergence time of the day is known with a convincing adaptive explanation (Soong et al 1999).However difference in dates within a month has been suspected, but never proven.In our study here, as the laboratory conditions excluded the possible effects related to semi-lunar tidal fluctuations that may have a dominating influence in nature.Thus the lab results helped to identify all the possible factors that have minor effects.On the other hand, as an adult remained active for only about 2 hours, during the short adult stage, males must find mates and females besides mating must place fertilized eggs in appropriate habitats.Hence, peak synchronous emergence become very important to these short-lived midges.See Soong et al (1999) for some details on the introduction of life histories about marine midges.In this work, we have investigated whether the factor of temperature and sex indeed have an effect on synchronous emergence of P. oceana from statistical point of view with the following laboratory experiment observations, where mixture model describing the emergence distribution and the corresponding proportions as well as peak synchronous dates for each factor combination has been used and estimated.Here the standard ANOVA is not used as we also concern about how many peak periods there are for the emergence date distribution, as well as what are the corresponding peak dates and proportion emergenced during each period.Standard ANOVA will not be able to distinguish all these differences.
The result of this analysis does not support the existence of an endogenous semi-lunar rhythm in controlling the emergence dates of this marine midge.Moreover, the difference between males and females from the analysis, nevertheless, suggests that sex hormone may be involved in affecting the emergence dates.The suggestion of a third peak is unexpected from our point of view and implies that there are factors we never suspected.The detailed modeling and analysis as well as discussion are presented below.Now consider the following experiment performed by a laboratory in National Sun Yat-sen University of Taiwan: at first adult P. oceana were collected in southern Taiwan and then placed in beakers to collect fertilized eggs.After approximately 20 days of culture, the emergence day observations from female and male P. oceana under three different temperatures, say 20 o , 25 o , and 30 o respectively were recorded.See Lee (2000) for a detailed description of the experiments.The sets of observed frequency counts n ijk for each of the factor combination(i, j), i = 1, 2, j = 1, 2, 3 on the kth day in histogram form are presented in Table 1 and Figure 1.It can be seen from Figure 1, the time distributions appear to be mixture models with two or three distributions.The effect of temperature seems to be more significant than sex.Meanwhile, it seems that high temperature advances the day of emergence.We discuss all these phenomena by using statistical technique.In this work, a two-factor factorial design with factors sex and temperature is considered although the response of emergence day for each P. oceana is assumed to be a random variable with mixture distribution, where the parameters of the distribution such as the mixture proportions, parameters for each distribution in the mixtures, depend on the factors.In the following, we will first consider the general mixture model and describe the methods we are going to use to do the estimation and perform the appropriate tests.
Let y ijk be the observed response of the emergence day with factor A at the ith level (i = 1, 2, . . ., a) and factor B at the jth level (j = 1, 2, . . ., b) for the kth replicate (k = 1, 2, . . ., n ij ).y ijk , k = 1, 2, . . ., n ij are assumed to be i.i.d.random variables with common mixture distribution function F (•; θ ij ), where θ ij is the unknown parameter vector for F (•; θ ij ).We will estimate these parameters first and perform the factor testing later.
In the practical problem here it is of interest to know whether the factors of temperature and sex have effects on the emergence day of P. oceana, and understand how those factors would affect the peak periods for the emergence to occur.In this work we first identify the pattern of the emergence time as following a mixture distribution, then distinguish whether there are significant differences on the mixture distributions of emergence day under different factor influences.Later the peak periods with the estimated mixture distributions are estimated through the mean estimate for each component.In order to do that, likelihood ratio tests are used here to accomplish our investigations.To perform the likelihood ratio tests, we have to determine what kind of mixture distribution function are appropriate for the data in hand first.Moreover, among the many popular existing methods for computing the maximum likelihood estimations(MLE), which one is more suitable to be used for our data will be explained in the following sections.The approaches and the methods we adopt here will be briefly described although still keep the presentation clear.
In Section 2 we introduce the definition of a finite mixture distribution and different existing methods to estimate the parameters, such as the method of moments and the maximum likelihood.The computation algorithms for finding MLE of parameter vectors of mixture distributions under different considerations to accommodate the practical situation are also introduced, namely, the EM algorithm proposed first by Demmpster, Laird,and Rubin (1977), the EM algorithm for grouped and truncated data proposed by McLachlan and Jones (1988) and Newton-Raphson iterative scheme by Hasselblad (1966).Later the mixture of normal distributions as well as the mixture of logistic distributions, both will be used to analyze our data.Moreover a method for estimating a probability density function nonparametrically proposed by Minnotte (1998) is considered, so as to recompute data in each interval and estimate mixture distributions afresh for comparison with the results from parametric models.The Pearson's chi-squared goodness of fit test for ascertaining whether an assumed probability distribution is consistent with a given set of data is also stated.In Section 3 likelihood ratio tests for testing the effects of main factors are formulated, a F-test is also used for testing the factor interaction.In Section 4 we applied all these methods to the emergence data of P. oceana, and in Section 5 we conclude with a discussion.

Mixture Distributions and Methods of Estimation of Parameters
The probability density function (p.d.f.) of a mixture distribution with finite components is usually expressed as (2.1) where p i , i = 1, . . ., c, are the mixing proportions satisfing c i=1 p i = 1 and g i (x; θ i ), i = 1, . . ., c, are the p.d.f.depending on an m-dimensional parameter vector θ i , see Everitt and Hand (1981, p.4) for more details.
Assume n independent observations, x 1 , . . ., x n , were obtained from a mixture distribution, the parameters of the distribution will be estimated.Let ∆ = (θ , p ) denote the vector of all unknown parameters of (2.1), where θ = (θ 1 , . . ., θ c ) and p = (p 1 , . . ., p c ) are the parameter vectors to be estimated.Many methods have been proposed and used for estimating ∆.Two of the wellknown methods namely the method of moments and maximum likelihood will be used in this work.
The problem of estimating the parameters in a mixture of two normal distributions was first considered by Karl Pearson (1894) where the method of moments was used.The evaluation of Pearson's estimators involved the solution of a ninth degree polynomial equation.Cohen (1967) presented a procedure with circumvents solution of the nonic equation which reduces the total computational effort required.But it is not difficult to find the solution with the help of computer today.
It is well known that the maximum likelihood estimator (MLE) has a number of desirable statistical properties.In the case of mixture distributions, some computational methods for finding the MLE is described briefly in the following subsection.

Algorithms for computing the MLE EM algorithm
EM algorithm is a powerful and useful tool for finding the MLE, which was proposed first by Dempster, Laird, and Rubin (1977).The M-steps and E-steps are repeated iteratively until some convergence criterion is satisfied.

EM algorithm for grouped and truncated data
In practice, data collected on the phenomenon of interest are frequently available only in grouped form and may also be truncated.In our study here, as the emergence number is recorded daily, the data obtained can be regarded as in grouped form, also we find that the incidence of emergence 21 days before and 56 days after the initial emergence time is very rare, we therefore have truncated our observations before 21 days and after 56 days of initial emergence.It seems to be interesting for us to see what influence it would have to our estimates under this kind of consideration.McLachlan and Jones (1988) proposed an EM algorithm for grouped and truncated data.More details can be found in the above paper.

Newton-Raphson iterative scheme
Other than using EM algorithm to obtain the MLE, it is also quite common to find the MLE by using a Newton iterative scheme (Hasselblad (1966)).Estimators obtained by method of moments can be regard as the initial values of the EM algorithm and the Newton's iteration procedure.

Density estimation with binned data
Other than estimating the p.d.f.parametrically assuming a possible form for the p.d.f., the approach of estimating the p.d.f.nonparametrically is also considered here for comparison.One popular method for estimating a p.d.f.nonparametrically is known as the kernel density estimation, where Minnotte (1998) proposed a method for achieving higher-order convergence rates in nonparametric density estimation.This will also be used for comparison here.

Pearson's chi-squared goodness-of-fit test
As soon as we have obtained the estimation of parameters, we need to test whether the particular estimated p.d.f. is consistent with those observed data.The Pearson's chi-squared goodness of fit test will be used.

Effects Testing
In a multi-factors experiment, once a model is fitted to the observed data in each combination of the treatments of all factors and is not rejected by the goodness of fit test, we may proceed to consider the likelihood ratio test to test the effect of each factor

Likelihood ratio test
To test θ ∈ Θ 0 versus θ ∈ Θ 1 , the well known likelihood ratio test statistic is defined through where Θ = Θ 0 ∪Θ 1 .Under some additional regularity conditions, the asymptotic distribution of the statistic −2logλ is χ 2 r−m , where Θ is an r-dimensional subset of R r and Θ 0 is an m-dimensional subset of Θ.One rejects θ ∈ Θ 0 whenever −2logλ > C, where C is determined by the desired level of the test.
For example, in our study here, for the observed data sets of female and male at different temperatures as in Figure 1, let ∆ i , i = 1, . . ., 6, denote the parameter vectors of corresponding distribution, respectively.If we want to test the effect of sex under temperature 20 o C, the test statistic used here is through where L( ∆i ) = n j=1 f (x j ; ∆i ), i = 1, 2 are the maximized likelihood for female and male at 20 o C respectively, ∆ 12 is the parameter vector of the combined data sets of female and male at temperatures 20 o C, and L( ∆12 ) is the maximized likelihood of the combined data set of female and male at 20 o C; the degrees of freedom equals to the difference between the sum of the number of parameters of ∆ 1 and ∆ 2 and the number of parameters of ∆ 12 .To test the effect of temperature under female, the test statistic λ 135 is adopted in the same manner, where Regardless of different temperatures, the entire effect of sex can be tested by The degrees of freedom equal to the difference in dimension between the sum of the number of parameters of ∆ 12 , ∆ 34 and ∆ 56 and the sum of the number of parameters of ∆ i , i = 1, . . ., 6. Similarly, the entire effect of temperature can be tested by . (3.5)

F-test
To test the interaction effect between temperature and sex, we use the Fstatistics comparing effect of factor A at low level of factor B with respect to effect of factor A at high level of factor B as follows.
for ij, hk∈ {12, 34, 56} and the critical region would be the tails of the F distribution with corresponding degrees of freedom.The reason for using F temp , ij = 12, hk = 34 to test the interaction between sex and temperatures at 20 o C and 25 o C is that we are testing whether the effect of sex under 20 o C would be significantly different from the effect of sex under 25 o C, and so on.Then if the value of F temp or F sex is too small or too large, it means that the differences of variation of sex between different temperatures are significant or the differences of variation of temperature between female and male are significant, then it indicates the effect of interaction does exist.

Case Study
We now examine the effects of sex and temperature on the day of emergence of P. oceana.In a laboratory trial, the ova fertilized at the same day were collected and kept in constant temperature 20 o C, 25 o C and 30 o C separately.The observed counts of emergence of female and male P. oceana at 20 o C, 25 o C and 30 o C from the 21th day to 56th day were recorded every day as are presented in Table 1.The total observed counts of each combination n is listed at the last row of Table 1.The 21th and 56th day were the lower and upper truncated values, and 36 grouped intervals with 1 day equal width.

Mixture of normal distributions
The sets of observed frequency counts in histogram form given in Figure 1 suggests that the time distributions are mixture models, which also seems to be able to explain the real situation reasonably.Since the experimenter collected data in grouped and truncated form, we first fit mixture of three normal distributions to these data via formulas deduced by McLachlan and Jones(1988).The estimate of parameter vectors ∆ i , including parameters µ 1i , σ 1i , µ 2i , σ 2i , µ 3i , σ 3i and proportions p 1i and p 2i , i = 1, . . ., 6, are listed at the first row of Table 2 and denoted by NL.But the fits are rejected since the modes of the histograms have higher frequencies and smaller variations than normal distributions do.Hence a mixture of logistic distributions is considered to fit to each data set.

Mixture of logistic distributions
Since mixture of normal distributions does not work well, we have tried some other distributions.It seems that mixture of logistic distributions is more suitable.In order to obtain the MLE of mixture of logistic distributions, the formulas for performing EM algorithm and EM algorithm for grouped and truncated data are derived and used in our estimates, the derivation can be obtained following the steps illustrated in the corresponding papers and is omitted.
The p.d.f. of logistic distribution is To estimate the MLE of parameter vector ∆ = (µ 1 , β 1 , µ 2 , β 2 , µ 3 , β 3 , p 1 , p 2 ) of a mixture of three logistic distributions four methods were used.The first is the EM algorithm where the effect of truncation is not considered since the part of the tails seems to be ignorable.Moreover, since the experimental data were collected in grouped form, we regard the individual observations x in each interval as the central point of the interval.For example, at 20 o C we observed that there were 3 female P. oceana emerged at the 26th day, then x 3 = x 4 = x 5 = 25.5.According to this rule and EM algorithm for mixture of logistic distributions, we obtained the estimates of parameter vectors ∆ i , i = 1, . . ., 6, listed at row 2 of Table 2 and denoted by EM.
Secondly, we estimate the parameter vectors with consideration of data are initially grouped and truncated or by the Newton-Raphson.The results are listed at row 3 and 4 of Table 2 and denoted by GT and NR respectively.
The last method we used is fitting a nonparametric density, to each set of experimentally observed data in order to smooth the original rugged curve.Then estimate the probable counts at each day by multipling total number to this nonparametric density, and re-estimate the parameter vectors of mixture of logistic distributions.The results are listed at row 5 of Table 2 and denoted by BD.
Compare the results of these four methods presented in Table 2, there is no significant difference between the estimates obtained by EM algorithm, EM algorithm for grouped and truncated data and Newton scheme except for ∆4 .The purpose of estimating a probability function nonparametrically is to smooth down the rough data style initially so as to obtain a better parametric form.But the estimates obtained by BD do not make significant differences here.On the other hand, it seems that the estimates obtained by GT has comparatively larger log-likelihood value listed at the bottom of Table 2 than the others and the curve of the density function with GT estimate also has a better peak estimate at the mode than the others.Hence in the following the discussion will be restricted to the estimates obtained by GT only, and the plots of the estimated density function for each data set are also presented in Figure 2.

Distribution fits
To test whether the mixture of logistic distributions is consistent with the observed data, the Pearson's chi-square goodness-of-fit statistic is considered.Before we calculate the test statistics value, we make two adjustments on some intervals.First we have combined several intervals with extremely small probabilities P j ( ∆), since even only one observation occurred in this kind of intervals will cause the test statistics value increased rapidly.Next, we have combined two neighboring intervals when in one interval the observed counts decrease rapidly and in the next interval the response increases to the seemly normal counts.The reason for this adjustment is because no matter what type of distribution was fitted, the unsmoothed rapid increase or decrease would almost always cause the null hypothesis of fit by a distribution with smooth p.d.f.being rejected.Hence combining intervals to alleviate the concussion seems to be a reasonable thing to do.Then the degrees of freedom are reduced accordingly.The results are given in Table 3 for each data set along with the associated p-value.It shows that those fittings are acceptable.

Effects significance of temperature and sex
From Table 3, it can be seen that three-component logistic distribution fits the data sets of 20 o C and 25 o C and two-component model fits the data sets of 30 o C quite well.The estimates of the parameter vectors at the same temperature are close for female and male.It seems that the effect of temperature is more significant than sex.In order to test the effect of sex and temperature, we combine the data sets for different combinations and fit mixture of logistic distributions to each combined data set.The MLE for combined data set of sets i and j or sets i , j and k are denoted by ∆ij and ∆ijk respectively, which are displayed in Table 4.The maximized likelihood L( ∆) of each data set are listed in Table 3 and Table 4 separately.Consider the likelihood ratio test statistics λ in (3.2) and (3.3), the effect of sex at 20 o C, 25 o C and 30 o C are indicated by λ 12 , λ 34 and λ 56 respectively; the effect of temperature for female and male are indicated by λ 135 and λ 246 respectively.All these value are listed in Table 5.Now, we test the effect of interaction between temperature and sex, i.e., the null hypothesis is H 0 : there is no interaction between temperature and sex.We check each of the statistics F temp and F sex as in (3.6) and (3.7) at different levels of temperature and sex.The test results are shown in Table 5 and we conclude that there is no indication that there is significant interaction between temperature and sex.This seems reasonable since in Figure 2 it shows that the patterns of female and male at the same temperature are similar and in Table 3 the scales and locations of the distributions at the same temperature seem to be quite close.
Next, we are going to test the effect of sex under the consideration of no significant interaction between temperature and sex, i.e., the null hypothesis is H 0 : there is no sex effect.The results listed in Table 6 show that at level α = .05the entire effect of sex tested by λ s as in (3.4) is significant, since p-value of −2logλ s is .0002.The effect of sex at different temperatures are tested by the likelihood ratio test statistic as in (3.2).As shown in Table 6, the effect of sex at 20 o C is less significant and is significant when the temperature is 25 o C or 30 o C. The most significant one is at 25 o C where the p-value is .001.Finally, we test the effect of temperature, i.e., the null hypothesis is H 0 : there is no temperature effect.Through the same argument as testing the effect of sex, the likelihood ratio test statistics λ t as in (3.5) is used, and it is not surprising that the effect is significant as shown in Table 7.The effects of temperature on female and male are tested by λ 135 and λ 246 separately, both show that the effect is significant, since the p-values are almost zero as shown in Table 7. Compare the results given in Tables 6 and 7, it is clear that the effect of temperature is more significant than sex.The p-values of goodness-of-fit test for the MLE of combined data sets listed in Table 8 which show that the fitting of ∆34 and ∆135 are not suitable and this is reasonable since we have already shown that the effect of sex at 25 o C and the effect of temperature for female are more significant than others, it indicates that there is an interaction between temperature and sex but it is not as significant enough so as to make the total interaction become significant.

Other related results
In Figure 2 it seems that the curve has only two peaks, but in Table 3, a threecomponent logistic distribution has been fitted to the data sets of 25 o C. Note that the third peak is estimated to occur at the 42th day close to the second one at the 39th day, so that the third peak in the curve is not easy to be distinguished just by a rough observation.The suggestion of a third peak is unexpected from our point of view.As the laboratory conditions excluded the possible effects related to semilunar tidal fluctuations that may have a dominating influence in nature.Thus the lab results helps to identify all the possible factors that have minor effects.The result of this analysis does not support the existence of an endogenous semilunar rhythm in controlling the emergence dates of this marine midge.Moreover, the difference between males and females from the analysis, nevertheless, suggests that sex hormone may be involved in affecting the emergence dates.The third peak implies that there are factors we never suspected.
Furthermore, from the analysis it shows that amounts of emergence of P. oceana are the highest at 25 o C. Also, it can be seen from the fitted models that the higher the temperature is, the higher proportion of emergence at the first high peak is.Finally, the models show that the day of emergence of P. oceana has a high peak first at 30 o C, next at 25 o C and 20 o C the last.That is, high temperature advances the day of emergence.In nature there are cues to constrain which days are allowed to emerge ( new moons and full moons).Advanced emergence may be observed within those allowed dates.The temperature effect may determine the proportion to emerge in the first available window allowable (emergence dates).Moreover from the experimental data and the estimates of parameters p 1 , p 2 , µ 1 , µ 2 , β 1 , β 2 etc., we are able to estimate the number of populations and the corresponding peak date of the emergence for experiments in the laboratory, which is helpful to study the synchronous emergence pattern of P. oceana.For the pattern in nature there are still a lot of unanswered questions to be investigated.

Discussion
In the process of finding suitable models for the data, some problems have arised.It can be seen that for the male data sets, the fits are not as good for the mode heights using the mixture of normal distributions.Some other distributions have also been used to fit these data sets, but it seems that the mixture of logistic distributions approaches the high peak of the observed data more closely.This is an interesting phenomenon as the logistics seems to be able to provide a steeper pattern for the density estimate than other types of continuous distributions, such as the normal.The major differences in the ability to fit data with steep mode between these two families of distributions are of interest for further investigation.In the case of our study the observed mixture density pattern is not quite smooth due to the nature of the midge, but a mixture of logistic fit does seem to be able to present a reasonable approximation of the practical situation quite well.It is worth noting that through rigorous statistical analysis presented here, it helps to provide an objective estimation on the distribution of the emergence dates as well as the corresponding proportions and the peak synchronous emergence dates in each period under different factor effects.We only start to speculate its possible adaptive meaning after the differences have been established as a true phenomenon.From this study, it reveals some additional biological phenomena worthy of more investigations.

Figure 1 :
Figure 1: Histograms for quantities of emerged P. oceana

Figure 2 :
Figure 2: Plots of mixture of logistic distributions

Table 1 :
Observed frequency counts 20 o C 20 o C 25 o C 25 o C 30 o C 30 o C Day Female Male Female Male Female Male

Table 2 :
Results of fitting a three(two)-component mixture of logistic distributions to day of emergency (Standard errors are in parentheses)

Table 3 :
Goodness-of-fit test for mixture of logistic distributions by GT (Standard errors are in parentheses)

Table 4 :
Results of fitting a three-component mixture of logistic distributions to combined data (Standard errors are in parentheses)

Table 5 :
Analysis of interaction between temperature and sex -H 0 : no interaction between temperature and sex

Table 6 :
Likelihood ratio test under the hypothesis -H 0 : the effect of sex is not significant

Table 7 :
Likelihood ratio test under the hypothesis -H 0 : the effect of temperature is not significant