A New Polya Tree Construction Facilitating A Goodness-of-Fit Test

: Polya tree, by embedding parametric families as a special case, provides natural suit to test goodness of ﬁt of a parametric null with nonparametric alternatives. For this purpose, we present a new construction on Polya tree for random probability measure, which aims to perform an easy multiple χ 2 test for goodness of ﬁt. Examples of data analyses are provided in simulation studies to highlight the performance of the proposed methods.


Introduction
Polya tree (PT), as a random probability measure, has recently received more and more attention due to its much greater tractability over tailfree processes as well as its continuous or absolutely continuous probability measure over Dirichlet processes (Ferguson, 1973(Ferguson, , 1974. The basic idea for Polya trees arises from Ferguson (1974), Lavine (1992Lavine ( , 1994, and Mauldin, Sudderth, and Williams (1992). Later on, more generalizations, such as the mixture of Polya trees (Hanson and Johnson, 2002;Hanson, 2006), the multivariate Polya trees (Paddock et. al, 2003), and the optional Polya trees (Wong and Ma, 2010), are developed and flourished in many areas, for example, the Polya trees for spatial frailty modeling (Zhao et. al 2009), the Polya tree sampler , the Polya trees for multiple sample tests , and the Polya trees for truncated data .
A Polya tree F is defined as F ∼ PT(Π, A) (Lavine, 1992(Lavine, , 1994. The parameter Π = ∪ ∞ j=0 Π j is a nested sequence of partitions with the jth partition Π j = ∪ ∀ j B j , where B j is the one of the 2 j base sets of Π j with B j ∩ B j = ∅ if j = j , and j = {0, 1} j is defined as a j digits binary fold with 0 = ∅. Specifically, Π 0 = {Ω}, where Ω is a separable measurable space, and B 0 def = B ∅ = Ω. Generally, by assuming a parametric distribution F θ (also called centering distribution), the nested partitions are obtained in such a way that, at each partition Π j for j ∈ {1, 2, . . . , J}, we obtain all 2 j base sets with each base set B j satisfying F θ (B j ) = 1/2 j . For example, at Π 1 ,B 0 = [F −1 θ 0, F −1 θ ( 1 2 )] and The refinement of each partition, say Π j+1 , is obtained by firstly independently assigning Beta priors to 2 j conditional probabilities. Specifically, for one of the 2 j+1 base sets B j 0 , we define Y j 0 = F (B j 0 |B j ) and assign a prior Y j 0 ∼ Beta(α j 0 , α j 1 ). Note, B j is one of the base sets of Π j and the parent of the two children of B j 0 and B j 1 with B j 0 ∪ B j 1 = B j and B j 0 ∩ B j 1 = ∅. Also, we have Y j 0 + Y j 1 = 1 with Y j 1 = F (B j 1 |B j ). The posterior of Y j 0 has a conjugate property, i.e., Y j 0 |Data ∼ Beta(α j 0 + n j 0 , α j 1 + n j 1 ), where n j 0 = n i=1 I(x i ∈ B j 0 ) and is an indicator function and equals to 1 if x ∈ A and 0 otherwise. Note, n 0 ≡ n is the size of the data set. The nonnegative parameter A is the set of the hyperparameter α(s) for Beta distributions. Keeping refinement for each base set at each partition, asymptotically, the true density F will be obtained (Lavine, 1992(Lavine, , 1994. In Section 2, we introduce a new Polya tree construction similar to this scheme except: At each partition Π j , the base sets are directly obtained from the previous refined distribution F j−1 at Π j−1 instead of from the common assumed F θ . In Section 3, a multiple χ 2 test is constructed with adjusted Bonferroni corrections (BC) to examine whether or not a nonparametric model is needed when the centering parametric model F θ is pre-specified. We conduct simulation studies in Section 4.1 and Section 4.2 to examine the performance of the proposed methods in Section 2 and Section 3 respectively. The Galaxy data set is studied in Section 5. Then we come up the conclusion in Section 6.

Construction
Let x i ∼ F for i = 1, 2, . . . , n and n is the sample size. F is an unknown continuous probability distribution function with the support domain Ω and constructed as follows. Given the refined partition Π j and the associated refined . Similar to the typical Polya trees construction, let r j 1 = F j+1 (B j 1 |B j ), we have r j 0 + r j 1 = 1 with B j 0 and B j 1 is the two children of B j satisfying B j 0 ∪ B j 1 = B j and B j 0 ∩ B j 1 = ∅. By independently assigning a Beta prior r j 0 ∼ Beta(α j 0 , α j 1 ), the conjugate posterior of r j 0 is then obtained by r j 0 |Data ∼ Beta(α j 0 + n j 0 , α j 1 + n j 1 ), where n j 0 and n j 1 are defined as the same with the typical Polya trees illustrated in Section 1. Obviously, the posterior of r j 0 refines F j on the base set B j 0 given B j . If we keep refining all base sets at Π j+1 in this way, F j is eventually refined at the level j + 1. We denote this refinement at Π j+1 as F j+1 . However, unlike the typical Polya trees, the base set B j 0 is obtained by where Before illustrating the way to obtain min j and max j , let firstly define the nested partitioning m-matrix as with m j 0 obtained from Equation (3) and b j 0 obtained from here r j 0 is the posterior obtained from (1). Now we define min j = b[j, (k j 0 − 1)/2] if B j is not the most left children in Π j , otherwise min j = 0; and max j = b[j, (k j 0 + 1)/2] if B j is not the most right children in Π j , otherwise max j = 1. k j 0 is the decimal expression of the binary fold j 0, for example, k 00 = 1 and k 10 = 3; b[i, j] is the element of the b-matrix at the ith row and the jth column. Then, for x ∈ B j ∈ Π j , the refined distribution function F j+1 (x) at Π j+1 can be obtained from: Here, F 0 (i.e. F θ ) can be arbitrarily chosen from any continuous parametric family as long as Ω F ⊆ Ω F θ is satisfied, where Ω F and Ω F θ are the support domains of the true distribution F and the centering distribution F θ respectively, to guarantee a convergence to the true distribution F (Theorem 1 in Lavine, 1994). Note, through the construction we have F j (B j 0 |B j ) = 1 2 , by letting Illustratively, starting from the first partition level j = 1, since min ∅ = 0 and max ∅ = 1, the partition point m 0 = 0+1 2 = 0.5 and thus (6), the refined b 0 equals to the posterior of r 0 , i.e. F 1 (B 0 |Ω). The refined distribution at j = 1 and x ∈ Ω, corresponding to Equation (7), then can be written as: Continuously refining the base set B 0 at j = 2, we get B 00 = {x : here b 00 is obtained by Equation (6). Recursively, for each base set at each partition level, the corresponding elements in the m-matrix and in the b-matrix are refined nestedly in order. And as long as Ω F ⊆ Ω F θ , the above refinement process has full support in the set of all continuous probability measures as j → ∞ (Theorem 1 in Lavine, 1994). However, it's not necessary to fit an infinite Polya tree due to the robust Bayesian consideration since at the bottom of the tree, it's much possible that the refinement of the partitions only depends on the prior (Lavine, 1994). Thus the Polya tree can be updated to a predetermined level as long as α's increase rapidly toward the bottom of the tree to guarantee to give a probability one to the set of absolutely continuous distributions. Throughout the paper, we cap j at J = log 4 (n) and find it works well in the simulation studies.
There are many ways to choose the hyper-parameters αs. One typical choice is to let α j = cj 2 , which makes F is absolutely continuous with respect to 1 (Ferguson, 1974); or let α j = c/2 j to make F to be a Dirichlet Process (Blackwell, 1973;Blackwell and MacQueen, 1973). The parameter c here can be considered as a precision parameter (Ferguson, 1973(Ferguson, , 1974Lavine, 1992Lavine, , 1994 and c > 0. When c → ∞, F → F θ ; and when c → 0, an empirical distribution will be obtained. Usually, one can let c = 1 or assign a prior as c ∼ Γ(5, 1) (Hanson, 2006).
For inference purpose, the quantile function F −1 J (q) of a finite Polya tree, where 0 ≤ q ≤ 1, can be easily calculated from a grid searching algorithm by starting from an arbitrarily chosen x * , and then increasing x * by a small value δ if F J (x * ) < q or decreasing x * by δ if F J (x * ) > q.

Goodness-of-Fit Test
Polya trees are naturally suited to a goodness-of-fit test as they can embed continuous parametric families as a special case (Berger and Guglielmi, 2001). Along this line, we will propose a new approach in this section to answer the question whether or not it is necessary to fit a nonparametric model when the centering parametric model F θ is pre-specified. By the construction in Section 2, the answer of this question is naturally the answer for a multiple comparisons test. Let define this multiple comparisons test as the following: A multiple χ 2 test, a natural approach to answer such a goodness of fit test question, is then constructed by comparing the currently refined frequencies of the base sets to the previous ones for each partition. But how to obtain the χ 2 test statistics at each partition level? One might have already noticed, by the construction, for example, at the partition Π 2 , before the refinement, we have F 1 (B 00 |B 0 ) = m 00 , F 1 (B 01 |B 0 ) = b 0 − m 00 , F 1 (B 10 |B 1 ) = m 10 − b 0 , and F 1 (B 11 |B 1 ) = 1 − m 10 . However, after the refinement, F 2 (B 00 |B 0 ) = b 00 , F 2 (B 01 |B 0 ) = b 0 − b 00 , F 2 (B 10 |B 1 ) = b 10 − b 0 , and F 2 (B 11 |B 1 ) = 1 − b 10 . Consequently, the frequencies of the base sets in Π 2 , for the before and the after refinements, is nothing but the inverse of those probabilities. Thus, we could define the test statistics for the jth individual hypothesis test as If the jth equality holds, i.e. F j−1 = F j , then ST j follows a χ 2 distribution with the degree of freedom 2 j − 1, denoted as ST j ∼ χ 2 2 j −1 . Simultaneously, if we can show all the J equalities holds in (10), then we can conclude that there is no need to fit a nonparametric model since the parametric one, i.e. F θ , has already well interpreted the data.
Maintaining the familywise error rate (FWER) in multiple dependent or independent hypotheses test is another critical concern. An easy and often used approach is Bonferroni correction which corrects each individual hypothesis at a statistical significance level of α/J, here J is the number of hypotheses test under the null. The proof for Bonferroni correction follows from Boole's inequality: FWER = P r{∪ j (p j ≤ α J )} ≤ j P r(p j ≤ α J ) ≤ J α J = α. However, based on our construction, giving each individual hypothesis the same correction might not be appropriate. Since as j increases, a prior driven pattern is more likely to be obtained. Thus we might need to adjust Bonferroni correction by assigning a small correction α j for a large j for compensation of more likely witnessing a prior driven pattern at this large j. Let define a weight w j as Clearly, if j > j , then w j < w j . The adjusted Bonferroni correction for the jth individual hypothesis test thus can be defined as α j = w j α. And if p j < α j , where p j is the p-value of the jth individual hypothesis test and obtained from the jth χ 2 test, then we can conclude there exists a significant difference between F j−1 and F j ; otherwise, the equality F j−1 = F j holds.

Model Fit
In this section, we will examine the performance of our proposed PT construction in Section 2. Data is generated from the following simulation cases with different sample sizes n = 100, n = 200, n = 500 and n = 1000. However, to make the paper concise, here we only report the results from n = 500 and n = 1000. The results from n = 100 and n = 200 behaviors quite similar except that the variance is larger than the one fitted from a larger sample size. Under each simulation case, we consider two different centering distribution F θ . The J is capped at log 4 (n) and c = 1. 100 fitted distribution function F J s are plotted in each Figure 1 to Figure 12. 1. x ∼ Γ(2, 5) • a: F θ = Γ(shape = 1, scale = 0.5) From the simulation study, see Figure 1 - Figure 9 and Figure 11, we find, as long as Ω F ⊆ Ω F θ is satisfied, the fitted distribution F J approaches the true distribution F very well and the deviation between the F and F J decreases as the sample size increases. However, when Ω F ⊃ Ω F θ , see Figure 10 and Figure  12, F J fails to approach the true distribution F due to the reason that F Γ (x) is undefined when x ≤ 0, see the simulation case 3(b). Solving this problem to get a robust F J is actually very simple: one, for example, can always choose a centering distribution with Ω F θ ≡ R to guarantee Ω F ⊆ Ω F θ ; or one can conduct a pilot study, which not only can help to find an appropriate support domain, but also can help the convergence rate. Here, we refer to a good convergence rate as the rate at which F J can approach F with a small sample size.
To further illustrate the help of the pilot study, we conduct another simulation study. Let x ∼ T(d.f. = 30) and n = 50. By applying the R builtin function qqnorm, we found the data is approximately normal distributed; and by calculating the sample mean and the sample variance, we found choosing a standard normal distribution as F θ might be appropriate. We then set F 1 θ = N(mean = 0, sd = 1). For comparison purpose, we choose another centering distribution F 2 θ = T(d.f. = 1). Obviously, by centering at F 1 θ (see Figure 13), the 100 fitted distribution F J s well approach the true distribution F compared to the ones (see Figure 14) which are centered at F 2 θ . Conclusively, a good centering distribution can be a help for improving the convergence rate when other

Goodness-of-Fit Test
In this section, we will study the performance of the proposed method for a goodness-of-fit test introduced in Section 3. The data is generated from the followings with different sample sizes n = 100, n = 200, and n = 500. For each • (iii) x ∼ SN(location = 0, scale = 1, shape = 2), (Azzalini, 1985) a: F θ = SN(location = 0, scale = 1, shape = 2) For each (a) scenario, the centering distribution F θ will be chosen to be exactly equal to the true distribution F . But for others, F θ is chosen from a set of competitive distributions. For example, in case (i)(a), the centering distribution F θ equals to the true distribution N(0, 1); other two competitive centering distributions are a mean shift normal distribution, see case (i)(b), and a variance shift normal distribution, see case (i)(c). We simulate MC = 1000 datasets for each scenario and then calculate the proportion of detecting the significant difference between F and F θ , i.e. Power = MC i=1 I(Significant)/MC. Obviously, for each (a) scenario, this should be the Type I Error since this calculates the P r(Reject H 0 |H 0 ); and for other scenarios, this is the power since it calculates P r(Reject H 0 |H 1 ). The simulation results is listed in Table 1. Overall, this proposed method works well for the above simulated scenarios. For example, in each (a) scenario, the Type I Error has a value less than the significant level 0.05 and decreases as the sample size increases as well. It might suggest, for a large sample size n, a different criteria for the truncated J should be used to increase the test power. We therefore applied J = log 3 (n) for n = 500, it turns out this criteria works well. As for other scenarios, power increases as the sample size n goes up and quickly approaches to 1 except for the scenario (ii)(b). We further investigate this case and find the reason behind it is that only a little information could be used on the tails. Thus if we want more power, we have to increase the sample size to help the update of the tree on the tails.

Galaxy Data
In this section, we study the Galaxy data set reported by Roeder (1990) using the proposed methods. The data set can be found in DPpackage  in R and only contains the physical information on velocities (km/second) for 82 galaxies, denoted as x * 1 , x * 2 , . . . , x * n with n = 82. The transformed data, x i = log(x * i /1000) for i = 1, 2, . . . , n, is adopted for fitting the model. A pilot study shows it might be appropriate to consider F θ = SN(location = 3, scale = 0.25, shape = −1). Let J = log 4 (n) = 4 and c = 1. The median speed is obtained by a grid searching algorithm introduced in Section 2. Repeating the fitting process for re-sampled galaxy data with replacement 1000 times and computing the median speed each time, we then obtain a sample of the median speed. The 95% confidence intervals then can be easily obtained from this sample, see Table 2 for the results. Meanwhile, for each re-sampled galaxy data, we also conducted the goodnessof-fit test introduced in Section 3 and then calculated the proportion of rejecting H 0 . We obtained this proportion is 980 out of 1000. Thus, we conclude: A nonparametric model is needed if the above skew normal distribution is specified as the centering distribution. This conclusion coincides to the one from a goodnessof-fit test for the family of skew-normal models (function ks.sn in package GOFSN in R (Romero, 2012)) since it has a p-value less than 0.001.

Conclusion
In Section 2, we proposed a new construction on Polya trees for random probability measures. Then, we constructed a multiple χ 2 test in Section 3 to examine whether or not a nonparametric model is needed when the parametric centering distribution is pre-specified. However, we found the capped level J chosen for fitting a finite Polya tree is a critical concern since the fitting and the testing results depend on this choice. Usually, with a too small J, the model might be underfitted, however, with a too large J, the model might be prior driven and thus overfitted. Practically, we capped J at J = log 4 (n) , where n is the sample size. In simulation studies, we found this value works well for a small or a moderate large sample. But when the n is extremely large, the null might be incorrectly rejected. The reason behind this is that any tiny difference between the b-matrix and the m-matrix at the bottom of the tree would result in obtaining a very large test statistics and thus the null could be rejected incorrectly even if the null is held. Therefore, for an extremely large sample, one might consider to assign J a relative small value, or apply other criteria instead of J = log 4 (n) , or use different weighted Bonferroni corrections.