Language Rhythm Model Selection by Weighted Kappa

Given processes that assign binary vectors to data, one wish to test models that simulate those processes and uncover groupings in the processes. It is shown that a suitable test can be derived from a kappa type agreement measure. This is applied to analyze stress placement in spoken phrases, based on experimental data previously obtained. The processes were Portuguese speakers and the grouping corresponds to the Brazilian and European varieties of that language. Optimality Theory gave rise to different models. The agreement measure was successful in pointing the relative fitness of models to language varieties.


Introduction
This work originates on the research for mathematical models for language rhythm.In particular, Portuguese is a language with two very distinct varieties, Brazilian Portuguese (BP) and European Portuguese (EP), which differ in many rhythmic aspects.One that stands out is the use of stress: when a person utters a phrase, the syllables that are stressed seem to reflect whether the person is a BP or EP speaker.A question posed in Sandalo et al. (2006) was whether such a stressing could be modelled within the confines of Optimality Theory (see Kager, 1999), in such a way that the provenience of a speaker could be gleaned by just looking at the stress pattern in speech; this article is, in a sense, a complement to that work.
In Optimality Theory models there are two main ingredients: structures and restrictions.The model has to choose among the structures; the restrictions are used to define a quasi-order on the structures and the choice is for the optimal structures (which may be many).For instance, from the restrictions one gets a real valued "cost" function on the structures, and the model chooses the minimal cost structures.This is how optimality was instanced in that article: the structures associated to each phrase were factorizations of the phrase into segments of successive syllables.There were some feasibility criteria for the segments, in such a way that, from each such segmentation one could directly read a stress placement for the phrase.These segmentations can be conveniently encoded as paths on a directed graph, the segmentation graph of the phrase.Then, after some experimenting, a collection of restrictions was chosen; those are linguistically significant constructs and involve further information gathered from the phrase.
As a final step, there is a choice of weights for the restrictions.Each choice of restriction weights yields costs on the edges of each segmentation graph, and the preferred segmentations are those corresponding to shortest paths linking two special vertices.
In what follows, we refer to each weighting of the restrictions as a model.So, this is the data flow: from a phrase one gets the graph, from a model one gets costs in the graph, and then some paths.Those are decodified to produce a stress placement.In the end, a model produces for each phrase a collection of binary vectors, each describing a stress placement for the phrase.
This process was tested in the following experimental setting, as reported in Sandalo et al. (2006): There was a fixed collection P of phrases for which different models could be tried.The phrases were given to speakers of both varieties of Portuguese (we further refer to them as readers); it was known, for each reader, which variety of Portuguese she speaks.They read the phrases aloud and the reading was recorded.The researchers then assigned to each reading a binary vector O r (p) (reader r, phrase p).As in the case of models, each binary vector associated to a phrase p has length equal to the number of syllables of p.The actual test bed consisted of 20 phrases and 4 readers.
It is worth noticing that in Portuguese each word has a primary stress, which does not vary; the variation occurs in the placement of the secondary stresses, which are needed for the utterance of long words.This has implications for the modelling, as will be explained in the next section.
An a priori grouping of the readers into two classes, BP and EP, was known.The main question was whether models could be chosen in such a way that this classification could be recovered, that is, whether one could choose, for each group, a model that reasonably predicted the stress placements uttered by its members.That was done in an ad-hoc manner, one model being chosen for each group and the adequacy of the models was argued in an intuitive manner.
We suggest here a quantitative approach for evaluating the models vis a vis the readers' grouping.That will be done through an agreement measure: members of the same group will have strong agreement within the group and with the certain group model, while there will be little agreement between different groups and the another models.The (weighted) kappa coefficient (Cohen, 1960), which has been used to assess the degree of agreement between two ratings on presence or absence of a characteristic (see, for example, Fleiss, 1971, Poentius, 2000) turns out to be a useful statistics for this purpose.Applying those kappa-based criterion to the data and models of Sandalo et al (2006) results in a qualified vindication of those models: the two models proposed are each a good fit for one language variety and a poor fit for the other.
Given the small number of observations available, we reprocess the data through bootstrap techniques to enhance the confidence on the earlier results.Section 2 presents the techniques used and the results thus obtained.Section 3 present the bootstrapping.Section 4 contains concluding remarks.

Weighted Kappa
Cohen's kappa is an index of agreement of observations of categorical data.We will use it to measure the agreement between the observations of the readers and the vectors assigned by the models, for each input phrase.We present a short description of kappa, specialized for binary data and modified to allow for weights, following Poentius (2000).
We consider a binary vector as an assignment of category 0 or 1 to each of its components; in our application, the components are syllables, 0 means not stressed and 1 means stressed.Given a pair u, v of binary vectors of same length, the standard definition of kappa is based on the contingency table D of paired observations (u k , v k ), where k ranges over the components.The weighted version allows for the presence of a nonnegative weight vector w of same length, so that each component k counts w k for the contingency.More precisely, the 2 × 2 matrix D has entries , where i, j ∈ {0, 1}.In particular, d 00 + d 11 is the total weight of the components where the vectors agree.There is no loss of generality in supposing that ∑ i w i = 1, and, since it simplifies some expressions, we assume it throughout.In particular, it follows that d 00 + d 01 + d 10 + d 11 = 1.
The marginal distributions of D give the weighted proportions of 0's and 1's in u and v. If, given these, the pairs (u i , v i ) occurred independently, the expected agreement would be Kappa is defined based in the actual proportion of agreement, A(u, v) = d 00 +d 11 , centered and normalized relative to P (u, v): It is easy to see that κ satisfies −1 ≤ κ(u, v) ≤ 1, the value 1 being attained only if u = v, and −1 attained when they completely disagree, and the components where u = 0 and v = 1 have total weight 1/2.The value 0 reflects independent observations.We turn back now to the experimental setting, which consists of a collection P of phrases, and a set of readers.Each reader r assigns each phrase p a single binary vector O r (p).The agreement of phrase of readers will be greater in the most appropriate models for its variety of Portuguese; each model m assigns to each phrase p a nonempty set of binary vectors O m (p).Recall that all vectors assigned to each phrase are the same length, the number of syllables.
We wish to recover the grouping of readers from agreement between binary vectors generated by phrases readers and those generated by models.So, for each phrase we compute the agreement between readers and models and summarize these values in order to drive the clustering decision .
For each phrase p, reader r and model m, consider the following multiset: and define K(r, m) as the multiset union of K(r, m, p) over all p ∈ P .We will consider two different weight types.Weighting 1 is uniform on the syllables, and is used as a ballpark measure.Weighting 2 is driven by a more accurate assessment of readers and models: linguistic reasons imply that for each phrase there are a few precisely identified coordinates in which all assigned vectors will agree 1 .It is natural then to assign weight 0 to those coordinates, and give equal weights to the others, keeping a total sum of 1.In what follows, all calculations will be done separately for each weighting.
The summary statistics of each K(r, m) is presented in Table 1.Each column is labeled by reader (a, b, c, d), model (b, e), and weighting (1, 2) (we use different symbols, instead of integer indices, in order to improve readability).The a priori grouping of readers was BP= {a, b}, EP= {c, d}.First we noted that assessing what constitutes a good value for κ is problematic in itself and that different scales have been proposed.For example, Landis and Koch (1977) and Rietveld and van Hout (1993) consider 0.21 ≤ κ ≤ 0.40 as indicating fair agreement, 0.40 ≤ κ ≤ 0.60 as indicating moderate agreement, 0.61 ≤ κ ≤ 0.80 and 0.81 ≤ κ ≤ 1.00 as indicating substantial and almost perfect agreement, respectively.Krippendorff (1980), which discounts when κ < 0.67 and allows tentative conclusions when 0.67 ≤ κ < 0.80 and definite conclusions when κ ≥ 0.81.In this work, we are interested in the comparison of the values of kappas in each case, thus here these scales serve as a guide and other studies would be necessary to determine the most appropriate scale for the weighted kappa.On a first glance one notices that, for each reader and model, weighting 2 affords kappa a smaller mean and bigger dispersion (Std Dev and SE mean) than weighting 1.That is to be expected, as the move from 1 to 2 was done by striking out components where agreement was fixed; as expected, this move accrued the discriminatory power of κ.
For each weighting, one notes that for readers a, b the mean κ is bigger for model b than for model e; the opposite occurs for readers c, d.That is the first evidence for our main conclusion about the data: Model b is a better fit for readers a and b, while model be is a better fit for readers c and d.
That was, indeed, the ad hoc conclusion offered in Sandalo et al. (2006); what we have shown is that their conclusion has a better support than intuition.
More support for this clustering is given by an analysis of how the adequacy of each model is evidenced at the individual phrase level.For this purpose for each reader, we consider the statistic ∆ r = κ e − κ b , where κ e ∈ Krep, κ b ∈ Krbp, and p ranges over all phrases.We expect ∆ r to be negative for r = a, b, because the agreement measure κ e would be less than κ b for r ∈ BP, and for it to be positive for r = c, d.
The summary statistics for these differences ∆ r , over all phrases, are presented in Table 2, columns indexed by reader and weighting2 .These differences are generally bigger when weighting 2 is used, again attesting for its better discriminating ability.

Bootstrap Based Inference
The reason for using bootstrap inference is that hypothesis tests and confidence intervals based on asymptotic theory can be seriously misleading when the sample size is not large.Here we use bootstrap to evaluate the confidence intervals for δ w r , the statistical mean of ∆ w r , for each reader r and weighting w.Note that for each reader the set of kappa values is naturally stratified by input phrases, and each stratum is correlated from inception.As we do not have any hypothesis or knowledge on the theoretical distribution of kappa, we appeal to non-parametric methods.
For this reason we consider non-parametric bootstrap confidence limits and the achieved significance level (ASL) of the test for the comparison of kappas (see, for example, Efron and Tibshirani, 1993).
For each resample, a bootstrap sample is drawn separately for each stratum {κ e − κ b | κ e ∈ Krep, κ b ∈ Krbp}, and those are combined to give the full resample.The sample mean ∆w r is calculated for the resample as a whole.The bootstrap summary statistics based on 10,000 bootstrap replications are presented in Table 3. Empirical percentiles and BCa (Bias-corrected accelerated) confidence limits are shown in Table 4.We can observe on Tables 3 and 4 that none of the intervals contains the zero value, thus the previous conclusions about model-reader fit are confirmed.data sets.In Table 5, we present the value of kappa associated to reader a for model m and weight w (κ w m ); we observe that κ w b decreases and κ w e increases with the increase in the number of substituted phrases.Bootstrap Summary Statistics, Empirical Percentiles and BCa Confidence Limits for the statistics mean ∆ w a = κ w e − κ w b were obtained.We observe that the values of ∆ w a tend to be positive when the number of substituted phrases increases, being practically null when the number of phrases of the two languages is the same.All these results are expected indicating a very good performance of kappa.

Selection of the weights
One can argue that, for a given reader r. the higher the absolute value of ∆ w r , the bigger the evidence that one of the two models fits r.We compare now the two weightings on this basis, by studying the statistics D r = |∆ 2 r | − |∆ 1 r |, where superscripts again indicate weightings.
We obtained the usual summary statistics, bootstrap summary statistics, bootstrap confidence limits and minimum and maximum values of the replicate bootstrap of D r > 0 for each reader .These results confirm, as expected, there is strong evidence that D r > 0, a further support to the intuition that weighting 2 is a better choice than weighting 1.

Conclusion
We have shown an example where weighted kappa can be a useful agreement measure for model selection.The use of stratified bootstrap was driven by the small sample size, and by the multi-valued character of the models.The analysis also exemplifies that a judicious choice of weighting can lead to more supported conclusions.
This can be further improved: given the quantitative quality measure given by kappa, one could aim to eliminate the ad-hoc component in the choice of models.Such a choice can perhaps be construed as an optimization problem in a suitable "model space".gestions.Special thanks are due to the referee for the careful reading and deep comments.

Table 1 :
Summary statistics for K(r, m)

Table 2 :
Summary statistics for ∆ r

Table 3 :
Bootstrap Summary Statistics

Table 4 :
Empirical percentiles and BCa confidence limits based on 10,000 bootstrap replications