Application Of Statistical Control Charts To Detect Unusual Frequency Of Earthquake In The World

Earthquake in recent years has increased tremendously. This paper outlines an evaluation of Cumulative Sum (𝐶𝑈𝑆𝑈𝑀) and Exponentially Weighted Moving Average (𝐸𝑊𝑀𝐴) charting technique to determine if the frequency of earthquake in the world is unusual. The frequency of earthquake in the world is considered from the period 1973 to 2016. As our data is auto correlated we cannot use the regular control chart like Shewhart control chart to detect unusual earthquake frequency. An approach that has proved useful in dealing with auto correlated data is to directly model time series model such as Autoregressive Integrated Moving Average (𝐴𝑅𝐼𝑀𝐴) , and apply control charts to the residuals. The 𝐸𝑊𝑀𝐴 control chart and the 𝐶𝑈𝑆𝑈𝑀 control chart have detected unusual frequencies of earthquake in the year 2012 and 2013 which are state of statistically out of control.


Introduction
Earthquake is one of the most horrific and devastating natural phenomena in the world. It is the sudden, rapid shaking of the earth, caused by the breaking and shifting of subterranean rock as it releases strain that has accumulated over a long time. Earthquakes may damage household items, building to move off foundations or collapse, damage roads, bridges and dams, cause fires and explosions. They may also trigger landslides, avalanches and tsunamis.
The origin of earthquake is as old as the origin of the Earth, but due to lack of knowledge and scientific instruments it was tremendous challenge for ancient scientists to collect and analysis the earthquake data. Now-a-days the advancement of science and technology makes it easy to collect and analysis the earthquake data. Chen et al. (2010) analysed earthquake data with the help of frequency time analysis. Machado and Lopes (2013) analysed global earthquake data covering the period from 1962 up to 2011. Gupta and Gupta (2016) analysed earthquake data with the help of Big Data technology and visualized with the help of Tableau in India from 1800 to 2014. A multivariate non-parametric hazard model (Ata and Ozel, 2011) was used to analyse 111 destructive earthquakes having magnitude greater than 5 between the years 1903 to 2009 in Turkey. Reyes (2013) studied the spatial distribution of cluster associated to the aftershocks of the megathrust Maule earthquake for the year 2010. Besides the analysis of earthquake data, there are some studies that predict the occurrences of large scale earthquakes. For example, Amei et al. (2012) predicted a total number of 12 large scale earthquakes based on the worldwide earthquake data during 1986 to 2009. Alam (2015) forecasted the earthquake behaviour in Indonesia based on the earthquake data during the year 1980 to 2007. Last et al. (2016) predicted the magnitude of the earthquakes in the area of Israel and its neighbouring countries using the earthquake data from the year 1983 to 2010.
Recently, much research in the literature has focused on whether there is an increase in the frequency of earthquake occurrences. Any increase in the frequency of earthquakes is due to climate change in the world (Yi˘giter, 2012). There is considerable debate on whether climate change really does increase the frequency of natural disasters such as earthquakes and volcano eruptions. In many studies, it is emphasized that there is serious concern about impact of climate change on the frequencies of hazardous events (Lindsey, 2007;Mandeville, 2007 etc.).
The earth has experienced several earthquakes starting from the end of 19th century. Although this amount has been increased a little higher than the previous years, statistical evidence is thus required to determine whether this recent number of earthquake is just a random phenomenon or a genuine shift. Statistical control chart is a technique that can be used to determine the shift in number of earthquake. This technique was used (Justin et al. 2012) to detect climate change in Masvingo city in Zimbabwe.
The standard assumptions that are used to apply control charts are that the data are normally and independently distributed. When these assumptions are satisfied, one may apply conventional control charts and draw conclusions. The conventional control charts do not work well and give misleading results if the data exhibit even low levels of correlation over time (Bisgaard and Kulahci 2005). An approach that has proved useful in dealing with auto correlated data is to directly model time series model such as Autoregressive Integrated Moving Average (ARIMA), and apply control charts to the residuals (Montgomery and Mastrangelo, 1991). The purpose of this study is to determine whether the recent number of earthquake is just a random phenomenon or a genuine shift using statistical control chart techniques. The earthquake data is an auto correlated data. So to detect shift in number of earthquake, we identify an appropriate model to the earthquake data and then apply the residuals of this model to Cumulative Sum Control Chart (Page, 1961) and Exponentially Weighted Moving Average (Roberts, 1959).

ARIMA time series analysis
In statistics and econometrics and in particular time series analysis, an autoregressive integrated moving average ( ) model is a generalization of an autoregressive moving average ( ) model. Both of these models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting).
ARIMA model is denoted by ( , ) where represents autoregressive terms and represents moving average terms. Combine both autoregressive terms and moving average terms, ( , ) with mean can be written as: where is the original series, for every t. The are the parameters of the autoregressive part of the model and are the parameters of the moving average part. Assumed that t is independent of − ; = 1, . . . , . If the process is non-stationary then the model will be written as ( , , ).

Modelling an
( , ) process requires stationarity. A stationary process has a mean and variance that do not change over time and the process does not have trends. Stationarity test can be performed in many ways. In this study we perform the graphical test and the Augmented Dickey-Fuller (ADF) test.
In graphical procedure, observations are plotted against the time. If there is any trend of increasing or decreasing, then it will be interpreted as that the process is non-stationary and then an appropriate procedure has to be taken to achieve stationarity. However, if there is no trend, then the process will be interpreted as stationary.
The most widely used test to check stationarity is the Dickey-Fuller (DF) test. Assuming an (1) model as below: we test the null hypothesis that * = 0, meaning that the time series is not stationary. In addition to the model (2) above, a drift µ and additional lags of the dependent variable can be added: The augmented Dickey-Fuller test evaluates the null hypothesis that * = 0. The model (3) will be non-stationary if * = 0. The model with a time trend can be considered as: Then we test the hypothesis that β = 0 and * = 0. Again, the model will be nonstationary if * = 0.
After checking stationarity, model identification is required. In this paper we use two procedures for model identification. One is graphical procedure (ACF and PACF plot). And another is Akaike Information Criterion (AIC).
An ARIMA model can be chosen upon inspection of the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF). This approach relies on the following facts: (i) the ACF of a stationary AR process of order p goes to zero at an exponential rate, while the PACF becomes zero after lag p. (ii) for an MA process of order q the theoretical ACF and PACF exhibit the reverse behaviour (the ACF truncates after lag q and the PACF goes to zero relatively quickly). It is usually clear to detect the order of an AR or MA model. However, with processes that include both an AR and MA part the lag at which they are truncated may be blurred because both the ACF and PACF will decay to zero. One way to proceed is to fit first an AR or MA model (the one that seems more clear in the ACF and PACF) of low order. Then, if there is some further structure it will show up in the residuals, so the ACF and PACF of the residuals is checked to determine if additional AR or MA terms are necessary.
The Akaike information criterion was developed by Hirotugu Akaike (Akaike, 1974), originally under the name "an information criterion". The Akaike Information Criterion (AIC) is a way of selecting a model from a set of models. It is based on information theory, but a heuristic way to think about it is as a criterion that seeks a model that has a good fit to the truth but few parameters. It is defined as: Where likelihood is the probability of the data given a model and K is the number of free parameters in the model. AIC scores are often shown as ∆AIC scores, or difference between the best model (smallest AIC) and each model (so the best model has a ∆AIC of zero).
For further use of the model, first we have to check whether the fitted model is appropriate. Whether the parameters of the model are significant or not is need to be checked as well as if the residuals are white noise or not. Whether the parameters are significant, can be examined by p-values and whether the residuals are white noise, can be checked by Ljung-Box statistic and the ACF plot of residuals. If the plot shows that the residuals stay within the limits, then it is said that residuals are white noise.
The Ljung-Box test is a type of statistical test of whether any of a group of autocorrelations of a time series is different from zero. Instead of testing randomness at each distinct lag, it tests the "overall" randomness based on a number of lags. Here, the null hypothesis is of data that are independently distributed and the alternative hypothesis is of data that are not independently distributed; they exhibit serial and the test statistic is: where, n is the sample size, ̂k is the sample autocorrelation at lag k, and h is the number of lags being tested. Under the null hypothesis that the series is white noise (data are independently distributed), Q has a limiting χ 2 distribution with p degrees of freedom.

Cumulative Sum(CUSUM) chart
The CUSUM chart directly incorporates all the information in the sequence of sample values by plotting the cumulative sums of observations of the deviations of the sample values from a target value. Let be the ℎ observation, which has a normal distribution with 0 and standard deviation σ. Now if 0 is the target, the CUSUM control chart is formed by plotting the quantity = ∑ ( − 0 ) =1 against the sample number . Ci is called the cumulative sum up to and including the ℎ sample. The tabular CUSUM works by accumulating derivations from 0 that are above target with one statistic C + and accumulating derivations from 0 that are below target with another statistic C − . The statistics C + and C − are called one-sided upper and lower CUSUMs, respectively. They are where the starting values are 0 + = 0 − = 0. K is usually called the reference value and it is often chosen about halfway between the target 0 and the out-of-control value of the mean 1 , and it can be calculated by 0 + and 0 − both are greater than K and if either 0 + or 0 − exceed the decision interval H, the process is considered to be out of control. We define = ℎ and = , where σ is the standard deviation of the sample variable used in forming the CUSUM. Using h = 4 or h = 5 and = 1 2 will generally provide a CUSUM that has good ARL properties against a shift of about 1σ in the process mean.

Exponentially Weighted Moving Average (EWMA) Control Chart
The EWMA control chart is approximately the same as CUSUM control chart and in some cases it is easier to set up and operate. This chart considers all previous points using a weighting factor that makes the outcome more influenced by recent points.
The EWMA is computed sequentially as a linear interpolation between the present observation and −1 , the previous EWMA. So the exponentially weighted moving average is defined as Where, 0 < < 1 is a weighted constant and the starting value is the process target, i.e. 0 = 0 . Sometimes the average of preliminary data is used as the starting value of EWMA, so that 0 = ̅.
The EWMA is sometimes called a geometric moving average (GMA). The EWMA is used widely in time series modelling and forecasting. Since the EWMA can be viewed as a weighted average of all past and current observations, it is very insensitive to the normality assumption. It is therefore an ideal control chart to use with individual observations.
If the observations are independent random variables with variance 2 , then the variance of is By plotting versus the sample number (or time), the EWMA control chart would be constructed. The UCL, CL and LCL for the EWMA control chart are as follows: where factor L is the width of the control limits. L = 3 (the usual three-sigma limits) works reasonably well.

Data description
United States Geological Survey (USGS) is one of the providers of earthquake data from all over the world. The data provided by USGS for year 1900 to 2016 contains too many variables. Our variables of interest are year of occurrence and magnitude. Figure 1(a) displays the total number of earthquake for different magnitudes and Figure 1(b) displays the total number of earthquake for different years. The time series plot for the earthquake data (Figure 1 Though from the graphical method it is quite clear that there are two shifts in the mean, however we can check this by performing global mean test (ANOVA). The test of hypothesis is: 0 : 1 = 2 = 3 vs : at least two are not equal. Where 1 , 2 , 3 are the mean of the number of earthquake for year 1900-1949, 1950-1972, 1973-2016 respectively. From Table 1 we can say that, as the p-value is less than the significant level α = 0.05, we reject the null hypothesis. Thus we conclude that there is mean difference in year 1900-1949, 1950-1972 and 1973-2016 for total. As the number of occurrences is higher during the period 1973 to 2016, in this study we analyse this period.

Analysis of the data for 1973 to 2016
To analyse the time series earthquake data first we check for stationarity of the data. Figure  2 shows the time against frequency of the earthquake data. From the figure we find that sometimes the frequency of earthquake is increasing and sometimes decreasing and the mean and variance are not constant. And from ADF test (Table 2), we observe that for zero difference the p-value is insignificant. So we cannot reject the null hypothesis. That means, both techniques provide the same results, that the earthquake data is nonstationary.  For difference 1, the data also remain nonstationary. However, the earthquake data become stationary at difference 2. So, the integrated part of the model is 2. As the integrated part i.e. d of the models for the earthquake data is known, to identify the value of and , we use two techniques, graphical procedure and AIC value. Figure 3 shows the ACF and PACF of second difference of the data. These figures suggested an (2) and (0) model is operating. The minimum value of AIC for the data is also for (2,2,0) ( Table 3). So for the earthquake data, we have identified an (2,2,0) model.  After model identification, we have to check the model adequacy i.e. if the residuals are white noise or not. The ACF plot (Figure 4) of the residuals from the (2,2,0) model shows all correlations within the threshold limits indicating that the residuals are behaving like white noise. A Ljung-Box test (Table 4) returns a large p-value, also suggesting the residuals are white noise i.e. residuals are independently distributed. And the selected models can be fitted for further procedure.  . Here, we define for UCL, H = ℎ and for LCL, H = −ℎ . Figure 5 shows that most of the points lie on the central line 0, however there are two points (year 2012 and 2013) which are beyond the LCL line. So there is abnormality in the frequency of earthquake data and we may conclude that it is out of statistical control. For control chart, we use the value of λ equals to 0.4 for our earthquake data. Figure 6 shows that, the starting residual is zero. Then the residuals gradually increase and decrease over time. And for the year 2012, we find that the state of the process is out of control.

Conclusion
Now-a-days with the help of science and technology, it is possible to collect and analyse earthquake data. Earthquake is a natural disaster and increase of the frequency of earthquake indicates the change of world climate which is a big issue now-a-days. In this paper we are trying to identify whether the number of earthquake in the world is under state of statistical control or not.
In this article it is observed that there are two shifts in the mean of the number of earthquakes. For the year 1900-1949, the mean was very small. Then, for the year 1950-1972 the mean had increased slightly. However, we witnessed a tremendous shift in mean at 1973. There can be several reasons for this significant difference. It is possible that at the beginning of the 1900s, the measurement and detecting equipments of earthquakes were so ancient that those could not detect the earthquakes correctly. Back then science and technology was not as advanced as now. But at the end of the century, the situation has changed. Better and advanced tools have invented that can detect the magnitude correctly. Yet, may be in the 1900s, earthquake did not appear as frequent as now.
In recent year, world has experienced several earthquakes. At the beginning of the study we had hypothesised that may be there are some abnormalities for these consequences. Though we have found most of the points are in the control zone, but some of them are in the state of out of control. So we may conclude that the predominant number of earthquake is not a random phenomenon, moreover it is a genuine shift that we should be concerned about and perform necessary research to find the real reasons behind those earthquakes.