Dynamic Co-movement Detection of High Frequency Financial Data

In this study, we propose a pattern matching procedure to seize similar price movements of two stocks. First, the algorithm of searching the longest common subsequence is introduced to sieve out the time periods in which the two stocks have the same integrated volatility levels and price rise/drop trends. Next we transform the price data in the found matching time periods to the Bollinger Percent b data. The low frequency power spectra of the transformed data are used to extract trends. Pearson’s chisquare test is used to assess similarity of the price movement patterns in the matching periods. Simulation results show the proposed procedure can effectively detect the co-movement periods of two price sequences. Finally, we apply the proposed procedure to empirical high frequency transaction data of NYSE.


Introduction
In security markets, the stock price movements are closely linked to the market information. For example, the news on subprime mortgage crisis triggered a global financial crisis through 2007 and 2008. Drops occurred in virtually every stock market in the world. After the Federal Reserve took several steps to address the crisis, the stock markets have been gradually stable. For intraday trading, the finance literature highlights that the arrival of information over intradaily frequencies has also a strong impact on both prices and volatility and affects the security market activities. Traders in securities markets are often characterized in two groups, that is, informed and liquidity traders. Informed traders carry private information. Securities prices become more informative when there are more informed traders in the market and liquidity traders prefer to trade in an informative market than otherwise. Reaction of the traders to the same information on stocks from the same sector results in similar price movements, yet their reaction time might be different. The study of arrival of informed traders or asymmetric information is an important subject for microstructure analysis of financial market. Thanks for modern computer technology, ultra high frequency financial data such as transaction-by-transaction data now has become available and provide a rich source in studying intraday market microstructure dynamics. In this paper, we use the high frequency transaction data to investigate the similar price movement patterns of two stocks around information arrivals.
Pattern matching is an important subject in future movement prediction, rule discovery and computer aided diagnosis. In the literature, the longest common subsequence (LCS) method is widely used in bioinformatics for biological sequence alignment. The LCS problem (Hirschberg, 1975, Agrawal, Faloutsos and Swami, 1993, Bergroth, Hakonen and Raita, 2000, and Dacorogna, Gençay, Müller, Olsen and Pectet, 2001 is to find the longest subsequence common to all sequences in a set of sequences (often just two). It is a classical problem in computer science and has applications in many fields. For example, biologists can decide similarity of two DNA sequences by the length of their LCS, the Unix program "diff" compare two different versions of the same file by finding a LCS of the lines in the two files.
In this study, we propose a four stage procedure to search similar patterns for intraday high frequency transaction data. First, we apply the LCS method to sieve out the time intervals in which the two stocks have the same integrated volatility levels as well as the price rise/drop trends. Next, we transform the price data sieved out from the first step to the Bollinger Percent b data, then use the power spectrum to filter out the low frequency components. The fourth step is to assess similarity of the price movement patterns in the matching periods by Pearson's chi-square test. There are several advantages of the proposed approach. For example, the LCS algorithm heightens efficiency of searching periods of similar price patterns, the power spectrum are easily obtained by software package and the Pearson's chi-square test provides a powerful and objective test.
The remainder of the paper is organized as follows. In Section 2, we introduce the stock price models. In Section 3, the LCS method is introduced. In Section 4, the proposed method is introduced and simulation and empirical studies are performed. Conclusion is given in Section 5. Tables and figures are in the Appendix.

Model Assumptions
In real market high frequency transactions arrive randomly. Equi-spaced sample can be obtained via certain synchronization scheme such as the previoustick interpolation scheme (Dacorogna, Gençay, Müller, Olsen and Pectet, 2001). In this study we assume the stock prices are available at equi-distance times t i = i∆, i = 1, · · · , n, where ∆ denotes the length of the sampling interval. Let S A i and S B i denote the stock prices of Companies A and B at time t i , respectively. Assume the log return of stock A has the following conditional normal distribution where µ A is the annualized average return and σ 2 A is the annualized volatility. The New York Stock Exchange trades for 6.5 hours per day from 09:30 AM to 16:00 PM. To simulate the real market, we generate 5201 equispaced stock prices in 6.5 hours, with sampling length ∆ = 1 250 × 1 5200 = 7.69 × 10 −7 (year). Divide the 6.5 hours into 26 non-overlapping 15-minute time interval denoted by b 1 , b 2 , · · · , b 26 , with 200 returns in each interval. In the first 10 intervals b 1 , b 2 , · · · , b 10 , the information receiving time of A lead B by 15 minutes. In the next 6 intervals, b 11 , b 12 , · · · , b 16 , the information receiving time of the two companies are synchronous. In the last 10 intervals, b 17 , b 18 , · · · , b 26 , the information receiving time of A lags B by 15 minutes. That is, we consider the following postulated models for Companies A and B: where α and β are constants. Since time-varying heteroscedastic features are frequently observed in a financial time series, herein we assume the noise term ε i comes from the following GARCH(1,1) model, where α 0 , α 1 and β 1 are positive constants and α 1 + β 1 < 1. Figure 1 is the time plots of the generated stock prices of the companies A and B. We are interested in detecting the dynamic co-movements of the stocks A and B.
Since the stock price process is generally non-stationary, unless for cointegrated processes, regression models might result in spurious regression. Thus it is not suitable to be applied regression analysis in this study. Moreover due the nonlinear relationship between A and B, the linear correlation coefficient is not useful in this case either. Define the integrated volatility of interval b i as j=1 are the log returns in the time interval b i . The correlations between the returns and integrated volatilities of stocks A and B are 0.240 and 0.062, respectively, which do not suggest significant linear relationship between the companies A and B. In the following section, we introduce the LCS method to search the time intervals in which the two stocks have similar integrated volatility and price rise/drop trends.

Longest Common Subsequence
A string u = u 1 u 2 · · · u k is called a subsequence of a string v = v 1 v 2 · · · v n if there is a mapping F : {1, 2, · · · , k} → {1, 2, · · · , n}, k ≤ n, such that F (i) = l only if u i = v l and F is a monotone strictly increasing function, that is, if F (i) = p, F (j) = q and i < j, then p < q. For example, "coin" is a subsequence of "correlation". In addition, a string u is called a common subsequence of two strings v and w if u is a subsequence of both v and w. Formally, we define the common subsequence of strings v = v 1 v 2 · · · v n , and w = w 1 w 2 · · · w m as a sequence of positions in v, and a sequence of positions in w, 1 ≤ j 1 < j 2 < · · · < j k ≤ m such that the symbols at the corresponding positions in v and w coincide: v it = w jt , t = 1, 2, · · · , k.
For example, "eat" is a common to both "correlation" and "relationship". Finally, we define string u to be a longest common subsequence of string v and w if u is a common subsequence of v and w of maximal length. For example, "relation" is the longest common subsequence of "correlation" and "relationship".
The longest common subsequence problem can be solved by dynamic programming, which gives a way of making the solution more efficient. To do this, we introduce the following recursive solution. Define s i,j to be the length of an LCS between v 1 · · · v i , the i-prefix of v and w 1 · · · w j , the j-prefix of w. Clearly, s i,0 = s 0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m. One can see that s i,j satisfies the following recurrence: The first term corresponds to the case when v i is not present in the LCS of the i-prefix of v and j-prefix of w; the second term corresponds to the case when w j is not present in this LCS; and the third term corresponds to the case when both v i and w j are present in the LCS. Note that the matching positions of the LCS {(i 1 , j 1 ), (i 2 , j 2 ), · · · , (i k , j k )} may not be unique. For example, suppose that the string v is abcdbb and the string w is cbacbaaba, then both bcbb and acbb are the LCSs with corresponding matching positions {(2,2),(3,4),(5,5),(6,8)} and {(1,3),(3,4),(5,5),(6,8)}, respectively. Figure 2 is an illustration of the matching positions of the LCSs of strings v and w.
In this study, the LCS method is used to find the similar market reaction periods of two stocks to the intradaily information. As mentioned in the previous section, the normal trading hours (9:30am-4:00pm) are divided into 26 nonoverlapping periods each of length 15 minutes. Let v i denote the integrated volatility in the ith time period b i , that is v i = j∈I ir 2 i,j , wherer i,j is the observed jth log return in the time period b i . The price movements within a time period are classified into the following 8 categories by their integrated volatility levels and the price trends in the period : Let {s A i } 26 i=1 and {s B j } 26 j=1 denote the categorized sequences of the stock prices of companies A and B, respectively. We apply the LCS method to find the matching time intervals of {s A i } 26 i=1 and {s B j } 26 j=1 , with the same integrated volatility levels and price trends. The lengths of C (1) AB , and C AB are 9, 6, 9 and 24, respectively. In the following, we perform simulation study to investigate the LCS method for pattern matching of Model (2.2). (1) denotes the correct rate of the LCS method for seizing the matching positions. Similarly, the correct rate of the LCM method when A leads B is , the correct rate when A and B are contemporaneous is , and the correct rate when A lags B is . Table 1 summarizes the simulation results of the correct rates for different α 0 based on 2000 replications, where the parameters (α, β, α 1 , β 1 ) = (0.6, 5, 0.6, 0.3) are kept fixed. Note that the unconditional variance of ε t (defined by (2.3)) increases as either of the parameters α 0 , α 1 or β 1 increases. In Table 1, the second column gives the ratios of σ 0 to σ A √ ∆ (the conditional standard deviation of the log price of the stock A, cf. (2.1)), which represents the noise size. The correct rate increases as σ 0 /σ A √ ∆ decreases. Moreover, for fixed α 0 ≤ 1.2 × 10 −8 , the correct rates of the LCS method are about the same in the fourth to the sixth columns in Table 1. This suggests that the performance of the LCS method is not affected by the receiving order of the information. Nevertheless, the correct rates of the LCS method (cf. column seven in Table 1) can still be improved. In next section, a new pattern matching scheme is proposed to improve the correct rates of the matching positions found by the LCS method.

Spectral Analysis of the Bollinger Percents
The real market stock price processes are well recognized as non-stationary processes. One can apply the Bollinger Band to convert a price sequence into a stationary %b sequence, see for example Wu, Salzberg and Zhang (2004). Bollinger Bands are created by John Bollinger in the early 1980s and are widely used as financial relative high or low indicators of the price. The Bollinger Percent (%b) is obtained from the Bollinger Bands and can be used to measure the highness or lowness of the price relative to previous trades. The bands are curves drawn above and below a simple moving average of period p (the typical value for the period p is 20) by a measure of standard deviation. The three curves are defined as follows: Middle Bollinger Band = the p-period simple moving average, Upper Bollinger Band = Middle Bollinger Band + 2 × p-period standard deviation, Lower Bollinger Band = Middle Bollinger Band − 2 × p-period standard deviation.
The formula for %b is defined by %b = Last Price − Lower Bollinger Band Upper Bollinger Band − Lower Bollinger Band . Figure 3 is an illustration of Bollinger Bands and %b of a stock price sequence. Next, we compute the power spectrum of the %b sequence. The power spectrum of a stationary sequence decomposes the sequence into a sum of fluctuating components from low to high frequencies. The low-frequency power spectra represent the longer-term trend of the original sequence and the high-frequency power spectra characterize the shorter-time oscillation and the noise. Therefore, we use the low-frequency power spectra of the %b sequence to acquire its trend. In the following, we excerpt the definition of the power spectrum described in Jones and Pevzner (2004).
Suppose that the complex exponential functions are defined on a finite number of n points, that is, for t = 1, 2, · · · , n. For −n/2 + 1 ≤ k ≤ n/2 , where x is the floor function of x, the system contains exactly n functions. The system (4.1) is actually a collection of orthogonal functions. Let Z 1 , Z 2 , · · · , Z n be a sequence of n numbers. This sequence can be regarded as a set of coordinates of a point in an n-dimensional space. And it can be written as a linear combination of the elements of the basis. For a given n-dimensional space, it is known that any set of n orthogonal vectors forms a basis, hence the given sequence, {Z t } n t=1 , can be written as a linear combination of the orthogonal complex exponential functions given in (4.1). That is, (4.2) is called the Fourier series of the sequence Z t and c k is called the Fourier coefficients. The coefficient c 0 = n t=1 Z t /n is the average value of the sequence. In the following, we denote 2πk/n by ω k , k = 0, 1, · · · , n/2 . These frequencies are called the Fourier frequencies.
For a given periodic sequence Z t of period n, the energy associated with the sequence in one period is defined as n t=1 Z 2 t . Multiplying Z t on the both sides of (4.2), summing from t = 1 to t = n, and using the relation (4.3), we have n t=1 Z 2 t = n n/2 k= −n/2 +1 |c k | 2 , (4.4) where |c k | 2 = c kck . (4.4) is known as Parseval's relation for Fourier series. By (4.4), the total energy of a periodic sequence over the whole time horizon t = 0, ±1, ±2, · · · is infinite. Hence, we only consider its energy per unit time, which is called the power of the sequence. This is given by Power = n/2 k= −n/2 +1 |c k | 2 .
As noted above, the jth harmonic components include the terms for both k = j and k = −j as they correspond to the same frequency j(2π/n). Therefore, we can interpret the quantity from the term in the Fourier series of Z t at the kth frequency ω k = 2πk/n as the contribution to the total power. The quantity f k is called the power spectrum and describes how the total power is distributed over the various frequency components of the sequence {Z t } n t=1 . By using the first m low-frequency power spectra of a stationary sequence, one can obtain a smooth line for describing the dynamic trend of the sequence. For example, Figure 4(a) is the time plot of a %b sequence and Figure 4(b) is the corresponding trend estimate based on the first 10 low-frequency power spectra. The smooth line in Figure 4(b) mimics the trend of the %b sequence.
Next we employ the Pearson's chi-square test to access similarity of the trends of the two sources. The procedure is explained below. Reweight the m lowestfrequency power spectra f k (see (4.5)) by the following: Since m i=1 f i = 1, the reweighted power spectrum, {f k } m k=1 can be viewed as a probability mass function.
We apply the Pearson's chi-square test to test whether the spectrum distributions of two sequences are the same. We regard nf k as the number of observations in class k (corresponding to the k-th lowest frequency), for k = 1, 2, · · · , m. In practice, when applying the Pearson's chi-square test, we need : 1. None of the expected number of observations are less than 1; 2. No more than 20% classes are smaller than 5.
If some of {nf k } m k=1 do not satisfy the above rules, then we merge the m classes into m (m ≤ m) classes to satisfy the rules and denote the number in these new classes by {nf * k } m k=1 , where f * k is the adjusted spectra after merging. Let f * A,i and f * B,i denote the adjusted spectra after merging of Stock A and Stock B respectively. Consider the following hypothesis testing problem: versus the alternative

The Pearson's chi-square test statistic is defined as
which has approximately a chi-square distribution with m −1 degrees of freedom.

Simulation and Empirical Studies
We perform simulation study to investigate the performance of the proposed method in detecting the co-movement period of the two price sequences of Stocks A and B. Simulation results based on 2000 replications are presented in Tables 2-6 for different parameter settings. In the tables, "TA" signifies the ratio of "True and Accept", which means that the matching positions found by the LCS method are correct and the chi-square test also accepts the null hypothesis. "FR" stands for the ratio of "False and Reject", which means that the matching positions found by the LCS method are incorrect and the chi-square test rejects H 0 . Similarly, "TR" and "FA" are short for the situations of "True and Reject" and "False and Accept", respectively. Hence, "TA+FR" is the correct rate of the proposed method in choosing the co-movement period of two price sequences and "TR+FA" is the error rate. If the "TA+FR" ratio is close to one, then the proposed method significantly improves the accuracy of the LCS method for the co-movement detection problem.
Similar to the results in Table 1, the correct rates of the matching positions found by the LCS method are still not persuasive, especially when the noise effect increases. Recall from (3.3), the standard deviation σ 0 of the noise increases as either of the parameters α 0 , α 1 or β 1 increases. The correct rates of the LCS method also decrease (see Tables 2-4) when either of the parameters α 0 , α 1 or β 1 increases. Similarly when the parameter α or β decreases, the impact of the noise term also increases, and the performance of the LCS method becomes worse (see Tables 5-6). However, when we apply the Pearson's chi-square test to the adjusted spectra of the Bollinger Percent, significant improvements are achieved. The ratios of "TA+FR" are all greater than 0.980 in the tables. The results indicate the integration of the LCS method and the proposed scheme introduced in the previous section can effectively detect the co-movement periods of the two price sequences.
For the real data application, we consider the intra-daily high frequency stock price data of Bank of America Corporation (BAC) and Bank of New York Mellon Corporation (BK) in June 12, 2002. We divide the normal trading hours into 35 nonoverlapping time periods, I 1 , · · · , I 35 , each with length 11 minutes, and obtain the integrated volatilities in each I i for the two stocks, denoted by {v BAC Next, we use the proposed method in the previous section to examine the comovement in the intervals of these 20 matching pairs. There are only two matching pairs (I 19 , I 19 ) and (I 23 , I 23 ) are concluded to have similar co-movement pattern. Figure 5 plots the price movements within these two matching pairs which also show great similarity visually.

Conclusion
This study proposes a scheme to detect the co-movement periods of two stock price processes. The proposed scheme includes 4 steps: (1) Apply the LCS method to find the matching position of the original sequences; (2) For each matching pair, convert the nonstationary price processes to the Bollinger Percent b sequences; (3) Compute the low-frequency power spectra of the %b sequences to characterize the dynamic trends; (4) Employ Pearson's chi-square test to assess the similarity of the two spectrum distributions. Simulation and empirical studies show that the proposed scheme can effectively detect the co-movement periods of the price sequences. In the future, we will extend the results of this study to develop financial trading strategies or arbitrage strategies when the similar price movements occur.   The LCS results of simulation data in different α 0 and fixed α = 0.6, β = 5, α 1 = 0.6, and β 1 = 0.3  Table 2: Simulation results of detecting the dynamic co-movement of two stock price sequences by the LCS and the proposed methods with various α 0 and fixed α = 0.6, β = 5, α 1 = 0.6, and β 1 = 0.3 α0 1.2 × 10 −7 6.0 × 10 −8 3.0 × 10 −8 1.2 × 10 −8 6.0 × 10 −9 3.0 × 10 −9 1.2 × 10 −9     Table 6: Simulation results of detecting the dynamic co-movement of two stock price sequences by the LCS and the proposed methods with various β and fixed α = 0.6, α 0 = 1.2 × 10 −8 , α 1 = 0.6, and β 1 = 0