A Hybrid Monitoring Procedure for Detecting Abnormality with Application to Energy Consumption Data

The complexity of energy infrastructure at large institutions increasingly calls for data-driven monitoring of energy usage. This article presents a hybrid monitoring algorithm for detecting consumption surges using statistical hypothesis testing, leveraging the posterior distribution and its information about uncertainty to introduce randomness in the parameter estimates, while retaining the frequentist testing framework. This hybrid approach is designed to be asymptotically equivalent to the Neyman-Pearson test. We show via extensive simulation studies that the hybrid approach enjoys control over type-1 error rate even with ﬁnite sample sizes whereas the naive plug-in method tends to exceed the speciﬁed level, resulting in overpowered tests. The proposed method is applied to the natural gas usage data at the University of Connecticut.


Introduction
Large academic institutions face challenges in energy management due to complex energy infrastructure (University of Michigan, 2011;Worcester Polytechnic Institute, 2007).The number of buildings is just one factor that makes energy management difficult.In most institutions, energy management envelops energy auditing, energy bills, life cycle costing, electrical distribution systems, boilers and fired systems, steam distribution, cogeneration, energy management control systems, insulation, compressed air, renewable energy sources and water management, distributed generation, and codes standards and legislation (Doty and Turner, 2004;Capehart et al., 2020).The sheer complexity of energy management calls for a screening process to identify anomalous energy usage with the goal of reducing the number of accounts that have to be inspected manually.Manual inspection is time-consuming and labor-intensive, and facilities professionals' time is wasted if an inspection turns out to have been unnecessary.A screening process to detect energy usage anomalies in buildings ultimately depends on what defines an anomalous behavior.In general, anomalies are observations in energy usage that substantially deviate from what is expected.Statistical monitoring learns normal behaviors from data to locate atypical behaviors.Zhang and Paschalidis (2018) derive a Hoeffding test statistic for network systems, leveraging large deviation theory and assuming that observations follow a finite-state Markov chain.They use the relative entropy between the empirical and theoretical probability laws to define what is anomalous.Likewise, in Fu and Jeske (2014), anomaly detection in network traffic is formulated as a hypothesis test of H 0 : μ = β 0 versus H 1 : μ = cβ 0 , where β 0 indicates a coefficient vector of fixed effects (the daily or weekly mean network traffic counts) and c is a percentage increment tuning parameter that represents the maximum acceptable multiplicative increase.The true parameter β 0 is often unknown in anomaly detection problems, whereas it should usually be prespecified under a conventional hypothesis testing setting.The main contribution of Fu and Jeske (2014) is that their proposed method resolves the problem of unknown β 0 and potential nuisance parameters in a model.However, they essentially take an estimate for β 0 as truth, which inherently harbors unaddressed randomness by virtue of being estimated.Their confidence in using plug-in estimates stems from the large amount of network traffic data readily available due to its high-frequency nature.
Existing literature on "anomaly detection" methods for energy consumption proposes various ways to quantify "what is expected" and how severely the observed data depart from it.Seem (2007) proposes a modified z-score after identifying outliers using the generalized studentized deviates.Zhao (2014) assumes that the difference between adjacent observations should approximately be the same, which works as a measure of large discontinuity.Rashid and Singh (2018) propose distance-based abnormality scores.To the best of our knowledge, these papers on anomaly detection are mainly focused on high-frequency demand or usage data, which do not use properties available for coarser-grained monthly data.In the time series data framework, anomaly detection is formulated as a change point detection problem.For instance, Ross et al. (2011) and Ross et al. (2013) develop change point detection models based on hypothesis tests where p-values govern statistical decision-making, while Raftery and Akman (1986) model the data as a stream of Poisson random variables whose parameter changes from λ 1 to λ 2 after an unknown break point.
Our paper was primarily motivated by Fu and Jeske (2014) where the data set is partitioned into historical data and monitored data (see Section 2 for a detailed description of the data) and the historical data set is used to learn normal behavior, which is then used to test the monitored data.This framework inevitably involves estimating parameters to be used in hypothesis testing, which requires large data.However, unlike network traffic systems, the amount of data available to train the model is usually smaller for energy usage data, yielding estimates with higher uncertainty.To overcome this shortcoming, multiple buildings a priori believed to behave similarly are grouped together to borrow strength from each other.This alone cannot remedy the uncertainty incurred by replacing truth with an estimate.In that direction, the posterior distribution is a powerful object that encodes uncertainty in its own way and can be used to reintroduce randomness that was ignored in the plug-in scheme.Unfortunately, the posterior predictive distribution of the test statistic is not available in closed form, and a computationally intensive method is required to construct it via Monte Carlo sampling.

A Motivating Case: UConn Energy Data
Facilities Operations at the University of Connecticut (UConn) collects and monitors data on utility consumption.UConn's main campus in Storrs, Connecticut, has 120 buildings with over 400 separate external utility accounts for natural gas and electricity provided by local utility service companies, CNG and Eversource.The main campus has an additional 150 buildings with over 240 internal meter accounts for utilities produced by UConn's Central Utility Plant providing the campus with electricity, steam, and chilled water.We mainly focus on UConn's natural gas consumption measured monthly by UConn Facilities Operations.The raw data consist of monthly measurements of natural gas consumption collected by UConn Facilities Operations.A total of 259 separate utility accounts are available across 120 buildings, each building having a varying number of meters.On a monthly basis, CNG, the local utility service provider for natural gas, provides UConn with a spreadsheet that contains all the account information including natural gas usage.In what follows, the building names and meter codes have been masked due to privacy considerations.UConn CNG data span 14 years from February 2007 to December 2020.Out of 245 CNG meters, we select 71 meters installed in residential buildings on UConn Storrs campus.Residential buildings have been grouped together a priori due to their similar energy consumption profiles.Energy meters have been installed and activated in university buildings at different times, making data availability different for each account.All 70 accounts for UConn's residential buildings were available from January 2008 and December 2016.One account was removed due to at least one unavailable measurement between Jan. 2008 andDec. 2016.Each observation corresponds to a single utility bill for the associated meter in one billing period, measured as the difference between two readings in hundred cubic feet (abbreviated to CCF).The billing start and end dates do not align uniformly across accounts, and therefore, the meter readings are adjusted for the differing number of days within each billing cycle to amount to 30-day use.Furthermore, the building size must be accounted for, since larger buildings tend to use more gas.Each measurement is adjusted to represent gas usage for 100 sqft.The final normalization becomes as follows: where y is the observation, sqft indicates the square footage and degree-days is the number of degree days of the corresponding month (see Section 6).A set of 12 months forms a "cycle" regardless of the month the cycle starts with.For example, Feb. 2012 to Jan. 2013 can form a cycle.For practical purposes, most interesting cycles are based on calendar years, fiscal years, or academic years.We represent the measurement for the ith year, j th month, and kth account as y h ij k , where i = 1, . . ., T , j = 1, . . ., J , and k = 1, . . ., K. A batch of observations within a cycle need not be 12, and therefore, we let j take on numbers up to an arbitrary J smaller than (or equal to) 12-i.e., j = 1, . . ., J (see Section 4.2).The batch that we are interested in regarding the existence of anomalous activities is in the last cycle, which we call monitored or test data, denoted by (y T +1,1 , . . ., y T +1,K ), where y T +1,k = (y T +1,1,k , . . ., y T +1,J,k ) .This naturally gives rise to T cycles of historical data, which we write (y h i1 , . . ., y h iK ) for i = 1, . . ., T , where y h ik = (y h ij k , . . ., y h iJ k ) and the superscript h indicates historical data.Figure 1 contains two heat maps, showing the standard deviations of the normalized natural gas use data for 70 residential buildings on UConn's main campus, with higher standard deviations colored darker.The left panel illustrates the standard deviations across years given a month and an account.Except for the first account (leftmost column), cells in each row are colored similarly.This suggests that the normalization removed various sources of variation and allowed years to be assumed as exhibiting similar variability.The same observation is possible for the right panel, containing standard deviations across accounts given a month and a year.Account variability is noticeably consistent within a month across years.

Testing Parameters Under Uncertainty
In energy monitoring, there is no one-size-fits-all number representing the energy use that is considered normal for all buildings.This very lack of reference point complicates the formulation of hypothesis testing.An intuitive solution would be to check whether observations have significantly changed relative to the historical data, assuming observations were stable and consistent for several years (Fu and Jeske, 2014).Unfortunately, this involves replacing truth with an estimate, which leaves the uncertainty unresolved.Assume y T +1,k independently follows a J -dimensional multivariate normal distribution, N(μ k , ), where μ k is a mean vector and is an unknown covariance matrix for k = 1, . . ., K. Consider the following hypotheses: where c is a prespecified scalar greater than one, indicating multiplicative excess energy use beyond a tolerable level, and β 0k is a J -dimensional vector that reflects acceptable energy usage, prespecified using historical data.Note that also needs to be specified using historical data.In essence, this hypothesis test compares the mean energy consumption represented by μ k to an acceptable level, β 0k .The only truly known, or user-specified, component in our setting is the percentage increment tuning parameter c.Provided that β 0k 's and are given, a traditional test would proceed to find the most powerful rejection region given by the Neyman-Pearson lemma (Neyman and Pearson, 1933;Casella and Berger, 2002).Estimated β 0k 's in a classical testing procedure may yield biased statistical inference due to the variability in the estimate.In the most general case, repeatedly splitting the data at random provides a way to reintroduce some randomness in the estimate by employing one data set for parameter estimation and the rest for testing-known as the train-test split.This bootstrapping scheme extracts information about uncertainty from subsetting the data to inject missing randomness into the test statistic (Efron and Tibshirani, 1994).When data are more structured than the general case, random splits no longer work.For example, the motivating UConn CNG data (see Section 2) are observed one batch per cycle, with a set of historical data {y h i1 , . . ., y h iK } for i = 1, . . ., T and a set of test data {y T +1,1 , . . ., y T +1,K } corresponding to the last cycle.The subscript T +1 in test data is henceforth omitted.We are only interested in testing the last cycle, where the historical data are assumed to have been generated from the null model, H 0 : μ k = β 0k .The specific interest in testing only the last cycle restricts our ability to generate random splits.Although it is technically possible to bootstrap estimates from the historical data, it is challenged by prohibitively small bootstrap sample sizes, and will produce an overpowered procedure similar to the naive plug-in procedures.Instead, we turn to an alternative object that contains the uncertainty information: the posterior distribution.

A Motivating Special Case
We illustrate the single-account setting where only one utility account is of interest (i.e., k is given) and lay out the proposed algorithm as a motivating example.Let y be a random vector from a J -dimensional multivariate normal distribution N(μ, σ 2 I J ).This is a special case of N(μ, ) where the covariance matrix = σ 2 I J .This assumption implies that the energy consumption is independent across different months and shares a common variance σ 2 , allowing the estimation of σ 2 without borrowing information from other accounts.Under this simple setting, the historical data are expressed as {y h 1 , . . ., y h T } (or {y h ik } for a given k and i = 1, . . ., T for notational consistency), and the test data are y.We drop the subscript k henceforth.The procedure is twofold: (1) the historical data are used to estimate the parameters; and (2) the test data are used to derive the rejection region and compute the test statistic.The hypotheses remain identical to those in Equation ( 2).The historical data y h i for i = 1, . . ., T are used to estimate the parameters β 0 and σ 2 .This model can be recast into the following matrix-vector form, y h = Xβ 0 + , where ) for which vec(•) denotes vectorization stacking the column vectors, X = ⊕ J j =1 1 T for which ⊕ indicates direct sum and 1 T is the T -dimensional vector of ones, and ∼ N(0, σ 2 I T J ).Then the least-squares estimator of , where It is well-known that the least-squares estimator follows a normal distribution, i.e., β 0 ∼ N(β 0 , σ 2 T I J ).Also, the unbiased estimator of σ 2 and its sampling distribution are given as follows: where χ 2 ν indicates a chi-squared distribution with ν degrees of freedom.This ultimately gives the posterior distributions under a uniform prior, π(β 0 , σ 2 ) ∝ 1, as follows: where X ∼ IG(a, b) denotes the inverse-gamma distribution whose density function is proportional to x −(a+1) e −b/x .Subsequently, the test data is used to derive the test statistic and rejection region corresponding to the most powerful test through the Neyman-Pearson lemma.Writing the likelihood ratio of the alternative hypothesis to the null hypothesis as (i.e., = L(cβ 0 ,σ 2 ) L(β 0 ,σ 2 ) ), log follows a normal distribution under H 0 (see Appendix B), which upon standardization yields The posterior distributions of the parameters capture the variability introduced in the estimation step.The main idea of our procedure lies in using the posterior distribution to calibrate the Algorithm 1 Single-account testing procedure. 11: 13: return H 1 if p obs γ α and H 0 otherwise frequentist testing procedure.The step-by-step description of our hybrid procedure is given in Algorithm 1.In our procedure, the test statistic is computed repeatedly using the parameter values generated from the posterior distribution, and the corresponding simulated data.This simulated data set is written as y f where the superscript f indicates future values.Then, the same calibration is performed on the test data to compute the calibrated p-value p obs , which is ultimately compared to p, the sample of posterior predictive p-values.If the observed p-value p obs associated with y exceeds the upper α-th quantile of p, the null hypothesis is rejected.

Hybrid Monitoring for Multiple Accounts
The common variance assumption and independence of energy usage across months in Section 2 are too restrictive.Estimating an unstructured typically requires a large amount of historical data (i.e., a sufficiently large T ) if we are using data from a single account.There are two key motivations: (1) estimating by pooling information; and (2) monitoring energy usage of all accounts.However, this approach requires that all accounts share the same covariance matrix, which is well-supported by Figure 1.Assume y k follows a J -dimensional multivariate normal distribution N(μ k , ) with a mean vector μ k and an unknown covariance matrix for k = 1, . . ., K. The data are now modeled as being correlated within a cycle but independent between cycles.Now consider the following hypotheses: where C is the percentage increment matrix.An example of C is diag(c 1 , . . ., c J ) with c j > 1 for all j .The historical data can be expressed as y h ik = β 0k + ik for i = 1, . . ., T and k = 1, . . ., K, where β 0k = (β 1,0k , . . ., β J,0k ) is a mean vector for the kth account, ik IID ∼ N(0, ), and is an unknown J × J symmetric positive-definite matrix.Unbiased estimators for β 0 and are well-known, computed using the historical data: The hypothesis testing expressed in Equation ( 3) is inherently a multiple testing problem.Denoting the test data by y k = (y 1k , . . ., y J k ) , the logarithm of the likelihood ratio follows a normal distribution under H 0 by the same logic as in the motivating special case, where assum- The sum-of-squares matrix , where W J (ν, ) is the Wishart distribution with ν degrees of freedom and a J × J scale matrix whose density function is J is the multivariate gamma function defined by and S J ++ is the space of J ×J symmetric positive-definite matrices.This in turn yields a posterior distribution | y h ∼ IW(K(T −1)−J −1, S) under a noninformative prior, π(β 01 , . . ., β 0K , ) ∝ 1, where IW J (ν, ) is the Inverse-Wishart distribution with ν degrees of freedom and a J × J scale matrix whose density function is Likewise, the (conditional) posterior distribution of β 0k is given by There is extensive research on how to adjust K p-values to control the familywise error rate or the false discovery rate.In multiple-testing literature, the null hypothesis, H 0 : μ k = β 0k for all k, is referred to as the grand null, and H 1 : μ k = β 0k for at least one k is equivalently the grand alternative hypothesis.For our procedure (see Algorithm 2), we consider five multiplicity corrections including the Bonferroni correction (Dunn, 1961;Holm, 1979;Hochberg, 1988;Hommel, 1988;Simes, 1986;Šidák, 1967;Benjamini and Hochberg, 1995).The adjusted p-value (Wright, 1992) for each method is computed as follows: • • Hommel: no one-line expression for adjusting p-values Algorithm 2 Multiple-account testing procedure.
1: procedure multipleTesting(y h , y, α, B 1 , B 2 , B 3 , h) 2: 14: δ b ← any( p1 < α, . . ., pK < α) 19: return H 1 if p obs γ α and H 0 otherwise Holm's procedure is replaced with the Holm-Šidák method in our simulation studies to increase power.The corresponding adjusted p-value becomes p Algorithm 2 describes the proposed multiple-account procedure, where h(u 1 , . . ., u K ) is the chosen algorithm for computing the adjusted p-values.The multiple-account procedure follows the same logic as that of Algorithm 1.However, unlike Algorithm 1, this multipleaccount procedure defies an analytical expression of its statistical power for a given multiplicity correction.We therefore provide a Monte Carlo scheme for computing the statistical power once we have full knowledge of the data generation process.See Section 5 for detail.

Complexity Analysis
The computationally intensive nature of Algorithm 2 merits a brief discussion about its computational cost.Since it is well-known that nested for loops are costly, we focus our discussion on the innermost for loop that repeats a sequence of test statistic calculation B 1 times.The most computational overhead comes from either line 7 or line 9 in Algorithm 2. Generating a J × J inverse-Wishart random matrix produces its inverse −1 as a byproduct.The inversion here is of O(J 3 ), where O(f (n)) indicates a class of algorithms whose order is f (n)-i.e., an algorithm g(n) is an element of O(f (n)) if there exist a positive integer N and a positive real number c such that 0 g(n) cf (n) for all n N (Cormen et al., 2022).In line 9, −1 (y f k − β k ) can be done in one step for all k = 1, . . ., K by stacking {y Then, using the fact that the computational complexity of matrix multiplication between an m × n matrix and an n × p matrix is O(mnp), calculating −1 D is of O(J 2 K).In theory, the computational complexity of one iteration of the innermost for loop is therefore O(J 2 max(K, J )).In practice, however, J is the number of months in a cycle, bounded above by 12.This makes it highly probable that K is greater than J .All in all, the complexity of Algorithm 2 is therefore O(B 1 B 2 J 2 max(K, J )).As daunting as it may seem, there is a source of significant computational gains in the proposed method: parellelization.The posterior distribution π(β 1 , . . ., β K , ) does not require Markov chain Monte Carlo sampling, which lends itself to parallelization.There are various ways to parallelize a program (see Section 5 for our settings).Although parallelizing Algorithm 2 considerably speeds up computation, a high value of B 1 B 2 can easily cancel out the acceleration.We recommend that both B 1 and B 2 be smaller than 2,000 for a good balance between performance and computational feasibility.

Testing Before a Full Cycle
Testing the last full cycle of data was initially proposed in Fu and Jeske (2014) for high-frequency data collection.However, testing full-cycle may be too prohibitive for practical use.For instance, we may want to test energy usage before having observed a full 12 months.The aggregate nature of the UConn CNG data lends itself well to the central limit theorem, allowing the use of the multivariate normal distribution as our sampling distribution.Fortunately, the affine property of the multivariate normal distribution (Rao, 1973;Ravishanker et al., 2022) allows the restriction to be relaxed to any subset of the observations.The index j = 1, . . ., J has been kept arbitrary on purpose to accommodate this relaxation.That is, where E is a J × a selection matrix whose columns are standard unit vectors e j = (0, 0, . . ., 1, . . ., 0) whose j th element is one and zeros elsewhere, and ỹk indicates an a-dimensional subvector of y k of selected elements to be tested, for which a is the number of selected elements.Redefining J := a and y k := ỹk brings us back to the original formulation-Equation (3)-with a mild abuse of notation.

Bayesian Interpretation
Our proposed procedures permit a straightforward Bayesian interpretation under noninformative priors.The posterior distribution of | y and the conditional posterior distribution β | , y are given in Sun and Berger (2007).In our case, the distribution β 0k | , y is easily obtained as N(y •k , /T ).As for the marginal posterior of , assume the Jeffreys-type prior: (Jeffreys, 1998;Geisser and Cornfield, 1963;Sun and Berger, 2007).Then the marginal posterior density becomes (5) If we choose d to be the dimension of , that is let π(β, ) to be Jeffreys' prior, then | y ∼ IW(K(T − 1) − 1, S).The first nested for loops in Algorithm 2 correspond to a Monte Carlo sampling scheme to construct the distribution of the random variables defined as an expectation as follows.Writing U k = 1 − (W k (y f , β k , )) with being the distribution function of the Lim, D. et al.
where y f follows the posterior predictive distribution.This expectation is a random variable whose randomness stems from y f .By sequentially sampling from (b) ∼ [ | y] and β (b) ∼ [β | (b) , y] for b = 1, . . ., B, the expectation in Equation 6 is approximated by as B → ∞.The entire procedure is asymptotically equivalent to the frequentist Neyman-Pearson test due to posterior consistency (see Proposition 1 in Appendix A).

Simulation Studies
In this section, we conduct extensive simulation studies to investigate the performance of the proposed algorithms relative to the naive plug-in method.The size, or equivalently the type-1 error, and the power are the quantities of interest.Both the size analysis and the power analysis consist of a simulation study for the motivating special case with = σ 2 I J for a given k and another for the multiple-account case.We detail the data generation setting here; note that the historical data are assumed to follow the null hypothesis whereas the test data follow the null hypothesis and alternative hypothesis, respectively, in size analysis and power analysis.The related data sets are generated accordingly using R (R Core Team, 2021).All algorithms are implemented in C++ using Rcpp (Eddelbuettel and Balamuta, 2018) and RcppArmadillo (Eddelbuettel and Sanderson, 2014) for linear algebra, and OpenMP (OpenMP Architecture Review Board, 2018) for parallel programming.For the single-account case, we generate 10,000 data sets each for T = 3, T = 5, T = 10, and T = 15 from the model specification, where true β is set to (6,7,8,9,10,11,12,13,14,15,16,17) and σ 2 = 2.5.Recall that T indicates the number of cycles in the historical data used to estimate the parameters.We run the algorithm for α = 0.01, α = 0.025, and α = 0.05.There are three tuning parameters: B 1 , B 2 , and B 3 indicating the numbers of Monte Carlo iterations within Algorithm 1 and Algorithm 2. The single-account case involves relatively few computational intensive matrix operations like matrix inversion or the Cholesky decomposition-the tuning parameters were chosen to be large numbers (B 1 = 4000, B 2 = 5000, B 3 = 4000).The seed number was set to 2797542 for data generation and 18007 for running the algorithms, respectively.For simulations related to the multiple-account case illustrated in Section 4, we generate 1,000 data sets each for T = 3, T = 5, T = 10, T = 15 from the appropriate model.That is, all historical data were generated from the model y k

Size Analysis
The three panels on the left column of Figure 3 visualize the comparison between Algorithm 1 and the naive plug-in method in the single-account case.It is clearly observed that the type-1 error is not controlled when plug-in estimates are used.Meanwhile, the proposed method effectively contains the type-1 error below the desired level.The discrepancy between the methods is the most pronounced when T = 3 since the number of samples used to estimate the (plug-in) unbiased estimator falls well short of that required for the law of large numbers to be at play, leaving much of the randomness in the estimator unaccounted for.
Similarly, the comparison between Algorithm 2 and the naive plug-in methods in the multiple-account case is summarized in Figure 3.The proposed methods, as well as the naive plug-in methods, depend on the multiple testing corrections denoted by h(•) in Algorithm 2. Four multiplicity corrections are considered-Holm, Hochberg, Hommel, and Benjamini-Hochberg.The familywise error rate, the multivariate equivalent of the type-1 error rate, is slightly elevated to avoid too low a probability of anomaly detection (α ∈ {0.025, 0.05, 0.1}).
The bar plots show that the inflated type-1 error issue that plagued the plug-in methods in the single-account case carries over to the multiple-account case regarding the FWER.All three panels in the right column of Figure 3 have bars labeled with the letter "N", indicating that the corresponding values were computed using "naive" (plug-in) methods, exceeding the desired FWERs marked as red lines.Note that the inflated FWERs of the plug-in estimates subside as more data become available, i.e., T increases.On the other hand, the proposed methods exhibit a rather conservative behavior.We observe that the proposed algorithms "play it safe" with scarce data, or equivalently with large randomness, and grow more confident to reject the grand null hypothesis as more data come in.Little difference in the performances exists across multiplicity corrections.It has been observed that the proposed method exhibits type-1 error rate that approaches the desired FWER from below, whereas the plug-in method does so from above.In theory, both methods should converge in the presence of sufficient data, due to frequentist consistency.To confirm this, an additional set of simulations with 4,629 data sets has been conducted for T = 100.Note that 100 years of historical data are unrealistic in practice.As evidenced in Figure 4, the two methods return indistinguishable size estimates with a wealth of data.This finding supports our method's control over size in circumstances with small-to-moderate data.
Figure 3: (Left column) Simulation results for the size analysis in the single-account case.The red dashed lines are the desired α-level.(Right column) Simulation results from 1,000 data sets in the multiple-account case.The letters "P" and "N" are short for "Proposed" and "Naive" respectively, indicating the proposed methods and naive plug-in methods.Multiple testing corrections are distinguished by colors, as shown in the legend.The red dashed lines are the desired FWER.
Figure 4: Simulation results for size convergence from 4,629 data sets in the multiple-account case for T = 100.The letters "P" and "N" are short for "Proposed" and "Naive", respectively, indicating the proposed methods and naive plug-in methods.Multiplicity corrections are distinguished by colors, as shown in the legend.The red dashed lines are the desired FWER.11).(Right column) Simulation results in the multiple-account case for power analysis.The letters "P" and "N" are short for "Proposed" and "Naive", respectively, indicating the proposed methods and naive plug-in methods.The multiple testing corrections are distinguished by colors, as shown in the legend.True power is marked as a dashed line for each method, computed through Monte Carlo simulations of 10,000 iterations.

Power Analysis
The left column of Figure 5 contains three bar plots comparing the estimated statistical power using the proposed method to that of plug-in estimates.The red lines indicate the theoretical power for the corresponding configuration, computed by 1 − (z 1−α − (c − 1) β 0 /σ ), where z α is the α-th quantile of the standard normal distribution.The statistical power is by and large comparable except when T is low where the plug-in estimate yields moderately higher statistical power than the proposed method.However, this comes at the expense of allowing exceedingly high type-1 error as demonstrated in Figure 3.The gains in statistical power are not worth relinquishing control over type-1 error, considering the differences in type-1 error and statistical power.
In the multiple-account case, the right column of Figure 5 shows the statistical power of the proposed methods and plug-in methods labeled as "P" and "N", respectively.Each multiple testing correction is distinguished by a different color.Despite the lack of a closed-form expression Lim, D. et al.
of the true power, it can be computed with Monte Carlo simulations, i.e., 10,11,12, β is the type-2 error with respect to the grand hypotheses, and h is the multiple testing correction chosen from {Holm, Hochberg, Hommel, BH} which yields the adjusted pvalues.Considering the true power computed through Monte Carlo simulations is marked as dashed lines in all three panels, it is easily observed that the naive plug-in methods outperform the Neyman-Pearson test which is supposed to be the most powerful.Despite its ostensibleyet misleading-outperformance, this is in fact due to the uncontrolled FWER, as shown in Figure 3.In the context of energy use monitoring, this indicates that the plug-in methods set off false alarms twice as many times as the acceptable rate specified by the type-1 error, incurring wasteful human inspection.Even with 15 cycles of historical data, the plug-in methods are still overpowered.On the other hand, the proposed procedure produces statistical power that falls short of that of the Neyman-Pearson test; however, it catches up reasonably quickly with 10 cycles of historical data, with which the naive plug-in procedures are still overpowered.

Real Data Analysis
In this section, we apply the proposed hybrid energy monitoring algorithm detailed in Section 4 to UConn CNG data described in Section 2. As briefly mentioned in Section 2, it is not optimal to flag an already expected surge in natural gas use as abnormal in cases where the building itself is large or the temperature for a particular month drops unusually.To that end, prior to analysis, we normalize each observation into consumption for a 100 square-foot building for a 30-day month per degree day (as in Equation ( 1)).For buildings with more than one meter, the square footage is divided by the number of meters.
Facilities Operations collects daily weather data from the National Oceanic and Atmospheric Administration (NOAA) through the web application programming interface (API) provided by the Applied Climate Information System.Of all available weather information, the heating degree days and cooling degree days defined as HDD = max(0, 65− ū) and CDD = max(0, ū−65) are used to encapsulate the weather on a given day, where ū is the average of the maximum and minimum temperatures of a day, and 65 degrees Fahrenheit is the neutral balance point.Thus, the observations are further divided by the sum of degree days to carve out the uneven impact of the weather on the natural gas usage.We set the tuning parameters to B 1 = 1000, B 2 = 2000, and B 3 = 2000, and run our procedure for the first four months of a calendar year (January, February, March, and April) of 2008 through 2016.Figure 7 contains the histograms of p of the proposed procedure for four multiplicity corrections: Holm, Hochberg, Hommel, and BH.The solid vertical line in each panel indicates the upper α-th quantile of the empirical distribution p, the exact value of which is written by the line with the Greek letter γ α .And the dashed line is the corresponding observed p-value of the test data, next to which the corresponding value is written as p obs .All four multiplicity-correction methods agree that there exists an anomalous account in the first four months of 2017.
Upon the grand null hypothesis being rejected, individual adjusted p-values have been examined to locate the possible anomalous surge in natural gas usage.It is beneficial to check each Figure 6: A line plot of the degree-day-adjusted data from UConn residential buildings.
Figure 7: The histograms of constructed distributions of p-values regarding the grand null.Only the first four months were selected to be tested-January, February, March, and April.adjusted p-value as the algorithm tends to drive up the critical value for the grand hypothesis test, which makes it more difficult to reject than in each individual case.This is because it requires just one pk smaller than α (line 11 of Algorithm 2) for δ b 1 to be counted as one, whereas the account-specific equivalent of δ b 1 may remain zero.All four correction methods agreed on k = 45, Apartment Building 45.All methods agreed upon k = 41, Apartment Building 41, being anomalous as well.Figure 8 contains the time plots for Apartment Building 41 and Apartment Building 45.The top two panels display the unadjusted native use recorded in CCF whereas the bottom two panels show the adjusted usage for 30 days and 100 square feet after each observation is divided by the number of degree days.Observations for each year's first four months (Jan-Apr) are colored red.Focusing on the last four red points that were monitored for anomalous behavior, the unadjusted native use in the top two panels do not particularly deviate from historical patterns.However, once adjusted, it becomes visible that the last cycle stands Figure 8: Two buildings detected as anomalous according to the proposed procedure.The top two panels show the (unadjusted) native use in CCF whereas the bottom two panels are adjusted for the number of days in a billing cycle, the square footage of each building, and the degree days.Points marked red are the first four months (Jan-Apr) in each year.The last four red dots were monitored.Orange lines indicate the average monthly temperature in Fahrenheit, whose axis is on the right side of each plot.out from the other cycles.For Apartment Building 45, in particular, the average native use of the four months in 2016 was 27.5 percent higher than that of the previous year.The monthly average temperature, colored orange in the plots, indicates that the first four months of 2016 were warmer than previous years' months, which defies the expectation that gas usage would go down with warm weather.There are no cut-and-dried explanations for the anomalies discovered.However, there are a few common scenarios that could have led to such increases in gas usage.First, utilities managers from UConn Facilities Operations have reported that everyone has a different "temperature comfort zone."The residents in 2016 may have had higher-thannormal temperature preferences.Second, utilities managers have also recounted that residents in these buildings frequently forget to shut their windows, especially when they leave their rooms.If these behaviors happened unusually frequently between January and April of 2016 in the two apartment buildings, it could have caused the gas consumption to balloon beyond what is considered normal as defined by the historical data.However, that only two buildings-out of 70-were flagged as anomalous increases in gas usage suggests that the gas usage within UConn's residential buildings is consistent and well-synchronized with the average temperature each month.
To further compare the performance of our proposed algorithm to that of the plug-in scheme, we derive prediction intervals using both methods.For the naive approach, the 100(1 − α)% prediction interval is given by β+z 1−α σ , whereas for our proposed method, we obtain γ α following Algorithm 1 and compute the upper γ α -th quantile of the realizations of {β (s) +z 1−α σ (s) } S s=1 , where β (s) and σ (s) are the s-th realized values from their posterior distribution.We have obtained these prediction intervals for Apartment Building 41 and Apartment Building 45.The results for 1 − α = 0.95 are shown in Figure 9, where blue lines indicate the upper bounds of the 95% posterior predictive interval under the proposed model, and the long-dashed lines mark the upper 95% prediction bounds using the naive plug-in approach.We see from this figure that our proposed method produces a markedly wider interval than the plug-in scheme as anticipated, since the estimation randomness accounted for in our method guards against overconfidence.These results are consistent with the simulation results in Section 5 where the proposed method was noticeably more conservative in rejecting the null hypothesis.

Discussions and Conclusion
We proposed a hybrid monitoring algorithm that takes advantage of the posterior distributions of the parameters to inject randomness in order to account for unaddressed uncertainty in statistical anomaly detection.To the best of our knowledge, the proposed algorithm is the first of its kind.From uncertainty calibration's perspective, our general idea in this application bears a resemblance to the posterior or prior predictive p-values (Meng, 1994;Hjort et al., 2006;Gelman et al., 1996).The proposed algorithm estimates the unknown parameters using historical data and tests the last cycle by computationally constructing the predictive distribution of the pvalues, to which the observed p-value is compared.We have shown that our method adjusts for the estimation uncertainty and properly controls type-1 error rate.
There are several possible extensions for future research.As more data become available, both incorporating new information and phasing out old information affect the performance of the algorithm.The former can be handled efficiently through an online updating scheme, which significantly alleviates computational burden.This online updating scheme will be useful for high-frequency data where repeated parameter estimation with an increasing amount of data can quickly strain computational resources.The latter can be addressed by retiring old data, which goes back to the question of what the minimum amount of data to achieve a desired level of statistical power is.Furthermore, it is crucial that all past anomalous data points be handled in a way that they do not undermine subsequent monitoring.Moreover, these fields are by nature highly collaborative, and domain experts play a substantial role in modeling the system correctly.We emphasize the importance of domain expert opinions in handling the detected anomalies.Methodologically, the hypothesis tests can be extended to composite hypotheses such as H 0 : μ cβ 0k versus H 1 : μ k < cβ 0k or H 0 : μ k cβ 0k versus H 1 : μ k > cβ 0k with inequalities applied elementwise.By picking arbitrary c 0 and c 1 for null hypothesis and alternative hypothesis respectively such that either c 0 > c 1 or c 1 > c 0 , the likelihood ratio becomes a monotone function of the data due to the fixed sign of c 1 − c 0 , which gives a uniformly most powerful test.In this setting, the test statistic is easily generalized as For future research, combining anomalous past data and online updating is currently under investigation.One potential useful method of handling flagged data is assigning weights.The online updating scheme will also need to be adjusted accordingly since deleting past data will disrupt updates.Gradually downweighting past data will automatically address retiring old data.
where φ(• | μ, ) is the density of a multivariate normal distribution with mean vector μ and covariance matrix .Under H 0 , log follows the following normal distribution: .
Although the rejection region does not depend on c, the increment parameter c plays a role in the power function, given by where (•) is the distribution function of the standard normal distribution, and z 1−α is the (1 − α)-th quantile of the standard normal distribution, i.e., (z 1−α ) = 1 − α.In the multivariate case, the logarithm of the likelihood ratio follows a normal distribution under H 0 , i.e., ∼ N(0, 1) under H 0 .

Figure 1 :
Figure 1: Heat maps of standard deviations.Left panel shows standard deviations across years for a given month and an account.Right panel shows standard deviations across accounts while month and account are held fixed.

IND∼
N(β k , )for k = 1, . . ., 70, while the test data for power analysis were generated from y k IND ∼ N(cβ k , ) for k = 10, 11, 12 where c = 1.2.Note that y k for k ∈ {10, 11, 12} still follow N(β k , ) even in power analysis.The specific values of the true parameters, β k and were selected to mirror the UConn CNG data, in which case J = 12, and c = 1.2 to indicate 20% increase in natural gas usage, a threshold beyond which utility engineers at the Facilities Operations deem excessive.

Figure 2 :
Figure 2: The coefficients of variation between β k and the diagonal entries of across 70 CNG meters installed in residential buildings at UConn.

Figure 5 :
Figure 5: (Left column) Simulation results in the single-account case for power analysis.The red dashed lines indicate the corresponding theoretical power when the true β and σ 2 were used, computed by Equation (11).(Right column) Simulation results in the multiple-account case for power analysis.The letters "P" and "N" are short for "Proposed" and "Naive", respectively, indicating the proposed methods and naive plug-in methods.The multiple testing corrections are distinguished by colors, as shown in the legend.True power is marked as a dashed line for each method, computed through Monte Carlo simulations of 10,000 iterations.

Figure 9 :
Figure 9: Prediction bands of natural gas consumption in Apartments 41 and 45 from Jan. to Apr., 2016.Blue solid lines indicate the upper bounds of the 95% Bayesian posterior predictive band under the proposed model, and the long-dashed lines mark the upper 95% prediction bound using the naive plug-in approach.