Econometrics at Scale: Spark up Big Data in Economics

Bluhm, Benjamin; Cutura, Jannic Alexander

doi:10.6339/22-JDS1035

Journal of Data Science

Econometrics at Scale: Spark up Big Data in Economics^✩

Volume 20, Issue 3 (2022): Special Issue: Data Science Meets Social Sciences, pp. 413–436

Benjamin Bluhm Jannic Alexander Cutura

https://doi.org/10.6339/22-JDS1035

Pub. online: 7 April 2022 Type: Data Science Reviews

Open Access

^✩ The views expressed in this paper are those of the authors alone and do not represent the view of the European Central Bank (ECB).

Received
9 November 2021

Accepted
12 January 2022

Published
7 April 2022

Abstract

This paper provides an overview of how to use “big data” for social science research (with an emphasis on economics and finance). We investigate the performance and ease of use of different Spark applications running on a distributed file system to enable the handling and analysis of data sets which were previously not usable due to their size. More specifically, we explain how to use Spark to (i) explore big data sets which exceed retail grade computers memory size and (ii) run typical statistical/econometric tasks including cross sectional, panel data and time series regression models which are prohibitively expensive to evaluate on stand-alone machines. By bridging the gap between the abstract concept of Spark and ready-to-use examples which can easily be altered to suite the researchers need, we provide economists and social scientists more generally with the theory and practice to handle the ever growing datasets available. The ease of reproducing the examples in this paper makes this guide a useful reference for researchers with a limited background in data handling and distributed computing.

Supplementary material

Supplementary Material

Supplementary material is available on our github page, containing all codes to replicate the results along links to the data. Additional instructions are also available, detailing how to setup the AWS infrastructure: https://github.com/benjaminbluhm/econometrics_at_scale.

References

Arellano M (1987). Practitioners’ corner: computing robust standard errors for within-groups estimators. Oxford Bulletin of Economics and Statistics, 49(4): 431–434.

Aruoba SB, Fernandez-Villaverde J, Rubio-Ramirez JF (2003). Comparing Solution Methods for Dynamic Equilibrium Economies. PIER Working Paper Archive 04-003. Penn Institute for Economic Research, Department of Economics, University of Pennsylvania.

Athey S, Imbens GW (2017). The state of applied econometrics: causality and policy evaluation. The Journal of Economic Perspectives, 31(2): 3–32.

Baltagi B (2008). Econometric Analysis of Panel Data. John Wiley & Sons.

Boneva L, Böninghausen B, Fache Rousová L, Letizia E, et al. (2019). Derivatives transactions data and their use in central bank analysis. Economic Bulletin Articles, 6.

Böse JH, Flunkert V, Gasthaus J, Januschowski T, Lange D, Salinas D, et al. (2017). Probabilistic demand forecasting at scale. Proceedings of the VLDB Endowment, 10: 1694–1705.

Cameron AC, Trivedi PK (2005). Microeconometrics: methods and applications. Cambridge University Press.

Cavallo A, Rigobon R (2016). The billion prices project: using online prices for measurement and research. The Journal of Economic Perspectives, 30(2): 151–78.

Chambers B, Zaharia M (2018). Spark – The Definitive Guide: Big Data Processing Made Simple. O’Reilly Media, Incorporated.

Chun BG, Cho B, Jeon B, Jeong JS, Kim G, Kim JY, et al. (2016). Dolphin: runtime optimization for distributed machine learning. In: The ML Systems Workshop at ICML.

Clemen RT (1989). Combining forecasts: a review and annotated bibliography. International Journal of Forecasting, 5(4): 559–583.

Correia S (2016). Linear Models with High-Dimensional Fixed Effects: An Efficient and Feasible Estimator. Technical Report. Working Paper.

Dean J, Ghemawat S (2004). Mapreduce: simplified data processing on large clusters. In: OSDI’04: Sixth Symposium on Operating System Design and Implementation, 137–150. San Francisco, CA.

Dick-Nielsen J, Feldhütter P, Lando D (2012). Corporate bond liquidity before and after the onset of the subprime crisis. Journal of Financial Economics, 103(3): 471–492.

Duchin R, Sosyura D (2014). Safer ratios, riskier portfolios: banks response to government aid. Journal of Financial Economics, 113(1): 1–28.

Edwards AK, Harris LE, Piwowar MS (2007). Corporate bond market transaction costs and transparency. The Journal of Finance, 62(3): 1421–1451.

Einav L, Levin J (2014). Economics in the age of big data. Science (New York, N. Y.), 346(6210): 1243089.

Ferguson TS (2017). A Course in Large Sample Theory. Routledge.

Fernández-Villaverde J, Valencia DZ (2018). A Practical Guide to Parallelization in Economics. National Bureau of Economic Research, Cambridge, MA.

Flom P (2013). Hypothesis testing with big data. Cross Validated. (Version: 2013-08-13).

Foster I, Ghani R, Jarmin RS, Kreuter F, Lane J (2016). Big Data and Social Science: A Practical Guide to Methods and Tools. Chapman and Hall/CRC.

Gao H, Ru H, Yang X (2019). What do a Billion Observations Say About Distance and Relationship Lending? Working Paper. Technical Report.

Gaure S (2019). lfe: Linear Group Fixed Effects. 2.8-7.1 edition.

Gentzkow M, Kelly BT, Taddy M (2019). Text as data. Journal of Economic Literature. 57(3): 535–374.

Ghemawat S, Gobioff H, Leung ST (2003). The Google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles, 20–43, Bolton Landing, NY.

Gilje EP, Loutskina E, Strahan PE (2016). Exporting liquidity: branch banking and financial integration. The Journal of Finance, 71(3): 1159–1184.

Greenwald M, Khanna S, et al. (2001). Space-efficient online computation of quantile summaries. ACM SIGMOD Record, 30(2): 58–66.

Grimmer J, Stewart BM (2013). Text as data: the promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(03): 267–297.

Hamermesh DS (2013). Six decades of top economics publishing: who and how? Journal of Economic Literature, 51(1): 162–172.

Hamilton JD (1994). Time Series Analysis. Princeton Univ. Press, Princeton, NJ.

Hansen C (2007). Asymptotic properties of a robust variance matrix estimator for panel data when t is large. Journal of Econometrics, 141: 597–620.

Irving-Fisher-Committee (2020). Irving Fisher Committee on Central Bank Statistics. 2019 ifc Annual Report. (accessed 15/01/2020).

Jankowitsch R, Nagler F, Subrahmanyam MG (2014). The determinants of recovery rates in the us corporate bond market. Journal of Financial Economics, 114(1): 155–177.

Karau H, Konwinski A, Wendell P, Zaharia M (2015). Learning Spark: Lightning-Fast Big Data Analytics. O’Reilly Media, Inc., 1st edition.

Karau H, Warren R (2017). High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark. O’Reilly Media, Inc., 1st edition.

Kleinberg J, Ludwig J, Mullainathan S, Obermeyer Z (2015). Prediction policy problems. The American Economic Review, 105(5): 491–495.

Leamer EE (1985). Sensitivity analyses would help. The American Economic Review, 75(3): 308–313.

Millo G (2017). Robust standard error estimators for panel models: a unifying approach. Journal of Statistical Software, 82(3): 1–27.

Mullainathan S, Spiess J (2017). Machine learning: an applied econometric approach. The Journal of Economic Perspectives, 31(2): 87–106.

Munnell AH, Tootell GM, Browne LE, McEneaney J (1996). Mortgage lending in Boston: interpreting hmda data. The American Economic Review, 86(1): 25–53.

Ng S (2017). Opportunities and Challenges: Lessons from Analyzing Terabytes of Scanner Data. Technical Report. National Bureau of Economic Research.

R Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

Sala-I-Martin XX (1997). I just ran two million regressions. The American Economic Review, 87(2): 178–183.

Samadi Y, Zbakh M, Tadonki C (2018). Performance comparison between hadoop and spark frameworks using hibench benchmarks. Concurrency and Computation: Practice and Experience, 30(12): e4367.

Sheppard K (2019). linearmodels: Models for Panel Data, 4.25 edition.

Timmermann A (2006). Forecast combinations. In: Handbook of Economic Forecasting (G Elliott, C Granger, A Timmermann, eds.), volume 1 of Handbook of Economic Forecasting, Chapter 4. 135–196. Elsevier.

Varian HR (2014). Big data: new tricks for econometrics. The Journal of Economic Perspectives, 28(2): 3–28.

Wooldridge JM (2010). Econometric Analysis of Cross Section and Panel Data. MIT press.

Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010). Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. USENIX Association, Boston, MA.

2022 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.

Open access article under the CC BY license.

Keywords

Apache Spark distributed computing econometrics

Funding

We gratefully acknowledge a travel grant sponsored by the Bank of England. We gratefully acknowledge research support from the Leibniz Institute for Financial Research SAFE.

Metrics

since February 2021

1938

Article info
views

1526

PDF
downloads

RSS

Authors

Abstract

Supplementary material

References

Export citation

Copy and paste formatted citation

Download citation in file