Producing Fast and Convenient Machine Learning Benchmarks in R with the stressor Package
Volume 22, Issue 2 (2024): Special Issue: 2023 Symposium on Data Science and Statistics (SDSS): “Inquire, Investigate, Implement, Innovate”, pp. 239–258
Pub. online: 4 June 2024
Type: Statistical Data Science
Open Access
Received
28 July 2023
28 July 2023
Accepted
21 February 2024
21 February 2024
Published
4 June 2024
4 June 2024
Abstract
The programming overhead required to implement machine learning workflows creates a barrier for many discipline-specific researchers with limited programming experience. The stressor package provides an R interface to Python’s PyCaret package, which automatically tunes and trains 14-18 machine learning (ML) models for use in accuracy comparisons. In addition to providing an R interface to PyCaret, stressor also contains functions that facilitate synthetic data generation and variants of cross-validation that allow for easy benchmarking of the ability of machine-learning models to extrapolate or compete with simpler models on simpler data forms. We show the utility of stressor on two agricultural datasets, one using classification models to predict crop suitability and another using regression models to predict crop yields. Full ML benchmarking workflows can be completed in only a few lines of code with relatively small computational cost. The results, and more importantly the workflow, provide a template for how applied researchers can quickly generate accuracy comparisons of many machine learning models with very little programming.
Supplementary material
Supplementary MaterialThe supplementary materials associated with this paper contain all the data and code necessary to reproduce the figures and tables shown in this paper. Dataset descriptions have been provided in the text, but additional information about the files can be found in the README file contained in the supplementary materials folder.
References
Aguilar J, Gramig GG, Hendrickson JR, Archer DW, Forcella F, Liebig MA (2015). Crop species diversity changes in the United States: 1978–2012. PLoS ONE. 10(8): 1–4. https://doi.org/10.1371/journal.pone.0136580.
Ali M (2020). PyCaret: An open source, low-code machine learning library in Python. PyCaret version 1.0.0. https://www.pycaret.org (Accessed May 17, 2023).
Brenning A (2012). Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest. In: 2012 IEEE International Geoscience and Remote Sensing Symposium, 5372–5375. IEEE. https://doi.org/10.1109/IGARSS.2012.6352393 (Accessed Dec 29, 2023).
Brenning A, Long S, Fieguth P (2012). Detecting rock glacier flow structures using Gabor filters and ikonos imagery. Remote Sensing of Environment, 125: 227–237. https://doi.org/10.1016/j.rse.2012.07.005.
Burchfield EK (2022). Shifting cultivation geographies in the central and eastern US. Environmental Research Letters, 17(5): 1–11. https://doi.org/10.1088/1748-9326/ac6c3d.
Burchfield EK, Nelson KS (2021). Agricultural yield geographies in the United States. Environmental Research Letters, 16(5): 1–12. https://doi.org/10.1088/1748-9326/abe88d.
Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, et al. (2023). xgboost: Extreme Gradient Boosting. R package version 1.7.6.1. https://CRAN.R-project.org/package=xgboost.
Crane-Droesch A (2018). Machine learning methods for crop yield prediction and climate change impact assessment in agriculture. Environmental Research Letters, 13(11): 114003. https://doi.org/10.1088/1748-9326/aae159.
Culp M, Johnson K, Michailidis G (2016). ada: The R Package Ada for Stochastic Boosting. R package version 2.0-5. https://CRAN.R-project.org/package=ada (Accessed May 17, 2023).
Friedman J, Hastie T, Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1): 1–22. https://doi.org/10.18637/jss.v033.i01.
Goslee SC (2020). Drivers of agricultural diversity in the contiguous United States. Frontiers in Sustainable Food Systems, 4: 75. https://doi.org/10.3389/fsufs.2020.00075.
Gramacy RB (2007). tgp: An R package for Bayesian nonstationary, semiparametric nonlinear regression and design by treed Gaussian process models. Journal of Statistical Software, 19: 1–46. https://doi.org/10.18637/jss.v019.i09.
Greenwell B, Boehmke B, Cunningham J, GBM Developers (2022). gbm: Generalized Boosted Regression Models. R package version 2.1.8.1. https://CRAN.R-project.org/package=gbm.
Harrison D, Rubinfeld DL (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1): 81–102. https://doi.org/10.1016/0095-0696(78)90006-2.
Hastie T, Efron B (2022). lars: Least Angle Regression, Lasso and Forward Stagewise. R package version 1.3. https://CRAN.R-project.org/package=lars.
Haycock S (2023). stressor: An R package for benchmarking machine learning models. Utah State University Digital Commons. 1-75. https://doi.org/10.26076/2am5-9f67.
Hengl T, Miller MA, Križan J, Shepherd KD, Sila A, Kilibarda M, et al. (2021). African soil properties and nutrients mapped at 30 m spatial resolution using two-scale ensemble machine learning. Scientific Reports, 11(1): 1–18. https://doi.org/10.1038/s41598-021-85639-y.
Hothorn T (2023). CRAN task view: Machine learning & statistical learning. Version 2023-07-20. https://CRAN.R-project.org/view=MachineLearning.
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. (2017). Lightgbm: A highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems. volume 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf (Accessed May 17, 2023).
Krueger T, Braun M (2022). CVST: Fast Cross-Validation via Sequential Testing. R package version 0.2-3. https://CRAN.R-project.org/package=CVST.
Kuhn M (2022). caret: Classification and Regression Training. R package version 6.0-93. https://CRAN.R-project.org/package=caret.
Kuhn M, Wickham H (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org.
Lang M, Binder M, Richter J, Schratz P, Pfisterer F, Coors S, et al. (2019). mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software. 1903. https://doi.org/10.21105/joss.01903.
Le HM, Eriksson A, Do TT, Milford M (2019). A binary optimization approach for constrained k-means clustering. In: Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Revised Selected Papers, Part IV. Perth, Australia. December 2–6, 2018, 383–398. Springer. https://doi.org/10.1007/978-3-030-20870-7_24.
Leisch F, Dimitriadou E (2021). mlbench: Machine Learning Benchmark Problems. R package version 2.1-3.1. https://cran.r-project.org/package=mlbench.
Liang XZ, Wu Y, Chambers RG, Schmoldt DL, Gao W, Liu C, et al. (2017). Determining climate effects on US total agricultural productivity. Proceedings of the National Academy of Sciences, 114(12): E2285–E2292. https://doi.org/10.1073/pnas.1615922114.
Lovelace R, Nowosad J, Muenchow J (2019). Geocomputation with R. CRC Press. https://r.geocompx.org/spatial-cv.html (Accessed: Dec 29, 2023).
Lundell JF (2017). There has to be an easier way: A simple alternative for parameter tuning of supervised learning methods. In: JSM Proceedings, Statistical Computing Section, 3028–3036. American Statistical Association, Alexandria, VA. https://cran.r-project.org/package=EZtune.
Majka M (2019). naivebayes: High Performance Implementation of the Naive Bayes Algorithm in R. R package version 0.9.7. https://CRAN.R-project.org/package=naivebayes.
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2022). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-12. https://CRAN.R-project.org/package=e1071.
Neunhoeffer M, Sternberg S (2019). How cross-validation can go wrong and what to do about it. Political Analysis, 27(1): 101–106. https://doi.org/10.1017/pan.2018.39.
Papadakis M, Tsagris M, Fafalios S (2023). Rfast: A Collection of Efficient and Extremely Fast R Functions. R package version 2.1.0. https://CRAN.R-project.org/package=Rfast.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12: 2825–2830. https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf?ref=https:/ (Accessed Dec 29, 2023).
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. (2013). sklearn.model_selection.randomizedsearchcv. Journal of Machine Learning Research, 12: 2825–2830. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html (Accessed May 17, 2023).
Ploton P, Mortier F, Réjou-Méchain M, Barbier N, Picard N, Rossi V, et al. (2020). Spatial validation reveals poor predictive performance of large-scale ecological mapping models. Nature Communications, 11(1): 1–11. https://doi.org/10.1038/s41467-020-18321-y.
Raschka S, Patterson J, Nolet C (2020). Machine learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence. Information, 11(4): 1–33. https://doi.org/10.3390/info11040193.
Ray DK, Gerber JS, MacDonald GK, West PC (2015). Climate variation explains a third of global crop yield variability. Nature Communications, 6(1). https://doi.org/10.1038/ncomms6989.
Roberts DR, Bahn V, Ciuti S, Boyce MS, Elith J, Guillera-Arroita G, et al. (2017a). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 40(8): 913–929. https://doi.org/10.1111/ecog.02881.
Roberts MJ, Braun NO, Sinclair TR, Lobell DB, Schlenker W (2017b). Comparing and combining process-based crop models and statistical models with some implications for climate change. Environmental Research Letters, 12(9): 095010. https://doi.org/10.1088/1748-9326/aa7f33.
Roberts MJ, Schlenker W, Eyer J (2013). Agronomic weather measures in econometric models of crop yield with implications for climate change. American Journal of Agricultural Economics, 95(2): 236–243. https://doi.org/10.1093/ajae/aas047.
Rong X (2022). deepnet: Deep Learning Toolkit in R. R package version 0.2.1. https://CRAN.R-project.org/package=deepnet.
Rosenzweig C, Jones JW, Hatfield JL, Ruane AC, Boote KJ, Thorburn P, et al. (2013). The agricultural model intercomparison and improvement project (agmip): Protocols and pilot studies. Agricultural and Forest Meteorology, 170: 166–182. https://doi.org/10.1016/j.agrformet.2012.09.011.
Schratz P, Becker M (2023). mlr3spatiotempcv: Spatiotemporal Resampling Methods for ‘mlr3’. https://mlr3spatiotempcv.mlr-org.com/.
Schumacher BL, Burchfield EK, Bean B, Yost MA (2023). Leveraging important covariate groups for corn yield prediction. Agriculture, 13(3). https://doi.org/10.3390/agriculture13030618.
Soltani A (2012). Modeling physiology of crop development, growth and yield. CABi. https://www.cabidigitallibrary.org/doi/book/10.1079/9781845939700.0000.
Spangler K, Schumacher BL, Bean B, Burchfield EK (2022). Path dependencies in US agriculture: Regional factors of diversification. Agriculture, Ecosystems & Environment, 333: 107957. https://doi.org/10.1016/j.agee.2022.107957.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1): 1929–1958. https://dl.acm.org/doi/abs/10.5555/2627435.2670313.
Therneau T (2018). deming: Deming, Theil-Sen, Passing-Bablock and Total Least Squares Regression. R package version 1.4. https://CRAN.R-project.org/package=deming.
Therneau T, Atkinson B (2022). rpart: Recursive Partitioning and Regression Trees. R package version 4.1.19. https://CRAN.R-project.org/package=rpart.
Urban DW, Sheffield J, Lobell DB (2015). The impacts of future climate and carbon dioxide changes on the average and variability of us maize yields under two emission scenarios. Environmental Research Letters, 10(4): 045003. https://doi.org/10.1088/1748-9326/10/4/045003.
USDA (2019). 2017 Census of Agriculture. https://www.nass.usda.gov/AgCensus (Accessed Dec 29, 2023).
Ushey K, Allaire J, Tang Y (2022). reticulate: Interface to Python. R package version 1.25. https://CRAN.R-project.org/package=reticulate.
van Klompenburg T, Kassahun A, Catal C (2020). Crop yield prediction using machine learning: A systematic literature review. Computers and Electronics in Agriculture, 177: 105709. https://doi.org/10.1016/j.compag.2020.105709.
Wadoux AMC, Heuvelink GB, De Bruin S, Brus DJ (2021). Spatial cross-validation is not the right way to evaluate map accuracy. Ecological Modelling, 457: 109692. https://doi.org/10.1016/j.ecolmodel.2021.109692.
Wang XD, Chen RC, Yan F, Zeng ZQ, Hong CQ (2019). Fast adaptive k-means subspace clustering for high-dimensional data. IEEE Access, 7: 42639–42651. https://doi.org/10.1109/ACCESS.2019.2907043.
Wright MN, Ziegler A (2017). Ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1): 1–17. https://doi.org/10.18637/jss.v077.i01.