Journal of Data Science logo


Login Register

  1. Home
  2. To appear
  3. Predicting Stunted Growth in Two Year Ol ...

Journal of Data Science

Submit your article Information
  • Article info
  • Related articles
  • More
    Article info Related articles

Predicting Stunted Growth in Two Year Old Bangladeshi Children via the Super Learner
Heather L. Cook   Jennie Z. Ma   Daniel M. Keenan     All authors (8)

Authors

 
Placeholder
https://doi.org/10.6339/25-JDS1197
Pub. online: 4 August 2025      Type: Data Science In Action      Open accessOpen Access

Received
14 November 2024
Accepted
2 July 2025
Published
4 August 2025

Abstract

Stunted growth in children is a worldwide issue which may cause long term problems for individuals stunted as early as two years of age. However, predicting stunted growth with accuracy is quite complex, but machine learning poses a distinct advantage in this regard. While several techniques are available for predictive modeling, the Super Learner stands out as an ensemble method that integrates multiple algorithms into a single predictive model with enhanced performance. In this study, the Super Learner model, comprising generalized linear model, bagged trees, random forests, conditional random forest, stochastic gradient boosting, Bayesian additive regression trees, neural networks, and model averaged neural networks, achieved high performance with high area under the receiver operating characteristic curve, Brier Score, and the minimum of precision and recall values. However, after analyzing the results from cross validation, the final model selected was the Bayesian additive regression trees. Within the final model, the height-for-age z-score at one year, income, expenditure, anti-lipopolysaccharide antibody at week 6 and at week 18, plasma retinol binding protein at week 6, plasma soluble cluster designation 14 at week 18, fecal Reg 1B at week 12, vitamin D at week 18, mother’s weight and height at enrollment, fecal calprotectin at week 12, fecal myeloperoxidase at week 12, number of days of diarrhea through the first year of life, and the number of days of exclusive breastfeeding through the first year of life emerged as the top important variables for predicting stunted growth at two years of age.

Supplementary material

 Supplementary Material
The supplementary material consists of the Supplementary Tables and Analyses PDF, a README file, R scripts to run all analyses, RData file with the data appropriately formatted for analyses, and RData files with the corresponding models. The Supplementary Tables and Analyses PDF file that contains data descriptions, summary of methods, and results for the NNLS optimization for the continuous outcome. The README file briefly explains each file.

References

 
Bhutta ZA, Ahmed T, Black RE, Cousens S, Dewey K, Giugliani E, et al. (2008). What works? Interventions for maternal and child undernutrition and survival. Lancet, 371(9610): 417–440.
 
Bleich J, Kapelner A, George EI, Jensen ST (2014). Variable selection for BART: An application to gene regulation. Annals of Applied Statistics, 8(3): 1750–1781. https://doi.org/10.1214/14-AOAS755
 
Boulesteix AL, Janitza S, Kruppa J, Konig IR (2012). Overview of random forest methodology and practical guidance with emphasis on comutaional biology and bioinformatics. WIREs Data Mining and Knowledge Discovery, 2(6): 493–507. https://doi.org/10.1002/widm.1072
 
Breiman L (1996). Bagging predictors. Machine Learning, 24: 123–140. https://doi.org/10.1023/A:1018054314350
 
Breiman L (2001). Random forests. Machine Learning, 45: 5–32. https://doi.org/10.1023/A:1010933404324
 
Brooks J (2012). Super Learner and Targeted Maximum Likelihood Estimation for Longitudinal Data Structures with Applications to Atrial Fibrillation, Dissertation/Thesis.
 
Butzin-Dozier Z, Ji Y, Coyle J, Malenica I, McQuade ETR, Grembi JA, et al. (2025). Treatment heterogeneity of water, sanitation, hygiene, and nutrition interventions on child growth by environmental enteric dysfunction and pathogen status for young children in Bangladesh. In: PLOS Neglected Tropical Diseases.
 
Campos AP, Vilar-Compte M, Hawkins SS (2020). Association between breastfeeding and child stunting in Mexico. Annals of Global Health, 86(1): 1–14. https://doi.org/10.5334/aogh.2836
 
Chipman HA, George EI, McCulloch RE (2010). BART: Bayesian additive regression trees. Annals of Applied Statistics, 4(1): 266–298. https://doi.org/10.1214/09-AOAS285
 
Christen P, Hand DJ, Kirielle N (2023). A review of the F-measure: Its history, properties, criticism, and alternatives. ACM Computing Surveys, 56(3): 1–24.
 
Davidson LA, Lönnerdal B (1990). Fecal alpha 1-antitrypsin in breast-fed infants is derived from human milk and is not indicative of enteric protein loss. Acta Paediatrica Scandinavica, 79(2): 137–141. https://doi.org/10.1111/j.1651-2227.1990.tb11429.x
 
Dewey KG, Begum K (2011). Long-term consequences of stunting in early life. Maternal and Child Nutrition, 7(Suppl 3): 5–18. https://doi.org/10.1111/j.1740-8709.2011.00349.x
 
Donowitz JR, Cook H, Alam M, Tofail F, Kabir M, Colgate ER, et al. (2018). Role of maternal health and infant inflammation in nutritional and neurodevelopmental outcomes of two-year-old Bangladeshi children. PLOS Neglected Tropical Diseases, 12(5): 1–20. https://doi.org/10.1371/journal.pntd.0006363
 
Dorosko SM, MacKenzie T, Connor RI (2008). Fecal calprotectin concentrations are higher in exclusively breastfed infants compared to those who are mixed-fed. Breastfeeding Medicine, 3(2): 117–119. PMID: 18564000. https://doi.org/10.1089/bfm.2007.0036
 
Fawcett T (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8): 861–874. https://doi.org/10.1016/j.patrec.2005.10.010
 
Friedman JH (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5): 1189–1232. https://doi.org/10.1214/aos/1013203451
 
Friedman JH (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4): 367–378. https://doi.org/10.1016/S0167-9473(01)00065-2
 
Goodman DS (1980). Plasma retinol-binding protein. Annals of the New York Academy of Sciences, 348: 378–390. https://doi.org/10.1111/j.1749-6632.1980.tb21314.x
 
Greg R, Developers G (2024). gbm: Generalized Boosted Regression Models. R package version 2.1.9.
 
Hadi H, Fatimatasari F, Irwanti W, Kusuma C, Alfiana RD, Asshiddiqi MIN, et al. (2021). Exclusive breastfeeding protects young children from stunting in a low-income population: A study from eastern Indonesia. Nutrients, 13(12): 1–14.
 
Hoddinott J, Maluccio JA, Behrman JR, Flores R, Martorell R (2008). Effect of a nutrition intervention during early childhood on economic productivity in guatemalan adults. Lancet, 371(9610): 411–416. https://doi.org/10.1016/S0140-6736(08)60205-6
 
Houssaini A, Assoumou L, Marcelin AG, Molina JM, Calvez V, Flandre P (2012). Investigation of super learner methodology on HIV-1 small sample: Application of jaguar trial data. AIDS Research and Treatment, 2012(1): 1–7.
 
Ju C, Combs M, Lendle SD, Franklin JM, Wyss R, Schneeweiss S, et al. (2016). Propensity Score Prediction for Electronic Healthcare Databases using Super Learner and High-Dimensional Propensity Score Methods, Technical report, The Berkeley Electronic Press. Working Paper 351.
 
Kapelner A, Bleich J (2016). BartMachine: Machine learning with Bayesian additive regression trees. Journal of Statistical Software, 70(4): 1–40. https://doi.org/10.18637/jss.v070.i04
 
Kirkpatrick BD, Colgate ER, Mychaleckyj JC, Haque R, Dickson DM, Carmolli MP, et al. (2015). The “performance of rotavirus and oral polio vaccines in developing countries” (PROVIDE) study: Description of methods of an interventional study designed to explore complex biologic problems. The American Journal of Tropical Medicine and Hygiene, 92(4): 744–751. https://doi.org/10.4269/ajtmh.14-0518
 
Kramer AA (2016). Which statistic can be either the worst or best metric for assessing a predictive model? Prescient News.
 
Kuhn M (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5): 1–26. https://doi.org/10.18637/jss.v028.i05
 
Kuhn M, Johnson K (2016). Applied Predictive Modeling. Springer Science+Business Media LLC.
 
Ladds MA, Thompson AP, Kadar JP, Hocking DP, Harcourt RG (2017). Super machine learning: Improving accuracy and reducing variance of behaviour classification from accelerometry. Animal Biotelemetry, 5(8): 1–9.
 
Liaw A, Wiener M (2002). Classification and regression by randomForest. R News, 2(3): 18–22.
 
Martorell R, Zongrone A (2012). Intergenerational influences on child growth and undernutrition. Paediatric and Perinatal Epidemiology, 26: 302–314. https://doi.org/10.1111/j.1365-3016.2012.01298.x
 
McDonald CM, Manji KP, Gosselin K, Tran H, Liu E, Kisenge R, et al. (2016). Elevations in serum anti-flagellin and anti-LPS igs are related to growth faltering in young Tanzanian children. The American Journal of Clinical Nutrition, 103(6): 1548–1554. https://doi.org/10.3945/ajcn.116.131409
 
Mertens A, Benjamin-Chung J, Colford JM Jr, Coyle J, van der Laan MJ, Hubbard AE, et al. (2023). Causes and consequences of child growth faltering in low-resource settings. Nature, 621: 568–576. https://doi.org/10.1038/s41586-023-06501-x
 
Mursil M, Rashwan HA, Santos-Calderon L, Murphy M, Valls DSP (2023). Maternal nutritional factors enhance birthweight prediction: A super learner ensemble approach. SSRN.
 
Naimi AI, Balzer LB (2018). Stacked generalization: An introduction to super learning. European Journal of Epidemiology, 33: 459–464. https://doi.org/10.1007/s10654-018-0390-z
 
Naylor C, Lu M, Haque R, Mondal D, Buonomo E, Nayak U, et al. (2015). Environmental enteropathy, oral vaccine failure and growth faltering in infants in Bangladesh. eBioMedicine, 2(11): 1759–1766. https://doi.org/10.1016/j.ebiom.2015.09.036
 
Olden JD, Joy MK, Death RG (2004). An accurate comparison of methods for quantifying variable importance in artificial neural networks using simulated data. Ecological Modelling, 178(3): 389–397. https://doi.org/10.1016/j.ecolmodel.2004.03.013
 
Perrone M, Cooper L (1993). When networks disagree: Ensemble methods for hybrid neural networks. Neural Networks for Speech and Image Processing.
 
Peters A, Hothorn T (2023). ipred: Improved predictors. R package version 0.9-14.
 
Peterson KM, Buss J, Easley R, Yang Z, Korpe PS, Niu F, et al. (2013). REG1B as a predictor of childhood stunting in Bangladesh and Peru. The American Journal of Clinical Nutrition, 97(5): 1129–1133. https://doi.org/10.3945/ajcn.112.048306
 
Phillips RV, van der Laan MJ, Lee H, Gruber S (2023). Practical considerations for specifying a super learner. International Journal of Epidemiology, 52(4): 1276–1285. https://doi.org/10.1093/ije/dyad023
 
Pirracchio R, Carone M (2016). The balance super learner: A robust adaptation of the super learner to improve estimation of the average treatment effect in the treated based on propensity score matching. Statistical Methods in Medical Research, 27(8): 2504–2518. https://doi.org/10.1177/0962280216682055
 
Polley E, LeDell E, Kennedy C, van der Laan M (2024). SuperLearner: Super learner prediction. R package version 2.0-29.
 
Polley EC, van der Laan MJ (2010). Super Learner in Prediction, Technical report, The Berkeley Electronic Press. Working Paper 266.
 
Prendergast AJ, Humphrey JH (2014). The stunting syndrome in developing countries. Paediatrics and International Child Health, 34(4): 250–265. https://doi.org/10.1179/2046905514Y.0000000158
 
R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
 
Ripley BD (1995). Statistical ideas for selecting network architectures. In: Neural Networks: Artificial Intelligence and Industrial Applications (B Kappen, S Gielen, eds.), 183–190. Springer, London, London.
 
Ripley BD (1996). Pattern Recognition and Neural Networks. Cambridge University Press.
 
Rosenblatt F (1962). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books.
 
Silva I, Moody G, Scott DJ, Celi LA, Mark RG (2012). Predicting in-hospital mortality of ICU patients: The PhysioNet computing in cardiology challenge 2012. Computing in Cardiology, 39: 245–248.
 
Sinisi SE, Petersen ML, van der Laan MJ (2006). Super Learning: An Application to Prediction of HIV-1 Drug Susceptibility, Technical report, The Berkeley Electronic Press. Working Paper 206.
 
Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9(307): 1–11.
 
Strobl C, Hothorn T, Zeileis A (2009). Party on! A New, Conditional Variable Importance Measure for Random Forests Available in the party Package, Technical report, Department of Statistics. Number 50.
 
Syed NT, Sand Iqbal, Sadiq K, Ma JZ, Akhund T, Xin W, Moore SR, et al. (2018). Serum Anti-flagellin and Anti-lipopolysaccharide Immunoglobulins as Predictors of Linear Growth Faltering in Pakistani Infants at Risk for Environmental Enteric Dysfunction. PLOS One, 13(3): 1–13.
 
van der Laan MJ, Polley EC, Hubbard AE (2007). Super Learner, Technical report, The Berkeley Electronic Press. Working Paper 222.
 
Venables WN, Ripley BD (2002). Modern Applied Statistics with S. Springer, New York.
 
Victora CG, Adair LS, Fall CHD, Hallal PC, Martorell R, Richter L, et al. (2008). Maternal and child undernutrition: Consequences for adult health and human capital. The Lancet, 371(9609): 340–357. https://doi.org/10.1016/S0140-6736(07)61692-4
 
Widrow B, Hoff ME (1960). Adaptive switching circuits. 1960 IRE WESCON Convention Reocrd, 4: 96–104.
 
Zambruni M, Ochoa TJ, Somasunderam A, Cabada MM, Morales ML, Mitreva M, et al. (2019). Stunting is preceded by intestinal mucosal damage and microbiome changes and is associated with systemic inflammation in a cohort of Peruvian infants. The American Journal of Tropical Medicine and Hygiene, 101(5): 1009–1017. https://doi.org/10.4269/ajtmh.18-0975

Related articles PDF XML
Related articles PDF XML

Copyright
2025 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.
by logo by logo
Open access article under the CC BY license.

Keywords
children’s health classification ensemble method machine learning

Metrics
since February 2021
15

Article info
views

6

PDF
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

Journal of data science

  • Online ISSN: 1683-8602
  • Print ISSN: 1680-743X

About

  • About journal

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • JDS@ruc.edu.cn
  • No. 59 Zhongguancun Street, Haidian District Beijing, 100872, P.R. China
Powered by PubliMill  •  Privacy policy