<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">JDS</journal-id>
<journal-title-group><journal-title>Journal of Data Science</journal-title></journal-title-group>
<issn pub-type="epub">1683-8602</issn>
<issn pub-type="ppub">1680-743X</issn>
<issn-l>1680-743X</issn-l>
<publisher>
<publisher-name>School of Statistics, Renmin University of China</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">JDS995</article-id>
<article-id pub-id-type="doi">10.6339/21-JDS995</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Computing in Data Science</subject></subj-group></article-categories>
<title-group>
<article-title>A Simple Aggregation Rule for Penalized Regression Coefficients after Multiple Imputation</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Peterson</surname><given-names>Ryan  A.</given-names></name><email xlink:href="mailto:ryan.a.peterson@cuanschutz.edu">ryan.a.peterson@cuanschutz.edu</email><xref ref-type="aff" rid="j_jds995_aff_001">1</xref><xref ref-type="fn" rid="cor1">∗</xref>
</contrib>
<aff id="j_jds995_aff_001"><label>1</label>Department of Biostatistics and Informatics, Colorado School of Public Health, <institution>University  of  Colorado-Denver  Anschutz  Medical  Campus</institution>, Aurora, Colorado, <country>USA</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Email: <ext-link ext-link-type="uri" xlink:href="mailto:ryan.a.peterson@cuanschutz.edu">ryan.a.peterson@cuanschutz.edu</ext-link>.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2021</year></pub-date><pub-date pub-type="epub"><day>28</day><month>1</month><year>2021</year></pub-date>
<volume>19</volume><issue>1</issue><fpage>1</fpage><lpage>14</lpage><supplementary-material id="S1" content-type="archive" xlink:href="jds995_s001.zip" mimetype="application" mime-subtype="x-zip-compressed">
<caption>
<title>Supplementary Material</title>
<p>A script to reproduce simulations under varied parameters has been provided as supplemental material online, along with an appendix containing additional tables and figures pertaining to the simulations described herein.</p>
</caption>
</supplementary-material>
<history>
<date date-type="received"><month>7</month><year>2020</year></date>
<date date-type="accepted"><month>10</month><year>2020</year></date>
</history>
<permissions><copyright-statement>2021 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin University of China.</copyright-statement><copyright-year>2021</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Early in the course of the pandemic in Colorado, researchers wished to fit a sparse predictive model to intubation status for newly admitted patients. Unfortunately, the training data had considerable missingness which complicated the modeling process. I developed a quick solution to this problem: Median Aggregation of penaLized Coefficients after Multiple imputation (MALCoM). This fast, simple solution proved successful on a prospective validation set. In this manuscript, I show how MALCoM performs comparably to a popular alternative (MI-lasso), and can be implemented in more general penalized regression settings. A simulation study and application to local COVID-19 data is included.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>elastic net</kwd>
<kwd>LASSO</kwd>
<kwd>minimax concave penalty</kwd>
<kwd>missing data</kwd>
<kwd>regularization</kwd>
</kwd-group>
</article-meta>
</front>
<back>
<ref-list id="j_jds995_reflist_001">
<title>References</title>
<ref id="j_jds995_ref_001">
<mixed-citation publication-type="journal"> <string-name><surname>Breheny</surname> <given-names>P</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>J</given-names></string-name> (<year>2011</year>). <article-title>Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection</article-title>. <source>Annals of Applied Statistics</source>, <volume>5</volume>(<issue>1</issue>): <fpage>232</fpage>–<lpage>253</lpage>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_002">
<mixed-citation publication-type="journal"> <string-name><surname>Chen</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>S</given-names></string-name> (<year>2013</year>). <article-title>Variable selection for multiply-imputed data with application to dioxin exposure study</article-title>. <source>Statistics in Medicine</source>, <volume>32</volume>(<issue>21</issue>): <fpage>3646</fpage>–<lpage>3659</lpage>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_003">
<mixed-citation publication-type="journal"> <string-name><surname>Collins</surname> <given-names>L</given-names></string-name>, <string-name><surname>Schafer</surname> <given-names>JL</given-names></string-name>, <string-name><surname>Kam</surname> <given-names>C</given-names></string-name> (<year>2001</year>). <article-title>A comparison of inclusive and restrictive strategies in modern missing data procedures</article-title>. <source>Psychological Methods</source>, <volume>6</volume>(<issue>4</issue>): <fpage>330</fpage>–<lpage>351</lpage>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_004">
<mixed-citation publication-type="journal"> <string-name><surname>Friedman</surname> <given-names>J</given-names></string-name>, <string-name><surname>Hastie</surname> <given-names>T</given-names></string-name>, <string-name><surname>Tibshirani</surname> <given-names>R</given-names></string-name> (<year>2010</year>). <article-title>Regularization paths for generalized linear models via coordinate descent</article-title>. <source>Journal of Statistical Software</source>, <volume>33</volume>(<issue>1</issue>): <fpage>1</fpage>–<lpage>22</lpage>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_005">
<mixed-citation publication-type="journal"> <string-name><surname>Gong</surname> <given-names>J</given-names></string-name>, <string-name><surname>Ou</surname> <given-names>J</given-names></string-name>, <string-name><surname>Qiu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Jie</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yuan</surname> <given-names>L</given-names></string-name>, <etal>et  al.</etal> (<year>2020</year>). <article-title>A tool for early prediction of severe coronavirus disease 2019 (COVID-19): A multicenter study using the risk nomogram in Wuhan and Guangdong, China</article-title>. <source>Clinical Infectious Diseases</source>, <volume>71</volume>(<issue>15</issue>): <fpage>833</fpage>–<lpage>840</lpage>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_006">
<mixed-citation publication-type="journal"> <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Feng</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wall</surname> <given-names>MM</given-names></string-name> (<year>2016</year>). <article-title>Variable selection and prediction with incomplete high-dimensional data</article-title>. <source>The Annals of Applied Statistics</source>, <volume>10</volume>(<issue>1</issue>): <fpage>418</fpage>–<lpage>450</lpage>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_007">
<mixed-citation publication-type="journal"> <string-name><surname>Long</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Johnson</surname> <given-names>BA</given-names></string-name> (<year>2015</year>). <article-title>Variable selection in the presence of missing data: Resampling and imputation</article-title>. <source>Biostatistics</source>, <volume>16</volume>(<issue>3</issue>): <fpage>596</fpage>–<lpage>610</lpage>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_008">
<mixed-citation publication-type="journal"> <string-name><surname>Meier</surname> <given-names>L</given-names></string-name>, <string-name><surname>Van De Geer</surname> <given-names>S</given-names></string-name>, <string-name><surname>Bühlmann</surname> <given-names>P</given-names></string-name> (<year>2008</year>). <article-title>The group lasso for logistic regression</article-title>. <source>Journal of the Royal Statistical Society: Series B (Statistical Methodology)</source>, <volume>70</volume>(<issue>1</issue>): <fpage>53</fpage>–<lpage>71</lpage>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_009">
<mixed-citation publication-type="journal"> <string-name><surname>Robin</surname> <given-names>X</given-names></string-name>, <string-name><surname>Turck</surname> <given-names>N</given-names></string-name>, <string-name><surname>Hainard</surname> <given-names>A</given-names></string-name>, <string-name><surname>Tiberti</surname> <given-names>N</given-names></string-name>, <string-name><surname>Lisacek</surname> <given-names>F</given-names></string-name>, <string-name><surname>Sanchez</surname> <given-names>JC</given-names></string-name>, <etal>et  al.</etal> (<year>2011</year>). <article-title>pROC: An open-source package for R and S+ to analyze and compare ROC curves</article-title>. <source>BMC Bioinformatics</source>, <volume>12</volume>: <fpage>77</fpage>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_010">
<mixed-citation publication-type="book"> <string-name><surname>Rubin</surname> <given-names>DB</given-names></string-name> (<year>2004</year>). <source>Multiple Imputation for Nonresponse in Surveys</source>. <publisher-name>Wiley</publisher-name>, <publisher-loc>New York</publisher-loc>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_011">
<mixed-citation publication-type="journal"> <string-name><surname>Seaman</surname> <given-names>SR</given-names></string-name>, <string-name><surname>White</surname> <given-names>IR</given-names></string-name> (<year>2013</year>). <article-title>Review of inverse probability weighting for dealing with missing data</article-title>. <source>Statistical Methods in Medical Research</source>, <volume>22</volume>(<issue>3</issue>): <fpage>278</fpage>–<lpage>295</lpage>. <comment>PMID: 21220355</comment>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_012">
<mixed-citation publication-type="journal"> <string-name><surname>Sirimongkolkasem</surname> <given-names>T</given-names></string-name>, <string-name><surname>Drikvandi</surname> <given-names>R</given-names></string-name> (<year>2019</year>). <article-title>On regularisation methods for analysis of high dimensional data</article-title>. <source>Annals of Data Science</source>, <volume>6</volume>(<issue>4</issue>): <fpage>737</fpage>–<lpage>763</lpage>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_013">
<mixed-citation publication-type="journal"> <string-name><surname>Tibshirani</surname> <given-names>R</given-names></string-name> (<year>1996</year>). <article-title>Regression shrinkage and selection via the lasso</article-title>. <source>Journal of the Royal Statistical Society: Series B (Methodological)</source>, <volume>58</volume>(<issue>1</issue>): <fpage>267</fpage>–<lpage>288</lpage>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_014">
<mixed-citation publication-type="book"> <string-name><surname>Van Buuren</surname> <given-names>S</given-names></string-name> (<year>2018</year>). <source>Flexible Imputation of Missing Data</source>. <publisher-name>CRC Press</publisher-name>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_015">
<mixed-citation publication-type="journal"> <string-name><surname>Van Buuren</surname> <given-names>S</given-names></string-name>, <string-name><surname>Groothuis-Oudshoorn</surname> <given-names>K</given-names></string-name> (<year>2011</year>). <article-title>mice: Multivariate imputation by chained equations in R</article-title>. <source>Journal of Statistical Software</source>, <volume>45</volume>(<issue>3</issue>): <fpage>1</fpage>–<lpage>67</lpage>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_016">
<mixed-citation publication-type="other"> <string-name><surname>Wynants</surname> <given-names>L</given-names></string-name>, <string-name><surname>Van Calster</surname> <given-names>B</given-names></string-name>, <string-name><surname>Collins</surname> <given-names>GS</given-names></string-name>, <string-name><surname>Riley</surname> <given-names>RD</given-names></string-name>, <string-name><surname>Heinze</surname> <given-names>G</given-names></string-name>, <string-name><surname>Schuit</surname> <given-names>E</given-names></string-name>, et al. (2020). Prediction models for diagnosis and prognosis of covid-19: Systematic review and critical appraisal. <italic>BMJ</italic>, 369.</mixed-citation>
</ref>
<ref id="j_jds995_ref_017">
<mixed-citation publication-type="journal"> <string-name><surname>Yang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>H</given-names></string-name> (<year>2018</year>). <article-title>Model selection consistency of lasso for empirical data</article-title>. <source>Chinese Annals of Mathematics, Series B</source>, <volume>39</volume>(<issue>4</issue>): <fpage>607</fpage>–<lpage>620</lpage>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_018">
<mixed-citation publication-type="journal"> <string-name><surname>Zhang</surname> <given-names>CH</given-names></string-name> (<year>2010</year>). <article-title>Nearly unbiased variable selection under minimax concave penalty</article-title>. <source>The Annals of Statistics</source>, <volume>38</volume>(<issue>2</issue>): <fpage>894</fpage>–<lpage>942</lpage>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_019">
<mixed-citation publication-type="journal"> <string-name><surname>Zhao</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Long</surname> <given-names>Q</given-names></string-name> (<year>2017</year>). <article-title>Variable selection in the presence of missing data: Imputation-based methods</article-title>. <source>WIREs Computational Statistics</source>, <volume>9</volume>(<issue>5</issue>): <fpage>e1402</fpage>.</mixed-citation>
</ref>
<ref id="j_jds995_ref_020">
<mixed-citation publication-type="journal"> <string-name><surname>Zou</surname> <given-names>H</given-names></string-name>, <string-name><surname>Hastie</surname> <given-names>T</given-names></string-name> (<year>2005</year>). <article-title>Regularization and variable selection via the elastic net</article-title>. <source>Journal of the Royal Statistical Society: Series B (Statistical Methodology)</source>, <volume>67</volume>(<issue>2</issue>): <fpage>301</fpage>–<lpage>320</lpage>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
