These scripts run in R and reproduce results in Tables 1, 2, 6 and Online Supplement Tables A.2 and A.3, in Sikdar et al. 
   

We divide the Readme instructions into the following: EBIRD DATA RESULTS & SIMULATION STUDIES

###########EBIRD DATA RESULTS ######################

1) To generate the results in Table 6 of the manuscript we have supplied the following 6 script files. 

a) VIM_count.R: This calculates the VIM based on frequency of variable use in MVRF
b) VIM_incidence.R: This calculates the VIM based on incidence (1/0) of variable use in MVRF
c) VIM_meanstructure.R: This calculates the split improvement based VIM for use in MVRF as proposed in the paper Sikdar et al.
d) VIM_meanstructureFtest.R: This calculates the split improvement (with F-test) based VIM  for use in MVRF as proposed in the paper Sikdar et al.
e) VIM_Outcomedifference.R: This calculates the per-outcome split difference based VIM for use in MVRF as proposed in the paper Sikdar et al.
f) VIM_outcomediffFtest.R: This calculates the per-outcome split difference (with F-test) based VIM for use in MVRF as proposed in the paper Sikdar et al.

 
Each of these script files performs the following tasks:

i) The script calls for the e-bird data from the R-package "MulvariateRandomForestVarImp". 

ii) This package is installed from: 

devtools::install_github("Megatvini/VIM")

iii) At the start of each script, there are three user-defined inputs
	- Specify the file path where output results need to be stored. Store each script output in a different sub-folder to avoid overwriting of output files.
	- Specify the number of trees (ntree) for each forest
	- Specify the number of iterations (niter) for the recursive feature elimination strategy

NOTE: The run time of each script file with ntree = 2000 and nforest = 20 as used in the paper Sikdar et al. is approximately 15 days. So, for code testing purposes lower the number of ntree and nforest.

iv) Each script runs for a given VIM in Table 6. The data is split into training and testing and the script runs for user-defined number of trees for an MVRF and iterations (niter) of the recursive feature elimination and creates iteration-specific output files. 

The iterative output files are described below (iter = 1, 2,..niter):
 	- Predicted values on test set at the given iteration of the MVRF (iter_avg_predict.csv).
 	- Print of the variables retained at the end of the iteration (itercovariates.csv).
	- VIM for the forest of the retained variables and the Gaussian noise at the end of iteration (iteravgVIM.csv).
	- Mean Absolute Error (MAE) estimated on the test set at the given iteration of the MVRF (iterMAE_mat.csv)
	- Mean Squared Error (MSE) estimated on the test set at the given iteration of the MVRF (iterMSE_mat.csv)
	- Number of variables retained at the end of the iteration (iterdimX1_iterations.csv)
	- Infinitesimal Jackknife (IJ) variance of VIM estimated at the end of the iteration (iterIJ_VIM_corr etc.)
	- SE of the calculated VIM estimated at the end of the iteration (iterSE_mat.csv).

v) The post-processing steps are done manually in Excel where the output from "iterdimX1_iterations.csv" files across all iterations are stacked and plotted to check the number of variables retained per iteration. The elbow method is used to determine the optimal iteration with the steepest drop in the number of variables retained. If there are multiple kink points or elbows (i.e., sharp drops in the number of variables retained), the optimal iteration is decided by the respective MSEs of the competing iterations. The iteration with the lower MSE is then chosen as the optimal one. 
The MSE and SE corresponding to the optimal iteration are recorded in Table 6.



############SIMULATION STUDIES#####################

2) The second set of files Nonsparsedatauncorrerrors_SimStudy1.R, sparsedata_uncorrerrors_SimStudy2.R, Nonsparsedata_correlatederrors_SimStudy3.R and Sparse_correlatederrors_SimStudy4.R are for the simulation studies reported in Tables 1, 2 and Online Supplement Tables A.2 and A.3

	- Nonsparsedatauncorrerrors_SimStudy1.R conducts the simulations for non-sparse data generated using a linear model with no error correlation.  It produces the variable rank ordering shown in Table 1 in the manuscript.

	- sparsedata_uncorrerrors_SimStudy2.R conducts the simulation for sparse data generated using a non-linear model with no error correlation. This produces the variable rank ordering shown in Table 2 in the manuscript.

	- Nonsparsedata_correlatederrors_SimStudy3.R conducts the simulation for non-sparse data using a linear model with error correlation. It produces the variable ranking shown in Table A.2. in the Online Supplement.

	- Sparse_correlatederrors_SimStudy4.R conducts the simulation for non-sparse data using a non-linear model with error correlation. It produces the variable ranking shown in Table A.3. in the Online Supplement.

Each of the simulation study scripts performs the following tasks:

i) At the start of each script, there are four user-defined inputs

 	- Specify the file path where output results need to be stored. Store each script output in a different sub-folder to avoid overwriting of output files.
	- n: This is the user-defined number of sample data points. For the simulation results in the paper Sikdar et al. this was set at 500, and split into N = 300 (training) and N-test = 200 (testing) sets. For code testing, this can be reduced to n = 100
	- ntree: The number of trees for each MVRF. For the results in the paper Sikdar et al., this was set to 3000. Change this to ntree =10 for code testing.
	- nforest: This is the total number of MVRFs that is run per simulation study. In the paper, this was set to 10. Change this to nforest = 1 for code testing.

NOTE: Each simulation code for the full specification as used in the paper (n = 500, ntree = 3000, nforest = 10) can take more than 24 hours to run.

ii) Each simulation script generates data (multivariate outcome = 4 x 1 vector, with 5 true covariates and 10 spurious covariates) based on the data generating process (DGP) used and runs the nforest specified number of forests. For each nforest iteration, each script computes the 6 different VIMs on the simulated data - VIM count, VIM incidence, VIM mean structure, VIM mean structure with F-test, VIM outcome difference and VIM outcome difference with F-test.

iii) Each script generates the iteration-specific output files (num_forest = 1, 2,..,nforest). These are listed below:

	- num_forest_avg_VIM_count3K: VIM count estimate for the given MVRF iteration

	- num_forest_avg_VIM_ind3K: VIM incidence estimate for the given MVRF iteration

	- num_forest_avg_VIM_SI_oob3K: VIM mean structure estimates (two columns) on training and OOB samples for the given MVRF iteration

	- num_forest_avg_VIM_SIF_oob3K: VIM mean structure estimates (two columns) on training and OOB samples for the given MVRF iteration

	- num_forest_avg_VIM_Ydiff_train_oob3K: VIM outcome difference estimated on a per outcome level (number of columns = dimY1 = 4) on the training set for the given MVRF iteration

	- num_forest_avg_VIM_Ydiff_test_oob3K: VIM outcome difference estimated on a per outcome level (number of columns = dimY1 = 4) on the OOB sample for the given MVRF iteration

	- num_forest_avg_VIM_Ydiff_F_train_oob3K: VIM outcome difference with F-test estimated on a per outcome level (number of columns = dimY1 = 4) on the training set for the given MVRF iteration

	- num_forest_avg_VIM_Ydiff_F_test_oob3K: VIM outcome difference estimate on a per outcome level (number of columns = dimY1 = 4) on the OOB sample for the given MVRF iteration

iv) The post-processing steps are done manually in Excel. The iteration-specific VIM for each definition (count, incidence, mean structure - w/o and w/ F test, outcome difference - w/o and w/ F test) are averaged across the nforest iterations and the variables are rank-ordered based on the average VIM calculated per definition used. 

NOTE: For mean structure - w/o and w/ F test and outcome difference - w/o and w/ F test, two sets of averages are obtained, one for training set and the other for the OOB sample. The variable ranking will vary based on the averages used, as noted in the Tables 1, and 2 in the main manuscript and Tables A.2. and A.3. in the Online Supplement.




