Visualizing Bootrapped Stepwise Regression in R using Plotly

We all have used stepwise regression at some point. Stepwise regression is known to be sensitive to initial inputs. One way to mitigate this sensitivity is to repeatedly run stepwise regression on bootstrap samples.

R has a nice package called bootStepAIC() which (from its description) “Implements a Bootstrap procedure to investigate the variability of model selection under the stepAIC() stepwise algorithm of package MASS.

It provides a lot of information as an output and sometimes it can get challenging to keep track of all of this information especially if there are a lot of covariates. In this post we’ll try to come up with a simple visualization aimed at summarizing the output from the function boot.stepAIC().

Running boot.stepAIC()

Using the boot.stepAIC() is fairly simple. Just input an already fitted lm/glm model and th associated dataset.

We’ll use the BostonHousing dataset from the mlbench package. More details here

Note that this post is now updated for Plotly 4.0 Syntax

Collecting required information

The output from boot.stepAIC() contains the following. Note that each output is shown as a percentage (based on the total number of bootstrapped samples)

  • No of times a covariate was featured in the final model from stepAIC()
  • No of times a covariate’s coefficient sign was positive / negative
  • No of times a covariate was statistically significant (default at alpha = 5%)

We’ll collect all of this information first and create data frames so as to make charting easier later on.

Note that in this particualr example there is a variable by the name chas which is a factor with levels 0 and 1. R renames the variable as chas1 by default.

Plot

Now that we have all the information we need, we just need to plot. The plot is arranged as such:

  • One layer for the number of times a variable was picked up by stepAIC() (barplot)
  • One layer for the positive and negative coefficients (scatter plot using triangles)
  • One layer for the number of times a variable was significant (vertical line chart)
  • Annotation for some other information