Getting averaged variable importance from bootstrap with the cubist models

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Getting averaged variable importance from bootstrap with the cubist models

Hounkpatin Ozias
Dear All,

I am using a Bootstrapping approach with the Cubist model. It is possible
to get the variable importance (percentages  in variable usage in the
models) after running the models. For each run of the model based on the
random sampling, the variable importance is
different. A robust estimate may be determined by taking the average of all
the percentages of usage for each specific variable involved in the models.
For this purpose, I have (1) saved each model to disk,  (2) read back in R
each model, (2) got the variable importance
 for each model, (3) got a final table with all the models and the
percentages allocated by each model for the variables. the final table I am
expecting should like the table below. It is possible to do  it manually by
calling in each model but the workload is too high when you have 100 number
of bootstraps.

Here below a reproducible example with dataset from the ithir package. I
have limited the bootstrap to 3 for quick run. Also below I have shown how
I could get the table focusing on one model at a time.

Is there anyway to have all the models called in a loop, and extract the
variable importance for each model and arrange them finally
to get the table instead of doing it for each model separately ? I am
actually running the models 100 times.

I will appreciate any help.

Expected table-------------------------------------------------------
Variables VImp cubist.type
MRVBF 62.5 Vimp.cubist1
AACN 72.5 Vimp.cubist1
NDVI 41 Vimp.cubist1
Mid_Slope_Positon 24.5 Vimp.cubist1
Landsat_Band1 24 Vimp.cubist1
Terrain_Ruggedness_Index 44.5 Vimp.cubist1
TWI 42 Vimp.cubist1
Hillshading 41.5 Vimp.cubist1
Slope 41.5 Vimp.cubist1
Light_insolation 16 Vimp.cubist1
Elevation 11.5 Vimp.cubist1
MRVBF 81.5 Vimp.cubist2
Elevation 48.5 Vimp.cubist2
NDVI 62.5 Vimp.cubist2
Mid_Slope_Positon 43 Vimp.cubist2
TWI 54.5 Vimp.cubist2
AACN 41.5 Vimp.cubist2
Landsat_Band1 38.5 Vimp.cubist2
Hillshading 37.5 Vimp.cubist2
Terrain_Ruggedness_Index 27 Vimp.cubist2
Slope 21 Vimp.cubist2
Light_insolation 20.5 Vimp.cubist2
MRVBF 78 Vimp.cubist3
Hillshading 62 Vimp.cubist3
TWI 57 Vimp.cubist3
Terrain_Ruggedness_Index 55.5 Vimp.cubist3
AACN 55 Vimp.cubist3
Light_insolation 38.5 Vimp.cubist3
NDVI 50.5 Vimp.cubist3
Slope 49 Vimp.cubist3
Landsat_Band1 47.5 Vimp.cubist3
Mid_Slope_Positon 45.5 Vimp.cubist3
Elevation 31.5 Vimp.cubist3


## REPRODUCIBLE EXAMPLE

rm(list = ls())

library(Cubist)
library(ithir) # install.packages("ithir", repos="
http://R-Forge.R-project.org <http://r-forge.r-project.org/>")
library(caret)

#Set a working directory
setwd("")

# Point data
data(HV_subsoilpH)

# subset data for modeling
set.seed(123)
training <- sample(nrow(HV_subsoilpH), 0.7 * nrow(HV_subsoilpH))
cDat <- HV_subsoilpH[training, ]
vDat <- HV_subsoilpH[-training, ]

#Create folder to store models
dir.create("models", recursive = TRUE)

# Number of bootstraps
nbag <- 3

# Fit cubist models for each bootstrap
for (i in 1:nbag) {
  trainingREP <- sample.int(nrow(cDat), 1.0 * nrow(cDat),replace = TRUE)
  fit_cubist <- cubist(x = cDat[trainingREP,
                                c("Terrain_Ruggedness_Index",
                                  "AACN", "Landsat_Band1", "Elevation",
"Hillshading",
                                  "Light_insolation", "Mid_Slope_Positon",
"MRVBF", "NDVI",
                                  "TWI", "Slope")],
                       y = cDat$pH60_100cm[trainingREP],
cubistControl(rules = 5,
                                       extrapolation = 5), committees = 3)

  modelFile <-paste(getwd(), "./models/",sep = "", "bootMod_", i,".rds")
  saveRDS(object = fit_cubist, file = modelFile)
}


# List all files in directory
c.models <- list.files(path = paste(getwd(),"./models",
                      sep = ""), pattern = "\\.rds$", full.names = TRUE)

#Reads the models
fit_cubist<-  readRDS(c.models[i])
varImp(fit_cubist) # But it only give variable importance for the final
model

#Alternatively, I can call the model one by one and apply varImp
#Call models one by one
fit_cubist1<-  readRDS("./models/bootMod_1.rds")
fit_cubist2<-  readRDS("./models/bootMod_2.rds")
fit_cubist3<-  readRDS("./models/bootMod_3.rds")

#Apply varImp on each model
VImp1<-varImp(fit_cubist1)
VImp2<-varImp(fit_cubist2)
VImp3<-varImp(fit_cubist3)

#Get a data frame for each variable importance
cub.VImp1<-as.data.frame(VImp1[1])
names(cub.VImp1)[1]<-"VImp"
cub.VImp2<-as.data.frame(VImp2[1])
names(cub.VImp2)[1]<-"VImp"
cub.VImp3<-as.data.frame(VImp3[1])
names(cub.VImp3)[1]<-"VImp"

#Create a column for each variable importance related to each model
cub.VImp1$cubist.type<-"Vimp.cubist1"
cub.VImp2$cubist.type<-"Vimp.cubist2"
cub.VImp3$cubist.type<-"Vimp.cubist3"

#Join the row names to each dataframe
cub.VImp1<-setNames(cbind(rownames(cub.VImp1),
cub.VImp1),c("Variables","VImp","cubist.model"))
cub.VImp2<-setNames(cbind(rownames(cub.VImp2),
cub.VImp2),c("Variables","VImp","cubist.model"))
cub.VImp3<-setNames(cbind(rownames(cub.VImp3),
cub.VImp3),c("Variables","VImp","cubist.model"))

#Bind all the models df together by rows
All.VarImp<-bind_rows(cub.VImp1,cub.VImp2,cub.VImp3)

#Get the average of variable importance for each variable
mean.VarImp <- ddply(All.VarImp, .(Variables), summarise,
               mean = mean(VImp),
               sd   = sd( VImp))
mean.VarImp


Ozias Hounkpatin

Post doc



Sveriges lantbruksuniversitet

Swedish University of Agricultural Sciences



Dept. of Soil and Environment

PO Box 1234, SE-123 45 Uppsala

Visiting address: Lennart Hjems våg 9

Phone: +46 18 67 12 51, Mobile: +46 72 207 85 62

[hidden email] , www.slu.se

Hounkpatin Ozias <[hidden email]>
17:41 (10 minutes ago)
to R-sig-geo
ReplyForward
<https://drive.google.com/u/0/settings/storage?hl=en-GB>
<https://www.google.com/intl/en-GB/policies/terms/>
<https://www.google.com/intl/en-GB/policies/privacy/>
<https://www.google.com/gmail/about/policy/>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: [FORGED] Getting averaged variable importance from bootstrap with the cubist models

Rolf Turner
On 8/04/19 3:52 AM, Hounkpatin Ozias wrote:

<SNIP>

> rm(list = ls())

<SNIP>

NOOOOOOOOOOOOOOOOOO!!!!  Don't do this to other people!!!  As (I think
it was) Jenny Bryan said at the NZSA Conference in December 2017:  "If
you do this I will come to your office and set fire to your computer!!!"

cheers,

Rolf Turner

--
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: [FORGED] Getting averaged variable importance from bootstrap with the cubist models

Roman Luštrik
Running reproducible examples in clear sessions should be something that is
not very hard to do, far far away from (semi)production environment.
Also, people might not stand idly while their work computer is being
damaged in a fire-hazardous way.

Cheers,
Roman

On Mon, Apr 8, 2019 at 4:09 AM Rolf Turner <[hidden email]> wrote:

> On 8/04/19 3:52 AM, Hounkpatin Ozias wrote:
>
> <SNIP>
>
> > rm(list = ls())
>
> <SNIP>
>
> NOOOOOOOOOOOOOOOOOO!!!!  Don't do this to other people!!!  As (I think
> it was) Jenny Bryan said at the NZSA Conference in December 2017:  "If
> you do this I will come to your office and set fire to your computer!!!"
>
> cheers,
>
> Rolf Turner
>
> --
> Honorary Research Fellow
> Department of Statistics
> University of Auckland
> Phone: +64-9-373-7599 ext. 88276
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>


--
In God we trust, all others bring data.

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: [FORGED] Getting averaged variable importance from bootstrap with the cubist models

Hounkpatin Ozias
I am sorry for sections of the code that could adversely affect (e.g.
rm(list = ls())) your computers. Still learning in R. I agree that the code
could have been better presented and will work on it .
Best,
Ozias

On Mon, 8 Apr 2019 at 09:52, Roman Luštrik <[hidden email]> wrote:

> Running reproducible examples in clear sessions should be something that is
> not very hard to do, far far away from (semi)production environment.
> Also, people might not stand idly while their work computer is being
> damaged in a fire-hazardous way.
>
> Cheers,
> Roman
>
> On Mon, Apr 8, 2019 at 4:09 AM Rolf Turner <[hidden email]>
> wrote:
>
> > On 8/04/19 3:52 AM, Hounkpatin Ozias wrote:
> >
> > <SNIP>
> >
> > > rm(list = ls())
> >
> > <SNIP>
> >
> > NOOOOOOOOOOOOOOOOOO!!!!  Don't do this to other people!!!  As (I think
> > it was) Jenny Bryan said at the NZSA Conference in December 2017:  "If
> > you do this I will come to your office and set fire to your computer!!!"
> >
> > cheers,
> >
> > Rolf Turner
> >
> > --
> > Honorary Research Fellow
> > Department of Statistics
> > University of Auckland
> > Phone: +64-9-373-7599 ext. 88276
> >
> > _______________________________________________
> > R-sig-Geo mailing list
> > [hidden email]
> > https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> >
>
>
> --
> In God we trust, all others bring data.
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo