spatialreg::predict.sarlm

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

spatialreg::predict.sarlm

Weldensie Embaye
I am trying to run  out-of-sample prediction
using  spatialreg::predict.sarlm package. I tried to prepare my data set as
per the requirements.However, it did not work.

Below is the code.
 csv_train <- read.csv("train.csv") #upload your csv. R needs to know your
region ID
> hhid <- csv_train$hhid #Explicitly created a variable to hold the regions
> nb_train <- read.gwt2nb('train.gwt', region.id=hhid)
> Q_train<-nb2listw(nb_train)
> csv_test <- read.csv("test.csv") #upload your csv. R needs to know your
region ID
> hhid <- csv_test$hhid #Explicitly created a variable to hold the regions
> nb_test <- read.gwt2nb('test.gwt', region.id=hhid)
> Q_test<-nb2listw(nb_test)
> y_test <- dataTable$annualrent[test]
> ######## spatial lag process model
> GM5<-spatialreg::lagsarlm(formula, data=traindata, listw=Q_train)
> summary(GM5)

Call:spatialreg::lagsarlm(formula = formula, data = traindata, listw =
Q_train)

Residuals:
      Min        1Q    Median        3Q       Max
-54.38677  -7.32082  -0.78772   5.69047  90.56889

Type: lag
Coefficients: (asymptotic standard errors)
             Estimate Std. Error z value  Pr(>|z|)
(Intercept) -20.63671    4.08382 -5.0533 4.343e-07
coveredpri   -4.04663    4.71961 -0.8574  0.391219
coveredsha   -3.70054    2.57827 -1.4353  0.151207
vipprivate   -3.19640    4.57774 -0.6982  0.485022
vipshare     -2.26742    2.30487 -0.9838  0.325238
unoveredla   -1.29846    3.62671 -0.3580  0.720322
flushpriva   10.22320    3.31060  3.0880  0.002015
electricit   11.25641    2.07508  5.4246 5.810e-08
privatetap    6.48159    2.63420  2.4606  0.013872
publictap    -2.91930    3.46341 -0.8429  0.399285
watertanke   -5.52987    7.58483 -0.7291  0.465959
protectewe   -5.44122    5.17858 -1.0507  0.293389
river        -1.76395    3.57625 -0.4932  0.621843
numberofro   12.47265    0.94757 13.1628 < 2.2e-16
roof         -2.83454    3.62586 -0.7818  0.434357
externalwa   10.19840    1.92089  5.3092 1.101e-07
floor         3.81212    2.55901  1.4897  0.136307

Rho: 0.3742, LR test value: 4.3541, p-value: 0.03692
Asymptotic standard error: 0.16111
    z-value: 2.3227, p-value: 0.020197
Wald statistic: 5.3948, p-value: 0.020197

Log likelihood: -1842.303 for lag model
ML residual variance (sigma squared): 273.25, (sigma: 16.53)
Number of observations: 436
Number of parameters estimated: 19
AIC: 3722.6, (AIC for lm: 3725)
LM test for residual autocorrelation
test value: 1.3748, p-value: 0.24099

> GM5_predict <- spatialreg::predict.sarlm(GM5, newdata = testdata, listw =
Q_test)
Error in spatialreg::predict.sarlm(GM5, newdata = testdata, listw = Q_test)
:
  mismatch between newdata and spatial weights. newdata should have
region.id as row.names
>

Any idea, what is going on?

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: spatialreg::predict.sarlm

Roger Bivand
Administrator
On Fri, 15 May 2020, Weldensie Embaye wrote:

> I am trying to run out-of-sample prediction using
> spatialreg::predict.sarlm package. I tried to prepare my data set as per
> the requirements.However, it did not work.
>
> Below is the code.

This is your code, but it cannot be reproduced. Please provide a fully
reproducible example that has available data, and which others can help
you to debug. If you can make the input data and weights files available
on a link, that might be easier than trying to use built-in data,
especially as you are loading weights generated outside the workflow.

A further problem is that prediction needs both data sets and weights,
because predictions for test observations are autocorrelated with training
set observations, linked through the weights links.

This leads to the problematic status of train/test on spatial data,
because ever-present autocorrelation (or entity misspecification) leaks
between training and test sets (which are then no longer independent).
There are steps that can be taken to handle this (see for point support
the blockCV package), not so far there is no such literature (I think) for
spatial econometrics models.

Please also post plain text, not HTML, to make copying and pasting code
simpler.

Hope this helps,

Roger

> csv_train <- read.csv("train.csv") #upload your csv. R needs to know your
> region ID
>> hhid <- csv_train$hhid #Explicitly created a variable to hold the regions
>> nb_train <- read.gwt2nb('train.gwt', region.id=hhid)
>> Q_train<-nb2listw(nb_train)
>> csv_test <- read.csv("test.csv") #upload your csv. R needs to know your
> region ID
>> hhid <- csv_test$hhid #Explicitly created a variable to hold the regions
>> nb_test <- read.gwt2nb('test.gwt', region.id=hhid)
>> Q_test<-nb2listw(nb_test)
>> y_test <- dataTable$annualrent[test]
>> ######## spatial lag process model
>> GM5<-spatialreg::lagsarlm(formula, data=traindata, listw=Q_train)
>> summary(GM5)
>
> Call:spatialreg::lagsarlm(formula = formula, data = traindata, listw =
> Q_train)
>
> Residuals:
>      Min        1Q    Median        3Q       Max
> -54.38677  -7.32082  -0.78772   5.69047  90.56889
>
> Type: lag
> Coefficients: (asymptotic standard errors)
>             Estimate Std. Error z value  Pr(>|z|)
> (Intercept) -20.63671    4.08382 -5.0533 4.343e-07
> coveredpri   -4.04663    4.71961 -0.8574  0.391219
> coveredsha   -3.70054    2.57827 -1.4353  0.151207
> vipprivate   -3.19640    4.57774 -0.6982  0.485022
> vipshare     -2.26742    2.30487 -0.9838  0.325238
> unoveredla   -1.29846    3.62671 -0.3580  0.720322
> flushpriva   10.22320    3.31060  3.0880  0.002015
> electricit   11.25641    2.07508  5.4246 5.810e-08
> privatetap    6.48159    2.63420  2.4606  0.013872
> publictap    -2.91930    3.46341 -0.8429  0.399285
> watertanke   -5.52987    7.58483 -0.7291  0.465959
> protectewe   -5.44122    5.17858 -1.0507  0.293389
> river        -1.76395    3.57625 -0.4932  0.621843
> numberofro   12.47265    0.94757 13.1628 < 2.2e-16
> roof         -2.83454    3.62586 -0.7818  0.434357
> externalwa   10.19840    1.92089  5.3092 1.101e-07
> floor         3.81212    2.55901  1.4897  0.136307
>
> Rho: 0.3742, LR test value: 4.3541, p-value: 0.03692
> Asymptotic standard error: 0.16111
>    z-value: 2.3227, p-value: 0.020197
> Wald statistic: 5.3948, p-value: 0.020197
>
> Log likelihood: -1842.303 for lag model
> ML residual variance (sigma squared): 273.25, (sigma: 16.53)
> Number of observations: 436
> Number of parameters estimated: 19
> AIC: 3722.6, (AIC for lm: 3725)
> LM test for residual autocorrelation
> test value: 1.3748, p-value: 0.24099
>
>> GM5_predict <- spatialreg::predict.sarlm(GM5, newdata = testdata, listw =
> Q_test)
> Error in spatialreg::predict.sarlm(GM5, newdata = testdata, listw = Q_test)
> :
>  mismatch between newdata and spatial weights. newdata should have
> region.id as row.names
>>
>
> Any idea, what is going on?
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway
Reply | Threaded
Open this post in threaded view
|

Re: spatialreg::predict.sarlm

Roger Bivand
Administrator
On Fri, 15 May 2020, Roger Bivand wrote:

> On Fri, 15 May 2020, Weldensie Embaye wrote:
>
>>  I am trying to run out-of-sample prediction using
>>  spatialreg::predict.sarlm package. I tried to prepare my data set as per
>>  the requirements.However, it did not work.
>>
>>  Below is the code.
>
> This is your code, but it cannot be reproduced. Please provide a fully
> reproducible example that has available data, and which others can help you
> to debug. If you can make the input data and weights files available on a
> link, that might be easier than trying to use built-in data, especially as
> you are loading weights generated outside the workflow.
>
> A further problem is that prediction needs both data sets and weights,
> because predictions for test observations are autocorrelated with training
> set observations, linked through the weights links.
>
> This leads to the problematic status of train/test on spatial data, because
> ever-present autocorrelation (or entity misspecification) leaks between
> training and test sets (which are then no longer independent). There are
> steps that can be taken to handle this (see for point support the blockCV
> package), not so far there is no such literature (I think) for spatial
> econometrics models.
>
> Please also post plain text, not HTML, to make copying and pasting code
> simpler.

Although the OP has only replied off list, and I do not follow up off-list
replies, here is a reproducible example also using the blockCV package:

data(house, package="spData")
library(sf)
house <- st_as_sf(house)
pts <- list(x = c(506449.047777553, 520341.830738478, 519078.850469303,
  506133.302710259), y = c(217578.479394771, 217368.535168049,
  226606.081143817, 225871.27635029)) # digitized subset using locator()
subset <- st_sfc(st_cast(st_linestring(do.call("cbind", pts), dim=2),
  "POLYGON"), crs=st_crs(house))
house_1 <- st_intersection(house, st_buffer(subset, dist=-1500))
library(blockCV)
sB <- spatialBlock(house_1, rows=3, cols=3, showBlocks=FALSE,
  progress=FALSE, verbose=FALSE)
crds <- st_coordinates(house_1)
plot(crds[sB$folds[[1]][[1]],])
points(crds[sB$folds[[1]][[2]],], col="red")
house_1$id <- 1:nrow(house_1)
row.names(house_1) <- as.character(house_1$id)
train <- house_1[sB$folds[[1]][[1]],]
test <- house_1[sB$folds[[1]][[2]],]
nb <- spdep::knn2nb(spdep::knearneigh(house_1, k=6), row.names=house_1$id,
  sym=TRUE) # the nb IDs should match the data IDs
nb_train <- subset(nb, 1:nrow(house_1) %in% sB$folds[[1]][[1]])
fm <- log(price) ~ TLA + frontage + rooms + beds
library(spatialreg)
GM5 <- lagsarlm(fm, data=train, listw=spdep::nb2listw(nb_train),
  method="Matrix") # large object, do not use default method
preds <- predict(GM5, newdata=test, listw=spdep::nb2listw(nb),
  pred.type="TS")
GM5a <- lagsarlm(fm, data=train, listw=spdep::nb2listw(nb_train),
  method="Matrix", Durbin=TRUE)
predsa <- predict(GM5a, newdata=test, listw=spdep::nb2listw(nb),
  pred.type="TS", legacy.mixed=TRUE)
cor(log(test$price), predsa)
cor(log(test$price), preds)

This gives one fold of a blockCV approach. spatialreg::predict.sarlm()
matches the IDs in the newdata object to the rows and columns in the
weights matrix, and out-of-sample prediction must be given the full set of
weights, because members of the test set are neighbours of training set
members. Using blockCV reduces the spillover between training and test
sets.

Please follow up here, especially if training/test using spatial
econometrics estimation methods is of interest.

Hope this clarifies,

Roger


>
> Hope this helps,
>
> Roger
>
>>  csv_train <- read.csv("train.csv") #upload your csv. R needs to know your
>>  region ID
>>>  hhid <- csv_train$hhid #Explicitly created a variable to hold the
>>>  regions
>>>  nb_train <- read.gwt2nb('train.gwt', region.id=hhid)
>>>  Q_train<-nb2listw(nb_train)
>>>  csv_test <- read.csv("test.csv") #upload your csv. R needs to know your
>>  region ID
>>>  hhid <- csv_test$hhid #Explicitly created a variable to hold the regions
>>>  nb_test <- read.gwt2nb('test.gwt', region.id=hhid)
>>>  Q_test<-nb2listw(nb_test)
>>>  y_test <- dataTable$annualrent[test]
>>>  ######## spatial lag process model
>>>  GM5<-spatialreg::lagsarlm(formula, data=traindata, listw=Q_train)
>>>  summary(GM5)
>>
>>  Call:spatialreg::lagsarlm(formula = formula, data = traindata, listw =
>>  Q_train)
>>
>>  Residuals:
>>       Min        1Q    Median        3Q       Max
>>  -54.38677  -7.32082  -0.78772   5.69047  90.56889
>>
>>  Type: lag
>>  Coefficients: (asymptotic standard errors)
>>              Estimate Std. Error z value  Pr(>|z|)
>>  (Intercept) -20.63671    4.08382 -5.0533 4.343e-07
>>  coveredpri   -4.04663    4.71961 -0.8574  0.391219
>>  coveredsha   -3.70054    2.57827 -1.4353  0.151207
>>  vipprivate   -3.19640    4.57774 -0.6982  0.485022
>>  vipshare     -2.26742    2.30487 -0.9838  0.325238
>>  unoveredla   -1.29846    3.62671 -0.3580  0.720322
>>  flushpriva   10.22320    3.31060  3.0880  0.002015
>>  electricit   11.25641    2.07508  5.4246 5.810e-08
>>  privatetap    6.48159    2.63420  2.4606  0.013872
>>  publictap    -2.91930    3.46341 -0.8429  0.399285
>>  watertanke   -5.52987    7.58483 -0.7291  0.465959
>>  protectewe   -5.44122    5.17858 -1.0507  0.293389
>>  river        -1.76395    3.57625 -0.4932  0.621843
>>  numberofro   12.47265    0.94757 13.1628 < 2.2e-16
>>  roof         -2.83454    3.62586 -0.7818  0.434357
>>  externalwa   10.19840    1.92089  5.3092 1.101e-07
>>  floor         3.81212    2.55901  1.4897  0.136307
>>
>>  Rho: 0.3742, LR test value: 4.3541, p-value: 0.03692
>>  Asymptotic standard error: 0.16111
>>     z-value: 2.3227, p-value: 0.020197
>>  Wald statistic: 5.3948, p-value: 0.020197
>>
>>  Log likelihood: -1842.303 for lag model
>>  ML residual variance (sigma squared): 273.25, (sigma: 16.53)
>>  Number of observations: 436
>>  Number of parameters estimated: 19
>>  AIC: 3722.6, (AIC for lm: 3725)
>>  LM test for residual autocorrelation
>>  test value: 1.3748, p-value: 0.24099
>>
>>>  GM5_predict <- spatialreg::predict.sarlm(GM5, newdata = testdata, listw
>>>  =
>>  Q_test)
>>  Error in spatialreg::predict.sarlm(GM5, newdata = testdata, listw =
>>  Q_test)
>>  :
>>  mismatch between newdata and spatial weights. newdata should have
>>  region.id as row.names
>>>
>>
>>  Any idea, what is going on?
>>
>>   [[alternative HTML version deleted]]
>>
>>  _______________________________________________
>>  R-sig-Geo mailing list
>>  [hidden email]
>>  https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway