Spatial Autocorrelation Estimation Method

Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Spatial Autocorrelation Estimation Method

Robert R
I have a large pooled cross-section data set.

I would like to estimate/regress using spatial autocorrelation methods. I am assuming for now that spatial dependence is present in both the dependent variable and the error term.​

My data set is over a period of 4 years, monthly data (54 periods). For this means, I've created a time dummy variable for each time period.​

I also created a weight matrix using the functions "poly2nb" and "nb2listw".​

Now I am trying to figure out a way to estimate my model which contains a really big data set.​

Basically, my model is as follows: y = γD + ρW1y + Xβ + λW2u + ε​

My questions are:​

1) My spatial weight matrix for the whole data set will be probably a enormous matrix with submatrices for each time period itself. I don't think it would be possible to calculate this.​
What I would like to know is a way to estimate each time dummy/period separately (to compare different periods alone). How to do it?​

2) Which package to use: spdep or splm?​

Thank you and best regards,​
Robert​

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Roger Bivand
Administrator
On Tue, 5 Nov 2019, Robert R wrote:

> I have a large pooled cross-section data set. ​I would like to
> estimate/regress using spatial autocorrelation methods. I am assuming
> for now that spatial dependence is present in both the dependent
> variable and the error term.​ ​My data set is over a period of 4 years,
> monthly data (54 periods). For this means, I've created a time dummy
> variable for each time period.​ ​I also created a weight matrix using the
> functions "poly2nb" and "nb2listw".​ ​Now I am trying to figure out a way
> to estimate my model which contains a really big data set.​ ​Basically, my
> model is as follows: y = γD + ρW1y + Xβ + λW2u + ε​ ​My questions are:​ ​1)
> My spatial weight matrix for the whole data set will be probably a
> enormous matrix with submatrices for each time period itself. I don't
> think it would be possible to calculate this.​ What I would like to know
> is a way to estimate each time dummy/period separately (to compare
> different periods alone). How to do it?​ ​2) Which package to use: spdep
> or splm?​ ​Thank you and best regards,​ Robert​
Please do not post HTML, only plain text. Almost certainly your model
specification is wrong (SARAR/SAC is always a bad idea if alternatives are
untried). What is your cross-sectional size? Using sparse kronecker
products, the "enormous" matrix may not be very big. Does it make any
sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
of the covariates time-varying? Please provide motivation and use area
(preferably with affiliation (your email and user name are not
informative) - this feels like a real estate problem, probably wrongly
specified. You should use splm if time make sense in your case, but if it
really doesn't, simplify your approach, as much of the data will be
subject to very large temporal autocorrelation.

If this is a continuation of your previous question about using
self-neighbours, be aware that you should not use self-neighbours in
modelling, they are only useful for the Getis-Ord local G_i^* measure.

Roger

>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Robert R
Dear Roger,

Thank you for your reply. I disabled HTML; my e-mails should be now in plain text.

I will give a better context for my desired outcome.

I am taking Airbnb's listings information for New York City available on: http://insideairbnb.com/get-the-data.html

I save every listings.csv.gz file available for NYC (2015-01 to 2019-09) - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a Listings/ folder. When importing all these 54 files into one single data set, I create a new "date_compiled" variable/column.

In total, after the data cleansing process, I have a little more 2 million observations.

I created 54 timedummy variables for each time period available.

I want to estimate using a hedonic spatial timedummy model the impact of a variety of characteristics which potentially determine the daily rate on Airbnb listings through time in New York City (e.g. characteristics of the listing as number of bedrooms, if the host if professional, proximity to downtown (New York City Hall) and nearest subway station from the listing, income per capita, etc.).

My dependent variable is price (log price, common in the related literature for hedonic prices).

The OLS model is done.

For the spatial model, I am assuming that hosts, when deciding the pricing of their listings, take not only into account its structural and location characteristics, but also the prices charged by near listings with similar characteristics - spatial
autocorrelation is then present, at least spatial dependence is present in the dependent variable.

As I wrote in my previous post, I was willing to consider the neighbor itself as a neighbor.

Parts of my code can be found below:

########

## packages

packages_install <- function(packages){
 new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
 if (length(new.packages))
 install.packages(new.packages, dependencies = TRUE)
 sapply(packages, require, character.only = TRUE)
}

packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
packages_install(packages_required)

# Working directory
setwd("C:/Users/User/R")



## shapefile_us

# Shapefile zips import and Coordinate Reference System (CRS) transformation
# Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")

# Columns removal
shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))

# Column rename: ZCTA5CE10
setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))

# Column class change: zipcode
shapefile_us$zipcode <- as.character(shapefile_us$zipcode)



## polygon_nyc

# Zip code not available in shapefile: 11695
polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)



## weight_matrix

# Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)

# Include neighbour itself as a neighbour
# for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
polygon_nyc_nb <- include.self(polygon_nyc_nb)

# Weights to each neighboring polygon
lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)



## listings

# Data import
files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
listings <- mapply(cbind, listings, date_compiled = names(listings))
listings <- listings %>% bind_rows

# Characters removal
listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
listings$price <- gsub("\\$", "", listings$price)
listings$price <- gsub(",", "", listings$price)



## timedummy

timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
timedummy <- paste(timedummy, sep = "", collapse = " + ")
timedummy <- gsub("-", "_", timedummy)



## OLS regression

# Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")

########

Some of my id's repeat in multiple time periods.

I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.

Now I want to apply the hedonic model with the timedummy variables.

Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?

Again, thank you very much for the help provided until now.

Best regards,
Robert

________________________________________
From: Roger Bivand <[hidden email]>
Sent: Tuesday, November 5, 2019 15:30
To: Robert R
Cc: [hidden email]
Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

On Tue, 5 Nov 2019, Robert R wrote:

> I have a large pooled cross-section data set. ​I would like to
> estimate/regress using spatial autocorrelation methods. I am assuming
> for now that spatial dependence is present in both the dependent
> variable and the error term.​ ​My data set is over a period of 4 years,
> monthly data (54 periods). For this means, I've created a time dummy
> variable for each time period.​ ​I also created a weight matrix using the
> functions "poly2nb" and "nb2listw".​ ​Now I am trying to figure out a way
> to estimate my model which contains a really big data set.​ ​Basically, my
> model is as follows: y = γD + ρW1y + Xβ + λW2u + ε​ ​My questions are:​ ​1)
> My spatial weight matrix for the whole data set will be probably a
> enormous matrix with submatrices for each time period itself. I don't
> think it would be possible to calculate this.​ What I would like to know
> is a way to estimate each time dummy/period separately (to compare
> different periods alone). How to do it?​ ​2) Which package to use: spdep
> or splm?​ ​Thank you and best regards,​ Robert​

Please do not post HTML, only plain text. Almost certainly your model
specification is wrong (SARAR/SAC is always a bad idea if alternatives are
untried). What is your cross-sectional size? Using sparse kronecker
products, the "enormous" matrix may not be very big. Does it make any
sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
of the covariates time-varying? Please provide motivation and use area
(preferably with affiliation (your email and user name are not
informative) - this feels like a real estate problem, probably wrongly
specified. You should use splm if time make sense in your case, but if it
really doesn't, simplify your approach, as much of the data will be
subject to very large temporal autocorrelation.

If this is a continuation of your previous question about using
self-neighbours, be aware that you should not use self-neighbours in
modelling, they are only useful for the Getis-Ord local G_i^* measure.

Roger

>
>       [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Roger Bivand
Administrator
On Tue, 5 Nov 2019, Robert R wrote:

> Dear Roger,
>
> Thank you for your reply. I disabled HTML; my e-mails should be now in
> plain text.
>
> I will give a better context for my desired outcome.
>
> I am taking Airbnb's listings information for New York City available
> on: http://insideairbnb.com/get-the-data.html
>
> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
> Listings/ folder. When importing all these 54 files into one single data
> set, I create a new "date_compiled" variable/column.
>
> In total, after the data cleansing process, I have a little more 2
> million observations.
You have repeat lettings for some, but not all properties. So this is at
best a very unbalanced panel. For those properties with repeats, you may
see temporal movement (trend/seasonal).

I suggest (strongly) taking a single borough or even zipcode with some
hindreds of properties, and working from there. Do not include the
observation as its own neighbour, perhaps identify repeats and handle them
specially (create or use a property ID). Unbalanced panels may also create
a selection bias issue (why are some properties only listed sometimes?).

So this although promising isn't simple, and getting to a hedonic model
may be hard, but not (just) because of spatial autocorrelation. I wouldn't
necessarily trust OLS output either, partly because of the repeat property
issue.

Roger

>
> I created 54 timedummy variables for each time period available.
>
> I want to estimate using a hedonic spatial timedummy model the impact of
> a variety of characteristics which potentially determine the daily rate
> on Airbnb listings through time in New York City (e.g. characteristics
> of the listing as number of bedrooms, if the host if professional,
> proximity to downtown (New York City Hall) and nearest subway station
> from the listing, income per capita, etc.).
>
> My dependent variable is price (log price, common in the related
> literature for hedonic prices).
>
> The OLS model is done.
>
> For the spatial model, I am assuming that hosts, when deciding the
> pricing of their listings, take not only into account its structural and
> location characteristics, but also the prices charged by near listings
> with similar characteristics - spatial autocorrelation is then present,
> at least spatial dependence is present in the dependent variable.
>
> As I wrote in my previous post, I was willing to consider the neighbor
> itself as a neighbor.
>
> Parts of my code can be found below:
>
> ########
>
> ## packages
>
> packages_install <- function(packages){
> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
> if (length(new.packages))
> install.packages(new.packages, dependencies = TRUE)
> sapply(packages, require, character.only = TRUE)
> }
>
> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
> packages_install(packages_required)
>
> # Working directory
> setwd("C:/Users/User/R")
>
>
>
> ## shapefile_us
>
> # Shapefile zips import and Coordinate Reference System (CRS) transformation
> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>
> # Columns removal
> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>
> # Column rename: ZCTA5CE10
> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>
> # Column class change: zipcode
> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>
>
>
> ## polygon_nyc
>
> # Zip code not available in shapefile: 11695
> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>
>
>
> ## weight_matrix
>
> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>
> # Include neighbour itself as a neighbour
> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>
> # Weights to each neighboring polygon
> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>
>
>
> ## listings
>
> # Data import
> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
> listings <- mapply(cbind, listings, date_compiled = names(listings))
> listings <- listings %>% bind_rows
>
> # Characters removal
> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
> listings$price <- gsub("\\$", "", listings$price)
> listings$price <- gsub(",", "", listings$price)
>
>
>
> ## timedummy
>
> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
> timedummy <- paste(timedummy, sep = "", collapse = " + ")
> timedummy <- gsub("-", "_", timedummy)
>
>
>
> ## OLS regression
>
> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>
> ########
>
> Some of my id's repeat in multiple time periods.
>
> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>
> Now I want to apply the hedonic model with the timedummy variables.
>
> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>
> Again, thank you very much for the help provided until now.
>
> Best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Tuesday, November 5, 2019 15:30
> To: Robert R
> Cc: [hidden email]
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> On Tue, 5 Nov 2019, Robert R wrote:
>
>> I have a large pooled cross-section data set. ​I would like to
>> estimate/regress using spatial autocorrelation methods. I am assuming
>> for now that spatial dependence is present in both the dependent
>> variable and the error term.​ ​My data set is over a period of 4 years,
>> monthly data (54 periods). For this means, I've created a time dummy
>> variable for each time period.​ ​I also created a weight matrix using the
>> functions "poly2nb" and "nb2listw".​ ​Now I am trying to figure out a way
>> to estimate my model which contains a really big data set.​ ​Basically, my
>> model is as follows: y = γD + ρW1y + Xβ + λW2u + ε​ ​My questions are:​ ​1)
>> My spatial weight matrix for the whole data set will be probably a
>> enormous matrix with submatrices for each time period itself. I don't
>> think it would be possible to calculate this.​ What I would like to know
>> is a way to estimate each time dummy/period separately (to compare
>> different periods alone). How to do it?​ ​2) Which package to use: spdep
>> or splm?​ ​Thank you and best regards,​ Robert​
>
> Please do not post HTML, only plain text. Almost certainly your model
> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
> untried). What is your cross-sectional size? Using sparse kronecker
> products, the "enormous" matrix may not be very big. Does it make any
> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
> of the covariates time-varying? Please provide motivation and use area
> (preferably with affiliation (your email and user name are not
> informative) - this feels like a real estate problem, probably wrongly
> specified. You should use splm if time make sense in your case, but if it
> really doesn't, simplify your approach, as much of the data will be
> subject to very large temporal autocorrelation.
>
> If this is a continuation of your previous question about using
> self-neighbours, be aware that you should not use self-neighbours in
> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>
> Roger
>
>>
>>       [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> [hidden email]
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>
--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Robert R
Dear Roger,

Many thanks for your help.

I have an additional question:

Is it possible to create a "separate" lw (nb2listw) (with different rownumbers) from my data set? For now, I am taking my data set and merging with the sf object polygon_nyc with the function "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create a huge n x n matrix (depending of the size of my data set).

Taking the polygon_nyc alone and turning it to a lw (weights list) object has only n = 177.

Of course running

spatialreg::lagsarlm(formula=model, data = listings_sample, polygon_nyc_lw, tol.solve=1.0e-10)

does not work ("Input data and weights have different dimensions").

The only option is to take my data set, merge it to my polygon_nyc (by zipcode) and then create the weights list lw? Or there another option?

Best regards,
Robert

________________________________________
From: Roger Bivand <[hidden email]>
Sent: Wednesday, November 6, 2019 15:07
To: Robert R
Cc: [hidden email]
Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

On Tue, 5 Nov 2019, Robert R wrote:

> Dear Roger,
>
> Thank you for your reply. I disabled HTML; my e-mails should be now in
> plain text.
>
> I will give a better context for my desired outcome.
>
> I am taking Airbnb's listings information for New York City available
> on: http://insideairbnb.com/get-the-data.html
>
> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
> Listings/ folder. When importing all these 54 files into one single data
> set, I create a new "date_compiled" variable/column.
>
> In total, after the data cleansing process, I have a little more 2
> million observations.

You have repeat lettings for some, but not all properties. So this is at
best a very unbalanced panel. For those properties with repeats, you may
see temporal movement (trend/seasonal).

I suggest (strongly) taking a single borough or even zipcode with some
hindreds of properties, and working from there. Do not include the
observation as its own neighbour, perhaps identify repeats and handle them
specially (create or use a property ID). Unbalanced panels may also create
a selection bias issue (why are some properties only listed sometimes?).

So this although promising isn't simple, and getting to a hedonic model
may be hard, but not (just) because of spatial autocorrelation. I wouldn't
necessarily trust OLS output either, partly because of the repeat property
issue.

Roger

>
> I created 54 timedummy variables for each time period available.
>
> I want to estimate using a hedonic spatial timedummy model the impact of
> a variety of characteristics which potentially determine the daily rate
> on Airbnb listings through time in New York City (e.g. characteristics
> of the listing as number of bedrooms, if the host if professional,
> proximity to downtown (New York City Hall) and nearest subway station
> from the listing, income per capita, etc.).
>
> My dependent variable is price (log price, common in the related
> literature for hedonic prices).
>
> The OLS model is done.
>
> For the spatial model, I am assuming that hosts, when deciding the
> pricing of their listings, take not only into account its structural and
> location characteristics, but also the prices charged by near listings
> with similar characteristics - spatial autocorrelation is then present,
> at least spatial dependence is present in the dependent variable.
>
> As I wrote in my previous post, I was willing to consider the neighbor
> itself as a neighbor.
>
> Parts of my code can be found below:
>
> ########
>
> ## packages
>
> packages_install <- function(packages){
> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
> if (length(new.packages))
> install.packages(new.packages, dependencies = TRUE)
> sapply(packages, require, character.only = TRUE)
> }
>
> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
> packages_install(packages_required)
>
> # Working directory
> setwd("C:/Users/User/R")
>
>
>
> ## shapefile_us
>
> # Shapefile zips import and Coordinate Reference System (CRS) transformation
> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>
> # Columns removal
> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>
> # Column rename: ZCTA5CE10
> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>
> # Column class change: zipcode
> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>
>
>
> ## polygon_nyc
>
> # Zip code not available in shapefile: 11695
> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>
>
>
> ## weight_matrix
>
> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>
> # Include neighbour itself as a neighbour
> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>
> # Weights to each neighboring polygon
> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>
>
>
> ## listings
>
> # Data import
> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
> listings <- mapply(cbind, listings, date_compiled = names(listings))
> listings <- listings %>% bind_rows
>
> # Characters removal
> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
> listings$price <- gsub("\\$", "", listings$price)
> listings$price <- gsub(",", "", listings$price)
>
>
>
> ## timedummy
>
> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
> timedummy <- paste(timedummy, sep = "", collapse = " + ")
> timedummy <- gsub("-", "_", timedummy)
>
>
>
> ## OLS regression
>
> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>
> ########
>
> Some of my id's repeat in multiple time periods.
>
> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>
> Now I want to apply the hedonic model with the timedummy variables.
>
> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>
> Again, thank you very much for the help provided until now.
>
> Best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Tuesday, November 5, 2019 15:30
> To: Robert R
> Cc: [hidden email]
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> On Tue, 5 Nov 2019, Robert R wrote:
>
>> I have a large pooled cross-section data set. ​I would like to
>> estimate/regress using spatial autocorrelation methods. I am assuming
>> for now that spatial dependence is present in both the dependent
>> variable and the error term.​ ​My data set is over a period of 4 years,
>> monthly data (54 periods). For this means, I've created a time dummy
>> variable for each time period.​ ​I also created a weight matrix using the
>> functions "poly2nb" and "nb2listw".​ ​Now I am trying to figure out a way
>> to estimate my model which contains a really big data set.​ ​Basically, my
>> model is as follows: y = γD + ρW1y + Xβ + λW2u + ε​ ​My questions are:​ ​1)
>> My spatial weight matrix for the whole data set will be probably a
>> enormous matrix with submatrices for each time period itself. I don't
>> think it would be possible to calculate this.​ What I would like to know
>> is a way to estimate each time dummy/period separately (to compare
>> different periods alone). How to do it?​ ​2) Which package to use: spdep
>> or splm?​ ​Thank you and best regards,​ Robert​
>
> Please do not post HTML, only plain text. Almost certainly your model
> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
> untried). What is your cross-sectional size? Using sparse kronecker
> products, the "enormous" matrix may not be very big. Does it make any
> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
> of the covariates time-varying? Please provide motivation and use area
> (preferably with affiliation (your email and user name are not
> informative) - this feels like a real estate problem, probably wrongly
> specified. You should use splm if time make sense in your case, but if it
> really doesn't, simplify your approach, as much of the data will be
> subject to very large temporal autocorrelation.
>
> If this is a continuation of your previous question about using
> self-neighbours, be aware that you should not use self-neighbours in
> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>
> Roger
>
>>
>>       [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> [hidden email]
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Roger Bivand
Administrator
On Thu, 7 Nov 2019, Robert R wrote:

> Dear Roger,
>
> Many thanks for your help.
>
> I have an additional question:
>
> Is it possible to create a "separate" lw (nb2listw) (with different
> rownumbers) from my data set? For now, I am taking my data set and
> merging with the sf object polygon_nyc with the function
> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create
> a huge n x n matrix (depending of the size of my data set).
>
> Taking the polygon_nyc alone and turning it to a lw (weights list)
> object has only n = 177.
>
> Of course running
>
> spatialreg::lagsarlm(formula=model, data = listings_sample,
> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
>
> does not work ("Input data and weights have different dimensions").
>
> The only option is to take my data set, merge it to my polygon_nyc (by
> zipcode) and then create the weights list lw? Or there another option?
I think we are getting more clarity. You do not know the location of the
lettings beyond their zipcode. You do know the boundaries of the zipcode
areas, and can create a neighbour object from these boundaries. You then
want to treat all the lettings in a zipcode area i as neighbours, and
additionally lettings in zipcode areas neighbouring i as neighbours of
lettings in i. This is the data structure that motivated the
spdep::nb2blocknb() function:

https://r-spatial.github.io/spdep/reference/nb2blocknb.html

Try running the examples to get a feel for what is going on.

I feel that most of the variability will vanish in the very large numbers
of neighbours, over-smoothing the outcomes. If you do not have locations
for the lettings themselves, I don't think you can make much progress.

You could try a linear mixed model (or gam with a spatially structured
random effect) with a temporal and a spatial random effect. See the HSAR
package, articles by Dong et al., and maybe
https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither
this nor Dong et al. handle spatio-temporal settings. MRF spatial random
effects at the zipcode level might be a way forward, together with an IID
random effect at the same level (equivalent to sef-neighbours).

Hope this helps,

Roger

>
> Best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Wednesday, November 6, 2019 15:07
> To: Robert R
> Cc: [hidden email]
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> On Tue, 5 Nov 2019, Robert R wrote:
>
>> Dear Roger,
>>
>> Thank you for your reply. I disabled HTML; my e-mails should be now in
>> plain text.
>>
>> I will give a better context for my desired outcome.
>>
>> I am taking Airbnb's listings information for New York City available
>> on: http://insideairbnb.com/get-the-data.html
>>
>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
>> Listings/ folder. When importing all these 54 files into one single data
>> set, I create a new "date_compiled" variable/column.
>>
>> In total, after the data cleansing process, I have a little more 2
>> million observations.
>
> You have repeat lettings for some, but not all properties. So this is at
> best a very unbalanced panel. For those properties with repeats, you may
> see temporal movement (trend/seasonal).
>
> I suggest (strongly) taking a single borough or even zipcode with some
> hindreds of properties, and working from there. Do not include the
> observation as its own neighbour, perhaps identify repeats and handle them
> specially (create or use a property ID). Unbalanced panels may also create
> a selection bias issue (why are some properties only listed sometimes?).
>
> So this although promising isn't simple, and getting to a hedonic model
> may be hard, but not (just) because of spatial autocorrelation. I wouldn't
> necessarily trust OLS output either, partly because of the repeat property
> issue.
>
> Roger
>
>>
>> I created 54 timedummy variables for each time period available.
>>
>> I want to estimate using a hedonic spatial timedummy model the impact of
>> a variety of characteristics which potentially determine the daily rate
>> on Airbnb listings through time in New York City (e.g. characteristics
>> of the listing as number of bedrooms, if the host if professional,
>> proximity to downtown (New York City Hall) and nearest subway station
>> from the listing, income per capita, etc.).
>>
>> My dependent variable is price (log price, common in the related
>> literature for hedonic prices).
>>
>> The OLS model is done.
>>
>> For the spatial model, I am assuming that hosts, when deciding the
>> pricing of their listings, take not only into account its structural and
>> location characteristics, but also the prices charged by near listings
>> with similar characteristics - spatial autocorrelation is then present,
>> at least spatial dependence is present in the dependent variable.
>>
>> As I wrote in my previous post, I was willing to consider the neighbor
>> itself as a neighbor.
>>
>> Parts of my code can be found below:
>>
>> ########
>>
>> ## packages
>>
>> packages_install <- function(packages){
>> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
>> if (length(new.packages))
>> install.packages(new.packages, dependencies = TRUE)
>> sapply(packages, require, character.only = TRUE)
>> }
>>
>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
>> packages_install(packages_required)
>>
>> # Working directory
>> setwd("C:/Users/User/R")
>>
>>
>>
>> ## shapefile_us
>>
>> # Shapefile zips import and Coordinate Reference System (CRS) transformation
>> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>>
>> # Columns removal
>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>>
>> # Column rename: ZCTA5CE10
>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>>
>> # Column class change: zipcode
>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>>
>>
>>
>> ## polygon_nyc
>>
>> # Zip code not available in shapefile: 11695
>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>>
>>
>>
>> ## weight_matrix
>>
>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>>
>> # Include neighbour itself as a neighbour
>> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>
>> # Weights to each neighboring polygon
>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>>
>>
>>
>> ## listings
>>
>> # Data import
>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
>> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
>> listings <- mapply(cbind, listings, date_compiled = names(listings))
>> listings <- listings %>% bind_rows
>>
>> # Characters removal
>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
>> listings$price <- gsub("\\$", "", listings$price)
>> listings$price <- gsub(",", "", listings$price)
>>
>>
>>
>> ## timedummy
>>
>> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
>> timedummy <- paste(timedummy, sep = "", collapse = " + ")
>> timedummy <- gsub("-", "_", timedummy)
>>
>>
>>
>> ## OLS regression
>>
>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
>> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>>
>> ########
>>
>> Some of my id's repeat in multiple time periods.
>>
>> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>>
>> Now I want to apply the hedonic model with the timedummy variables.
>>
>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>>
>> Again, thank you very much for the help provided until now.
>>
>> Best regards,
>> Robert
>>
>> ________________________________________
>> From: Roger Bivand <[hidden email]>
>> Sent: Tuesday, November 5, 2019 15:30
>> To: Robert R
>> Cc: [hidden email]
>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> On Tue, 5 Nov 2019, Robert R wrote:
>>
>>> I have a large pooled cross-section data set. ​I would like to
>>> estimate/regress using spatial autocorrelation methods. I am assuming
>>> for now that spatial dependence is present in both the dependent
>>> variable and the error term.​ ​My data set is over a period of 4 years,
>>> monthly data (54 periods). For this means, I've created a time dummy
>>> variable for each time period.​ ​I also created a weight matrix using the
>>> functions "poly2nb" and "nb2listw".​ ​Now I am trying to figure out a way
>>> to estimate my model which contains a really big data set.​ ​Basically, my
>>> model is as follows: y = γD + ρW1y + Xβ + λW2u + ε​ ​My questions are:​ ​1)
>>> My spatial weight matrix for the whole data set will be probably a
>>> enormous matrix with submatrices for each time period itself. I don't
>>> think it would be possible to calculate this.​ What I would like to know
>>> is a way to estimate each time dummy/period separately (to compare
>>> different periods alone). How to do it?​ ​2) Which package to use: spdep
>>> or splm?​ ​Thank you and best regards,​ Robert​
>>
>> Please do not post HTML, only plain text. Almost certainly your model
>> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
>> untried). What is your cross-sectional size? Using sparse kronecker
>> products, the "enormous" matrix may not be very big. Does it make any
>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
>> of the covariates time-varying? Please provide motivation and use area
>> (preferably with affiliation (your email and user name are not
>> informative) - this feels like a real estate problem, probably wrongly
>> specified. You should use splm if time make sense in your case, but if it
>> really doesn't, simplify your approach, as much of the data will be
>> subject to very large temporal autocorrelation.
>>
>> If this is a continuation of your previous question about using
>> self-neighbours, be aware that you should not use self-neighbours in
>> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>>
>> Roger
>>
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-Geo mailing list
>>> [hidden email]
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: [hidden email]
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>
--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Robert R
Dear Roger,

Thank you for your answer.

I successfully used the function nb2blocknb() for a smaller dataset.

But for a dataset of over 2 million observations, I get the following error: "Error: cannot allocate vector of size 840 Kb".

I am expecting that at least 500.000 observations will be dropped due the lack of values for the chosen variables for the regression model, so probably I will filter and remove the observations/rows that will not be used anyway - do you know if there is any package that does this automatically, given the variables/columns chosed by me?

Or would you recommend me another approach to avoid the above mentioned error?

Thank you and best regards,
Robert

________________________________________
From: Roger Bivand <[hidden email]>
Sent: Thursday, November 7, 2019 10:13
To: Robert R
Cc: [hidden email]
Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

On Thu, 7 Nov 2019, Robert R wrote:

> Dear Roger,
>
> Many thanks for your help.
>
> I have an additional question:
>
> Is it possible to create a "separate" lw (nb2listw) (with different
> rownumbers) from my data set? For now, I am taking my data set and
> merging with the sf object polygon_nyc with the function
> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create
> a huge n x n matrix (depending of the size of my data set).
>
> Taking the polygon_nyc alone and turning it to a lw (weights list)
> object has only n = 177.
>
> Of course running
>
> spatialreg::lagsarlm(formula=model, data = listings_sample,
> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
>
> does not work ("Input data and weights have different dimensions").
>
> The only option is to take my data set, merge it to my polygon_nyc (by
> zipcode) and then create the weights list lw? Or there another option?

I think we are getting more clarity. You do not know the location of the
lettings beyond their zipcode. You do know the boundaries of the zipcode
areas, and can create a neighbour object from these boundaries. You then
want to treat all the lettings in a zipcode area i as neighbours, and
additionally lettings in zipcode areas neighbouring i as neighbours of
lettings in i. This is the data structure that motivated the
spdep::nb2blocknb() function:

https://r-spatial.github.io/spdep/reference/nb2blocknb.html

Try running the examples to get a feel for what is going on.

I feel that most of the variability will vanish in the very large numbers
of neighbours, over-smoothing the outcomes. If you do not have locations
for the lettings themselves, I don't think you can make much progress.

You could try a linear mixed model (or gam with a spatially structured
random effect) with a temporal and a spatial random effect. See the HSAR
package, articles by Dong et al., and maybe
https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither
this nor Dong et al. handle spatio-temporal settings. MRF spatial random
effects at the zipcode level might be a way forward, together with an IID
random effect at the same level (equivalent to sef-neighbours).

Hope this helps,

Roger

>
> Best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Wednesday, November 6, 2019 15:07
> To: Robert R
> Cc: [hidden email]
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> On Tue, 5 Nov 2019, Robert R wrote:
>
>> Dear Roger,
>>
>> Thank you for your reply. I disabled HTML; my e-mails should be now in
>> plain text.
>>
>> I will give a better context for my desired outcome.
>>
>> I am taking Airbnb's listings information for New York City available
>> on: http://insideairbnb.com/get-the-data.html
>>
>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
>> Listings/ folder. When importing all these 54 files into one single data
>> set, I create a new "date_compiled" variable/column.
>>
>> In total, after the data cleansing process, I have a little more 2
>> million observations.
>
> You have repeat lettings for some, but not all properties. So this is at
> best a very unbalanced panel. For those properties with repeats, you may
> see temporal movement (trend/seasonal).
>
> I suggest (strongly) taking a single borough or even zipcode with some
> hindreds of properties, and working from there. Do not include the
> observation as its own neighbour, perhaps identify repeats and handle them
> specially (create or use a property ID). Unbalanced panels may also create
> a selection bias issue (why are some properties only listed sometimes?).
>
> So this although promising isn't simple, and getting to a hedonic model
> may be hard, but not (just) because of spatial autocorrelation. I wouldn't
> necessarily trust OLS output either, partly because of the repeat property
> issue.
>
> Roger
>
>>
>> I created 54 timedummy variables for each time period available.
>>
>> I want to estimate using a hedonic spatial timedummy model the impact of
>> a variety of characteristics which potentially determine the daily rate
>> on Airbnb listings through time in New York City (e.g. characteristics
>> of the listing as number of bedrooms, if the host if professional,
>> proximity to downtown (New York City Hall) and nearest subway station
>> from the listing, income per capita, etc.).
>>
>> My dependent variable is price (log price, common in the related
>> literature for hedonic prices).
>>
>> The OLS model is done.
>>
>> For the spatial model, I am assuming that hosts, when deciding the
>> pricing of their listings, take not only into account its structural and
>> location characteristics, but also the prices charged by near listings
>> with similar characteristics - spatial autocorrelation is then present,
>> at least spatial dependence is present in the dependent variable.
>>
>> As I wrote in my previous post, I was willing to consider the neighbor
>> itself as a neighbor.
>>
>> Parts of my code can be found below:
>>
>> ########
>>
>> ## packages
>>
>> packages_install <- function(packages){
>> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
>> if (length(new.packages))
>> install.packages(new.packages, dependencies = TRUE)
>> sapply(packages, require, character.only = TRUE)
>> }
>>
>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
>> packages_install(packages_required)
>>
>> # Working directory
>> setwd("C:/Users/User/R")
>>
>>
>>
>> ## shapefile_us
>>
>> # Shapefile zips import and Coordinate Reference System (CRS) transformation
>> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>>
>> # Columns removal
>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>>
>> # Column rename: ZCTA5CE10
>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>>
>> # Column class change: zipcode
>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>>
>>
>>
>> ## polygon_nyc
>>
>> # Zip code not available in shapefile: 11695
>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>>
>>
>>
>> ## weight_matrix
>>
>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>>
>> # Include neighbour itself as a neighbour
>> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>
>> # Weights to each neighboring polygon
>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>>
>>
>>
>> ## listings
>>
>> # Data import
>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
>> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
>> listings <- mapply(cbind, listings, date_compiled = names(listings))
>> listings <- listings %>% bind_rows
>>
>> # Characters removal
>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
>> listings$price <- gsub("\\$", "", listings$price)
>> listings$price <- gsub(",", "", listings$price)
>>
>>
>>
>> ## timedummy
>>
>> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
>> timedummy <- paste(timedummy, sep = "", collapse = " + ")
>> timedummy <- gsub("-", "_", timedummy)
>>
>>
>>
>> ## OLS regression
>>
>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
>> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>>
>> ########
>>
>> Some of my id's repeat in multiple time periods.
>>
>> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>>
>> Now I want to apply the hedonic model with the timedummy variables.
>>
>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>>
>> Again, thank you very much for the help provided until now.
>>
>> Best regards,
>> Robert
>>
>> ________________________________________
>> From: Roger Bivand <[hidden email]>
>> Sent: Tuesday, November 5, 2019 15:30
>> To: Robert R
>> Cc: [hidden email]
>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> On Tue, 5 Nov 2019, Robert R wrote:
>>
>>> I have a large pooled cross-section data set. ​I would like to
>>> estimate/regress using spatial autocorrelation methods. I am assuming
>>> for now that spatial dependence is present in both the dependent
>>> variable and the error term.​ ​My data set is over a period of 4 years,
>>> monthly data (54 periods). For this means, I've created a time dummy
>>> variable for each time period.​ ​I also created a weight matrix using the
>>> functions "poly2nb" and "nb2listw".​ ​Now I am trying to figure out a way
>>> to estimate my model which contains a really big data set.​ ​Basically, my
>>> model is as follows: y = γD + ρW1y + Xβ + λW2u + ε​ ​My questions are:​ ​1)
>>> My spatial weight matrix for the whole data set will be probably a
>>> enormous matrix with submatrices for each time period itself. I don't
>>> think it would be possible to calculate this.​ What I would like to know
>>> is a way to estimate each time dummy/period separately (to compare
>>> different periods alone). How to do it?​ ​2) Which package to use: spdep
>>> or splm?​ ​Thank you and best regards,​ Robert​
>>
>> Please do not post HTML, only plain text. Almost certainly your model
>> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
>> untried). What is your cross-sectional size? Using sparse kronecker
>> products, the "enormous" matrix may not be very big. Does it make any
>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
>> of the covariates time-varying? Please provide motivation and use area
>> (preferably with affiliation (your email and user name are not
>> informative) - this feels like a real estate problem, probably wrongly
>> specified. You should use splm if time make sense in your case, but if it
>> really doesn't, simplify your approach, as much of the data will be
>> subject to very large temporal autocorrelation.
>>
>> If this is a continuation of your previous question about using
>> self-neighbours, be aware that you should not use self-neighbours in
>> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>>
>> Roger
>>
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-Geo mailing list
>>> [hidden email]
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: [hidden email]
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Roger Bivand
Administrator
On Fri, 8 Nov 2019, Robert R wrote:

> Dear Roger,
>
> Thank you for your answer.
>
> I successfully used the function nb2blocknb() for a smaller dataset.
>
> But for a dataset of over 2 million observations, I get the following
> error: "Error: cannot allocate vector of size 840 Kb".

I don't think the observations are helpful. If you have repeat lets in the
same property in a given month, you need to handle that anyway. I'd go for
making the modelling exercise work (we agree that this is not panel data,
right?) on a small subset first. I would further argue that you need a
multi-level approach rather than spdep::nb2blocknb(), with a zipcode IID
RE. You could very well take (stratified) samples per zipcode to represent
your data. Once that works, introduce an MRF RE at the zipcode level,
where you do know relative position. Using SARAR is going to be a waste of
time unless you can geocode the letting addresses. A multi-level approach
will work. Having big data in your case with no useful location
information per observation is just adding noise and over-smoothing, I'm
afraid. The approach used in https://doi.org/10.1016/j.spasta.2017.01.002 
will work, also when you sample the within zipcode lets, given a split
into training and test sets, and making CV possible.

Roger

>
> I am expecting that at least 500.000 observations will be dropped due
> the lack of values for the chosen variables for the regression model, so
> probably I will filter and remove the observations/rows that will not be
> used anyway - do you know if there is any package that does this
> automatically, given the variables/columns chosed by me?
>
> Or would you recommend me another approach to avoid the above mentioned
> error?
>
> Thank you and best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Thursday, November 7, 2019 10:13
> To: Robert R
> Cc: [hidden email]
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> On Thu, 7 Nov 2019, Robert R wrote:
>
>> Dear Roger,
>>
>> Many thanks for your help.
>>
>> I have an additional question:
>>
>> Is it possible to create a "separate" lw (nb2listw) (with different
>> rownumbers) from my data set? For now, I am taking my data set and
>> merging with the sf object polygon_nyc with the function
>> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create
>> a huge n x n matrix (depending of the size of my data set).
>>
>> Taking the polygon_nyc alone and turning it to a lw (weights list)
>> object has only n = 177.
>>
>> Of course running
>>
>> spatialreg::lagsarlm(formula=model, data = listings_sample,
>> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
>>
>> does not work ("Input data and weights have different dimensions").
>>
>> The only option is to take my data set, merge it to my polygon_nyc (by
>> zipcode) and then create the weights list lw? Or there another option?
>
> I think we are getting more clarity. You do not know the location of the
> lettings beyond their zipcode. You do know the boundaries of the zipcode
> areas, and can create a neighbour object from these boundaries. You then
> want to treat all the lettings in a zipcode area i as neighbours, and
> additionally lettings in zipcode areas neighbouring i as neighbours of
> lettings in i. This is the data structure that motivated the
> spdep::nb2blocknb() function:
>
> https://r-spatial.github.io/spdep/reference/nb2blocknb.html
>
> Try running the examples to get a feel for what is going on.
>
> I feel that most of the variability will vanish in the very large numbers
> of neighbours, over-smoothing the outcomes. If you do not have locations
> for the lettings themselves, I don't think you can make much progress.
>
> You could try a linear mixed model (or gam with a spatially structured
> random effect) with a temporal and a spatial random effect. See the HSAR
> package, articles by Dong et al., and maybe
> https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither
> this nor Dong et al. handle spatio-temporal settings. MRF spatial random
> effects at the zipcode level might be a way forward, together with an IID
> random effect at the same level (equivalent to sef-neighbours).
>
> Hope this helps,
>
> Roger
>
>>
>> Best regards,
>> Robert
>>
>> ________________________________________
>> From: Roger Bivand <[hidden email]>
>> Sent: Wednesday, November 6, 2019 15:07
>> To: Robert R
>> Cc: [hidden email]
>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> On Tue, 5 Nov 2019, Robert R wrote:
>>
>>> Dear Roger,
>>>
>>> Thank you for your reply. I disabled HTML; my e-mails should be now in
>>> plain text.
>>>
>>> I will give a better context for my desired outcome.
>>>
>>> I am taking Airbnb's listings information for New York City available
>>> on: http://insideairbnb.com/get-the-data.html
>>>
>>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
>>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
>>> Listings/ folder. When importing all these 54 files into one single data
>>> set, I create a new "date_compiled" variable/column.
>>>
>>> In total, after the data cleansing process, I have a little more 2
>>> million observations.
>>
>> You have repeat lettings for some, but not all properties. So this is at
>> best a very unbalanced panel. For those properties with repeats, you may
>> see temporal movement (trend/seasonal).
>>
>> I suggest (strongly) taking a single borough or even zipcode with some
>> hindreds of properties, and working from there. Do not include the
>> observation as its own neighbour, perhaps identify repeats and handle them
>> specially (create or use a property ID). Unbalanced panels may also create
>> a selection bias issue (why are some properties only listed sometimes?).
>>
>> So this although promising isn't simple, and getting to a hedonic model
>> may be hard, but not (just) because of spatial autocorrelation. I wouldn't
>> necessarily trust OLS output either, partly because of the repeat property
>> issue.
>>
>> Roger
>>
>>>
>>> I created 54 timedummy variables for each time period available.
>>>
>>> I want to estimate using a hedonic spatial timedummy model the impact of
>>> a variety of characteristics which potentially determine the daily rate
>>> on Airbnb listings through time in New York City (e.g. characteristics
>>> of the listing as number of bedrooms, if the host if professional,
>>> proximity to downtown (New York City Hall) and nearest subway station
>>> from the listing, income per capita, etc.).
>>>
>>> My dependent variable is price (log price, common in the related
>>> literature for hedonic prices).
>>>
>>> The OLS model is done.
>>>
>>> For the spatial model, I am assuming that hosts, when deciding the
>>> pricing of their listings, take not only into account its structural and
>>> location characteristics, but also the prices charged by near listings
>>> with similar characteristics - spatial autocorrelation is then present,
>>> at least spatial dependence is present in the dependent variable.
>>>
>>> As I wrote in my previous post, I was willing to consider the neighbor
>>> itself as a neighbor.
>>>
>>> Parts of my code can be found below:
>>>
>>> ########
>>>
>>> ## packages
>>>
>>> packages_install <- function(packages){
>>> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
>>> if (length(new.packages))
>>> install.packages(new.packages, dependencies = TRUE)
>>> sapply(packages, require, character.only = TRUE)
>>> }
>>>
>>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
>>> packages_install(packages_required)
>>>
>>> # Working directory
>>> setwd("C:/Users/User/R")
>>>
>>>
>>>
>>> ## shapefile_us
>>>
>>> # Shapefile zips import and Coordinate Reference System (CRS) transformation
>>> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
>>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>>>
>>> # Columns removal
>>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>>>
>>> # Column rename: ZCTA5CE10
>>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>>>
>>> # Column class change: zipcode
>>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>>>
>>>
>>>
>>> ## polygon_nyc
>>>
>>> # Zip code not available in shapefile: 11695
>>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>>>
>>>
>>>
>>> ## weight_matrix
>>>
>>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>>>
>>> # Include neighbour itself as a neighbour
>>> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>
>>> # Weights to each neighboring polygon
>>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>>>
>>>
>>>
>>> ## listings
>>>
>>> # Data import
>>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
>>> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
>>> listings <- mapply(cbind, listings, date_compiled = names(listings))
>>> listings <- listings %>% bind_rows
>>>
>>> # Characters removal
>>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
>>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
>>> listings$price <- gsub("\\$", "", listings$price)
>>> listings$price <- gsub(",", "", listings$price)
>>>
>>>
>>>
>>> ## timedummy
>>>
>>> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
>>> timedummy <- paste(timedummy, sep = "", collapse = " + ")
>>> timedummy <- gsub("-", "_", timedummy)
>>>
>>>
>>>
>>> ## OLS regression
>>>
>>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
>>> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>>>
>>> ########
>>>
>>> Some of my id's repeat in multiple time periods.
>>>
>>> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>>>
>>> Now I want to apply the hedonic model with the timedummy variables.
>>>
>>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>>>
>>> Again, thank you very much for the help provided until now.
>>>
>>> Best regards,
>>> Robert
>>>
>>> ________________________________________
>>> From: Roger Bivand <[hidden email]>
>>> Sent: Tuesday, November 5, 2019 15:30
>>> To: Robert R
>>> Cc: [hidden email]
>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>
>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>
>>>> I have a large pooled cross-section data set. ​I would like to
>>>> estimate/regress using spatial autocorrelation methods. I am assuming
>>>> for now that spatial dependence is present in both the dependent
>>>> variable and the error term.​ ​My data set is over a period of 4 years,
>>>> monthly data (54 periods). For this means, I've created a time dummy
>>>> variable for each time period.​ ​I also created a weight matrix using the
>>>> functions "poly2nb" and "nb2listw".​ ​Now I am trying to figure out a way
>>>> to estimate my model which contains a really big data set.​ ​Basically, my
>>>> model is as follows: y = γD + ρW1y + Xβ + λW2u + ε​ ​My questions are:​ ​1)
>>>> My spatial weight matrix for the whole data set will be probably a
>>>> enormous matrix with submatrices for each time period itself. I don't
>>>> think it would be possible to calculate this.​ What I would like to know
>>>> is a way to estimate each time dummy/period separately (to compare
>>>> different periods alone). How to do it?​ ​2) Which package to use: spdep
>>>> or splm?​ ​Thank you and best regards,​ Robert​
>>>
>>> Please do not post HTML, only plain text. Almost certainly your model
>>> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
>>> untried). What is your cross-sectional size? Using sparse kronecker
>>> products, the "enormous" matrix may not be very big. Does it make any
>>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
>>> of the covariates time-varying? Please provide motivation and use area
>>> (preferably with affiliation (your email and user name are not
>>> informative) - this feels like a real estate problem, probably wrongly
>>> specified. You should use splm if time make sense in your case, but if it
>>> really doesn't, simplify your approach, as much of the data will be
>>> subject to very large temporal autocorrelation.
>>>
>>> If this is a continuation of your previous question about using
>>> self-neighbours, be aware that you should not use self-neighbours in
>>> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>>>
>>> Roger
>>>
>>>>
>>>>       [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> R-sig-Geo mailing list
>>>> [hidden email]
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>
>>> --
>>> Roger Bivand
>>> Department of Economics, Norwegian School of Economics,
>>> Helleveien 30, N-5045 Bergen, Norway.
>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>> https://orcid.org/0000-0003-2392-6140
>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: [hidden email]
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>
--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Robert R
Dear Roger,

Again, thank you for your answer. I read the material provided and decided that Hierarchical Spatial Autoregressive (HSAR) could be the right model for me.

I indeed have the precise latitude and longitude information for all my listings for NYC.

I created a stratified sample (group = zipcode) with 22172 (1%) of my observations called listings_sample and tried to replicate the hsar model, please see below.

For now W = NULL, because otherwise I would have a 22172 x 22172 matrix.

You recommended then to introduce a Markov random field (MRF) random effect (RE) at the zipcode level, but I did not understand it so well. Could you develop a litte more?

##############
library(spdep)
library(HSAR)
library(dplyr)
library(splitstackshape)


# Stratified sample per zipcode (size = 1%)
listings_sample <- splitstackshape::stratified(indt = listings, group = "zipcode", size = 0.01)

# Removing zipcodes from polygon_nyc which are not observable in listings_sample
polygon_nyc_listings <- polygon_nyc %>% filter(zipcode %in% c(unique(as.character(listings_sample$zipcode))))


## Random effect matrix (N by J)

# N: 22172
# J: 154

# Arrange listings_sample by zipcode (ascending)
listings_sample <- listings_sample %>% arrange(zipcode)

# Count number of listings per zipcode
MM <- listings_sample %>% st_drop_geometry() %>% group_by(zipcode) %>% summarise(count = n()) %>% as.data.frame()
# sum(MM$count)

# N by J nulled matrix creation
Delta <- matrix(data = 0, nrow = nrow(listings_sample), ncol = dim(MM)[1])

# The total number of neighbourhood
Uid <- rep(c(1:dim(MM)[1]), MM[,2])

for(i in 1:dim(MM)[1]) {
  Delta[Uid==i,i] <- 1
}
rm(i)

Delta <- as(Delta,"dgCMatrix")


## Higher-level spatial weights matrix or neighbourhood matrix (J by J)

# Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
polygon_nyc_nb <- poly2nb(polygon_nyc_listings, row.names = polygon_nyc$zipcode, queen = TRUE)

# Include neighbour itself as a neighbour
polygon_nyc_nb <- include.self(polygon_nyc_nb)

# Spatial weights matrix for nb
polygon_nyc_nb_matrix <- nb2mat(neighbours = polygon_nyc_nb, style = "W", zero.policy = NULL)
M <- as(polygon_nyc_nb_matrix,"dgCMatrix")


## Fit HSAR SAR upper level random effect
model <- as.formula(log_price ~ guests_included + minimum_nights)

betas = coef(lm(formula = model, data = listings_sample))
pars = list(rho = 0.5, lambda = 0.5, sigma2e = 2.0, sigma2u = 2.0, betas = betas)

m_hsar <- hsar(model, data = listings_sample, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)

##############

Thank you and best regards
Robert

________________________________________
From: Roger Bivand <[hidden email]>
Sent: Friday, November 8, 2019 13:29
To: Robert R
Cc: [hidden email]
Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

On Fri, 8 Nov 2019, Robert R wrote:

> Dear Roger,
>
> Thank you for your answer.
>
> I successfully used the function nb2blocknb() for a smaller dataset.
>
> But for a dataset of over 2 million observations, I get the following
> error: "Error: cannot allocate vector of size 840 Kb".

I don't think the observations are helpful. If you have repeat lets in the
same property in a given month, you need to handle that anyway. I'd go for
making the modelling exercise work (we agree that this is not panel data,
right?) on a small subset first. I would further argue that you need a
multi-level approach rather than spdep::nb2blocknb(), with a zipcode IID
RE. You could very well take (stratified) samples per zipcode to represent
your data. Once that works, introduce an MRF RE at the zipcode level,
where you do know relative position. Using SARAR is going to be a waste of
time unless you can geocode the letting addresses. A multi-level approach
will work. Having big data in your case with no useful location
information per observation is just adding noise and over-smoothing, I'm
afraid. The approach used in https://doi.org/10.1016/j.spasta.2017.01.002
will work, also when you sample the within zipcode lets, given a split
into training and test sets, and making CV possible.

Roger

>
> I am expecting that at least 500.000 observations will be dropped due
> the lack of values for the chosen variables for the regression model, so
> probably I will filter and remove the observations/rows that will not be
> used anyway - do you know if there is any package that does this
> automatically, given the variables/columns chosed by me?
>
> Or would you recommend me another approach to avoid the above mentioned
> error?
>
> Thank you and best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Thursday, November 7, 2019 10:13
> To: Robert R
> Cc: [hidden email]
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> On Thu, 7 Nov 2019, Robert R wrote:
>
>> Dear Roger,
>>
>> Many thanks for your help.
>>
>> I have an additional question:
>>
>> Is it possible to create a "separate" lw (nb2listw) (with different
>> rownumbers) from my data set? For now, I am taking my data set and
>> merging with the sf object polygon_nyc with the function
>> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create
>> a huge n x n matrix (depending of the size of my data set).
>>
>> Taking the polygon_nyc alone and turning it to a lw (weights list)
>> object has only n = 177.
>>
>> Of course running
>>
>> spatialreg::lagsarlm(formula=model, data = listings_sample,
>> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
>>
>> does not work ("Input data and weights have different dimensions").
>>
>> The only option is to take my data set, merge it to my polygon_nyc (by
>> zipcode) and then create the weights list lw? Or there another option?
>
> I think we are getting more clarity. You do not know the location of the
> lettings beyond their zipcode. You do know the boundaries of the zipcode
> areas, and can create a neighbour object from these boundaries. You then
> want to treat all the lettings in a zipcode area i as neighbours, and
> additionally lettings in zipcode areas neighbouring i as neighbours of
> lettings in i. This is the data structure that motivated the
> spdep::nb2blocknb() function:
>
> https://r-spatial.github.io/spdep/reference/nb2blocknb.html
>
> Try running the examples to get a feel for what is going on.
>
> I feel that most of the variability will vanish in the very large numbers
> of neighbours, over-smoothing the outcomes. If you do not have locations
> for the lettings themselves, I don't think you can make much progress.
>
> You could try a linear mixed model (or gam with a spatially structured
> random effect) with a temporal and a spatial random effect. See the HSAR
> package, articles by Dong et al., and maybe
> https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither
> this nor Dong et al. handle spatio-temporal settings. MRF spatial random
> effects at the zipcode level might be a way forward, together with an IID
> random effect at the same level (equivalent to sef-neighbours).
>
> Hope this helps,
>
> Roger
>
>>
>> Best regards,
>> Robert
>>
>> ________________________________________
>> From: Roger Bivand <[hidden email]>
>> Sent: Wednesday, November 6, 2019 15:07
>> To: Robert R
>> Cc: [hidden email]
>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> On Tue, 5 Nov 2019, Robert R wrote:
>>
>>> Dear Roger,
>>>
>>> Thank you for your reply. I disabled HTML; my e-mails should be now in
>>> plain text.
>>>
>>> I will give a better context for my desired outcome.
>>>
>>> I am taking Airbnb's listings information for New York City available
>>> on: http://insideairbnb.com/get-the-data.html
>>>
>>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
>>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
>>> Listings/ folder. When importing all these 54 files into one single data
>>> set, I create a new "date_compiled" variable/column.
>>>
>>> In total, after the data cleansing process, I have a little more 2
>>> million observations.
>>
>> You have repeat lettings for some, but not all properties. So this is at
>> best a very unbalanced panel. For those properties with repeats, you may
>> see temporal movement (trend/seasonal).
>>
>> I suggest (strongly) taking a single borough or even zipcode with some
>> hindreds of properties, and working from there. Do not include the
>> observation as its own neighbour, perhaps identify repeats and handle them
>> specially (create or use a property ID). Unbalanced panels may also create
>> a selection bias issue (why are some properties only listed sometimes?).
>>
>> So this although promising isn't simple, and getting to a hedonic model
>> may be hard, but not (just) because of spatial autocorrelation. I wouldn't
>> necessarily trust OLS output either, partly because of the repeat property
>> issue.
>>
>> Roger
>>
>>>
>>> I created 54 timedummy variables for each time period available.
>>>
>>> I want to estimate using a hedonic spatial timedummy model the impact of
>>> a variety of characteristics which potentially determine the daily rate
>>> on Airbnb listings through time in New York City (e.g. characteristics
>>> of the listing as number of bedrooms, if the host if professional,
>>> proximity to downtown (New York City Hall) and nearest subway station
>>> from the listing, income per capita, etc.).
>>>
>>> My dependent variable is price (log price, common in the related
>>> literature for hedonic prices).
>>>
>>> The OLS model is done.
>>>
>>> For the spatial model, I am assuming that hosts, when deciding the
>>> pricing of their listings, take not only into account its structural and
>>> location characteristics, but also the prices charged by near listings
>>> with similar characteristics - spatial autocorrelation is then present,
>>> at least spatial dependence is present in the dependent variable.
>>>
>>> As I wrote in my previous post, I was willing to consider the neighbor
>>> itself as a neighbor.
>>>
>>> Parts of my code can be found below:
>>>
>>> ########
>>>
>>> ## packages
>>>
>>> packages_install <- function(packages){
>>> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
>>> if (length(new.packages))
>>> install.packages(new.packages, dependencies = TRUE)
>>> sapply(packages, require, character.only = TRUE)
>>> }
>>>
>>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
>>> packages_install(packages_required)
>>>
>>> # Working directory
>>> setwd("C:/Users/User/R")
>>>
>>>
>>>
>>> ## shapefile_us
>>>
>>> # Shapefile zips import and Coordinate Reference System (CRS) transformation
>>> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
>>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>>>
>>> # Columns removal
>>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>>>
>>> # Column rename: ZCTA5CE10
>>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>>>
>>> # Column class change: zipcode
>>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>>>
>>>
>>>
>>> ## polygon_nyc
>>>
>>> # Zip code not available in shapefile: 11695
>>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>>>
>>>
>>>
>>> ## weight_matrix
>>>
>>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>>>
>>> # Include neighbour itself as a neighbour
>>> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>
>>> # Weights to each neighboring polygon
>>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>>>
>>>
>>>
>>> ## listings
>>>
>>> # Data import
>>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
>>> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
>>> listings <- mapply(cbind, listings, date_compiled = names(listings))
>>> listings <- listings %>% bind_rows
>>>
>>> # Characters removal
>>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
>>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
>>> listings$price <- gsub("\\$", "", listings$price)
>>> listings$price <- gsub(",", "", listings$price)
>>>
>>>
>>>
>>> ## timedummy
>>>
>>> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
>>> timedummy <- paste(timedummy, sep = "", collapse = " + ")
>>> timedummy <- gsub("-", "_", timedummy)
>>>
>>>
>>>
>>> ## OLS regression
>>>
>>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
>>> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>>>
>>> ########
>>>
>>> Some of my id's repeat in multiple time periods.
>>>
>>> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>>>
>>> Now I want to apply the hedonic model with the timedummy variables.
>>>
>>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>>>
>>> Again, thank you very much for the help provided until now.
>>>
>>> Best regards,
>>> Robert
>>>
>>> ________________________________________
>>> From: Roger Bivand <[hidden email]>
>>> Sent: Tuesday, November 5, 2019 15:30
>>> To: Robert R
>>> Cc: [hidden email]
>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>
>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>
>>>> I have a large pooled cross-section data set. ​I would like to
>>>> estimate/regress using spatial autocorrelation methods. I am assuming
>>>> for now that spatial dependence is present in both the dependent
>>>> variable and the error term.​ ​My data set is over a period of 4 years,
>>>> monthly data (54 periods). For this means, I've created a time dummy
>>>> variable for each time period.​ ​I also created a weight matrix using the
>>>> functions "poly2nb" and "nb2listw".​ ​Now I am trying to figure out a way
>>>> to estimate my model which contains a really big data set.​ ​Basically, my
>>>> model is as follows: y = γD + ρW1y + Xβ + λW2u + ε​ ​My questions are:​ ​1)
>>>> My spatial weight matrix for the whole data set will be probably a
>>>> enormous matrix with submatrices for each time period itself. I don't
>>>> think it would be possible to calculate this.​ What I would like to know
>>>> is a way to estimate each time dummy/period separately (to compare
>>>> different periods alone). How to do it?​ ​2) Which package to use: spdep
>>>> or splm?​ ​Thank you and best regards,​ Robert​
>>>
>>> Please do not post HTML, only plain text. Almost certainly your model
>>> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
>>> untried). What is your cross-sectional size? Using sparse kronecker
>>> products, the "enormous" matrix may not be very big. Does it make any
>>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
>>> of the covariates time-varying? Please provide motivation and use area
>>> (preferably with affiliation (your email and user name are not
>>> informative) - this feels like a real estate problem, probably wrongly
>>> specified. You should use splm if time make sense in your case, but if it
>>> really doesn't, simplify your approach, as much of the data will be
>>> subject to very large temporal autocorrelation.
>>>
>>> If this is a continuation of your previous question about using
>>> self-neighbours, be aware that you should not use self-neighbours in
>>> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>>>
>>> Roger
>>>
>>>>
>>>>       [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> R-sig-Geo mailing list
>>>> [hidden email]
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>
>>> --
>>> Roger Bivand
>>> Department of Economics, Norwegian School of Economics,
>>> Helleveien 30, N-5045 Bergen, Norway.
>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>> https://orcid.org/0000-0003-2392-6140
>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: [hidden email]
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Roger Bivand
Administrator
On Sun, 10 Nov 2019, Robert R wrote:

> Dear Roger,
>
> Again, thank you for your answer. I read the material provided and
> decided that Hierarchical Spatial Autoregressive (HSAR) could be the
> right model for me.
>
> I indeed have the precise latitude and longitude information for all my
> listings for NYC.
>
> I created a stratified sample (group = zipcode) with 22172 (1%) of my
> observations called listings_sample and tried to replicate the hsar
> model, please see below.
>
> For now W = NULL, because otherwise I would have a 22172 x 22172 matrix.
Unless you know definitely that you want to relate the response to its
lagged value, you do not need this. Do note that the matrix is very
sparse, so could be fitted without difficulty with ML in a cross-sectional
model.

>
> You recommended then to introduce a Markov random field (MRF) random
> effect (RE) at the zipcode level, but I did not understand it so well.
> Could you develop a litte more?
>

Did you read the development in
https://doi.org/10.1016/j.spasta.2017.01.002? It is explained there, and
includes code for fitting the Beijing housing parcels data se from HSAR
with many other packages (MCMC, INLA, hglm, etc.). I guess that you should
try to create a model that works on a single borough, sing the zipcodes
in that borough as a proxy for unobserved neighbourhood effects. Try for
example using lme4::lmer() with only a zipcode IID random effect, see if
the hedonic estimates are similar to lm(), and leave adding an MRF RE
(with for example mgcv::gam() or hglm::hglm()) until you have a working
testbed. Then advance step-by-step from there.

You still have not said how many repeat lettings you see - it will affect
the way you specify your model.

Roger

> ##############
> library(spdep)
> library(HSAR)
> library(dplyr)
> library(splitstackshape)
>
>
> # Stratified sample per zipcode (size = 1%) listings_sample <-
> splitstackshape::stratified(indt = listings, group = "zipcode", size =
> 0.01)
>
> # Removing zipcodes from polygon_nyc which are not observable in
> listings_sample polygon_nyc_listings <- polygon_nyc %>% filter(zipcode
> %in% c(unique(as.character(listings_sample$zipcode))))
>
>
> ## Random effect matrix (N by J)
>
> # N: 22172
> # J: 154
>
> # Arrange listings_sample by zipcode (ascending)
> listings_sample <- listings_sample %>% arrange(zipcode)
>
> # Count number of listings per zipcode
> MM <- listings_sample %>% st_drop_geometry() %>% group_by(zipcode) %>% summarise(count = n()) %>% as.data.frame()
> # sum(MM$count)
>
> # N by J nulled matrix creation
> Delta <- matrix(data = 0, nrow = nrow(listings_sample), ncol = dim(MM)[1])
>
> # The total number of neighbourhood
> Uid <- rep(c(1:dim(MM)[1]), MM[,2])
>
> for(i in 1:dim(MM)[1]) {
>  Delta[Uid==i,i] <- 1
> }
> rm(i)
>
> Delta <- as(Delta,"dgCMatrix")
>
>
> ## Higher-level spatial weights matrix or neighbourhood matrix (J by J)
>
> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
> polygon_nyc_nb <- poly2nb(polygon_nyc_listings, row.names = polygon_nyc$zipcode, queen = TRUE)
>
> # Include neighbour itself as a neighbour
> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>
> # Spatial weights matrix for nb
> polygon_nyc_nb_matrix <- nb2mat(neighbours = polygon_nyc_nb, style = "W", zero.policy = NULL)
> M <- as(polygon_nyc_nb_matrix,"dgCMatrix")
>
>
> ## Fit HSAR SAR upper level random effect
> model <- as.formula(log_price ~ guests_included + minimum_nights)
>
> betas = coef(lm(formula = model, data = listings_sample))
> pars = list(rho = 0.5, lambda = 0.5, sigma2e = 2.0, sigma2u = 2.0, betas = betas)
>
> m_hsar <- hsar(model, data = listings_sample, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>
> ##############
>
> Thank you and best regards
> Robert
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Friday, November 8, 2019 13:29
> To: Robert R
> Cc: [hidden email]
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> On Fri, 8 Nov 2019, Robert R wrote:
>
>> Dear Roger,
>>
>> Thank you for your answer.
>>
>> I successfully used the function nb2blocknb() for a smaller dataset.
>>
>> But for a dataset of over 2 million observations, I get the following
>> error: "Error: cannot allocate vector of size 840 Kb".
>
> I don't think the observations are helpful. If you have repeat lets in the
> same property in a given month, you need to handle that anyway. I'd go for
> making the modelling exercise work (we agree that this is not panel data,
> right?) on a small subset first. I would further argue that you need a
> multi-level approach rather than spdep::nb2blocknb(), with a zipcode IID
> RE. You could very well take (stratified) samples per zipcode to represent
> your data. Once that works, introduce an MRF RE at the zipcode level,
> where you do know relative position. Using SARAR is going to be a waste of
> time unless you can geocode the letting addresses. A multi-level approach
> will work. Having big data in your case with no useful location
> information per observation is just adding noise and over-smoothing, I'm
> afraid. The approach used in https://doi.org/10.1016/j.spasta.2017.01.002
> will work, also when you sample the within zipcode lets, given a split
> into training and test sets, and making CV possible.
>
> Roger
>
>>
>> I am expecting that at least 500.000 observations will be dropped due
>> the lack of values for the chosen variables for the regression model, so
>> probably I will filter and remove the observations/rows that will not be
>> used anyway - do you know if there is any package that does this
>> automatically, given the variables/columns chosed by me?
>>
>> Or would you recommend me another approach to avoid the above mentioned
>> error?
>>
>> Thank you and best regards,
>> Robert
>>
>> ________________________________________
>> From: Roger Bivand <[hidden email]>
>> Sent: Thursday, November 7, 2019 10:13
>> To: Robert R
>> Cc: [hidden email]
>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> On Thu, 7 Nov 2019, Robert R wrote:
>>
>>> Dear Roger,
>>>
>>> Many thanks for your help.
>>>
>>> I have an additional question:
>>>
>>> Is it possible to create a "separate" lw (nb2listw) (with different
>>> rownumbers) from my data set? For now, I am taking my data set and
>>> merging with the sf object polygon_nyc with the function
>>> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create
>>> a huge n x n matrix (depending of the size of my data set).
>>>
>>> Taking the polygon_nyc alone and turning it to a lw (weights list)
>>> object has only n = 177.
>>>
>>> Of course running
>>>
>>> spatialreg::lagsarlm(formula=model, data = listings_sample,
>>> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
>>>
>>> does not work ("Input data and weights have different dimensions").
>>>
>>> The only option is to take my data set, merge it to my polygon_nyc (by
>>> zipcode) and then create the weights list lw? Or there another option?
>>
>> I think we are getting more clarity. You do not know the location of the
>> lettings beyond their zipcode. You do know the boundaries of the zipcode
>> areas, and can create a neighbour object from these boundaries. You then
>> want to treat all the lettings in a zipcode area i as neighbours, and
>> additionally lettings in zipcode areas neighbouring i as neighbours of
>> lettings in i. This is the data structure that motivated the
>> spdep::nb2blocknb() function:
>>
>> https://r-spatial.github.io/spdep/reference/nb2blocknb.html
>>
>> Try running the examples to get a feel for what is going on.
>>
>> I feel that most of the variability will vanish in the very large numbers
>> of neighbours, over-smoothing the outcomes. If you do not have locations
>> for the lettings themselves, I don't think you can make much progress.
>>
>> You could try a linear mixed model (or gam with a spatially structured
>> random effect) with a temporal and a spatial random effect. See the HSAR
>> package, articles by Dong et al., and maybe
>> https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither
>> this nor Dong et al. handle spatio-temporal settings. MRF spatial random
>> effects at the zipcode level might be a way forward, together with an IID
>> random effect at the same level (equivalent to sef-neighbours).
>>
>> Hope this helps,
>>
>> Roger
>>
>>>
>>> Best regards,
>>> Robert
>>>
>>> ________________________________________
>>> From: Roger Bivand <[hidden email]>
>>> Sent: Wednesday, November 6, 2019 15:07
>>> To: Robert R
>>> Cc: [hidden email]
>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>
>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>
>>>> Dear Roger,
>>>>
>>>> Thank you for your reply. I disabled HTML; my e-mails should be now in
>>>> plain text.
>>>>
>>>> I will give a better context for my desired outcome.
>>>>
>>>> I am taking Airbnb's listings information for New York City available
>>>> on: http://insideairbnb.com/get-the-data.html
>>>>
>>>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
>>>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
>>>> Listings/ folder. When importing all these 54 files into one single data
>>>> set, I create a new "date_compiled" variable/column.
>>>>
>>>> In total, after the data cleansing process, I have a little more 2
>>>> million observations.
>>>
>>> You have repeat lettings for some, but not all properties. So this is at
>>> best a very unbalanced panel. For those properties with repeats, you may
>>> see temporal movement (trend/seasonal).
>>>
>>> I suggest (strongly) taking a single borough or even zipcode with some
>>> hindreds of properties, and working from there. Do not include the
>>> observation as its own neighbour, perhaps identify repeats and handle them
>>> specially (create or use a property ID). Unbalanced panels may also create
>>> a selection bias issue (why are some properties only listed sometimes?).
>>>
>>> So this although promising isn't simple, and getting to a hedonic model
>>> may be hard, but not (just) because of spatial autocorrelation. I wouldn't
>>> necessarily trust OLS output either, partly because of the repeat property
>>> issue.
>>>
>>> Roger
>>>
>>>>
>>>> I created 54 timedummy variables for each time period available.
>>>>
>>>> I want to estimate using a hedonic spatial timedummy model the impact of
>>>> a variety of characteristics which potentially determine the daily rate
>>>> on Airbnb listings through time in New York City (e.g. characteristics
>>>> of the listing as number of bedrooms, if the host if professional,
>>>> proximity to downtown (New York City Hall) and nearest subway station
>>>> from the listing, income per capita, etc.).
>>>>
>>>> My dependent variable is price (log price, common in the related
>>>> literature for hedonic prices).
>>>>
>>>> The OLS model is done.
>>>>
>>>> For the spatial model, I am assuming that hosts, when deciding the
>>>> pricing of their listings, take not only into account its structural and
>>>> location characteristics, but also the prices charged by near listings
>>>> with similar characteristics - spatial autocorrelation is then present,
>>>> at least spatial dependence is present in the dependent variable.
>>>>
>>>> As I wrote in my previous post, I was willing to consider the neighbor
>>>> itself as a neighbor.
>>>>
>>>> Parts of my code can be found below:
>>>>
>>>> ########
>>>>
>>>> ## packages
>>>>
>>>> packages_install <- function(packages){
>>>> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
>>>> if (length(new.packages))
>>>> install.packages(new.packages, dependencies = TRUE)
>>>> sapply(packages, require, character.only = TRUE)
>>>> }
>>>>
>>>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
>>>> packages_install(packages_required)
>>>>
>>>> # Working directory
>>>> setwd("C:/Users/User/R")
>>>>
>>>>
>>>>
>>>> ## shapefile_us
>>>>
>>>> # Shapefile zips import and Coordinate Reference System (CRS) transformation
>>>> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
>>>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>>>>
>>>> # Columns removal
>>>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>>>>
>>>> # Column rename: ZCTA5CE10
>>>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>>>>
>>>> # Column class change: zipcode
>>>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>>>>
>>>>
>>>>
>>>> ## polygon_nyc
>>>>
>>>> # Zip code not available in shapefile: 11695
>>>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>>>>
>>>>
>>>>
>>>> ## weight_matrix
>>>>
>>>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>>>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>>>>
>>>> # Include neighbour itself as a neighbour
>>>> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
>>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>>
>>>> # Weights to each neighboring polygon
>>>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>>>>
>>>>
>>>>
>>>> ## listings
>>>>
>>>> # Data import
>>>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
>>>> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
>>>> listings <- mapply(cbind, listings, date_compiled = names(listings))
>>>> listings <- listings %>% bind_rows
>>>>
>>>> # Characters removal
>>>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
>>>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
>>>> listings$price <- gsub("\\$", "", listings$price)
>>>> listings$price <- gsub(",", "", listings$price)
>>>>
>>>>
>>>>
>>>> ## timedummy
>>>>
>>>> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
>>>> timedummy <- paste(timedummy, sep = "", collapse = " + ")
>>>> timedummy <- gsub("-", "_", timedummy)
>>>>
>>>>
>>>>
>>>> ## OLS regression
>>>>
>>>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
>>>> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>>>>
>>>> ########
>>>>
>>>> Some of my id's repeat in multiple time periods.
>>>>
>>>> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>>>>
>>>> Now I want to apply the hedonic model with the timedummy variables.
>>>>
>>>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>>>>
>>>> Again, thank you very much for the help provided until now.
>>>>
>>>> Best regards,
>>>> Robert
>>>>
>>>> ________________________________________
>>>> From: Roger Bivand <[hidden email]>
>>>> Sent: Tuesday, November 5, 2019 15:30
>>>> To: Robert R
>>>> Cc: [hidden email]
>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>
>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>
>>>>> I have a large pooled cross-section data set. ​I would like to
>>>>> estimate/regress using spatial autocorrelation methods. I am assuming
>>>>> for now that spatial dependence is present in both the dependent
>>>>> variable and the error term.​ ​My data set is over a period of 4 years,
>>>>> monthly data (54 periods). For this means, I've created a time dummy
>>>>> variable for each time period.​ ​I also created a weight matrix using the
>>>>> functions "poly2nb" and "nb2listw".​ ​Now I am trying to figure out a way
>>>>> to estimate my model which contains a really big data set.​ ​Basically, my
>>>>> model is as follows: y = γD + ρW1y + Xβ + λW2u + ε​ ​My questions are:​ ​1)
>>>>> My spatial weight matrix for the whole data set will be probably a
>>>>> enormous matrix with submatrices for each time period itself. I don't
>>>>> think it would be possible to calculate this.​ What I would like to know
>>>>> is a way to estimate each time dummy/period separately (to compare
>>>>> different periods alone). How to do it?​ ​2) Which package to use: spdep
>>>>> or splm?​ ​Thank you and best regards,​ Robert​
>>>>
>>>> Please do not post HTML, only plain text. Almost certainly your model
>>>> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
>>>> untried). What is your cross-sectional size? Using sparse kronecker
>>>> products, the "enormous" matrix may not be very big. Does it make any
>>>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
>>>> of the covariates time-varying? Please provide motivation and use area
>>>> (preferably with affiliation (your email and user name are not
>>>> informative) - this feels like a real estate problem, probably wrongly
>>>> specified. You should use splm if time make sense in your case, but if it
>>>> really doesn't, simplify your approach, as much of the data will be
>>>> subject to very large temporal autocorrelation.
>>>>
>>>> If this is a continuation of your previous question about using
>>>> self-neighbours, be aware that you should not use self-neighbours in
>>>> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>>>>
>>>> Roger
>>>>
>>>>>
>>>>>       [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> R-sig-Geo mailing list
>>>>> [hidden email]
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>
>>>> --
>>>> Roger Bivand
>>>> Department of Economics, Norwegian School of Economics,
>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>> https://orcid.org/0000-0003-2392-6140
>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>
>>>
>>> --
>>> Roger Bivand
>>> Department of Economics, Norwegian School of Economics,
>>> Helleveien 30, N-5045 Bergen, Norway.
>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>> https://orcid.org/0000-0003-2392-6140
>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: [hidden email]
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>
--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Robert R
Dear Roger,

Thank you for your message and sorry for my late answer.

Regarding the number of listings (lettings) for my data set (2.216.642 observations), each listing contains an individual id:

unique ids: 180.004
time periods: 54 (2015-01 to 2019-09)
number of ids that appear only once: 28.486 (of 180.004 ids) (15,8%)
number of ids that appear/repeat 2-10 times: 82.641 (of 180.004 ids) (45,9%)
number of ids that appear/repeat 11-30 times: 46.465 (of 180.004 ids) (25,8%)
number of ids that appear/repeat 31-54 times: 22.412 (of 180.004 ids) (12,5%)

Important to notice is that hosts can change the room_category (between entire/home apt, private room and shared room) keeping the same listing id number. In my data, the number of unique ids that in some point changed the room_type is of 7.204 ids.

--

For the OLS model, I was using only a fixed effect model, where each time period (date_compiled) (54 in total) is a time dummy.

plm::plm(formula = model, data = listings, model = "pooling", index = c("id", "date_compiled"))


--
Osland et al. (2016) (https://doi.org/10.1111/jors.12281) use a spatial fixed effects (SFE) hedonic model, where each defined neighborhood zone in the study area is represented by dummy variables.

Dong et al. (2015) (https://doi.org/10.1111/gean.12049) outline four model specifications to accommodate geographically hierarchical data structures: (1) groupwise W and fixed regional effects; (2) groupwise W and random regional effects; (3) proximity-based W and fixed regional effects; and (4) proximity-based W and random regional effects.
--

I created a new column/variable containing the borough where the zipcode is found (Manhattan, Brooklyn, Queens, Bronx, Staten Island).

If I understood it right, the (two-level) Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) considers the occurrence of spatial relations at the (lower) individual (geographical coordinates - in my case, the listing location) and (higher) group level (territorial units - in my case, zipcodes).

According to Bivand et al. (2017): "(...) W is a spatial weights matrix. The HSAR model may also be estimated without this component.". So, in this case I only estimate the Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) in a "one-level" basis, i.e., at the higher-level.

HSAR::hsar(model, data = listings, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)

(Where the "model" formula contains the 54 time dummy variables)

Do you think I can proceed with this model? I was able to calculate it.

If I remove all observations/rows with NAs in one of the chosen variables/observations, 884.183 observations remain. If I would create a W matrix for HSAR::hsar, I would have a gigantic 884.183 by 884.183 matrix. This is the reason why I put W = NULL.


Thank you and best regards

________________________________________
From: Roger Bivand <[hidden email]>
Sent: Monday, November 11, 2019 11:31
To: Robert R
Cc: [hidden email]
Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

On Sun, 10 Nov 2019, Robert R wrote:

> Dear Roger,
>
> Again, thank you for your answer. I read the material provided and
> decided that Hierarchical Spatial Autoregressive (HSAR) could be the
> right model for me.
>
> I indeed have the precise latitude and longitude information for all my
> listings for NYC.
>
> I created a stratified sample (group = zipcode) with 22172 (1%) of my
> observations called listings_sample and tried to replicate the hsar
> model, please see below.
>
> For now W = NULL, because otherwise I would have a 22172 x 22172 matrix.

Unless you know definitely that you want to relate the response to its
lagged value, you do not need this. Do note that the matrix is very
sparse, so could be fitted without difficulty with ML in a cross-sectional
model.

>
> You recommended then to introduce a Markov random field (MRF) random
> effect (RE) at the zipcode level, but I did not understand it so well.
> Could you develop a litte more?
>

Did you read the development in
https://doi.org/10.1016/j.spasta.2017.01.002? It is explained there, and
includes code for fitting the Beijing housing parcels data se from HSAR
with many other packages (MCMC, INLA, hglm, etc.). I guess that you should
try to create a model that works on a single borough, sing the zipcodes
in that borough as a proxy for unobserved neighbourhood effects. Try for
example using lme4::lmer() with only a zipcode IID random effect, see if
the hedonic estimates are similar to lm(), and leave adding an MRF RE
(with for example mgcv::gam() or hglm::hglm()) until you have a working
testbed. Then advance step-by-step from there.

You still have not said how many repeat lettings you see - it will affect
the way you specify your model.

Roger

> ##############
> library(spdep)
> library(HSAR)
> library(dplyr)
> library(splitstackshape)
>
>
> # Stratified sample per zipcode (size = 1%) listings_sample <-
> splitstackshape::stratified(indt = listings, group = "zipcode", size =
> 0.01)
>
> # Removing zipcodes from polygon_nyc which are not observable in
> listings_sample polygon_nyc_listings <- polygon_nyc %>% filter(zipcode
> %in% c(unique(as.character(listings_sample$zipcode))))
>
>
> ## Random effect matrix (N by J)
>
> # N: 22172
> # J: 154
>
> # Arrange listings_sample by zipcode (ascending)
> listings_sample <- listings_sample %>% arrange(zipcode)
>
> # Count number of listings per zipcode
> MM <- listings_sample %>% st_drop_geometry() %>% group_by(zipcode) %>% summarise(count = n()) %>% as.data.frame()
> # sum(MM$count)
>
> # N by J nulled matrix creation
> Delta <- matrix(data = 0, nrow = nrow(listings_sample), ncol = dim(MM)[1])
>
> # The total number of neighbourhood
> Uid <- rep(c(1:dim(MM)[1]), MM[,2])
>
> for(i in 1:dim(MM)[1]) {
>  Delta[Uid==i,i] <- 1
> }
> rm(i)
>
> Delta <- as(Delta,"dgCMatrix")
>
>
> ## Higher-level spatial weights matrix or neighbourhood matrix (J by J)
>
> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
> polygon_nyc_nb <- poly2nb(polygon_nyc_listings, row.names = polygon_nyc$zipcode, queen = TRUE)
>
> # Include neighbour itself as a neighbour
> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>
> # Spatial weights matrix for nb
> polygon_nyc_nb_matrix <- nb2mat(neighbours = polygon_nyc_nb, style = "W", zero.policy = NULL)
> M <- as(polygon_nyc_nb_matrix,"dgCMatrix")
>
>
> ## Fit HSAR SAR upper level random effect
> model <- as.formula(log_price ~ guests_included + minimum_nights)
>
> betas = coef(lm(formula = model, data = listings_sample))
> pars = list(rho = 0.5, lambda = 0.5, sigma2e = 2.0, sigma2u = 2.0, betas = betas)
>
> m_hsar <- hsar(model, data = listings_sample, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>
> ##############
>
> Thank you and best regards
> Robert
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Friday, November 8, 2019 13:29
> To: Robert R
> Cc: [hidden email]
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> On Fri, 8 Nov 2019, Robert R wrote:
>
>> Dear Roger,
>>
>> Thank you for your answer.
>>
>> I successfully used the function nb2blocknb() for a smaller dataset.
>>
>> But for a dataset of over 2 million observations, I get the following
>> error: "Error: cannot allocate vector of size 840 Kb".
>
> I don't think the observations are helpful. If you have repeat lets in the
> same property in a given month, you need to handle that anyway. I'd go for
> making the modelling exercise work (we agree that this is not panel data,
> right?) on a small subset first. I would further argue that you need a
> multi-level approach rather than spdep::nb2blocknb(), with a zipcode IID
> RE. You could very well take (stratified) samples per zipcode to represent
> your data. Once that works, introduce an MRF RE at the zipcode level,
> where you do know relative position. Using SARAR is going to be a waste of
> time unless you can geocode the letting addresses. A multi-level approach
> will work. Having big data in your case with no useful location
> information per observation is just adding noise and over-smoothing, I'm
> afraid. The approach used in https://doi.org/10.1016/j.spasta.2017.01.002
> will work, also when you sample the within zipcode lets, given a split
> into training and test sets, and making CV possible.
>
> Roger
>
>>
>> I am expecting that at least 500.000 observations will be dropped due
>> the lack of values for the chosen variables for the regression model, so
>> probably I will filter and remove the observations/rows that will not be
>> used anyway - do you know if there is any package that does this
>> automatically, given the variables/columns chosed by me?
>>
>> Or would you recommend me another approach to avoid the above mentioned
>> error?
>>
>> Thank you and best regards,
>> Robert
>>
>> ________________________________________
>> From: Roger Bivand <[hidden email]>
>> Sent: Thursday, November 7, 2019 10:13
>> To: Robert R
>> Cc: [hidden email]
>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> On Thu, 7 Nov 2019, Robert R wrote:
>>
>>> Dear Roger,
>>>
>>> Many thanks for your help.
>>>
>>> I have an additional question:
>>>
>>> Is it possible to create a "separate" lw (nb2listw) (with different
>>> rownumbers) from my data set? For now, I am taking my data set and
>>> merging with the sf object polygon_nyc with the function
>>> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create
>>> a huge n x n matrix (depending of the size of my data set).
>>>
>>> Taking the polygon_nyc alone and turning it to a lw (weights list)
>>> object has only n = 177.
>>>
>>> Of course running
>>>
>>> spatialreg::lagsarlm(formula=model, data = listings_sample,
>>> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
>>>
>>> does not work ("Input data and weights have different dimensions").
>>>
>>> The only option is to take my data set, merge it to my polygon_nyc (by
>>> zipcode) and then create the weights list lw? Or there another option?
>>
>> I think we are getting more clarity. You do not know the location of the
>> lettings beyond their zipcode. You do know the boundaries of the zipcode
>> areas, and can create a neighbour object from these boundaries. You then
>> want to treat all the lettings in a zipcode area i as neighbours, and
>> additionally lettings in zipcode areas neighbouring i as neighbours of
>> lettings in i. This is the data structure that motivated the
>> spdep::nb2blocknb() function:
>>
>> https://r-spatial.github.io/spdep/reference/nb2blocknb.html
>>
>> Try running the examples to get a feel for what is going on.
>>
>> I feel that most of the variability will vanish in the very large numbers
>> of neighbours, over-smoothing the outcomes. If you do not have locations
>> for the lettings themselves, I don't think you can make much progress.
>>
>> You could try a linear mixed model (or gam with a spatially structured
>> random effect) with a temporal and a spatial random effect. See the HSAR
>> package, articles by Dong et al., and maybe
>> https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither
>> this nor Dong et al. handle spatio-temporal settings. MRF spatial random
>> effects at the zipcode level might be a way forward, together with an IID
>> random effect at the same level (equivalent to sef-neighbours).
>>
>> Hope this helps,
>>
>> Roger
>>
>>>
>>> Best regards,
>>> Robert
>>>
>>> ________________________________________
>>> From: Roger Bivand <[hidden email]>
>>> Sent: Wednesday, November 6, 2019 15:07
>>> To: Robert R
>>> Cc: [hidden email]
>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>
>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>
>>>> Dear Roger,
>>>>
>>>> Thank you for your reply. I disabled HTML; my e-mails should be now in
>>>> plain text.
>>>>
>>>> I will give a better context for my desired outcome.
>>>>
>>>> I am taking Airbnb's listings information for New York City available
>>>> on: http://insideairbnb.com/get-the-data.html
>>>>
>>>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
>>>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
>>>> Listings/ folder. When importing all these 54 files into one single data
>>>> set, I create a new "date_compiled" variable/column.
>>>>
>>>> In total, after the data cleansing process, I have a little more 2
>>>> million observations.
>>>
>>> You have repeat lettings for some, but not all properties. So this is at
>>> best a very unbalanced panel. For those properties with repeats, you may
>>> see temporal movement (trend/seasonal).
>>>
>>> I suggest (strongly) taking a single borough or even zipcode with some
>>> hindreds of properties, and working from there. Do not include the
>>> observation as its own neighbour, perhaps identify repeats and handle them
>>> specially (create or use a property ID). Unbalanced panels may also create
>>> a selection bias issue (why are some properties only listed sometimes?).
>>>
>>> So this although promising isn't simple, and getting to a hedonic model
>>> may be hard, but not (just) because of spatial autocorrelation. I wouldn't
>>> necessarily trust OLS output either, partly because of the repeat property
>>> issue.
>>>
>>> Roger
>>>
>>>>
>>>> I created 54 timedummy variables for each time period available.
>>>>
>>>> I want to estimate using a hedonic spatial timedummy model the impact of
>>>> a variety of characteristics which potentially determine the daily rate
>>>> on Airbnb listings through time in New York City (e.g. characteristics
>>>> of the listing as number of bedrooms, if the host if professional,
>>>> proximity to downtown (New York City Hall) and nearest subway station
>>>> from the listing, income per capita, etc.).
>>>>
>>>> My dependent variable is price (log price, common in the related
>>>> literature for hedonic prices).
>>>>
>>>> The OLS model is done.
>>>>
>>>> For the spatial model, I am assuming that hosts, when deciding the
>>>> pricing of their listings, take not only into account its structural and
>>>> location characteristics, but also the prices charged by near listings
>>>> with similar characteristics - spatial autocorrelation is then present,
>>>> at least spatial dependence is present in the dependent variable.
>>>>
>>>> As I wrote in my previous post, I was willing to consider the neighbor
>>>> itself as a neighbor.
>>>>
>>>> Parts of my code can be found below:
>>>>
>>>> ########
>>>>
>>>> ## packages
>>>>
>>>> packages_install <- function(packages){
>>>> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
>>>> if (length(new.packages))
>>>> install.packages(new.packages, dependencies = TRUE)
>>>> sapply(packages, require, character.only = TRUE)
>>>> }
>>>>
>>>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
>>>> packages_install(packages_required)
>>>>
>>>> # Working directory
>>>> setwd("C:/Users/User/R")
>>>>
>>>>
>>>>
>>>> ## shapefile_us
>>>>
>>>> # Shapefile zips import and Coordinate Reference System (CRS) transformation
>>>> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
>>>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>>>>
>>>> # Columns removal
>>>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>>>>
>>>> # Column rename: ZCTA5CE10
>>>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>>>>
>>>> # Column class change: zipcode
>>>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>>>>
>>>>
>>>>
>>>> ## polygon_nyc
>>>>
>>>> # Zip code not available in shapefile: 11695
>>>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>>>>
>>>>
>>>>
>>>> ## weight_matrix
>>>>
>>>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>>>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>>>>
>>>> # Include neighbour itself as a neighbour
>>>> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
>>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>>
>>>> # Weights to each neighboring polygon
>>>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>>>>
>>>>
>>>>
>>>> ## listings
>>>>
>>>> # Data import
>>>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
>>>> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
>>>> listings <- mapply(cbind, listings, date_compiled = names(listings))
>>>> listings <- listings %>% bind_rows
>>>>
>>>> # Characters removal
>>>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
>>>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
>>>> listings$price <- gsub("\\$", "", listings$price)
>>>> listings$price <- gsub(",", "", listings$price)
>>>>
>>>>
>>>>
>>>> ## timedummy
>>>>
>>>> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
>>>> timedummy <- paste(timedummy, sep = "", collapse = " + ")
>>>> timedummy <- gsub("-", "_", timedummy)
>>>>
>>>>
>>>>
>>>> ## OLS regression
>>>>
>>>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
>>>> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>>>>
>>>> ########
>>>>
>>>> Some of my id's repeat in multiple time periods.
>>>>
>>>> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>>>>
>>>> Now I want to apply the hedonic model with the timedummy variables.
>>>>
>>>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>>>>
>>>> Again, thank you very much for the help provided until now.
>>>>
>>>> Best regards,
>>>> Robert
>>>>
>>>> ________________________________________
>>>> From: Roger Bivand <[hidden email]>
>>>> Sent: Tuesday, November 5, 2019 15:30
>>>> To: Robert R
>>>> Cc: [hidden email]
>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>
>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>
>>>>> I have a large pooled cross-section data set. ?I would like to
>>>>> estimate/regress using spatial autocorrelation methods. I am assuming
>>>>> for now that spatial dependence is present in both the dependent
>>>>> variable and the error term.? ?My data set is over a period of 4 years,
>>>>> monthly data (54 periods). For this means, I've created a time dummy
>>>>> variable for each time period.? ?I also created a weight matrix using the
>>>>> functions "poly2nb" and "nb2listw".? ?Now I am trying to figure out a way
>>>>> to estimate my model which contains a really big data set.? ?Basically, my
>>>>> model is as follows: y = ?D + ?W1y + X? + ?W2u + ?? ?My questions are:? ?1)
>>>>> My spatial weight matrix for the whole data set will be probably a
>>>>> enormous matrix with submatrices for each time period itself. I don't
>>>>> think it would be possible to calculate this.? What I would like to know
>>>>> is a way to estimate each time dummy/period separately (to compare
>>>>> different periods alone). How to do it?? ?2) Which package to use: spdep
>>>>> or splm?? ?Thank you and best regards,? Robert?
>>>>
>>>> Please do not post HTML, only plain text. Almost certainly your model
>>>> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
>>>> untried). What is your cross-sectional size? Using sparse kronecker
>>>> products, the "enormous" matrix may not be very big. Does it make any
>>>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
>>>> of the covariates time-varying? Please provide motivation and use area
>>>> (preferably with affiliation (your email and user name are not
>>>> informative) - this feels like a real estate problem, probably wrongly
>>>> specified. You should use splm if time make sense in your case, but if it
>>>> really doesn't, simplify your approach, as much of the data will be
>>>> subject to very large temporal autocorrelation.
>>>>
>>>> If this is a continuation of your previous question about using
>>>> self-neighbours, be aware that you should not use self-neighbours in
>>>> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>>>>
>>>> Roger
>>>>
>>>>>
>>>>>       [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> R-sig-Geo mailing list
>>>>> [hidden email]
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>
>>>> --
>>>> Roger Bivand
>>>> Department of Economics, Norwegian School of Economics,
>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>> https://orcid.org/0000-0003-2392-6140
>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>
>>>
>>> --
>>> Roger Bivand
>>> Department of Economics, Norwegian School of Economics,
>>> Helleveien 30, N-5045 Bergen, Norway.
>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>> https://orcid.org/0000-0003-2392-6140
>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: [hidden email]
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Roger Bivand
Administrator
Sorry for late reply, am indisposed and unable to help further. I feel
that there is so much noise in your data (differences in offers, rental
lengths, repeats or not, etc.), that you will certainly have to subset
vigorously first to isolate response cases that are comparable. What you
are trying to disentangle are the hedonic components in the bundle where
you just have price as response, but lots of other bundle characteristics
on the right hand side (days, etc.). I feel you'd need to try to get to a
response index of price per day per rental area or some such. I'd
certainly advise examining responses to a specific driver (major concert
or sports event) to get a feel for how the market responds, and return to
spatial hedonic after finding an approach that gives reasonable aspatial
outcomes.

Roger

On Sun, 17 Nov 2019, Robert R wrote:

> Dear Roger,
>
> Thank you for your message and sorry for my late answer.
>
> Regarding the number of listings (lettings) for my data set (2.216.642 observations), each listing contains an individual id:
>
> unique ids: 180.004
> time periods: 54 (2015-01 to 2019-09)
> number of ids that appear only once: 28.486 (of 180.004 ids) (15,8%)
> number of ids that appear/repeat 2-10 times: 82.641 (of 180.004 ids) (45,9%)
> number of ids that appear/repeat 11-30 times: 46.465 (of 180.004 ids) (25,8%)
> number of ids that appear/repeat 31-54 times: 22.412 (of 180.004 ids) (12,5%)
>
> Important to notice is that hosts can change the room_category (between entire/home apt, private room and shared room) keeping the same listing id number. In my data, the number of unique ids that in some point changed the room_type is of 7.204 ids.
>
> --
>
> For the OLS model, I was using only a fixed effect model, where each time period (date_compiled) (54 in total) is a time dummy.
>
> plm::plm(formula = model, data = listings, model = "pooling", index = c("id", "date_compiled"))
>
>
> --
> Osland et al. (2016) (https://doi.org/10.1111/jors.12281) use a spatial fixed effects (SFE) hedonic model, where each defined neighborhood zone in the study area is represented by dummy variables.
>
> Dong et al. (2015) (https://doi.org/10.1111/gean.12049) outline four model specifications to accommodate geographically hierarchical data structures: (1) groupwise W and fixed regional effects; (2) groupwise W and random regional effects; (3) proximity-based W and fixed regional effects; and (4) proximity-based W and random regional effects.
> --
>
> I created a new column/variable containing the borough where the zipcode is found (Manhattan, Brooklyn, Queens, Bronx, Staten Island).
>
> If I understood it right, the (two-level) Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) considers the occurrence of spatial relations at the (lower) individual (geographical coordinates - in my case, the listing location) and (higher) group level (territorial units - in my case, zipcodes).
>
> According to Bivand et al. (2017): "(...) W is a spatial weights matrix. The HSAR model may also be estimated without this component.". So, in this case I only estimate the Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) in a "one-level" basis, i.e., at the higher-level.
>
> HSAR::hsar(model, data = listings, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>
> (Where the "model" formula contains the 54 time dummy variables)
>
> Do you think I can proceed with this model? I was able to calculate it.
>
> If I remove all observations/rows with NAs in one of the chosen variables/observations, 884.183 observations remain. If I would create a W matrix for HSAR::hsar, I would have a gigantic 884.183 by 884.183 matrix. This is the reason why I put W = NULL.
>
>
> Thank you and best regards
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Monday, November 11, 2019 11:31
> To: Robert R
> Cc: [hidden email]
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> On Sun, 10 Nov 2019, Robert R wrote:
>
>> Dear Roger,
>>
>> Again, thank you for your answer. I read the material provided and
>> decided that Hierarchical Spatial Autoregressive (HSAR) could be the
>> right model for me.
>>
>> I indeed have the precise latitude and longitude information for all my
>> listings for NYC.
>>
>> I created a stratified sample (group = zipcode) with 22172 (1%) of my
>> observations called listings_sample and tried to replicate the hsar
>> model, please see below.
>>
>> For now W = NULL, because otherwise I would have a 22172 x 22172 matrix.
>
> Unless you know definitely that you want to relate the response to its
> lagged value, you do not need this. Do note that the matrix is very
> sparse, so could be fitted without difficulty with ML in a cross-sectional
> model.
>
>>
>> You recommended then to introduce a Markov random field (MRF) random
>> effect (RE) at the zipcode level, but I did not understand it so well.
>> Could you develop a litte more?
>>
>
> Did you read the development in
> https://doi.org/10.1016/j.spasta.2017.01.002? It is explained there, and
> includes code for fitting the Beijing housing parcels data se from HSAR
> with many other packages (MCMC, INLA, hglm, etc.). I guess that you should
> try to create a model that works on a single borough, sing the zipcodes
> in that borough as a proxy for unobserved neighbourhood effects. Try for
> example using lme4::lmer() with only a zipcode IID random effect, see if
> the hedonic estimates are similar to lm(), and leave adding an MRF RE
> (with for example mgcv::gam() or hglm::hglm()) until you have a working
> testbed. Then advance step-by-step from there.
>
> You still have not said how many repeat lettings you see - it will affect
> the way you specify your model.
>
> Roger
>
>> ##############
>> library(spdep)
>> library(HSAR)
>> library(dplyr)
>> library(splitstackshape)
>>
>>
>> # Stratified sample per zipcode (size = 1%) listings_sample <-
>> splitstackshape::stratified(indt = listings, group = "zipcode", size =
>> 0.01)
>>
>> # Removing zipcodes from polygon_nyc which are not observable in
>> listings_sample polygon_nyc_listings <- polygon_nyc %>% filter(zipcode
>> %in% c(unique(as.character(listings_sample$zipcode))))
>>
>>
>> ## Random effect matrix (N by J)
>>
>> # N: 22172
>> # J: 154
>>
>> # Arrange listings_sample by zipcode (ascending)
>> listings_sample <- listings_sample %>% arrange(zipcode)
>>
>> # Count number of listings per zipcode
>> MM <- listings_sample %>% st_drop_geometry() %>% group_by(zipcode) %>% summarise(count = n()) %>% as.data.frame()
>> # sum(MM$count)
>>
>> # N by J nulled matrix creation
>> Delta <- matrix(data = 0, nrow = nrow(listings_sample), ncol = dim(MM)[1])
>>
>> # The total number of neighbourhood
>> Uid <- rep(c(1:dim(MM)[1]), MM[,2])
>>
>> for(i in 1:dim(MM)[1]) {
>>  Delta[Uid==i,i] <- 1
>> }
>> rm(i)
>>
>> Delta <- as(Delta,"dgCMatrix")
>>
>>
>> ## Higher-level spatial weights matrix or neighbourhood matrix (J by J)
>>
>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>> polygon_nyc_nb <- poly2nb(polygon_nyc_listings, row.names = polygon_nyc$zipcode, queen = TRUE)
>>
>> # Include neighbour itself as a neighbour
>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>
>> # Spatial weights matrix for nb
>> polygon_nyc_nb_matrix <- nb2mat(neighbours = polygon_nyc_nb, style = "W", zero.policy = NULL)
>> M <- as(polygon_nyc_nb_matrix,"dgCMatrix")
>>
>>
>> ## Fit HSAR SAR upper level random effect
>> model <- as.formula(log_price ~ guests_included + minimum_nights)
>>
>> betas = coef(lm(formula = model, data = listings_sample))
>> pars = list(rho = 0.5, lambda = 0.5, sigma2e = 2.0, sigma2u = 2.0, betas = betas)
>>
>> m_hsar <- hsar(model, data = listings_sample, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>>
>> ##############
>>
>> Thank you and best regards
>> Robert
>>
>> ________________________________________
>> From: Roger Bivand <[hidden email]>
>> Sent: Friday, November 8, 2019 13:29
>> To: Robert R
>> Cc: [hidden email]
>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> On Fri, 8 Nov 2019, Robert R wrote:
>>
>>> Dear Roger,
>>>
>>> Thank you for your answer.
>>>
>>> I successfully used the function nb2blocknb() for a smaller dataset.
>>>
>>> But for a dataset of over 2 million observations, I get the following
>>> error: "Error: cannot allocate vector of size 840 Kb".
>>
>> I don't think the observations are helpful. If you have repeat lets in the
>> same property in a given month, you need to handle that anyway. I'd go for
>> making the modelling exercise work (we agree that this is not panel data,
>> right?) on a small subset first. I would further argue that you need a
>> multi-level approach rather than spdep::nb2blocknb(), with a zipcode IID
>> RE. You could very well take (stratified) samples per zipcode to represent
>> your data. Once that works, introduce an MRF RE at the zipcode level,
>> where you do know relative position. Using SARAR is going to be a waste of
>> time unless you can geocode the letting addresses. A multi-level approach
>> will work. Having big data in your case with no useful location
>> information per observation is just adding noise and over-smoothing, I'm
>> afraid. The approach used in https://doi.org/10.1016/j.spasta.2017.01.002
>> will work, also when you sample the within zipcode lets, given a split
>> into training and test sets, and making CV possible.
>>
>> Roger
>>
>>>
>>> I am expecting that at least 500.000 observations will be dropped due
>>> the lack of values for the chosen variables for the regression model, so
>>> probably I will filter and remove the observations/rows that will not be
>>> used anyway - do you know if there is any package that does this
>>> automatically, given the variables/columns chosed by me?
>>>
>>> Or would you recommend me another approach to avoid the above mentioned
>>> error?
>>>
>>> Thank you and best regards,
>>> Robert
>>>
>>> ________________________________________
>>> From: Roger Bivand <[hidden email]>
>>> Sent: Thursday, November 7, 2019 10:13
>>> To: Robert R
>>> Cc: [hidden email]
>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>
>>> On Thu, 7 Nov 2019, Robert R wrote:
>>>
>>>> Dear Roger,
>>>>
>>>> Many thanks for your help.
>>>>
>>>> I have an additional question:
>>>>
>>>> Is it possible to create a "separate" lw (nb2listw) (with different
>>>> rownumbers) from my data set? For now, I am taking my data set and
>>>> merging with the sf object polygon_nyc with the function
>>>> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create
>>>> a huge n x n matrix (depending of the size of my data set).
>>>>
>>>> Taking the polygon_nyc alone and turning it to a lw (weights list)
>>>> object has only n = 177.
>>>>
>>>> Of course running
>>>>
>>>> spatialreg::lagsarlm(formula=model, data = listings_sample,
>>>> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
>>>>
>>>> does not work ("Input data and weights have different dimensions").
>>>>
>>>> The only option is to take my data set, merge it to my polygon_nyc (by
>>>> zipcode) and then create the weights list lw? Or there another option?
>>>
>>> I think we are getting more clarity. You do not know the location of the
>>> lettings beyond their zipcode. You do know the boundaries of the zipcode
>>> areas, and can create a neighbour object from these boundaries. You then
>>> want to treat all the lettings in a zipcode area i as neighbours, and
>>> additionally lettings in zipcode areas neighbouring i as neighbours of
>>> lettings in i. This is the data structure that motivated the
>>> spdep::nb2blocknb() function:
>>>
>>> https://r-spatial.github.io/spdep/reference/nb2blocknb.html
>>>
>>> Try running the examples to get a feel for what is going on.
>>>
>>> I feel that most of the variability will vanish in the very large numbers
>>> of neighbours, over-smoothing the outcomes. If you do not have locations
>>> for the lettings themselves, I don't think you can make much progress.
>>>
>>> You could try a linear mixed model (or gam with a spatially structured
>>> random effect) with a temporal and a spatial random effect. See the HSAR
>>> package, articles by Dong et al., and maybe
>>> https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither
>>> this nor Dong et al. handle spatio-temporal settings. MRF spatial random
>>> effects at the zipcode level might be a way forward, together with an IID
>>> random effect at the same level (equivalent to sef-neighbours).
>>>
>>> Hope this helps,
>>>
>>> Roger
>>>
>>>>
>>>> Best regards,
>>>> Robert
>>>>
>>>> ________________________________________
>>>> From: Roger Bivand <[hidden email]>
>>>> Sent: Wednesday, November 6, 2019 15:07
>>>> To: Robert R
>>>> Cc: [hidden email]
>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>
>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>
>>>>> Dear Roger,
>>>>>
>>>>> Thank you for your reply. I disabled HTML; my e-mails should be now in
>>>>> plain text.
>>>>>
>>>>> I will give a better context for my desired outcome.
>>>>>
>>>>> I am taking Airbnb's listings information for New York City available
>>>>> on: http://insideairbnb.com/get-the-data.html
>>>>>
>>>>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
>>>>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
>>>>> Listings/ folder. When importing all these 54 files into one single data
>>>>> set, I create a new "date_compiled" variable/column.
>>>>>
>>>>> In total, after the data cleansing process, I have a little more 2
>>>>> million observations.
>>>>
>>>> You have repeat lettings for some, but not all properties. So this is at
>>>> best a very unbalanced panel. For those properties with repeats, you may
>>>> see temporal movement (trend/seasonal).
>>>>
>>>> I suggest (strongly) taking a single borough or even zipcode with some
>>>> hindreds of properties, and working from there. Do not include the
>>>> observation as its own neighbour, perhaps identify repeats and handle them
>>>> specially (create or use a property ID). Unbalanced panels may also create
>>>> a selection bias issue (why are some properties only listed sometimes?).
>>>>
>>>> So this although promising isn't simple, and getting to a hedonic model
>>>> may be hard, but not (just) because of spatial autocorrelation. I wouldn't
>>>> necessarily trust OLS output either, partly because of the repeat property
>>>> issue.
>>>>
>>>> Roger
>>>>
>>>>>
>>>>> I created 54 timedummy variables for each time period available.
>>>>>
>>>>> I want to estimate using a hedonic spatial timedummy model the impact of
>>>>> a variety of characteristics which potentially determine the daily rate
>>>>> on Airbnb listings through time in New York City (e.g. characteristics
>>>>> of the listing as number of bedrooms, if the host if professional,
>>>>> proximity to downtown (New York City Hall) and nearest subway station
>>>>> from the listing, income per capita, etc.).
>>>>>
>>>>> My dependent variable is price (log price, common in the related
>>>>> literature for hedonic prices).
>>>>>
>>>>> The OLS model is done.
>>>>>
>>>>> For the spatial model, I am assuming that hosts, when deciding the
>>>>> pricing of their listings, take not only into account its structural and
>>>>> location characteristics, but also the prices charged by near listings
>>>>> with similar characteristics - spatial autocorrelation is then present,
>>>>> at least spatial dependence is present in the dependent variable.
>>>>>
>>>>> As I wrote in my previous post, I was willing to consider the neighbor
>>>>> itself as a neighbor.
>>>>>
>>>>> Parts of my code can be found below:
>>>>>
>>>>> ########
>>>>>
>>>>> ## packages
>>>>>
>>>>> packages_install <- function(packages){
>>>>> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
>>>>> if (length(new.packages))
>>>>> install.packages(new.packages, dependencies = TRUE)
>>>>> sapply(packages, require, character.only = TRUE)
>>>>> }
>>>>>
>>>>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
>>>>> packages_install(packages_required)
>>>>>
>>>>> # Working directory
>>>>> setwd("C:/Users/User/R")
>>>>>
>>>>>
>>>>>
>>>>> ## shapefile_us
>>>>>
>>>>> # Shapefile zips import and Coordinate Reference System (CRS) transformation
>>>>> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
>>>>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>>>>>
>>>>> # Columns removal
>>>>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>>>>>
>>>>> # Column rename: ZCTA5CE10
>>>>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>>>>>
>>>>> # Column class change: zipcode
>>>>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>>>>>
>>>>>
>>>>>
>>>>> ## polygon_nyc
>>>>>
>>>>> # Zip code not available in shapefile: 11695
>>>>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>>>>>
>>>>>
>>>>>
>>>>> ## weight_matrix
>>>>>
>>>>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>>>>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>>>>>
>>>>> # Include neighbour itself as a neighbour
>>>>> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
>>>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>>>
>>>>> # Weights to each neighboring polygon
>>>>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>>>>>
>>>>>
>>>>>
>>>>> ## listings
>>>>>
>>>>> # Data import
>>>>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
>>>>> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
>>>>> listings <- mapply(cbind, listings, date_compiled = names(listings))
>>>>> listings <- listings %>% bind_rows
>>>>>
>>>>> # Characters removal
>>>>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
>>>>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
>>>>> listings$price <- gsub("\\$", "", listings$price)
>>>>> listings$price <- gsub(",", "", listings$price)
>>>>>
>>>>>
>>>>>
>>>>> ## timedummy
>>>>>
>>>>> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
>>>>> timedummy <- paste(timedummy, sep = "", collapse = " + ")
>>>>> timedummy <- gsub("-", "_", timedummy)
>>>>>
>>>>>
>>>>>
>>>>> ## OLS regression
>>>>>
>>>>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
>>>>> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>>>>>
>>>>> ########
>>>>>
>>>>> Some of my id's repeat in multiple time periods.
>>>>>
>>>>> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>>>>>
>>>>> Now I want to apply the hedonic model with the timedummy variables.
>>>>>
>>>>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>>>>>
>>>>> Again, thank you very much for the help provided until now.
>>>>>
>>>>> Best regards,
>>>>> Robert
>>>>>
>>>>> ________________________________________
>>>>> From: Roger Bivand <[hidden email]>
>>>>> Sent: Tuesday, November 5, 2019 15:30
>>>>> To: Robert R
>>>>> Cc: [hidden email]
>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>>
>>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>>
>>>>>> I have a large pooled cross-section data set. ?I would like to
>>>>>> estimate/regress using spatial autocorrelation methods. I am assuming
>>>>>> for now that spatial dependence is present in both the dependent
>>>>>> variable and the error term.? ?My data set is over a period of 4 years,
>>>>>> monthly data (54 periods). For this means, I've created a time dummy
>>>>>> variable for each time period.? ?I also created a weight matrix using the
>>>>>> functions "poly2nb" and "nb2listw".? ?Now I am trying to figure out a way
>>>>>> to estimate my model which contains a really big data set.? ?Basically, my
>>>>>> model is as follows: y = ?D + ?W1y + X? + ?W2u + ?? ?My questions are:? ?1)
>>>>>> My spatial weight matrix for the whole data set will be probably a
>>>>>> enormous matrix with submatrices for each time period itself. I don't
>>>>>> think it would be possible to calculate this.? What I would like to know
>>>>>> is a way to estimate each time dummy/period separately (to compare
>>>>>> different periods alone). How to do it?? ?2) Which package to use: spdep
>>>>>> or splm?? ?Thank you and best regards,? Robert?
>>>>>
>>>>> Please do not post HTML, only plain text. Almost certainly your model
>>>>> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
>>>>> untried). What is your cross-sectional size? Using sparse kronecker
>>>>> products, the "enormous" matrix may not be very big. Does it make any
>>>>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
>>>>> of the covariates time-varying? Please provide motivation and use area
>>>>> (preferably with affiliation (your email and user name are not
>>>>> informative) - this feels like a real estate problem, probably wrongly
>>>>> specified. You should use splm if time make sense in your case, but if it
>>>>> really doesn't, simplify your approach, as much of the data will be
>>>>> subject to very large temporal autocorrelation.
>>>>>
>>>>> If this is a continuation of your previous question about using
>>>>> self-neighbours, be aware that you should not use self-neighbours in
>>>>> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>>>>>
>>>>> Roger
>>>>>
>>>>>>
>>>>>>       [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> R-sig-Geo mailing list
>>>>>> [hidden email]
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>>
>>>>> --
>>>>> Roger Bivand
>>>>> Department of Economics, Norwegian School of Economics,
>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>>> https://orcid.org/0000-0003-2392-6140
>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>
>>>>
>>>> --
>>>> Roger Bivand
>>>> Department of Economics, Norwegian School of Economics,
>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>> https://orcid.org/0000-0003-2392-6140
>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>
>>>
>>> --
>>> Roger Bivand
>>> Department of Economics, Norwegian School of Economics,
>>> Helleveien 30, N-5045 Bergen, Norway.
>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>> https://orcid.org/0000-0003-2392-6140
>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: [hidden email]
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Roger Bivand
Administrator
Yes this is expected, since the # neighbours in a single zip code block is a dense matrix, and there will be multiple such matrices. (15000^2)*8 is 1.8e+09 so such a dense matrix will max out your RAM. There is no way to look at block neighbours in that format without subsetting your data (think train/test), use a zip code random effect. I would certainly drop all attempts to examine spatial dependency until you get an aspatial multilevel hedonic model working.

Roger

--
Roger Bivand
Norwegian School of Economics
Helleveien 30, 5045 Bergen, Norway
[hidden email]


________________________________________
Fra: Robert R <[hidden email]>
Sendt: tirsdag 26. november 2019 21.04
Til: Roger Bivand
Kopi: [hidden email]
Emne: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

Dear Roger,

Thank you for your e-mail. Actually there is less noise that it seems. Rental prices are daily rental prices and I have an extract of all Airbnb listings daily prices once a month for a period of 4 years. Each listings information contains the lat, lon, number of bedrooms, category (entire home/apt, shared room or private room), etc.

One question regarding the spdep::nb2blocknb function: it runs super fast with up to n = 1000, and always crashes my R session with n = 15000 or so. Is there an alternative to solve this problem?

Thank you and best regards,
Robert

________________________________________
From: Roger Bivand <[hidden email]>
Sent: Tuesday, November 26, 2019 20:48
To: Robert R
Cc: [hidden email]
Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

Sorry for late reply, am indisposed and unable to help further. I feel
that there is so much noise in your data (differences in offers, rental
lengths, repeats or not, etc.), that you will certainly have to subset
vigorously first to isolate response cases that are comparable. What you
are trying to disentangle are the hedonic components in the bundle where
you just have price as response, but lots of other bundle characteristics
on the right hand side (days, etc.). I feel you'd need to try to get to a
response index of price per day per rental area or some such. I'd
certainly advise examining responses to a specific driver (major concert
or sports event) to get a feel for how the market responds, and return to
spatial hedonic after finding an approach that gives reasonable aspatial
outcomes.

Roger

On Sun, 17 Nov 2019, Robert R wrote:

> Dear Roger,
>
> Thank you for your message and sorry for my late answer.
>
> Regarding the number of listings (lettings) for my data set (2.216.642 observations), each listing contains an individual id:
>
> unique ids: 180.004
> time periods: 54 (2015-01 to 2019-09)
> number of ids that appear only once: 28.486 (of 180.004 ids) (15,8%)
> number of ids that appear/repeat 2-10 times: 82.641 (of 180.004 ids) (45,9%)
> number of ids that appear/repeat 11-30 times: 46.465 (of 180.004 ids) (25,8%)
> number of ids that appear/repeat 31-54 times: 22.412 (of 180.004 ids) (12,5%)
>
> Important to notice is that hosts can change the room_category (between entire/home apt, private room and shared room) keeping the same listing id number. In my data, the number of unique ids that in some point changed the room_type is of 7.204 ids.
>
> --
>
> For the OLS model, I was using only a fixed effect model, where each time period (date_compiled) (54 in total) is a time dummy.
>
> plm::plm(formula = model, data = listings, model = "pooling", index = c("id", "date_compiled"))
>
>
> --
> Osland et al. (2016) (https://doi.org/10.1111/jors.12281) use a spatial fixed effects (SFE) hedonic model, where each defined neighborhood zone in the study area is represented by dummy variables.
>
> Dong et al. (2015) (https://doi.org/10.1111/gean.12049) outline four model specifications to accommodate geographically hierarchical data structures: (1) groupwise W and fixed regional effects; (2) groupwise W and random regional effects; (3) proximity-based W and fixed regional effects; and (4) proximity-based W and random regional effects.
> --
>
> I created a new column/variable containing the borough where the zipcode is found (Manhattan, Brooklyn, Queens, Bronx, Staten Island).
>
> If I understood it right, the (two-level) Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) considers the occurrence of spatial relations at the (lower) individual (geographical coordinates - in my case, the listing location) and (higher) group level (territorial units - in my case, zipcodes).
>
> According to Bivand et al. (2017): "(...) W is a spatial weights matrix. The HSAR model may also be estimated without this component.". So, in this case I only estimate the Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) in a "one-level" basis, i.e., at the higher-level.
>
> HSAR::hsar(model, data = listings, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>
> (Where the "model" formula contains the 54 time dummy variables)
>
> Do you think I can proceed with this model? I was able to calculate it.
>
> If I remove all observations/rows with NAs in one of the chosen variables/observations, 884.183 observations remain. If I would create a W matrix for HSAR::hsar, I would have a gigantic 884.183 by 884.183 matrix. This is the reason why I put W = NULL.
>
>
> Thank you and best regards
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Monday, November 11, 2019 11:31
> To: Robert R
> Cc: [hidden email]
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> On Sun, 10 Nov 2019, Robert R wrote:
>
>> Dear Roger,
>>
>> Again, thank you for your answer. I read the material provided and
>> decided that Hierarchical Spatial Autoregressive (HSAR) could be the
>> right model for me.
>>
>> I indeed have the precise latitude and longitude information for all my
>> listings for NYC.
>>
>> I created a stratified sample (group = zipcode) with 22172 (1%) of my
>> observations called listings_sample and tried to replicate the hsar
>> model, please see below.
>>
>> For now W = NULL, because otherwise I would have a 22172 x 22172 matrix.
>
> Unless you know definitely that you want to relate the response to its
> lagged value, you do not need this. Do note that the matrix is very
> sparse, so could be fitted without difficulty with ML in a cross-sectional
> model.
>
>>
>> You recommended then to introduce a Markov random field (MRF) random
>> effect (RE) at the zipcode level, but I did not understand it so well.
>> Could you develop a litte more?
>>
>
> Did you read the development in
> https://doi.org/10.1016/j.spasta.2017.01.002? It is explained there, and
> includes code for fitting the Beijing housing parcels data se from HSAR
> with many other packages (MCMC, INLA, hglm, etc.). I guess that you should
> try to create a model that works on a single borough, sing the zipcodes
> in that borough as a proxy for unobserved neighbourhood effects. Try for
> example using lme4::lmer() with only a zipcode IID random effect, see if
> the hedonic estimates are similar to lm(), and leave adding an MRF RE
> (with for example mgcv::gam() or hglm::hglm()) until you have a working
> testbed. Then advance step-by-step from there.
>
> You still have not said how many repeat lettings you see - it will affect
> the way you specify your model.
>
> Roger
>
>> ##############
>> library(spdep)
>> library(HSAR)
>> library(dplyr)
>> library(splitstackshape)
>>
>>
>> # Stratified sample per zipcode (size = 1%) listings_sample <-
>> splitstackshape::stratified(indt = listings, group = "zipcode", size =
>> 0.01)
>>
>> # Removing zipcodes from polygon_nyc which are not observable in
>> listings_sample polygon_nyc_listings <- polygon_nyc %>% filter(zipcode
>> %in% c(unique(as.character(listings_sample$zipcode))))
>>
>>
>> ## Random effect matrix (N by J)
>>
>> # N: 22172
>> # J: 154
>>
>> # Arrange listings_sample by zipcode (ascending)
>> listings_sample <- listings_sample %>% arrange(zipcode)
>>
>> # Count number of listings per zipcode
>> MM <- listings_sample %>% st_drop_geometry() %>% group_by(zipcode) %>% summarise(count = n()) %>% as.data.frame()
>> # sum(MM$count)
>>
>> # N by J nulled matrix creation
>> Delta <- matrix(data = 0, nrow = nrow(listings_sample), ncol = dim(MM)[1])
>>
>> # The total number of neighbourhood
>> Uid <- rep(c(1:dim(MM)[1]), MM[,2])
>>
>> for(i in 1:dim(MM)[1]) {
>>  Delta[Uid==i,i] <- 1
>> }
>> rm(i)
>>
>> Delta <- as(Delta,"dgCMatrix")
>>
>>
>> ## Higher-level spatial weights matrix or neighbourhood matrix (J by J)
>>
>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>> polygon_nyc_nb <- poly2nb(polygon_nyc_listings, row.names = polygon_nyc$zipcode, queen = TRUE)
>>
>> # Include neighbour itself as a neighbour
>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>
>> # Spatial weights matrix for nb
>> polygon_nyc_nb_matrix <- nb2mat(neighbours = polygon_nyc_nb, style = "W", zero.policy = NULL)
>> M <- as(polygon_nyc_nb_matrix,"dgCMatrix")
>>
>>
>> ## Fit HSAR SAR upper level random effect
>> model <- as.formula(log_price ~ guests_included + minimum_nights)
>>
>> betas = coef(lm(formula = model, data = listings_sample))
>> pars = list(rho = 0.5, lambda = 0.5, sigma2e = 2.0, sigma2u = 2.0, betas = betas)
>>
>> m_hsar <- hsar(model, data = listings_sample, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>>
>> ##############
>>
>> Thank you and best regards
>> Robert
>>
>> ________________________________________
>> From: Roger Bivand <[hidden email]>
>> Sent: Friday, November 8, 2019 13:29
>> To: Robert R
>> Cc: [hidden email]
>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> On Fri, 8 Nov 2019, Robert R wrote:
>>
>>> Dear Roger,
>>>
>>> Thank you for your answer.
>>>
>>> I successfully used the function nb2blocknb() for a smaller dataset.
>>>
>>> But for a dataset of over 2 million observations, I get the following
>>> error: "Error: cannot allocate vector of size 840 Kb".
>>
>> I don't think the observations are helpful. If you have repeat lets in the
>> same property in a given month, you need to handle that anyway. I'd go for
>> making the modelling exercise work (we agree that this is not panel data,
>> right?) on a small subset first. I would further argue that you need a
>> multi-level approach rather than spdep::nb2blocknb(), with a zipcode IID
>> RE. You could very well take (stratified) samples per zipcode to represent
>> your data. Once that works, introduce an MRF RE at the zipcode level,
>> where you do know relative position. Using SARAR is going to be a waste of
>> time unless you can geocode the letting addresses. A multi-level approach
>> will work. Having big data in your case with no useful location
>> information per observation is just adding noise and over-smoothing, I'm
>> afraid. The approach used in https://doi.org/10.1016/j.spasta.2017.01.002
>> will work, also when you sample the within zipcode lets, given a split
>> into training and test sets, and making CV possible.
>>
>> Roger
>>
>>>
>>> I am expecting that at least 500.000 observations will be dropped due
>>> the lack of values for the chosen variables for the regression model, so
>>> probably I will filter and remove the observations/rows that will not be
>>> used anyway - do you know if there is any package that does this
>>> automatically, given the variables/columns chosed by me?
>>>
>>> Or would you recommend me another approach to avoid the above mentioned
>>> error?
>>>
>>> Thank you and best regards,
>>> Robert
>>>
>>> ________________________________________
>>> From: Roger Bivand <[hidden email]>
>>> Sent: Thursday, November 7, 2019 10:13
>>> To: Robert R
>>> Cc: [hidden email]
>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>
>>> On Thu, 7 Nov 2019, Robert R wrote:
>>>
>>>> Dear Roger,
>>>>
>>>> Many thanks for your help.
>>>>
>>>> I have an additional question:
>>>>
>>>> Is it possible to create a "separate" lw (nb2listw) (with different
>>>> rownumbers) from my data set? For now, I am taking my data set and
>>>> merging with the sf object polygon_nyc with the function
>>>> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create
>>>> a huge n x n matrix (depending of the size of my data set).
>>>>
>>>> Taking the polygon_nyc alone and turning it to a lw (weights list)
>>>> object has only n = 177.
>>>>
>>>> Of course running
>>>>
>>>> spatialreg::lagsarlm(formula=model, data = listings_sample,
>>>> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
>>>>
>>>> does not work ("Input data and weights have different dimensions").
>>>>
>>>> The only option is to take my data set, merge it to my polygon_nyc (by
>>>> zipcode) and then create the weights list lw? Or there another option?
>>>
>>> I think we are getting more clarity. You do not know the location of the
>>> lettings beyond their zipcode. You do know the boundaries of the zipcode
>>> areas, and can create a neighbour object from these boundaries. You then
>>> want to treat all the lettings in a zipcode area i as neighbours, and
>>> additionally lettings in zipcode areas neighbouring i as neighbours of
>>> lettings in i. This is the data structure that motivated the
>>> spdep::nb2blocknb() function:
>>>
>>> https://r-spatial.github.io/spdep/reference/nb2blocknb.html
>>>
>>> Try running the examples to get a feel for what is going on.
>>>
>>> I feel that most of the variability will vanish in the very large numbers
>>> of neighbours, over-smoothing the outcomes. If you do not have locations
>>> for the lettings themselves, I don't think you can make much progress.
>>>
>>> You could try a linear mixed model (or gam with a spatially structured
>>> random effect) with a temporal and a spatial random effect. See the HSAR
>>> package, articles by Dong et al., and maybe
>>> https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither
>>> this nor Dong et al. handle spatio-temporal settings. MRF spatial random
>>> effects at the zipcode level might be a way forward, together with an IID
>>> random effect at the same level (equivalent to sef-neighbours).
>>>
>>> Hope this helps,
>>>
>>> Roger
>>>
>>>>
>>>> Best regards,
>>>> Robert
>>>>
>>>> ________________________________________
>>>> From: Roger Bivand <[hidden email]>
>>>> Sent: Wednesday, November 6, 2019 15:07
>>>> To: Robert R
>>>> Cc: [hidden email]
>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>
>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>
>>>>> Dear Roger,
>>>>>
>>>>> Thank you for your reply. I disabled HTML; my e-mails should be now in
>>>>> plain text.
>>>>>
>>>>> I will give a better context for my desired outcome.
>>>>>
>>>>> I am taking Airbnb's listings information for New York City available
>>>>> on: http://insideairbnb.com/get-the-data.html
>>>>>
>>>>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
>>>>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
>>>>> Listings/ folder. When importing all these 54 files into one single data
>>>>> set, I create a new "date_compiled" variable/column.
>>>>>
>>>>> In total, after the data cleansing process, I have a little more 2
>>>>> million observations.
>>>>
>>>> You have repeat lettings for some, but not all properties. So this is at
>>>> best a very unbalanced panel. For those properties with repeats, you may
>>>> see temporal movement (trend/seasonal).
>>>>
>>>> I suggest (strongly) taking a single borough or even zipcode with some
>>>> hindreds of properties, and working from there. Do not include the
>>>> observation as its own neighbour, perhaps identify repeats and handle them
>>>> specially (create or use a property ID). Unbalanced panels may also create
>>>> a selection bias issue (why are some properties only listed sometimes?).
>>>>
>>>> So this although promising isn't simple, and getting to a hedonic model
>>>> may be hard, but not (just) because of spatial autocorrelation. I wouldn't
>>>> necessarily trust OLS output either, partly because of the repeat property
>>>> issue.
>>>>
>>>> Roger
>>>>
>>>>>
>>>>> I created 54 timedummy variables for each time period available.
>>>>>
>>>>> I want to estimate using a hedonic spatial timedummy model the impact of
>>>>> a variety of characteristics which potentially determine the daily rate
>>>>> on Airbnb listings through time in New York City (e.g. characteristics
>>>>> of the listing as number of bedrooms, if the host if professional,
>>>>> proximity to downtown (New York City Hall) and nearest subway station
>>>>> from the listing, income per capita, etc.).
>>>>>
>>>>> My dependent variable is price (log price, common in the related
>>>>> literature for hedonic prices).
>>>>>
>>>>> The OLS model is done.
>>>>>
>>>>> For the spatial model, I am assuming that hosts, when deciding the
>>>>> pricing of their listings, take not only into account its structural and
>>>>> location characteristics, but also the prices charged by near listings
>>>>> with similar characteristics - spatial autocorrelation is then present,
>>>>> at least spatial dependence is present in the dependent variable.
>>>>>
>>>>> As I wrote in my previous post, I was willing to consider the neighbor
>>>>> itself as a neighbor.
>>>>>
>>>>> Parts of my code can be found below:
>>>>>
>>>>> ########
>>>>>
>>>>> ## packages
>>>>>
>>>>> packages_install <- function(packages){
>>>>> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
>>>>> if (length(new.packages))
>>>>> install.packages(new.packages, dependencies = TRUE)
>>>>> sapply(packages, require, character.only = TRUE)
>>>>> }
>>>>>
>>>>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
>>>>> packages_install(packages_required)
>>>>>
>>>>> # Working directory
>>>>> setwd("C:/Users/User/R")
>>>>>
>>>>>
>>>>>
>>>>> ## shapefile_us
>>>>>
>>>>> # Shapefile zips import and Coordinate Reference System (CRS) transformation
>>>>> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
>>>>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>>>>>
>>>>> # Columns removal
>>>>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>>>>>
>>>>> # Column rename: ZCTA5CE10
>>>>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>>>>>
>>>>> # Column class change: zipcode
>>>>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>>>>>
>>>>>
>>>>>
>>>>> ## polygon_nyc
>>>>>
>>>>> # Zip code not available in shapefile: 11695
>>>>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>>>>>
>>>>>
>>>>>
>>>>> ## weight_matrix
>>>>>
>>>>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>>>>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>>>>>
>>>>> # Include neighbour itself as a neighbour
>>>>> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
>>>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>>>
>>>>> # Weights to each neighboring polygon
>>>>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>>>>>
>>>>>
>>>>>
>>>>> ## listings
>>>>>
>>>>> # Data import
>>>>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
>>>>> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
>>>>> listings <- mapply(cbind, listings, date_compiled = names(listings))
>>>>> listings <- listings %>% bind_rows
>>>>>
>>>>> # Characters removal
>>>>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
>>>>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
>>>>> listings$price <- gsub("\\$", "", listings$price)
>>>>> listings$price <- gsub(",", "", listings$price)
>>>>>
>>>>>
>>>>>
>>>>> ## timedummy
>>>>>
>>>>> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
>>>>> timedummy <- paste(timedummy, sep = "", collapse = " + ")
>>>>> timedummy <- gsub("-", "_", timedummy)
>>>>>
>>>>>
>>>>>
>>>>> ## OLS regression
>>>>>
>>>>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
>>>>> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>>>>>
>>>>> ########
>>>>>
>>>>> Some of my id's repeat in multiple time periods.
>>>>>
>>>>> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>>>>>
>>>>> Now I want to apply the hedonic model with the timedummy variables.
>>>>>
>>>>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>>>>>
>>>>> Again, thank you very much for the help provided until now.
>>>>>
>>>>> Best regards,
>>>>> Robert
>>>>>
>>>>> ________________________________________
>>>>> From: Roger Bivand <[hidden email]>
>>>>> Sent: Tuesday, November 5, 2019 15:30
>>>>> To: Robert R
>>>>> Cc: [hidden email]
>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>>
>>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>>
>>>>>> I have a large pooled cross-section data set. ?I would like to
>>>>>> estimate/regress using spatial autocorrelation methods. I am assuming
>>>>>> for now that spatial dependence is present in both the dependent
>>>>>> variable and the error term.? ?My data set is over a period of 4 years,
>>>>>> monthly data (54 periods). For this means, I've created a time dummy
>>>>>> variable for each time period.? ?I also created a weight matrix using the
>>>>>> functions "poly2nb" and "nb2listw".? ?Now I am trying to figure out a way
>>>>>> to estimate my model which contains a really big data set.? ?Basically, my
>>>>>> model is as follows: y = ?D + ?W1y + X? + ?W2u + ?? ?My questions are:? ?1)
>>>>>> My spatial weight matrix for the whole data set will be probably a
>>>>>> enormous matrix with submatrices for each time period itself. I don't
>>>>>> think it would be possible to calculate this.? What I would like to know
>>>>>> is a way to estimate each time dummy/period separately (to compare
>>>>>> different periods alone). How to do it?? ?2) Which package to use: spdep
>>>>>> or splm?? ?Thank you and best regards,? Robert?
>>>>>
>>>>> Please do not post HTML, only plain text. Almost certainly your model
>>>>> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
>>>>> untried). What is your cross-sectional size? Using sparse kronecker
>>>>> products, the "enormous" matrix may not be very big. Does it make any
>>>>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
>>>>> of the covariates time-varying? Please provide motivation and use area
>>>>> (preferably with affiliation (your email and user name are not
>>>>> informative) - this feels like a real estate problem, probably wrongly
>>>>> specified. You should use splm if time make sense in your case, but if it
>>>>> really doesn't, simplify your approach, as much of the data will be
>>>>> subject to very large temporal autocorrelation.
>>>>>
>>>>> If this is a continuation of your previous question about using
>>>>> self-neighbours, be aware that you should not use self-neighbours in
>>>>> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>>>>>
>>>>> Roger
>>>>>
>>>>>>
>>>>>>       [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> R-sig-Geo mailing list
>>>>>> [hidden email]
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>>
>>>>> --
>>>>> Roger Bivand
>>>>> Department of Economics, Norwegian School of Economics,
>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>>> https://orcid.org/0000-0003-2392-6140
>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>
>>>>
>>>> --
>>>> Roger Bivand
>>>> Department of Economics, Norwegian School of Economics,
>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>> https://orcid.org/0000-0003-2392-6140
>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>
>>>
>>> --
>>> Roger Bivand
>>> Department of Economics, Norwegian School of Economics,
>>> Helleveien 30, N-5045 Bergen, Norway.
>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>> https://orcid.org/0000-0003-2392-6140
>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: [hidden email]
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Robert R
Dear Roger,

Again, thank you for your answer. What do you mean by "zip code random effect"? You mean I should use in plm the model "random"?

regression_re <- plm(formula = model, data = listings, model = "random", index = c("id", "date_compiled"))

And any other methodology in dealing with large weight matrices in spatialreg::lagsarlm?

Thank you and best regards,
Robert

________________________________________
From: Roger Bivand <[hidden email]>
Sent: Wednesday, November 27, 2019 13:53
To: Robert R
Cc: [hidden email]
Subject: SV: [R-sig-Geo] Spatial Autocorrelation Estimation Method

Yes this is expected, since the # neighbours in a single zip code block is a dense matrix, and there will be multiple such matrices. (15000^2)*8 is 1.8e+09 so such a dense matrix will max out your RAM. There is no way to look at block neighbours in that format without subsetting your data (think train/test), use a zip code random effect. I would certainly drop all attempts to examine spatial dependency until you get an aspatial multilevel hedonic model working.

Roger

--
Roger Bivand
Norwegian School of Economics
Helleveien 30, 5045 Bergen, Norway
[hidden email]


________________________________________
Fra: Robert R <[hidden email]>
Sendt: tirsdag 26. november 2019 21.04
Til: Roger Bivand
Kopi: [hidden email]
Emne: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

Dear Roger,

Thank you for your e-mail. Actually there is less noise that it seems. Rental prices are daily rental prices and I have an extract of all Airbnb listings daily prices once a month for a period of 4 years. Each listings information contains the lat, lon, number of bedrooms, category (entire home/apt, shared room or private room), etc.

One question regarding the spdep::nb2blocknb function: it runs super fast with up to n = 1000, and always crashes my R session with n = 15000 or so. Is there an alternative to solve this problem?

Thank you and best regards,
Robert

________________________________________
From: Roger Bivand <[hidden email]>
Sent: Tuesday, November 26, 2019 20:48
To: Robert R
Cc: [hidden email]
Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

Sorry for late reply, am indisposed and unable to help further. I feel
that there is so much noise in your data (differences in offers, rental
lengths, repeats or not, etc.), that you will certainly have to subset
vigorously first to isolate response cases that are comparable. What you
are trying to disentangle are the hedonic components in the bundle where
you just have price as response, but lots of other bundle characteristics
on the right hand side (days, etc.). I feel you'd need to try to get to a
response index of price per day per rental area or some such. I'd
certainly advise examining responses to a specific driver (major concert
or sports event) to get a feel for how the market responds, and return to
spatial hedonic after finding an approach that gives reasonable aspatial
outcomes.

Roger

On Sun, 17 Nov 2019, Robert R wrote:

> Dear Roger,
>
> Thank you for your message and sorry for my late answer.
>
> Regarding the number of listings (lettings) for my data set (2.216.642 observations), each listing contains an individual id:
>
> unique ids: 180.004
> time periods: 54 (2015-01 to 2019-09)
> number of ids that appear only once: 28.486 (of 180.004 ids) (15,8%)
> number of ids that appear/repeat 2-10 times: 82.641 (of 180.004 ids) (45,9%)
> number of ids that appear/repeat 11-30 times: 46.465 (of 180.004 ids) (25,8%)
> number of ids that appear/repeat 31-54 times: 22.412 (of 180.004 ids) (12,5%)
>
> Important to notice is that hosts can change the room_category (between entire/home apt, private room and shared room) keeping the same listing id number. In my data, the number of unique ids that in some point changed the room_type is of 7.204 ids.
>
> --
>
> For the OLS model, I was using only a fixed effect model, where each time period (date_compiled) (54 in total) is a time dummy.
>
> plm::plm(formula = model, data = listings, model = "pooling", index = c("id", "date_compiled"))
>
>
> --
> Osland et al. (2016) (https://doi.org/10.1111/jors.12281) use a spatial fixed effects (SFE) hedonic model, where each defined neighborhood zone in the study area is represented by dummy variables.
>
> Dong et al. (2015) (https://doi.org/10.1111/gean.12049) outline four model specifications to accommodate geographically hierarchical data structures: (1) groupwise W and fixed regional effects; (2) groupwise W and random regional effects; (3) proximity-based W and fixed regional effects; and (4) proximity-based W and random regional effects.
> --
>
> I created a new column/variable containing the borough where the zipcode is found (Manhattan, Brooklyn, Queens, Bronx, Staten Island).
>
> If I understood it right, the (two-level) Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) considers the occurrence of spatial relations at the (lower) individual (geographical coordinates - in my case, the listing location) and (higher) group level (territorial units - in my case, zipcodes).
>
> According to Bivand et al. (2017): "(...) W is a spatial weights matrix. The HSAR model may also be estimated without this component.". So, in this case I only estimate the Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) in a "one-level" basis, i.e., at the higher-level.
>
> HSAR::hsar(model, data = listings, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>
> (Where the "model" formula contains the 54 time dummy variables)
>
> Do you think I can proceed with this model? I was able to calculate it.
>
> If I remove all observations/rows with NAs in one of the chosen variables/observations, 884.183 observations remain. If I would create a W matrix for HSAR::hsar, I would have a gigantic 884.183 by 884.183 matrix. This is the reason why I put W = NULL.
>
>
> Thank you and best regards
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Monday, November 11, 2019 11:31
> To: Robert R
> Cc: [hidden email]
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> On Sun, 10 Nov 2019, Robert R wrote:
>
>> Dear Roger,
>>
>> Again, thank you for your answer. I read the material provided and
>> decided that Hierarchical Spatial Autoregressive (HSAR) could be the
>> right model for me.
>>
>> I indeed have the precise latitude and longitude information for all my
>> listings for NYC.
>>
>> I created a stratified sample (group = zipcode) with 22172 (1%) of my
>> observations called listings_sample and tried to replicate the hsar
>> model, please see below.
>>
>> For now W = NULL, because otherwise I would have a 22172 x 22172 matrix.
>
> Unless you know definitely that you want to relate the response to its
> lagged value, you do not need this. Do note that the matrix is very
> sparse, so could be fitted without difficulty with ML in a cross-sectional
> model.
>
>>
>> You recommended then to introduce a Markov random field (MRF) random
>> effect (RE) at the zipcode level, but I did not understand it so well.
>> Could you develop a litte more?
>>
>
> Did you read the development in
> https://doi.org/10.1016/j.spasta.2017.01.002? It is explained there, and
> includes code for fitting the Beijing housing parcels data se from HSAR
> with many other packages (MCMC, INLA, hglm, etc.). I guess that you should
> try to create a model that works on a single borough, sing the zipcodes
> in that borough as a proxy for unobserved neighbourhood effects. Try for
> example using lme4::lmer() with only a zipcode IID random effect, see if
> the hedonic estimates are similar to lm(), and leave adding an MRF RE
> (with for example mgcv::gam() or hglm::hglm()) until you have a working
> testbed. Then advance step-by-step from there.
>
> You still have not said how many repeat lettings you see - it will affect
> the way you specify your model.
>
> Roger
>
>> ##############
>> library(spdep)
>> library(HSAR)
>> library(dplyr)
>> library(splitstackshape)
>>
>>
>> # Stratified sample per zipcode (size = 1%) listings_sample <-
>> splitstackshape::stratified(indt = listings, group = "zipcode", size =
>> 0.01)
>>
>> # Removing zipcodes from polygon_nyc which are not observable in
>> listings_sample polygon_nyc_listings <- polygon_nyc %>% filter(zipcode
>> %in% c(unique(as.character(listings_sample$zipcode))))
>>
>>
>> ## Random effect matrix (N by J)
>>
>> # N: 22172
>> # J: 154
>>
>> # Arrange listings_sample by zipcode (ascending)
>> listings_sample <- listings_sample %>% arrange(zipcode)
>>
>> # Count number of listings per zipcode
>> MM <- listings_sample %>% st_drop_geometry() %>% group_by(zipcode) %>% summarise(count = n()) %>% as.data.frame()
>> # sum(MM$count)
>>
>> # N by J nulled matrix creation
>> Delta <- matrix(data = 0, nrow = nrow(listings_sample), ncol = dim(MM)[1])
>>
>> # The total number of neighbourhood
>> Uid <- rep(c(1:dim(MM)[1]), MM[,2])
>>
>> for(i in 1:dim(MM)[1]) {
>>  Delta[Uid==i,i] <- 1
>> }
>> rm(i)
>>
>> Delta <- as(Delta,"dgCMatrix")
>>
>>
>> ## Higher-level spatial weights matrix or neighbourhood matrix (J by J)
>>
>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>> polygon_nyc_nb <- poly2nb(polygon_nyc_listings, row.names = polygon_nyc$zipcode, queen = TRUE)
>>
>> # Include neighbour itself as a neighbour
>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>
>> # Spatial weights matrix for nb
>> polygon_nyc_nb_matrix <- nb2mat(neighbours = polygon_nyc_nb, style = "W", zero.policy = NULL)
>> M <- as(polygon_nyc_nb_matrix,"dgCMatrix")
>>
>>
>> ## Fit HSAR SAR upper level random effect
>> model <- as.formula(log_price ~ guests_included + minimum_nights)
>>
>> betas = coef(lm(formula = model, data = listings_sample))
>> pars = list(rho = 0.5, lambda = 0.5, sigma2e = 2.0, sigma2u = 2.0, betas = betas)
>>
>> m_hsar <- hsar(model, data = listings_sample, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>>
>> ##############
>>
>> Thank you and best regards
>> Robert
>>
>> ________________________________________
>> From: Roger Bivand <[hidden email]>
>> Sent: Friday, November 8, 2019 13:29
>> To: Robert R
>> Cc: [hidden email]
>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> On Fri, 8 Nov 2019, Robert R wrote:
>>
>>> Dear Roger,
>>>
>>> Thank you for your answer.
>>>
>>> I successfully used the function nb2blocknb() for a smaller dataset.
>>>
>>> But for a dataset of over 2 million observations, I get the following
>>> error: "Error: cannot allocate vector of size 840 Kb".
>>
>> I don't think the observations are helpful. If you have repeat lets in the
>> same property in a given month, you need to handle that anyway. I'd go for
>> making the modelling exercise work (we agree that this is not panel data,
>> right?) on a small subset first. I would further argue that you need a
>> multi-level approach rather than spdep::nb2blocknb(), with a zipcode IID
>> RE. You could very well take (stratified) samples per zipcode to represent
>> your data. Once that works, introduce an MRF RE at the zipcode level,
>> where you do know relative position. Using SARAR is going to be a waste of
>> time unless you can geocode the letting addresses. A multi-level approach
>> will work. Having big data in your case with no useful location
>> information per observation is just adding noise and over-smoothing, I'm
>> afraid. The approach used in https://doi.org/10.1016/j.spasta.2017.01.002
>> will work, also when you sample the within zipcode lets, given a split
>> into training and test sets, and making CV possible.
>>
>> Roger
>>
>>>
>>> I am expecting that at least 500.000 observations will be dropped due
>>> the lack of values for the chosen variables for the regression model, so
>>> probably I will filter and remove the observations/rows that will not be
>>> used anyway - do you know if there is any package that does this
>>> automatically, given the variables/columns chosed by me?
>>>
>>> Or would you recommend me another approach to avoid the above mentioned
>>> error?
>>>
>>> Thank you and best regards,
>>> Robert
>>>
>>> ________________________________________
>>> From: Roger Bivand <[hidden email]>
>>> Sent: Thursday, November 7, 2019 10:13
>>> To: Robert R
>>> Cc: [hidden email]
>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>
>>> On Thu, 7 Nov 2019, Robert R wrote:
>>>
>>>> Dear Roger,
>>>>
>>>> Many thanks for your help.
>>>>
>>>> I have an additional question:
>>>>
>>>> Is it possible to create a "separate" lw (nb2listw) (with different
>>>> rownumbers) from my data set? For now, I am taking my data set and
>>>> merging with the sf object polygon_nyc with the function
>>>> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create
>>>> a huge n x n matrix (depending of the size of my data set).
>>>>
>>>> Taking the polygon_nyc alone and turning it to a lw (weights list)
>>>> object has only n = 177.
>>>>
>>>> Of course running
>>>>
>>>> spatialreg::lagsarlm(formula=model, data = listings_sample,
>>>> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
>>>>
>>>> does not work ("Input data and weights have different dimensions").
>>>>
>>>> The only option is to take my data set, merge it to my polygon_nyc (by
>>>> zipcode) and then create the weights list lw? Or there another option?
>>>
>>> I think we are getting more clarity. You do not know the location of the
>>> lettings beyond their zipcode. You do know the boundaries of the zipcode
>>> areas, and can create a neighbour object from these boundaries. You then
>>> want to treat all the lettings in a zipcode area i as neighbours, and
>>> additionally lettings in zipcode areas neighbouring i as neighbours of
>>> lettings in i. This is the data structure that motivated the
>>> spdep::nb2blocknb() function:
>>>
>>> https://r-spatial.github.io/spdep/reference/nb2blocknb.html
>>>
>>> Try running the examples to get a feel for what is going on.
>>>
>>> I feel that most of the variability will vanish in the very large numbers
>>> of neighbours, over-smoothing the outcomes. If you do not have locations
>>> for the lettings themselves, I don't think you can make much progress.
>>>
>>> You could try a linear mixed model (or gam with a spatially structured
>>> random effect) with a temporal and a spatial random effect. See the HSAR
>>> package, articles by Dong et al., and maybe
>>> https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither
>>> this nor Dong et al. handle spatio-temporal settings. MRF spatial random
>>> effects at the zipcode level might be a way forward, together with an IID
>>> random effect at the same level (equivalent to sef-neighbours).
>>>
>>> Hope this helps,
>>>
>>> Roger
>>>
>>>>
>>>> Best regards,
>>>> Robert
>>>>
>>>> ________________________________________
>>>> From: Roger Bivand <[hidden email]>
>>>> Sent: Wednesday, November 6, 2019 15:07
>>>> To: Robert R
>>>> Cc: [hidden email]
>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>
>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>
>>>>> Dear Roger,
>>>>>
>>>>> Thank you for your reply. I disabled HTML; my e-mails should be now in
>>>>> plain text.
>>>>>
>>>>> I will give a better context for my desired outcome.
>>>>>
>>>>> I am taking Airbnb's listings information for New York City available
>>>>> on: http://insideairbnb.com/get-the-data.html
>>>>>
>>>>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
>>>>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
>>>>> Listings/ folder. When importing all these 54 files into one single data
>>>>> set, I create a new "date_compiled" variable/column.
>>>>>
>>>>> In total, after the data cleansing process, I have a little more 2
>>>>> million observations.
>>>>
>>>> You have repeat lettings for some, but not all properties. So this is at
>>>> best a very unbalanced panel. For those properties with repeats, you may
>>>> see temporal movement (trend/seasonal).
>>>>
>>>> I suggest (strongly) taking a single borough or even zipcode with some
>>>> hindreds of properties, and working from there. Do not include the
>>>> observation as its own neighbour, perhaps identify repeats and handle them
>>>> specially (create or use a property ID). Unbalanced panels may also create
>>>> a selection bias issue (why are some properties only listed sometimes?).
>>>>
>>>> So this although promising isn't simple, and getting to a hedonic model
>>>> may be hard, but not (just) because of spatial autocorrelation. I wouldn't
>>>> necessarily trust OLS output either, partly because of the repeat property
>>>> issue.
>>>>
>>>> Roger
>>>>
>>>>>
>>>>> I created 54 timedummy variables for each time period available.
>>>>>
>>>>> I want to estimate using a hedonic spatial timedummy model the impact of
>>>>> a variety of characteristics which potentially determine the daily rate
>>>>> on Airbnb listings through time in New York City (e.g. characteristics
>>>>> of the listing as number of bedrooms, if the host if professional,
>>>>> proximity to downtown (New York City Hall) and nearest subway station
>>>>> from the listing, income per capita, etc.).
>>>>>
>>>>> My dependent variable is price (log price, common in the related
>>>>> literature for hedonic prices).
>>>>>
>>>>> The OLS model is done.
>>>>>
>>>>> For the spatial model, I am assuming that hosts, when deciding the
>>>>> pricing of their listings, take not only into account its structural and
>>>>> location characteristics, but also the prices charged by near listings
>>>>> with similar characteristics - spatial autocorrelation is then present,
>>>>> at least spatial dependence is present in the dependent variable.
>>>>>
>>>>> As I wrote in my previous post, I was willing to consider the neighbor
>>>>> itself as a neighbor.
>>>>>
>>>>> Parts of my code can be found below:
>>>>>
>>>>> ########
>>>>>
>>>>> ## packages
>>>>>
>>>>> packages_install <- function(packages){
>>>>> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
>>>>> if (length(new.packages))
>>>>> install.packages(new.packages, dependencies = TRUE)
>>>>> sapply(packages, require, character.only = TRUE)
>>>>> }
>>>>>
>>>>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
>>>>> packages_install(packages_required)
>>>>>
>>>>> # Working directory
>>>>> setwd("C:/Users/User/R")
>>>>>
>>>>>
>>>>>
>>>>> ## shapefile_us
>>>>>
>>>>> # Shapefile zips import and Coordinate Reference System (CRS) transformation
>>>>> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
>>>>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>>>>>
>>>>> # Columns removal
>>>>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>>>>>
>>>>> # Column rename: ZCTA5CE10
>>>>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>>>>>
>>>>> # Column class change: zipcode
>>>>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>>>>>
>>>>>
>>>>>
>>>>> ## polygon_nyc
>>>>>
>>>>> # Zip code not available in shapefile: 11695
>>>>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>>>>>
>>>>>
>>>>>
>>>>> ## weight_matrix
>>>>>
>>>>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>>>>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>>>>>
>>>>> # Include neighbour itself as a neighbour
>>>>> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
>>>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>>>
>>>>> # Weights to each neighboring polygon
>>>>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>>>>>
>>>>>
>>>>>
>>>>> ## listings
>>>>>
>>>>> # Data import
>>>>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
>>>>> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
>>>>> listings <- mapply(cbind, listings, date_compiled = names(listings))
>>>>> listings <- listings %>% bind_rows
>>>>>
>>>>> # Characters removal
>>>>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
>>>>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
>>>>> listings$price <- gsub("\\$", "", listings$price)
>>>>> listings$price <- gsub(",", "", listings$price)
>>>>>
>>>>>
>>>>>
>>>>> ## timedummy
>>>>>
>>>>> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
>>>>> timedummy <- paste(timedummy, sep = "", collapse = " + ")
>>>>> timedummy <- gsub("-", "_", timedummy)
>>>>>
>>>>>
>>>>>
>>>>> ## OLS regression
>>>>>
>>>>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
>>>>> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>>>>>
>>>>> ########
>>>>>
>>>>> Some of my id's repeat in multiple time periods.
>>>>>
>>>>> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>>>>>
>>>>> Now I want to apply the hedonic model with the timedummy variables.
>>>>>
>>>>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>>>>>
>>>>> Again, thank you very much for the help provided until now.
>>>>>
>>>>> Best regards,
>>>>> Robert
>>>>>
>>>>> ________________________________________
>>>>> From: Roger Bivand <[hidden email]>
>>>>> Sent: Tuesday, November 5, 2019 15:30
>>>>> To: Robert R
>>>>> Cc: [hidden email]
>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>>
>>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>>
>>>>>> I have a large pooled cross-section data set. ?I would like to
>>>>>> estimate/regress using spatial autocorrelation methods. I am assuming
>>>>>> for now that spatial dependence is present in both the dependent
>>>>>> variable and the error term.? ?My data set is over a period of 4 years,
>>>>>> monthly data (54 periods). For this means, I've created a time dummy
>>>>>> variable for each time period.? ?I also created a weight matrix using the
>>>>>> functions "poly2nb" and "nb2listw".? ?Now I am trying to figure out a way
>>>>>> to estimate my model which contains a really big data set.? ?Basically, my
>>>>>> model is as follows: y = ?D + ?W1y + X? + ?W2u + ?? ?My questions are:? ?1)
>>>>>> My spatial weight matrix for the whole data set will be probably a
>>>>>> enormous matrix with submatrices for each time period itself. I don't
>>>>>> think it would be possible to calculate this.? What I would like to know
>>>>>> is a way to estimate each time dummy/period separately (to compare
>>>>>> different periods alone). How to do it?? ?2) Which package to use: spdep
>>>>>> or splm?? ?Thank you and best regards,? Robert?
>>>>>
>>>>> Please do not post HTML, only plain text. Almost certainly your model
>>>>> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
>>>>> untried). What is your cross-sectional size? Using sparse kronecker
>>>>> products, the "enormous" matrix may not be very big. Does it make any
>>>>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
>>>>> of the covariates time-varying? Please provide motivation and use area
>>>>> (preferably with affiliation (your email and user name are not
>>>>> informative) - this feels like a real estate problem, probably wrongly
>>>>> specified. You should use splm if time make sense in your case, but if it
>>>>> really doesn't, simplify your approach, as much of the data will be
>>>>> subject to very large temporal autocorrelation.
>>>>>
>>>>> If this is a continuation of your previous question about using
>>>>> self-neighbours, be aware that you should not use self-neighbours in
>>>>> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>>>>>
>>>>> Roger
>>>>>
>>>>>>
>>>>>>       [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> R-sig-Geo mailing list
>>>>>> [hidden email]
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>>
>>>>> --
>>>>> Roger Bivand
>>>>> Department of Economics, Norwegian School of Economics,
>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>>> https://orcid.org/0000-0003-2392-6140
>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>
>>>>
>>>> --
>>>> Roger Bivand
>>>> Department of Economics, Norwegian School of Economics,
>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>> https://orcid.org/0000-0003-2392-6140
>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>
>>>
>>> --
>>> Roger Bivand
>>> Department of Economics, Norwegian School of Economics,
>>> Helleveien 30, N-5045 Bergen, Norway.
>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>> https://orcid.org/0000-0003-2392-6140
>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: [hidden email]
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Roger Bivand
Administrator
On Wed, 4 Dec 2019, Robert R wrote:

> Dear Roger,
>
> Again, thank you for your answer. What do you mean by "zip code random
> effect"? You mean I should use in plm the model "random"?
>
> regression_re <- plm(formula = model, data = listings, model = "random",
> index = c("id", "date_compiled"))

No, obviously not, your data are not a balanced panel. I mean a multilevel
model, where the <200 zip codes cluster the data, and where a zip code
level IID RE will almost certainly do a better job than dummies. An
MRF/ICAR RE might be an extension.

>
> And any other methodology in dealing with large weight matrices in
> spatialreg::lagsarlm?

Please refer to Bivand et al. (2013) refered to in the package. Probably
the weights would need to be symmetric and very sparse.

I still think that you should focus on a small subset of the data and to
improving the signal-noise ratio before trying to scale up.

Roger

>
> Thank you and best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Wednesday, November 27, 2019 13:53
> To: Robert R
> Cc: [hidden email]
> Subject: SV: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> Yes this is expected, since the # neighbours in a single zip code block is a dense matrix, and there will be multiple such matrices. (15000^2)*8 is 1.8e+09 so such a dense matrix will max out your RAM. There is no way to look at block neighbours in that format without subsetting your data (think train/test), use a zip code random effect. I would certainly drop all attempts to examine spatial dependency until you get an aspatial multilevel hedonic model working.
>
> Roger
>
> --
> Roger Bivand
> Norwegian School of Economics
> Helleveien 30, 5045 Bergen, Norway
> [hidden email]
>
>
> ________________________________________
> Fra: Robert R <[hidden email]>
> Sendt: tirsdag 26. november 2019 21.04
> Til: Roger Bivand
> Kopi: [hidden email]
> Emne: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> Dear Roger,
>
> Thank you for your e-mail. Actually there is less noise that it seems. Rental prices are daily rental prices and I have an extract of all Airbnb listings daily prices once a month for a period of 4 years. Each listings information contains the lat, lon, number of bedrooms, category (entire home/apt, shared room or private room), etc.
>
> One question regarding the spdep::nb2blocknb function: it runs super fast with up to n = 1000, and always crashes my R session with n = 15000 or so. Is there an alternative to solve this problem?
>
> Thank you and best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Tuesday, November 26, 2019 20:48
> To: Robert R
> Cc: [hidden email]
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> Sorry for late reply, am indisposed and unable to help further. I feel
> that there is so much noise in your data (differences in offers, rental
> lengths, repeats or not, etc.), that you will certainly have to subset
> vigorously first to isolate response cases that are comparable. What you
> are trying to disentangle are the hedonic components in the bundle where
> you just have price as response, but lots of other bundle characteristics
> on the right hand side (days, etc.). I feel you'd need to try to get to a
> response index of price per day per rental area or some such. I'd
> certainly advise examining responses to a specific driver (major concert
> or sports event) to get a feel for how the market responds, and return to
> spatial hedonic after finding an approach that gives reasonable aspatial
> outcomes.
>
> Roger
>
> On Sun, 17 Nov 2019, Robert R wrote:
>
>> Dear Roger,
>>
>> Thank you for your message and sorry for my late answer.
>>
>> Regarding the number of listings (lettings) for my data set (2.216.642 observations), each listing contains an individual id:
>>
>> unique ids: 180.004
>> time periods: 54 (2015-01 to 2019-09)
>> number of ids that appear only once: 28.486 (of 180.004 ids) (15,8%)
>> number of ids that appear/repeat 2-10 times: 82.641 (of 180.004 ids) (45,9%)
>> number of ids that appear/repeat 11-30 times: 46.465 (of 180.004 ids) (25,8%)
>> number of ids that appear/repeat 31-54 times: 22.412 (of 180.004 ids) (12,5%)
>>
>> Important to notice is that hosts can change the room_category (between entire/home apt, private room and shared room) keeping the same listing id number. In my data, the number of unique ids that in some point changed the room_type is of 7.204 ids.
>>
>> --
>>
>> For the OLS model, I was using only a fixed effect model, where each time period (date_compiled) (54 in total) is a time dummy.
>>
>> plm::plm(formula = model, data = listings, model = "pooling", index = c("id", "date_compiled"))
>>
>>
>> --
>> Osland et al. (2016) (https://doi.org/10.1111/jors.12281) use a spatial fixed effects (SFE) hedonic model, where each defined neighborhood zone in the study area is represented by dummy variables.
>>
>> Dong et al. (2015) (https://doi.org/10.1111/gean.12049) outline four model specifications to accommodate geographically hierarchical data structures: (1) groupwise W and fixed regional effects; (2) groupwise W and random regional effects; (3) proximity-based W and fixed regional effects; and (4) proximity-based W and random regional effects.
>> --
>>
>> I created a new column/variable containing the borough where the zipcode is found (Manhattan, Brooklyn, Queens, Bronx, Staten Island).
>>
>> If I understood it right, the (two-level) Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) considers the occurrence of spatial relations at the (lower) individual (geographical coordinates - in my case, the listing location) and (higher) group level (territorial units - in my case, zipcodes).
>>
>> According to Bivand et al. (2017): "(...) W is a spatial weights matrix. The HSAR model may also be estimated without this component.". So, in this case I only estimate the Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) in a "one-level" basis, i.e., at the higher-level.
>>
>> HSAR::hsar(model, data = listings, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>>
>> (Where the "model" formula contains the 54 time dummy variables)
>>
>> Do you think I can proceed with this model? I was able to calculate it.
>>
>> If I remove all observations/rows with NAs in one of the chosen variables/observations, 884.183 observations remain. If I would create a W matrix for HSAR::hsar, I would have a gigantic 884.183 by 884.183 matrix. This is the reason why I put W = NULL.
>>
>>
>> Thank you and best regards
>>
>> ________________________________________
>> From: Roger Bivand <[hidden email]>
>> Sent: Monday, November 11, 2019 11:31
>> To: Robert R
>> Cc: [hidden email]
>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> On Sun, 10 Nov 2019, Robert R wrote:
>>
>>> Dear Roger,
>>>
>>> Again, thank you for your answer. I read the material provided and
>>> decided that Hierarchical Spatial Autoregressive (HSAR) could be the
>>> right model for me.
>>>
>>> I indeed have the precise latitude and longitude information for all my
>>> listings for NYC.
>>>
>>> I created a stratified sample (group = zipcode) with 22172 (1%) of my
>>> observations called listings_sample and tried to replicate the hsar
>>> model, please see below.
>>>
>>> For now W = NULL, because otherwise I would have a 22172 x 22172 matrix.
>>
>> Unless you know definitely that you want to relate the response to its
>> lagged value, you do not need this. Do note that the matrix is very
>> sparse, so could be fitted without difficulty with ML in a cross-sectional
>> model.
>>
>>>
>>> You recommended then to introduce a Markov random field (MRF) random
>>> effect (RE) at the zipcode level, but I did not understand it so well.
>>> Could you develop a litte more?
>>>
>>
>> Did you read the development in
>> https://doi.org/10.1016/j.spasta.2017.01.002? It is explained there, and
>> includes code for fitting the Beijing housing parcels data se from HSAR
>> with many other packages (MCMC, INLA, hglm, etc.). I guess that you should
>> try to create a model that works on a single borough, sing the zipcodes
>> in that borough as a proxy for unobserved neighbourhood effects. Try for
>> example using lme4::lmer() with only a zipcode IID random effect, see if
>> the hedonic estimates are similar to lm(), and leave adding an MRF RE
>> (with for example mgcv::gam() or hglm::hglm()) until you have a working
>> testbed. Then advance step-by-step from there.
>>
>> You still have not said how many repeat lettings you see - it will affect
>> the way you specify your model.
>>
>> Roger
>>
>>> ##############
>>> library(spdep)
>>> library(HSAR)
>>> library(dplyr)
>>> library(splitstackshape)
>>>
>>>
>>> # Stratified sample per zipcode (size = 1%) listings_sample <-
>>> splitstackshape::stratified(indt = listings, group = "zipcode", size =
>>> 0.01)
>>>
>>> # Removing zipcodes from polygon_nyc which are not observable in
>>> listings_sample polygon_nyc_listings <- polygon_nyc %>% filter(zipcode
>>> %in% c(unique(as.character(listings_sample$zipcode))))
>>>
>>>
>>> ## Random effect matrix (N by J)
>>>
>>> # N: 22172
>>> # J: 154
>>>
>>> # Arrange listings_sample by zipcode (ascending)
>>> listings_sample <- listings_sample %>% arrange(zipcode)
>>>
>>> # Count number of listings per zipcode
>>> MM <- listings_sample %>% st_drop_geometry() %>% group_by(zipcode) %>% summarise(count = n()) %>% as.data.frame()
>>> # sum(MM$count)
>>>
>>> # N by J nulled matrix creation
>>> Delta <- matrix(data = 0, nrow = nrow(listings_sample), ncol = dim(MM)[1])
>>>
>>> # The total number of neighbourhood
>>> Uid <- rep(c(1:dim(MM)[1]), MM[,2])
>>>
>>> for(i in 1:dim(MM)[1]) {
>>>  Delta[Uid==i,i] <- 1
>>> }
>>> rm(i)
>>>
>>> Delta <- as(Delta,"dgCMatrix")
>>>
>>>
>>> ## Higher-level spatial weights matrix or neighbourhood matrix (J by J)
>>>
>>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>>> polygon_nyc_nb <- poly2nb(polygon_nyc_listings, row.names = polygon_nyc$zipcode, queen = TRUE)
>>>
>>> # Include neighbour itself as a neighbour
>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>
>>> # Spatial weights matrix for nb
>>> polygon_nyc_nb_matrix <- nb2mat(neighbours = polygon_nyc_nb, style = "W", zero.policy = NULL)
>>> M <- as(polygon_nyc_nb_matrix,"dgCMatrix")
>>>
>>>
>>> ## Fit HSAR SAR upper level random effect
>>> model <- as.formula(log_price ~ guests_included + minimum_nights)
>>>
>>> betas = coef(lm(formula = model, data = listings_sample))
>>> pars = list(rho = 0.5, lambda = 0.5, sigma2e = 2.0, sigma2u = 2.0, betas = betas)
>>>
>>> m_hsar <- hsar(model, data = listings_sample, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>>>
>>> ##############
>>>
>>> Thank you and best regards
>>> Robert
>>>
>>> ________________________________________
>>> From: Roger Bivand <[hidden email]>
>>> Sent: Friday, November 8, 2019 13:29
>>> To: Robert R
>>> Cc: [hidden email]
>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>
>>> On Fri, 8 Nov 2019, Robert R wrote:
>>>
>>>> Dear Roger,
>>>>
>>>> Thank you for your answer.
>>>>
>>>> I successfully used the function nb2blocknb() for a smaller dataset.
>>>>
>>>> But for a dataset of over 2 million observations, I get the following
>>>> error: "Error: cannot allocate vector of size 840 Kb".
>>>
>>> I don't think the observations are helpful. If you have repeat lets in the
>>> same property in a given month, you need to handle that anyway. I'd go for
>>> making the modelling exercise work (we agree that this is not panel data,
>>> right?) on a small subset first. I would further argue that you need a
>>> multi-level approach rather than spdep::nb2blocknb(), with a zipcode IID
>>> RE. You could very well take (stratified) samples per zipcode to represent
>>> your data. Once that works, introduce an MRF RE at the zipcode level,
>>> where you do know relative position. Using SARAR is going to be a waste of
>>> time unless you can geocode the letting addresses. A multi-level approach
>>> will work. Having big data in your case with no useful location
>>> information per observation is just adding noise and over-smoothing, I'm
>>> afraid. The approach used in https://doi.org/10.1016/j.spasta.2017.01.002
>>> will work, also when you sample the within zipcode lets, given a split
>>> into training and test sets, and making CV possible.
>>>
>>> Roger
>>>
>>>>
>>>> I am expecting that at least 500.000 observations will be dropped due
>>>> the lack of values for the chosen variables for the regression model, so
>>>> probably I will filter and remove the observations/rows that will not be
>>>> used anyway - do you know if there is any package that does this
>>>> automatically, given the variables/columns chosed by me?
>>>>
>>>> Or would you recommend me another approach to avoid the above mentioned
>>>> error?
>>>>
>>>> Thank you and best regards,
>>>> Robert
>>>>
>>>> ________________________________________
>>>> From: Roger Bivand <[hidden email]>
>>>> Sent: Thursday, November 7, 2019 10:13
>>>> To: Robert R
>>>> Cc: [hidden email]
>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>
>>>> On Thu, 7 Nov 2019, Robert R wrote:
>>>>
>>>>> Dear Roger,
>>>>>
>>>>> Many thanks for your help.
>>>>>
>>>>> I have an additional question:
>>>>>
>>>>> Is it possible to create a "separate" lw (nb2listw) (with different
>>>>> rownumbers) from my data set? For now, I am taking my data set and
>>>>> merging with the sf object polygon_nyc with the function
>>>>> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create
>>>>> a huge n x n matrix (depending of the size of my data set).
>>>>>
>>>>> Taking the polygon_nyc alone and turning it to a lw (weights list)
>>>>> object has only n = 177.
>>>>>
>>>>> Of course running
>>>>>
>>>>> spatialreg::lagsarlm(formula=model, data = listings_sample,
>>>>> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
>>>>>
>>>>> does not work ("Input data and weights have different dimensions").
>>>>>
>>>>> The only option is to take my data set, merge it to my polygon_nyc (by
>>>>> zipcode) and then create the weights list lw? Or there another option?
>>>>
>>>> I think we are getting more clarity. You do not know the location of the
>>>> lettings beyond their zipcode. You do know the boundaries of the zipcode
>>>> areas, and can create a neighbour object from these boundaries. You then
>>>> want to treat all the lettings in a zipcode area i as neighbours, and
>>>> additionally lettings in zipcode areas neighbouring i as neighbours of
>>>> lettings in i. This is the data structure that motivated the
>>>> spdep::nb2blocknb() function:
>>>>
>>>> https://r-spatial.github.io/spdep/reference/nb2blocknb.html
>>>>
>>>> Try running the examples to get a feel for what is going on.
>>>>
>>>> I feel that most of the variability will vanish in the very large numbers
>>>> of neighbours, over-smoothing the outcomes. If you do not have locations
>>>> for the lettings themselves, I don't think you can make much progress.
>>>>
>>>> You could try a linear mixed model (or gam with a spatially structured
>>>> random effect) with a temporal and a spatial random effect. See the HSAR
>>>> package, articles by Dong et al., and maybe
>>>> https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither
>>>> this nor Dong et al. handle spatio-temporal settings. MRF spatial random
>>>> effects at the zipcode level might be a way forward, together with an IID
>>>> random effect at the same level (equivalent to sef-neighbours).
>>>>
>>>> Hope this helps,
>>>>
>>>> Roger
>>>>
>>>>>
>>>>> Best regards,
>>>>> Robert
>>>>>
>>>>> ________________________________________
>>>>> From: Roger Bivand <[hidden email]>
>>>>> Sent: Wednesday, November 6, 2019 15:07
>>>>> To: Robert R
>>>>> Cc: [hidden email]
>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>>
>>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>>
>>>>>> Dear Roger,
>>>>>>
>>>>>> Thank you for your reply. I disabled HTML; my e-mails should be now in
>>>>>> plain text.
>>>>>>
>>>>>> I will give a better context for my desired outcome.
>>>>>>
>>>>>> I am taking Airbnb's listings information for New York City available
>>>>>> on: http://insideairbnb.com/get-the-data.html
>>>>>>
>>>>>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
>>>>>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
>>>>>> Listings/ folder. When importing all these 54 files into one single data
>>>>>> set, I create a new "date_compiled" variable/column.
>>>>>>
>>>>>> In total, after the data cleansing process, I have a little more 2
>>>>>> million observations.
>>>>>
>>>>> You have repeat lettings for some, but not all properties. So this is at
>>>>> best a very unbalanced panel. For those properties with repeats, you may
>>>>> see temporal movement (trend/seasonal).
>>>>>
>>>>> I suggest (strongly) taking a single borough or even zipcode with some
>>>>> hindreds of properties, and working from there. Do not include the
>>>>> observation as its own neighbour, perhaps identify repeats and handle them
>>>>> specially (create or use a property ID). Unbalanced panels may also create
>>>>> a selection bias issue (why are some properties only listed sometimes?).
>>>>>
>>>>> So this although promising isn't simple, and getting to a hedonic model
>>>>> may be hard, but not (just) because of spatial autocorrelation. I wouldn't
>>>>> necessarily trust OLS output either, partly because of the repeat property
>>>>> issue.
>>>>>
>>>>> Roger
>>>>>
>>>>>>
>>>>>> I created 54 timedummy variables for each time period available.
>>>>>>
>>>>>> I want to estimate using a hedonic spatial timedummy model the impact of
>>>>>> a variety of characteristics which potentially determine the daily rate
>>>>>> on Airbnb listings through time in New York City (e.g. characteristics
>>>>>> of the listing as number of bedrooms, if the host if professional,
>>>>>> proximity to downtown (New York City Hall) and nearest subway station
>>>>>> from the listing, income per capita, etc.).
>>>>>>
>>>>>> My dependent variable is price (log price, common in the related
>>>>>> literature for hedonic prices).
>>>>>>
>>>>>> The OLS model is done.
>>>>>>
>>>>>> For the spatial model, I am assuming that hosts, when deciding the
>>>>>> pricing of their listings, take not only into account its structural and
>>>>>> location characteristics, but also the prices charged by near listings
>>>>>> with similar characteristics - spatial autocorrelation is then present,
>>>>>> at least spatial dependence is present in the dependent variable.
>>>>>>
>>>>>> As I wrote in my previous post, I was willing to consider the neighbor
>>>>>> itself as a neighbor.
>>>>>>
>>>>>> Parts of my code can be found below:
>>>>>>
>>>>>> ########
>>>>>>
>>>>>> ## packages
>>>>>>
>>>>>> packages_install <- function(packages){
>>>>>> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
>>>>>> if (length(new.packages))
>>>>>> install.packages(new.packages, dependencies = TRUE)
>>>>>> sapply(packages, require, character.only = TRUE)
>>>>>> }
>>>>>>
>>>>>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
>>>>>> packages_install(packages_required)
>>>>>>
>>>>>> # Working directory
>>>>>> setwd("C:/Users/User/R")
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## shapefile_us
>>>>>>
>>>>>> # Shapefile zips import and Coordinate Reference System (CRS) transformation
>>>>>> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
>>>>>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>>>>>>
>>>>>> # Columns removal
>>>>>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>>>>>>
>>>>>> # Column rename: ZCTA5CE10
>>>>>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>>>>>>
>>>>>> # Column class change: zipcode
>>>>>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## polygon_nyc
>>>>>>
>>>>>> # Zip code not available in shapefile: 11695
>>>>>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## weight_matrix
>>>>>>
>>>>>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>>>>>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>>>>>>
>>>>>> # Include neighbour itself as a neighbour
>>>>>> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
>>>>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>>>>
>>>>>> # Weights to each neighboring polygon
>>>>>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## listings
>>>>>>
>>>>>> # Data import
>>>>>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
>>>>>> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
>>>>>> listings <- mapply(cbind, listings, date_compiled = names(listings))
>>>>>> listings <- listings %>% bind_rows
>>>>>>
>>>>>> # Characters removal
>>>>>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
>>>>>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
>>>>>> listings$price <- gsub("\\$", "", listings$price)
>>>>>> listings$price <- gsub(",", "", listings$price)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## timedummy
>>>>>>
>>>>>> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
>>>>>> timedummy <- paste(timedummy, sep = "", collapse = " + ")
>>>>>> timedummy <- gsub("-", "_", timedummy)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## OLS regression
>>>>>>
>>>>>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
>>>>>> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>>>>>>
>>>>>> ########
>>>>>>
>>>>>> Some of my id's repeat in multiple time periods.
>>>>>>
>>>>>> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>>>>>>
>>>>>> Now I want to apply the hedonic model with the timedummy variables.
>>>>>>
>>>>>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>>>>>>
>>>>>> Again, thank you very much for the help provided until now.
>>>>>>
>>>>>> Best regards,
>>>>>> Robert
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Roger Bivand <[hidden email]>
>>>>>> Sent: Tuesday, November 5, 2019 15:30
>>>>>> To: Robert R
>>>>>> Cc: [hidden email]
>>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>>>
>>>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>>>
>>>>>>> I have a large pooled cross-section data set. ?I would like to
>>>>>>> estimate/regress using spatial autocorrelation methods. I am assuming
>>>>>>> for now that spatial dependence is present in both the dependent
>>>>>>> variable and the error term.? ?My data set is over a period of 4 years,
>>>>>>> monthly data (54 periods). For this means, I've created a time dummy
>>>>>>> variable for each time period.? ?I also created a weight matrix using the
>>>>>>> functions "poly2nb" and "nb2listw".? ?Now I am trying to figure out a way
>>>>>>> to estimate my model which contains a really big data set.? ?Basically, my
>>>>>>> model is as follows: y = ?D + ?W1y + X? + ?W2u + ?? ?My questions are:? ?1)
>>>>>>> My spatial weight matrix for the whole data set will be probably a
>>>>>>> enormous matrix with submatrices for each time period itself. I don't
>>>>>>> think it would be possible to calculate this.? What I would like to know
>>>>>>> is a way to estimate each time dummy/period separately (to compare
>>>>>>> different periods alone). How to do it?? ?2) Which package to use: spdep
>>>>>>> or splm?? ?Thank you and best regards,? Robert?
>>>>>>
>>>>>> Please do not post HTML, only plain text. Almost certainly your model
>>>>>> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
>>>>>> untried). What is your cross-sectional size? Using sparse kronecker
>>>>>> products, the "enormous" matrix may not be very big. Does it make any
>>>>>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
>>>>>> of the covariates time-varying? Please provide motivation and use area
>>>>>> (preferably with affiliation (your email and user name are not
>>>>>> informative) - this feels like a real estate problem, probably wrongly
>>>>>> specified. You should use splm if time make sense in your case, but if it
>>>>>> really doesn't, simplify your approach, as much of the data will be
>>>>>> subject to very large temporal autocorrelation.
>>>>>>
>>>>>> If this is a continuation of your previous question about using
>>>>>> self-neighbours, be aware that you should not use self-neighbours in
>>>>>> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>>>>>>
>>>>>> Roger
>>>>>>
>>>>>>>
>>>>>>>       [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> R-sig-Geo mailing list
>>>>>>> [hidden email]
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>>>
>>>>>> --
>>>>>> Roger Bivand
>>>>>> Department of Economics, Norwegian School of Economics,
>>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>>>> https://orcid.org/0000-0003-2392-6140
>>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>>
>>>>>
>>>>> --
>>>>> Roger Bivand
>>>>> Department of Economics, Norwegian School of Economics,
>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>>> https://orcid.org/0000-0003-2392-6140
>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>
>>>>
>>>> --
>>>> Roger Bivand
>>>> Department of Economics, Norwegian School of Economics,
>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>> https://orcid.org/0000-0003-2392-6140
>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>
>>>
>>> --
>>> Roger Bivand
>>> Department of Economics, Norwegian School of Economics,
>>> Helleveien 30, N-5045 Bergen, Norway.
>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>> https://orcid.org/0000-0003-2392-6140
>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: [hidden email]
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Robert R
Dear Roger,

Thank you for your answer. Regarding the sparse matrix, you're right - I tested creating one, as follows:

#####

listings_nb <- listings %>% spdep::poly2nb(queen = TRUE) %>% spdep::include.self()
# is.symmetric.nb(listings_nb)

nb_B  <- spdep::nb2listw(neighbours = listings_nb, style="B", zero.policy = FALSE)

B <- as(nb_B , "CsparseMatrix")
all(B == t(B))

nb_B1 <- mat2listw(as(B, "dgTMatrix"))

format(object.size(nb_B), units = "Mb")
# 85.6 Mb

format(object.size(nb_B1), units = "Mb")
# 85.6 Mb

#####

The size for both objects is the same - different from this example: https://cran.r-project.org/web/packages/spdep/vignettes/nb_igraph.html


Summing up the data that I have: a "picture" for Airbnb's listings/ads once a month for NYC, incl. lat, lon, price (per night), id, and some characteristics from the listing/ad as number of rooms, bedrooms, guests included, etc. for a period of 54 months. The data was taken here: http://insideairbnb.com/get-the-data.html (listings.csv.gz)

For the OLS, I used a pooled OLS with time dummy fixed effects (date_compiled, when the "picture" was compiled - how Airbnb listings for NYC were shown) because many of my observations (listings id) do not repeat for many periods. Also, many listings changed the room_type at least once during the whole time period analyzed (3 types of room_type: entire home/apt, private room, shared room).

I am now trying a random intercepts three-level hierarchy multilevel model, where _id_ (level 1) are nested within _zipcode_ (level 2), and the last is nested within _borough_ (level 3). So groups: id:(zipcode:borough).

lme4::lmer(log_price ~ factor(room_type) + bedrooms + bathrooms + guests_included + minimum_nights + distance_downtown + distance_subway + number_of_reviews + review_scores_cleanliness + professional_host + host_is_superhost + is_business_travel_ready + offense_misdemeanor + offense_felony + income_per_capita + factor(date_compiled) + (1 | borough / zipcode / id), data = listings)

Roger, do you think it is okay to factor(room_type) (for the 3 types of room) and factor(date_compiled) for the dates when the NYC Airbnb's listings/ads were extracted?

Thank you and best regards,
Robert

________________________________________
From: Roger Bivand <[hidden email]>
Sent: Wednesday, December 4, 2019 09:07
To: Robert R
Cc: [hidden email]
Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method

On Wed, 4 Dec 2019, Robert R wrote:

> Dear Roger,
>
> Again, thank you for your answer. What do you mean by "zip code random
> effect"? You mean I should use in plm the model "random"?
>
> regression_re <- plm(formula = model, data = listings, model = "random",
> index = c("id", "date_compiled"))

No, obviously not, your data are not a balanced panel. I mean a multilevel
model, where the <200 zip codes cluster the data, and where a zip code
level IID RE will almost certainly do a better job than dummies. An
MRF/ICAR RE might be an extension.

>
> And any other methodology in dealing with large weight matrices in
> spatialreg::lagsarlm?

Please refer to Bivand et al. (2013) refered to in the package. Probably
the weights would need to be symmetric and very sparse.

I still think that you should focus on a small subset of the data and to
improving the signal-noise ratio before trying to scale up.

Roger

>
> Thank you and best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Wednesday, November 27, 2019 13:53
> To: Robert R
> Cc: [hidden email]
> Subject: SV: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> Yes this is expected, since the # neighbours in a single zip code block is a dense matrix, and there will be multiple such matrices. (15000^2)*8 is 1.8e+09 so such a dense matrix will max out your RAM. There is no way to look at block neighbours in that format without subsetting your data (think train/test), use a zip code random effect. I would certainly drop all attempts to examine spatial dependency until you get an aspatial multilevel hedonic model working.
>
> Roger
>
> --
> Roger Bivand
> Norwegian School of Economics
> Helleveien 30, 5045 Bergen, Norway
> [hidden email]
>
>
> ________________________________________
> Fra: Robert R <[hidden email]>
> Sendt: tirsdag 26. november 2019 21.04
> Til: Roger Bivand
> Kopi: [hidden email]
> Emne: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> Dear Roger,
>
> Thank you for your e-mail. Actually there is less noise that it seems. Rental prices are daily rental prices and I have an extract of all Airbnb listings daily prices once a month for a period of 4 years. Each listings information contains the lat, lon, number of bedrooms, category (entire home/apt, shared room or private room), etc.
>
> One question regarding the spdep::nb2blocknb function: it runs super fast with up to n = 1000, and always crashes my R session with n = 15000 or so. Is there an alternative to solve this problem?
>
> Thank you and best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Tuesday, November 26, 2019 20:48
> To: Robert R
> Cc: [hidden email]
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> Sorry for late reply, am indisposed and unable to help further. I feel
> that there is so much noise in your data (differences in offers, rental
> lengths, repeats or not, etc.), that you will certainly have to subset
> vigorously first to isolate response cases that are comparable. What you
> are trying to disentangle are the hedonic components in the bundle where
> you just have price as response, but lots of other bundle characteristics
> on the right hand side (days, etc.). I feel you'd need to try to get to a
> response index of price per day per rental area or some such. I'd
> certainly advise examining responses to a specific driver (major concert
> or sports event) to get a feel for how the market responds, and return to
> spatial hedonic after finding an approach that gives reasonable aspatial
> outcomes.
>
> Roger
>
> On Sun, 17 Nov 2019, Robert R wrote:
>
>> Dear Roger,
>>
>> Thank you for your message and sorry for my late answer.
>>
>> Regarding the number of listings (lettings) for my data set (2.216.642 observations), each listing contains an individual id:
>>
>> unique ids: 180.004
>> time periods: 54 (2015-01 to 2019-09)
>> number of ids that appear only once: 28.486 (of 180.004 ids) (15,8%)
>> number of ids that appear/repeat 2-10 times: 82.641 (of 180.004 ids) (45,9%)
>> number of ids that appear/repeat 11-30 times: 46.465 (of 180.004 ids) (25,8%)
>> number of ids that appear/repeat 31-54 times: 22.412 (of 180.004 ids) (12,5%)
>>
>> Important to notice is that hosts can change the room_category (between entire/home apt, private room and shared room) keeping the same listing id number. In my data, the number of unique ids that in some point changed the room_type is of 7.204 ids.
>>
>> --
>>
>> For the OLS model, I was using only a fixed effect model, where each time period (date_compiled) (54 in total) is a time dummy.
>>
>> plm::plm(formula = model, data = listings, model = "pooling", index = c("id", "date_compiled"))
>>
>>
>> --
>> Osland et al. (2016) (https://doi.org/10.1111/jors.12281) use a spatial fixed effects (SFE) hedonic model, where each defined neighborhood zone in the study area is represented by dummy variables.
>>
>> Dong et al. (2015) (https://doi.org/10.1111/gean.12049) outline four model specifications to accommodate geographically hierarchical data structures: (1) groupwise W and fixed regional effects; (2) groupwise W and random regional effects; (3) proximity-based W and fixed regional effects; and (4) proximity-based W and random regional effects.
>> --
>>
>> I created a new column/variable containing the borough where the zipcode is found (Manhattan, Brooklyn, Queens, Bronx, Staten Island).
>>
>> If I understood it right, the (two-level) Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) considers the occurrence of spatial relations at the (lower) individual (geographical coordinates - in my case, the listing location) and (higher) group level (territorial units - in my case, zipcodes).
>>
>> According to Bivand et al. (2017): "(...) W is a spatial weights matrix. The HSAR model may also be estimated without this component.". So, in this case I only estimate the Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) in a "one-level" basis, i.e., at the higher-level.
>>
>> HSAR::hsar(model, data = listings, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>>
>> (Where the "model" formula contains the 54 time dummy variables)
>>
>> Do you think I can proceed with this model? I was able to calculate it.
>>
>> If I remove all observations/rows with NAs in one of the chosen variables/observations, 884.183 observations remain. If I would create a W matrix for HSAR::hsar, I would have a gigantic 884.183 by 884.183 matrix. This is the reason why I put W = NULL.
>>
>>
>> Thank you and best regards
>>
>> ________________________________________
>> From: Roger Bivand <[hidden email]>
>> Sent: Monday, November 11, 2019 11:31
>> To: Robert R
>> Cc: [hidden email]
>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> On Sun, 10 Nov 2019, Robert R wrote:
>>
>>> Dear Roger,
>>>
>>> Again, thank you for your answer. I read the material provided and
>>> decided that Hierarchical Spatial Autoregressive (HSAR) could be the
>>> right model for me.
>>>
>>> I indeed have the precise latitude and longitude information for all my
>>> listings for NYC.
>>>
>>> I created a stratified sample (group = zipcode) with 22172 (1%) of my
>>> observations called listings_sample and tried to replicate the hsar
>>> model, please see below.
>>>
>>> For now W = NULL, because otherwise I would have a 22172 x 22172 matrix.
>>
>> Unless you know definitely that you want to relate the response to its
>> lagged value, you do not need this. Do note that the matrix is very
>> sparse, so could be fitted without difficulty with ML in a cross-sectional
>> model.
>>
>>>
>>> You recommended then to introduce a Markov random field (MRF) random
>>> effect (RE) at the zipcode level, but I did not understand it so well.
>>> Could you develop a litte more?
>>>
>>
>> Did you read the development in
>> https://doi.org/10.1016/j.spasta.2017.01.002? It is explained there, and
>> includes code for fitting the Beijing housing parcels data se from HSAR
>> with many other packages (MCMC, INLA, hglm, etc.). I guess that you should
>> try to create a model that works on a single borough, sing the zipcodes
>> in that borough as a proxy for unobserved neighbourhood effects. Try for
>> example using lme4::lmer() with only a zipcode IID random effect, see if
>> the hedonic estimates are similar to lm(), and leave adding an MRF RE
>> (with for example mgcv::gam() or hglm::hglm()) until you have a working
>> testbed. Then advance step-by-step from there.
>>
>> You still have not said how many repeat lettings you see - it will affect
>> the way you specify your model.
>>
>> Roger
>>
>>> ##############
>>> library(spdep)
>>> library(HSAR)
>>> library(dplyr)
>>> library(splitstackshape)
>>>
>>>
>>> # Stratified sample per zipcode (size = 1%) listings_sample <-
>>> splitstackshape::stratified(indt = listings, group = "zipcode", size =
>>> 0.01)
>>>
>>> # Removing zipcodes from polygon_nyc which are not observable in
>>> listings_sample polygon_nyc_listings <- polygon_nyc %>% filter(zipcode
>>> %in% c(unique(as.character(listings_sample$zipcode))))
>>>
>>>
>>> ## Random effect matrix (N by J)
>>>
>>> # N: 22172
>>> # J: 154
>>>
>>> # Arrange listings_sample by zipcode (ascending)
>>> listings_sample <- listings_sample %>% arrange(zipcode)
>>>
>>> # Count number of listings per zipcode
>>> MM <- listings_sample %>% st_drop_geometry() %>% group_by(zipcode) %>% summarise(count = n()) %>% as.data.frame()
>>> # sum(MM$count)
>>>
>>> # N by J nulled matrix creation
>>> Delta <- matrix(data = 0, nrow = nrow(listings_sample), ncol = dim(MM)[1])
>>>
>>> # The total number of neighbourhood
>>> Uid <- rep(c(1:dim(MM)[1]), MM[,2])
>>>
>>> for(i in 1:dim(MM)[1]) {
>>>  Delta[Uid==i,i] <- 1
>>> }
>>> rm(i)
>>>
>>> Delta <- as(Delta,"dgCMatrix")
>>>
>>>
>>> ## Higher-level spatial weights matrix or neighbourhood matrix (J by J)
>>>
>>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>>> polygon_nyc_nb <- poly2nb(polygon_nyc_listings, row.names = polygon_nyc$zipcode, queen = TRUE)
>>>
>>> # Include neighbour itself as a neighbour
>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>
>>> # Spatial weights matrix for nb
>>> polygon_nyc_nb_matrix <- nb2mat(neighbours = polygon_nyc_nb, style = "W", zero.policy = NULL)
>>> M <- as(polygon_nyc_nb_matrix,"dgCMatrix")
>>>
>>>
>>> ## Fit HSAR SAR upper level random effect
>>> model <- as.formula(log_price ~ guests_included + minimum_nights)
>>>
>>> betas = coef(lm(formula = model, data = listings_sample))
>>> pars = list(rho = 0.5, lambda = 0.5, sigma2e = 2.0, sigma2u = 2.0, betas = betas)
>>>
>>> m_hsar <- hsar(model, data = listings_sample, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>>>
>>> ##############
>>>
>>> Thank you and best regards
>>> Robert
>>>
>>> ________________________________________
>>> From: Roger Bivand <[hidden email]>
>>> Sent: Friday, November 8, 2019 13:29
>>> To: Robert R
>>> Cc: [hidden email]
>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>
>>> On Fri, 8 Nov 2019, Robert R wrote:
>>>
>>>> Dear Roger,
>>>>
>>>> Thank you for your answer.
>>>>
>>>> I successfully used the function nb2blocknb() for a smaller dataset.
>>>>
>>>> But for a dataset of over 2 million observations, I get the following
>>>> error: "Error: cannot allocate vector of size 840 Kb".
>>>
>>> I don't think the observations are helpful. If you have repeat lets in the
>>> same property in a given month, you need to handle that anyway. I'd go for
>>> making the modelling exercise work (we agree that this is not panel data,
>>> right?) on a small subset first. I would further argue that you need a
>>> multi-level approach rather than spdep::nb2blocknb(), with a zipcode IID
>>> RE. You could very well take (stratified) samples per zipcode to represent
>>> your data. Once that works, introduce an MRF RE at the zipcode level,
>>> where you do know relative position. Using SARAR is going to be a waste of
>>> time unless you can geocode the letting addresses. A multi-level approach
>>> will work. Having big data in your case with no useful location
>>> information per observation is just adding noise and over-smoothing, I'm
>>> afraid. The approach used in https://doi.org/10.1016/j.spasta.2017.01.002
>>> will work, also when you sample the within zipcode lets, given a split
>>> into training and test sets, and making CV possible.
>>>
>>> Roger
>>>
>>>>
>>>> I am expecting that at least 500.000 observations will be dropped due
>>>> the lack of values for the chosen variables for the regression model, so
>>>> probably I will filter and remove the observations/rows that will not be
>>>> used anyway - do you know if there is any package that does this
>>>> automatically, given the variables/columns chosed by me?
>>>>
>>>> Or would you recommend me another approach to avoid the above mentioned
>>>> error?
>>>>
>>>> Thank you and best regards,
>>>> Robert
>>>>
>>>> ________________________________________
>>>> From: Roger Bivand <[hidden email]>
>>>> Sent: Thursday, November 7, 2019 10:13
>>>> To: Robert R
>>>> Cc: [hidden email]
>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>
>>>> On Thu, 7 Nov 2019, Robert R wrote:
>>>>
>>>>> Dear Roger,
>>>>>
>>>>> Many thanks for your help.
>>>>>
>>>>> I have an additional question:
>>>>>
>>>>> Is it possible to create a "separate" lw (nb2listw) (with different
>>>>> rownumbers) from my data set? For now, I am taking my data set and
>>>>> merging with the sf object polygon_nyc with the function
>>>>> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create
>>>>> a huge n x n matrix (depending of the size of my data set).
>>>>>
>>>>> Taking the polygon_nyc alone and turning it to a lw (weights list)
>>>>> object has only n = 177.
>>>>>
>>>>> Of course running
>>>>>
>>>>> spatialreg::lagsarlm(formula=model, data = listings_sample,
>>>>> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
>>>>>
>>>>> does not work ("Input data and weights have different dimensions").
>>>>>
>>>>> The only option is to take my data set, merge it to my polygon_nyc (by
>>>>> zipcode) and then create the weights list lw? Or there another option?
>>>>
>>>> I think we are getting more clarity. You do not know the location of the
>>>> lettings beyond their zipcode. You do know the boundaries of the zipcode
>>>> areas, and can create a neighbour object from these boundaries. You then
>>>> want to treat all the lettings in a zipcode area i as neighbours, and
>>>> additionally lettings in zipcode areas neighbouring i as neighbours of
>>>> lettings in i. This is the data structure that motivated the
>>>> spdep::nb2blocknb() function:
>>>>
>>>> https://r-spatial.github.io/spdep/reference/nb2blocknb.html
>>>>
>>>> Try running the examples to get a feel for what is going on.
>>>>
>>>> I feel that most of the variability will vanish in the very large numbers
>>>> of neighbours, over-smoothing the outcomes. If you do not have locations
>>>> for the lettings themselves, I don't think you can make much progress.
>>>>
>>>> You could try a linear mixed model (or gam with a spatially structured
>>>> random effect) with a temporal and a spatial random effect. See the HSAR
>>>> package, articles by Dong et al., and maybe
>>>> https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither
>>>> this nor Dong et al. handle spatio-temporal settings. MRF spatial random
>>>> effects at the zipcode level might be a way forward, together with an IID
>>>> random effect at the same level (equivalent to sef-neighbours).
>>>>
>>>> Hope this helps,
>>>>
>>>> Roger
>>>>
>>>>>
>>>>> Best regards,
>>>>> Robert
>>>>>
>>>>> ________________________________________
>>>>> From: Roger Bivand <[hidden email]>
>>>>> Sent: Wednesday, November 6, 2019 15:07
>>>>> To: Robert R
>>>>> Cc: [hidden email]
>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>>
>>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>>
>>>>>> Dear Roger,
>>>>>>
>>>>>> Thank you for your reply. I disabled HTML; my e-mails should be now in
>>>>>> plain text.
>>>>>>
>>>>>> I will give a better context for my desired outcome.
>>>>>>
>>>>>> I am taking Airbnb's listings information for New York City available
>>>>>> on: http://insideairbnb.com/get-the-data.html
>>>>>>
>>>>>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
>>>>>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
>>>>>> Listings/ folder. When importing all these 54 files into one single data
>>>>>> set, I create a new "date_compiled" variable/column.
>>>>>>
>>>>>> In total, after the data cleansing process, I have a little more 2
>>>>>> million observations.
>>>>>
>>>>> You have repeat lettings for some, but not all properties. So this is at
>>>>> best a very unbalanced panel. For those properties with repeats, you may
>>>>> see temporal movement (trend/seasonal).
>>>>>
>>>>> I suggest (strongly) taking a single borough or even zipcode with some
>>>>> hindreds of properties, and working from there. Do not include the
>>>>> observation as its own neighbour, perhaps identify repeats and handle them
>>>>> specially (create or use a property ID). Unbalanced panels may also create
>>>>> a selection bias issue (why are some properties only listed sometimes?).
>>>>>
>>>>> So this although promising isn't simple, and getting to a hedonic model
>>>>> may be hard, but not (just) because of spatial autocorrelation. I wouldn't
>>>>> necessarily trust OLS output either, partly because of the repeat property
>>>>> issue.
>>>>>
>>>>> Roger
>>>>>
>>>>>>
>>>>>> I created 54 timedummy variables for each time period available.
>>>>>>
>>>>>> I want to estimate using a hedonic spatial timedummy model the impact of
>>>>>> a variety of characteristics which potentially determine the daily rate
>>>>>> on Airbnb listings through time in New York City (e.g. characteristics
>>>>>> of the listing as number of bedrooms, if the host if professional,
>>>>>> proximity to downtown (New York City Hall) and nearest subway station
>>>>>> from the listing, income per capita, etc.).
>>>>>>
>>>>>> My dependent variable is price (log price, common in the related
>>>>>> literature for hedonic prices).
>>>>>>
>>>>>> The OLS model is done.
>>>>>>
>>>>>> For the spatial model, I am assuming that hosts, when deciding the
>>>>>> pricing of their listings, take not only into account its structural and
>>>>>> location characteristics, but also the prices charged by near listings
>>>>>> with similar characteristics - spatial autocorrelation is then present,
>>>>>> at least spatial dependence is present in the dependent variable.
>>>>>>
>>>>>> As I wrote in my previous post, I was willing to consider the neighbor
>>>>>> itself as a neighbor.
>>>>>>
>>>>>> Parts of my code can be found below:
>>>>>>
>>>>>> ########
>>>>>>
>>>>>> ## packages
>>>>>>
>>>>>> packages_install <- function(packages){
>>>>>> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
>>>>>> if (length(new.packages))
>>>>>> install.packages(new.packages, dependencies = TRUE)
>>>>>> sapply(packages, require, character.only = TRUE)
>>>>>> }
>>>>>>
>>>>>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
>>>>>> packages_install(packages_required)
>>>>>>
>>>>>> # Working directory
>>>>>> setwd("C:/Users/User/R")
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## shapefile_us
>>>>>>
>>>>>> # Shapefile zips import and Coordinate Reference System (CRS) transformation
>>>>>> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
>>>>>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>>>>>>
>>>>>> # Columns removal
>>>>>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>>>>>>
>>>>>> # Column rename: ZCTA5CE10
>>>>>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>>>>>>
>>>>>> # Column class change: zipcode
>>>>>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## polygon_nyc
>>>>>>
>>>>>> # Zip code not available in shapefile: 11695
>>>>>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## weight_matrix
>>>>>>
>>>>>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>>>>>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>>>>>>
>>>>>> # Include neighbour itself as a neighbour
>>>>>> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
>>>>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>>>>
>>>>>> # Weights to each neighboring polygon
>>>>>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## listings
>>>>>>
>>>>>> # Data import
>>>>>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
>>>>>> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
>>>>>> listings <- mapply(cbind, listings, date_compiled = names(listings))
>>>>>> listings <- listings %>% bind_rows
>>>>>>
>>>>>> # Characters removal
>>>>>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
>>>>>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
>>>>>> listings$price <- gsub("\\$", "", listings$price)
>>>>>> listings$price <- gsub(",", "", listings$price)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## timedummy
>>>>>>
>>>>>> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
>>>>>> timedummy <- paste(timedummy, sep = "", collapse = " + ")
>>>>>> timedummy <- gsub("-", "_", timedummy)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## OLS regression
>>>>>>
>>>>>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
>>>>>> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>>>>>>
>>>>>> ########
>>>>>>
>>>>>> Some of my id's repeat in multiple time periods.
>>>>>>
>>>>>> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>>>>>>
>>>>>> Now I want to apply the hedonic model with the timedummy variables.
>>>>>>
>>>>>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>>>>>>
>>>>>> Again, thank you very much for the help provided until now.
>>>>>>
>>>>>> Best regards,
>>>>>> Robert
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Roger Bivand <[hidden email]>
>>>>>> Sent: Tuesday, November 5, 2019 15:30
>>>>>> To: Robert R
>>>>>> Cc: [hidden email]
>>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>>>
>>>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>>>
>>>>>>> I have a large pooled cross-section data set. ?I would like to
>>>>>>> estimate/regress using spatial autocorrelation methods. I am assuming
>>>>>>> for now that spatial dependence is present in both the dependent
>>>>>>> variable and the error term.? ?My data set is over a period of 4 years,
>>>>>>> monthly data (54 periods). For this means, I've created a time dummy
>>>>>>> variable for each time period.? ?I also created a weight matrix using the
>>>>>>> functions "poly2nb" and "nb2listw".? ?Now I am trying to figure out a way
>>>>>>> to estimate my model which contains a really big data set.? ?Basically, my
>>>>>>> model is as follows: y = ?D + ?W1y + X? + ?W2u + ?? ?My questions are:? ?1)
>>>>>>> My spatial weight matrix for the whole data set will be probably a
>>>>>>> enormous matrix with submatrices for each time period itself. I don't
>>>>>>> think it would be possible to calculate this.? What I would like to know
>>>>>>> is a way to estimate each time dummy/period separately (to compare
>>>>>>> different periods alone). How to do it?? ?2) Which package to use: spdep
>>>>>>> or splm?? ?Thank you and best regards,? Robert?
>>>>>>
>>>>>> Please do not post HTML, only plain text. Almost certainly your model
>>>>>> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
>>>>>> untried). What is your cross-sectional size? Using sparse kronecker
>>>>>> products, the "enormous" matrix may not be very big. Does it make any
>>>>>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
>>>>>> of the covariates time-varying? Please provide motivation and use area
>>>>>> (preferably with affiliation (your email and user name are not
>>>>>> informative) - this feels like a real estate problem, probably wrongly
>>>>>> specified. You should use splm if time make sense in your case, but if it
>>>>>> really doesn't, simplify your approach, as much of the data will be
>>>>>> subject to very large temporal autocorrelation.
>>>>>>
>>>>>> If this is a continuation of your previous question about using
>>>>>> self-neighbours, be aware that you should not use self-neighbours in
>>>>>> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>>>>>>
>>>>>> Roger
>>>>>>
>>>>>>>
>>>>>>>       [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> R-sig-Geo mailing list
>>>>>>> [hidden email]
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>>>
>>>>>> --
>>>>>> Roger Bivand
>>>>>> Department of Economics, Norwegian School of Economics,
>>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>>>> https://orcid.org/0000-0003-2392-6140
>>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>>
>>>>>
>>>>> --
>>>>> Roger Bivand
>>>>> Department of Economics, Norwegian School of Economics,
>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>>> https://orcid.org/0000-0003-2392-6140
>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>
>>>>
>>>> --
>>>> Roger Bivand
>>>> Department of Economics, Norwegian School of Economics,
>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>> https://orcid.org/0000-0003-2392-6140
>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>
>>>
>>> --
>>> Roger Bivand
>>> Department of Economics, Norwegian School of Economics,
>>> Helleveien 30, N-5045 Bergen, Norway.
>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>> https://orcid.org/0000-0003-2392-6140
>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: [hidden email]
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Reply | Threaded
Open this post in threaded view
|

Re: Spatial Autocorrelation Estimation Method

Roger Bivand
Administrator
On Sun, 8 Dec 2019, Robert R wrote:

> Dear Roger,
>
> Thank you for your answer. Regarding the sparse matrix, you're right - I
> tested creating one, as follows:
>
> #####
>
> listings_nb <- listings %>% spdep::poly2nb(queen = TRUE) %>%
> spdep::include.self() # is.symmetric.nb(listings_nb)

Are listings point or polygon support (must be polygon for this function)?

>
> nb_B <- spdep::nb2listw(neighbours = listings_nb, style="B", zero.policy
> = FALSE)
>
> B <- as(nb_B , "CsparseMatrix")
> all(B == t(B))
>
> nb_B1 <- mat2listw(as(B, "dgTMatrix"))
>
> format(object.size(nb_B), units = "Mb")
> # 85.6 Mb
>
> format(object.size(nb_B1), units = "Mb")
> # 85.6 Mb
>
> #####
>
> The size for both objects is the same - different from this example:
> https://cran.r-project.org/web/packages/spdep/vignettes/nb_igraph.html
>
>
> Summing up the data that I have: a "picture" for Airbnb's listings/ads
> once a month for NYC, incl. lat, lon, price (per night), id, and some
> characteristics from the listing/ad as number of rooms, bedrooms, guests
> included, etc. for a period of 54 months. The data was taken here:
> http://insideairbnb.com/get-the-data.html (listings.csv.gz)
>
> For the OLS, I used a pooled OLS with time dummy fixed effects
> (date_compiled, when the "picture" was compiled - how Airbnb listings
> for NYC were shown) because many of my observations (listings id) do not
> repeat for many periods. Also, many listings changed the room_type at
> least once during the whole time period analyzed (3 types of room_type:
> entire home/apt, private room, shared room).
>

OK, I suppose.

> I am now trying a random intercepts three-level hierarchy multilevel
> model, where _id_ (level 1) are nested within _zipcode_ (level 2), and
> the last is nested within _borough_ (level 3). So groups:
> id:(zipcode:borough).
>

I'd try and see what happens, and watch machine resources (memory anyway).

> lme4::lmer(log_price ~ factor(room_type) + bedrooms + bathrooms +
> guests_included + minimum_nights + distance_downtown + distance_subway +
> number_of_reviews + review_scores_cleanliness + professional_host +
> host_is_superhost + is_business_travel_ready + offense_misdemeanor +
> offense_felony + income_per_capita + factor(date_compiled) + (1 |
> borough / zipcode / id), data = listings)
>
> Roger, do you think it is okay to factor(room_type) (for the 3 types of
> room) and factor(date_compiled) for the dates when the NYC Airbnb's
> listings/ads were extracted?

Do you see structured variability coming from these - if so, inclusion
makes sense.

Roger

>
> Thank you and best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <[hidden email]>
> Sent: Wednesday, December 4, 2019 09:07
> To: Robert R
> Cc: [hidden email]
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> On Wed, 4 Dec 2019, Robert R wrote:
>
>> Dear Roger,
>>
>> Again, thank you for your answer. What do you mean by "zip code random
>> effect"? You mean I should use in plm the model "random"?
>>
>> regression_re <- plm(formula = model, data = listings, model = "random",
>> index = c("id", "date_compiled"))
>
> No, obviously not, your data are not a balanced panel. I mean a multilevel
> model, where the <200 zip codes cluster the data, and where a zip code
> level IID RE will almost certainly do a better job than dummies. An
> MRF/ICAR RE might be an extension.
>
>>
>> And any other methodology in dealing with large weight matrices in
>> spatialreg::lagsarlm?
>
> Please refer to Bivand et al. (2013) refered to in the package. Probably
> the weights would need to be symmetric and very sparse.
>
> I still think that you should focus on a small subset of the data and to
> improving the signal-noise ratio before trying to scale up.
>
> Roger
>
>>
>> Thank you and best regards,
>> Robert
>>
>> ________________________________________
>> From: Roger Bivand <[hidden email]>
>> Sent: Wednesday, November 27, 2019 13:53
>> To: Robert R
>> Cc: [hidden email]
>> Subject: SV: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> Yes this is expected, since the # neighbours in a single zip code block is a dense matrix, and there will be multiple such matrices. (15000^2)*8 is 1.8e+09 so such a dense matrix will max out your RAM. There is no way to look at block neighbours in that format without subsetting your data (think train/test), use a zip code random effect. I would certainly drop all attempts to examine spatial dependency until you get an aspatial multilevel hedonic model working.
>>
>> Roger
>>
>> --
>> Roger Bivand
>> Norwegian School of Economics
>> Helleveien 30, 5045 Bergen, Norway
>> [hidden email]
>>
>>
>> ________________________________________
>> Fra: Robert R <[hidden email]>
>> Sendt: tirsdag 26. november 2019 21.04
>> Til: Roger Bivand
>> Kopi: [hidden email]
>> Emne: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> Dear Roger,
>>
>> Thank you for your e-mail. Actually there is less noise that it seems. Rental prices are daily rental prices and I have an extract of all Airbnb listings daily prices once a month for a period of 4 years. Each listings information contains the lat, lon, number of bedrooms, category (entire home/apt, shared room or private room), etc.
>>
>> One question regarding the spdep::nb2blocknb function: it runs super fast with up to n = 1000, and always crashes my R session with n = 15000 or so. Is there an alternative to solve this problem?
>>
>> Thank you and best regards,
>> Robert
>>
>> ________________________________________
>> From: Roger Bivand <[hidden email]>
>> Sent: Tuesday, November 26, 2019 20:48
>> To: Robert R
>> Cc: [hidden email]
>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> Sorry for late reply, am indisposed and unable to help further. I feel
>> that there is so much noise in your data (differences in offers, rental
>> lengths, repeats or not, etc.), that you will certainly have to subset
>> vigorously first to isolate response cases that are comparable. What you
>> are trying to disentangle are the hedonic components in the bundle where
>> you just have price as response, but lots of other bundle characteristics
>> on the right hand side (days, etc.). I feel you'd need to try to get to a
>> response index of price per day per rental area or some such. I'd
>> certainly advise examining responses to a specific driver (major concert
>> or sports event) to get a feel for how the market responds, and return to
>> spatial hedonic after finding an approach that gives reasonable aspatial
>> outcomes.
>>
>> Roger
>>
>> On Sun, 17 Nov 2019, Robert R wrote:
>>
>>> Dear Roger,
>>>
>>> Thank you for your message and sorry for my late answer.
>>>
>>> Regarding the number of listings (lettings) for my data set (2.216.642 observations), each listing contains an individual id:
>>>
>>> unique ids: 180.004
>>> time periods: 54 (2015-01 to 2019-09)
>>> number of ids that appear only once: 28.486 (of 180.004 ids) (15,8%)
>>> number of ids that appear/repeat 2-10 times: 82.641 (of 180.004 ids) (45,9%)
>>> number of ids that appear/repeat 11-30 times: 46.465 (of 180.004 ids) (25,8%)
>>> number of ids that appear/repeat 31-54 times: 22.412 (of 180.004 ids) (12,5%)
>>>
>>> Important to notice is that hosts can change the room_category (between entire/home apt, private room and shared room) keeping the same listing id number. In my data, the number of unique ids that in some point changed the room_type is of 7.204 ids.
>>>
>>> --
>>>
>>> For the OLS model, I was using only a fixed effect model, where each time period (date_compiled) (54 in total) is a time dummy.
>>>
>>> plm::plm(formula = model, data = listings, model = "pooling", index = c("id", "date_compiled"))
>>>
>>>
>>> --
>>> Osland et al. (2016) (https://doi.org/10.1111/jors.12281) use a spatial fixed effects (SFE) hedonic model, where each defined neighborhood zone in the study area is represented by dummy variables.
>>>
>>> Dong et al. (2015) (https://doi.org/10.1111/gean.12049) outline four model specifications to accommodate geographically hierarchical data structures: (1) groupwise W and fixed regional effects; (2) groupwise W and random regional effects; (3) proximity-based W and fixed regional effects; and (4) proximity-based W and random regional effects.
>>> --
>>>
>>> I created a new column/variable containing the borough where the zipcode is found (Manhattan, Brooklyn, Queens, Bronx, Staten Island).
>>>
>>> If I understood it right, the (two-level) Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) considers the occurrence of spatial relations at the (lower) individual (geographical coordinates - in my case, the listing location) and (higher) group level (territorial units - in my case, zipcodes).
>>>
>>> According to Bivand et al. (2017): "(...) W is a spatial weights matrix. The HSAR model may also be estimated without this component.". So, in this case I only estimate the Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) in a "one-level" basis, i.e., at the higher-level.
>>>
>>> HSAR::hsar(model, data = listings, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>>>
>>> (Where the "model" formula contains the 54 time dummy variables)
>>>
>>> Do you think I can proceed with this model? I was able to calculate it.
>>>
>>> If I remove all observations/rows with NAs in one of the chosen variables/observations, 884.183 observations remain. If I would create a W matrix for HSAR::hsar, I would have a gigantic 884.183 by 884.183 matrix. This is the reason why I put W = NULL.
>>>
>>>
>>> Thank you and best regards
>>>
>>> ________________________________________
>>> From: Roger Bivand <[hidden email]>
>>> Sent: Monday, November 11, 2019 11:31
>>> To: Robert R
>>> Cc: [hidden email]
>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>
>>> On Sun, 10 Nov 2019, Robert R wrote:
>>>
>>>> Dear Roger,
>>>>
>>>> Again, thank you for your answer. I read the material provided and
>>>> decided that Hierarchical Spatial Autoregressive (HSAR) could be the
>>>> right model for me.
>>>>
>>>> I indeed have the precise latitude and longitude information for all my
>>>> listings for NYC.
>>>>
>>>> I created a stratified sample (group = zipcode) with 22172 (1%) of my
>>>> observations called listings_sample and tried to replicate the hsar
>>>> model, please see below.
>>>>
>>>> For now W = NULL, because otherwise I would have a 22172 x 22172 matrix.
>>>
>>> Unless you know definitely that you want to relate the response to its
>>> lagged value, you do not need this. Do note that the matrix is very
>>> sparse, so could be fitted without difficulty with ML in a cross-sectional
>>> model.
>>>
>>>>
>>>> You recommended then to introduce a Markov random field (MRF) random
>>>> effect (RE) at the zipcode level, but I did not understand it so well.
>>>> Could you develop a litte more?
>>>>
>>>
>>> Did you read the development in
>>> https://doi.org/10.1016/j.spasta.2017.01.002? It is explained there, and
>>> includes code for fitting the Beijing housing parcels data se from HSAR
>>> with many other packages (MCMC, INLA, hglm, etc.). I guess that you should
>>> try to create a model that works on a single borough, sing the zipcodes
>>> in that borough as a proxy for unobserved neighbourhood effects. Try for
>>> example using lme4::lmer() with only a zipcode IID random effect, see if
>>> the hedonic estimates are similar to lm(), and leave adding an MRF RE
>>> (with for example mgcv::gam() or hglm::hglm()) until you have a working
>>> testbed. Then advance step-by-step from there.
>>>
>>> You still have not said how many repeat lettings you see - it will affect
>>> the way you specify your model.
>>>
>>> Roger
>>>
>>>> ##############
>>>> library(spdep)
>>>> library(HSAR)
>>>> library(dplyr)
>>>> library(splitstackshape)
>>>>
>>>>
>>>> # Stratified sample per zipcode (size = 1%) listings_sample <-
>>>> splitstackshape::stratified(indt = listings, group = "zipcode", size =
>>>> 0.01)
>>>>
>>>> # Removing zipcodes from polygon_nyc which are not observable in
>>>> listings_sample polygon_nyc_listings <- polygon_nyc %>% filter(zipcode
>>>> %in% c(unique(as.character(listings_sample$zipcode))))
>>>>
>>>>
>>>> ## Random effect matrix (N by J)
>>>>
>>>> # N: 22172
>>>> # J: 154
>>>>
>>>> # Arrange listings_sample by zipcode (ascending)
>>>> listings_sample <- listings_sample %>% arrange(zipcode)
>>>>
>>>> # Count number of listings per zipcode
>>>> MM <- listings_sample %>% st_drop_geometry() %>% group_by(zipcode) %>% summarise(count = n()) %>% as.data.frame()
>>>> # sum(MM$count)
>>>>
>>>> # N by J nulled matrix creation
>>>> Delta <- matrix(data = 0, nrow = nrow(listings_sample), ncol = dim(MM)[1])
>>>>
>>>> # The total number of neighbourhood
>>>> Uid <- rep(c(1:dim(MM)[1]), MM[,2])
>>>>
>>>> for(i in 1:dim(MM)[1]) {
>>>>  Delta[Uid==i,i] <- 1
>>>> }
>>>> rm(i)
>>>>
>>>> Delta <- as(Delta,"dgCMatrix")
>>>>
>>>>
>>>> ## Higher-level spatial weights matrix or neighbourhood matrix (J by J)
>>>>
>>>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>>>> polygon_nyc_nb <- poly2nb(polygon_nyc_listings, row.names = polygon_nyc$zipcode, queen = TRUE)
>>>>
>>>> # Include neighbour itself as a neighbour
>>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>>
>>>> # Spatial weights matrix for nb
>>>> polygon_nyc_nb_matrix <- nb2mat(neighbours = polygon_nyc_nb, style = "W", zero.policy = NULL)
>>>> M <- as(polygon_nyc_nb_matrix,"dgCMatrix")
>>>>
>>>>
>>>> ## Fit HSAR SAR upper level random effect
>>>> model <- as.formula(log_price ~ guests_included + minimum_nights)
>>>>
>>>> betas = coef(lm(formula = model, data = listings_sample))
>>>> pars = list(rho = 0.5, lambda = 0.5, sigma2e = 2.0, sigma2u = 2.0, betas = betas)
>>>>
>>>> m_hsar <- hsar(model, data = listings_sample, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>>>>
>>>> ##############
>>>>
>>>> Thank you and best regards
>>>> Robert
>>>>
>>>> ________________________________________
>>>> From: Roger Bivand <[hidden email]>
>>>> Sent: Friday, November 8, 2019 13:29
>>>> To: Robert R
>>>> Cc: [hidden email]
>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>
>>>> On Fri, 8 Nov 2019, Robert R wrote:
>>>>
>>>>> Dear Roger,
>>>>>
>>>>> Thank you for your answer.
>>>>>
>>>>> I successfully used the function nb2blocknb() for a smaller dataset.
>>>>>
>>>>> But for a dataset of over 2 million observations, I get the following
>>>>> error: "Error: cannot allocate vector of size 840 Kb".
>>>>
>>>> I don't think the observations are helpful. If you have repeat lets in the
>>>> same property in a given month, you need to handle that anyway. I'd go for
>>>> making the modelling exercise work (we agree that this is not panel data,
>>>> right?) on a small subset first. I would further argue that you need a
>>>> multi-level approach rather than spdep::nb2blocknb(), with a zipcode IID
>>>> RE. You could very well take (stratified) samples per zipcode to represent
>>>> your data. Once that works, introduce an MRF RE at the zipcode level,
>>>> where you do know relative position. Using SARAR is going to be a waste of
>>>> time unless you can geocode the letting addresses. A multi-level approach
>>>> will work. Having big data in your case with no useful location
>>>> information per observation is just adding noise and over-smoothing, I'm
>>>> afraid. The approach used in https://doi.org/10.1016/j.spasta.2017.01.002
>>>> will work, also when you sample the within zipcode lets, given a split
>>>> into training and test sets, and making CV possible.
>>>>
>>>> Roger
>>>>
>>>>>
>>>>> I am expecting that at least 500.000 observations will be dropped due
>>>>> the lack of values for the chosen variables for the regression model, so
>>>>> probably I will filter and remove the observations/rows that will not be
>>>>> used anyway - do you know if there is any package that does this
>>>>> automatically, given the variables/columns chosed by me?
>>>>>
>>>>> Or would you recommend me another approach to avoid the above mentioned
>>>>> error?
>>>>>
>>>>> Thank you and best regards,
>>>>> Robert
>>>>>
>>>>> ________________________________________
>>>>> From: Roger Bivand <[hidden email]>
>>>>> Sent: Thursday, November 7, 2019 10:13
>>>>> To: Robert R
>>>>> Cc: [hidden email]
>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>>
>>>>> On Thu, 7 Nov 2019, Robert R wrote:
>>>>>
>>>>>> Dear Roger,
>>>>>>
>>>>>> Many thanks for your help.
>>>>>>
>>>>>> I have an additional question:
>>>>>>
>>>>>> Is it possible to create a "separate" lw (nb2listw) (with different
>>>>>> rownumbers) from my data set? For now, I am taking my data set and
>>>>>> merging with the sf object polygon_nyc with the function
>>>>>> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create
>>>>>> a huge n x n matrix (depending of the size of my data set).
>>>>>>
>>>>>> Taking the polygon_nyc alone and turning it to a lw (weights list)
>>>>>> object has only n = 177.
>>>>>>
>>>>>> Of course running
>>>>>>
>>>>>> spatialreg::lagsarlm(formula=model, data = listings_sample,
>>>>>> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
>>>>>>
>>>>>> does not work ("Input data and weights have different dimensions").
>>>>>>
>>>>>> The only option is to take my data set, merge it to my polygon_nyc (by
>>>>>> zipcode) and then create the weights list lw? Or there another option?
>>>>>
>>>>> I think we are getting more clarity. You do not know the location of the
>>>>> lettings beyond their zipcode. You do know the boundaries of the zipcode
>>>>> areas, and can create a neighbour object from these boundaries. You then
>>>>> want to treat all the lettings in a zipcode area i as neighbours, and
>>>>> additionally lettings in zipcode areas neighbouring i as neighbours of
>>>>> lettings in i. This is the data structure that motivated the
>>>>> spdep::nb2blocknb() function:
>>>>>
>>>>> https://r-spatial.github.io/spdep/reference/nb2blocknb.html
>>>>>
>>>>> Try running the examples to get a feel for what is going on.
>>>>>
>>>>> I feel that most of the variability will vanish in the very large numbers
>>>>> of neighbours, over-smoothing the outcomes. If you do not have locations
>>>>> for the lettings themselves, I don't think you can make much progress.
>>>>>
>>>>> You could try a linear mixed model (or gam with a spatially structured
>>>>> random effect) with a temporal and a spatial random effect. See the HSAR
>>>>> package, articles by Dong et al., and maybe
>>>>> https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither
>>>>> this nor Dong et al. handle spatio-temporal settings. MRF spatial random
>>>>> effects at the zipcode level might be a way forward, together with an IID
>>>>> random effect at the same level (equivalent to sef-neighbours).
>>>>>
>>>>> Hope this helps,
>>>>>
>>>>> Roger
>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Robert
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Roger Bivand <[hidden email]>
>>>>>> Sent: Wednesday, November 6, 2019 15:07
>>>>>> To: Robert R
>>>>>> Cc: [hidden email]
>>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>>>
>>>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>>>
>>>>>>> Dear Roger,
>>>>>>>
>>>>>>> Thank you for your reply. I disabled HTML; my e-mails should be now in
>>>>>>> plain text.
>>>>>>>
>>>>>>> I will give a better context for my desired outcome.
>>>>>>>
>>>>>>> I am taking Airbnb's listings information for New York City available
>>>>>>> on: http://insideairbnb.com/get-the-data.html
>>>>>>>
>>>>>>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
>>>>>>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
>>>>>>> Listings/ folder. When importing all these 54 files into one single data
>>>>>>> set, I create a new "date_compiled" variable/column.
>>>>>>>
>>>>>>> In total, after the data cleansing process, I have a little more 2
>>>>>>> million observations.
>>>>>>
>>>>>> You have repeat lettings for some, but not all properties. So this is at
>>>>>> best a very unbalanced panel. For those properties with repeats, you may
>>>>>> see temporal movement (trend/seasonal).
>>>>>>
>>>>>> I suggest (strongly) taking a single borough or even zipcode with some
>>>>>> hindreds of properties, and working from there. Do not include the
>>>>>> observation as its own neighbour, perhaps identify repeats and handle them
>>>>>> specially (create or use a property ID). Unbalanced panels may also create
>>>>>> a selection bias issue (why are some properties only listed sometimes?).
>>>>>>
>>>>>> So this although promising isn't simple, and getting to a hedonic model
>>>>>> may be hard, but not (just) because of spatial autocorrelation. I wouldn't
>>>>>> necessarily trust OLS output either, partly because of the repeat property
>>>>>> issue.
>>>>>>
>>>>>> Roger
>>>>>>
>>>>>>>
>>>>>>> I created 54 timedummy variables for each time period available.
>>>>>>>
>>>>>>> I want to estimate using a hedonic spatial timedummy model the impact of
>>>>>>> a variety of characteristics which potentially determine the daily rate
>>>>>>> on Airbnb listings through time in New York City (e.g. characteristics
>>>>>>> of the listing as number of bedrooms, if the host if professional,
>>>>>>> proximity to downtown (New York City Hall) and nearest subway station
>>>>>>> from the listing, income per capita, etc.).
>>>>>>>
>>>>>>> My dependent variable is price (log price, common in the related
>>>>>>> literature for hedonic prices).
>>>>>>>
>>>>>>> The OLS model is done.
>>>>>>>
>>>>>>> For the spatial model, I am assuming that hosts, when deciding the
>>>>>>> pricing of their listings, take not only into account its structural and
>>>>>>> location characteristics, but also the prices charged by near listings
>>>>>>> with similar characteristics - spatial autocorrelation is then present,
>>>>>>> at least spatial dependence is present in the dependent variable.
>>>>>>>
>>>>>>> As I wrote in my previous post, I was willing to consider the neighbor
>>>>>>> itself as a neighbor.
>>>>>>>
>>>>>>> Parts of my code can be found below:
>>>>>>>
>>>>>>> ########
>>>>>>>
>>>>>>> ## packages
>>>>>>>
>>>>>>> packages_install <- function(packages){
>>>>>>> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
>>>>>>> if (length(new.packages))
>>>>>>> install.packages(new.packages, dependencies = TRUE)
>>>>>>> sapply(packages, require, character.only = TRUE)
>>>>>>> }
>>>>>>>
>>>>>>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
>>>>>>> packages_install(packages_required)
>>>>>>>
>>>>>>> # Working directory
>>>>>>> setwd("C:/Users/User/R")
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## shapefile_us
>>>>>>>
>>>>>>> # Shapefile zips import and Coordinate Reference System (CRS) transformation
>>>>>>> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
>>>>>>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>>>>>>>
>>>>>>> # Columns removal
>>>>>>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>>>>>>>
>>>>>>> # Column rename: ZCTA5CE10
>>>>>>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>>>>>>>
>>>>>>> # Column class change: zipcode
>>>>>>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## polygon_nyc
>>>>>>>
>>>>>>> # Zip code not available in shapefile: 11695
>>>>>>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## weight_matrix
>>>>>>>
>>>>>>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>>>>>>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>>>>>>>
>>>>>>> # Include neighbour itself as a neighbour
>>>>>>> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
>>>>>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>>>>>
>>>>>>> # Weights to each neighboring polygon
>>>>>>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## listings
>>>>>>>
>>>>>>> # Data import
>>>>>>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
>>>>>>> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
>>>>>>> listings <- mapply(cbind, listings, date_compiled = names(listings))
>>>>>>> listings <- listings %>% bind_rows
>>>>>>>
>>>>>>> # Characters removal
>>>>>>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
>>>>>>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
>>>>>>> listings$price <- gsub("\\$", "", listings$price)
>>>>>>> listings$price <- gsub(",", "", listings$price)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## timedummy
>>>>>>>
>>>>>>> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
>>>>>>> timedummy <- paste(timedummy, sep = "", collapse = " + ")
>>>>>>> timedummy <- gsub("-", "_", timedummy)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## OLS regression
>>>>>>>
>>>>>>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
>>>>>>> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>>>>>>>
>>>>>>> ########
>>>>>>>
>>>>>>> Some of my id's repeat in multiple time periods.
>>>>>>>
>>>>>>> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>>>>>>>
>>>>>>> Now I want to apply the hedonic model with the timedummy variables.
>>>>>>>
>>>>>>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>>>>>>>
>>>>>>> Again, thank you very much for the help provided until now.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Robert
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: Roger Bivand <[hidden email]>
>>>>>>> Sent: Tuesday, November 5, 2019 15:30
>>>>>>> To: Robert R
>>>>>>> Cc: [hidden email]
>>>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>>>>
>>>>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>>>>
>>>>>>>> I have a large pooled cross-section data set. ?I would like to
>>>>>>>> estimate/regress using spatial autocorrelation methods. I am assuming
>>>>>>>> for now that spatial dependence is present in both the dependent
>>>>>>>> variable and the error term.? ?My data set is over a period of 4 years,
>>>>>>>> monthly data (54 periods). For this means, I've created a time dummy
>>>>>>>> variable for each time period.? ?I also created a weight matrix using the
>>>>>>>> functions "poly2nb" and "nb2listw".? ?Now I am trying to figure out a way
>>>>>>>> to estimate my model which contains a really big data set.? ?Basically, my
>>>>>>>> model is as follows: y = ?D + ?W1y + X? + ?W2u + ?? ?My questions are:? ?1)
>>>>>>>> My spatial weight matrix for the whole data set will be probably a
>>>>>>>> enormous matrix with submatrices for each time period itself. I don't
>>>>>>>> think it would be possible to calculate this.? What I would like to know
>>>>>>>> is a way to estimate each time dummy/period separately (to compare
>>>>>>>> different periods alone). How to do it?? ?2) Which package to use: spdep
>>>>>>>> or splm?? ?Thank you and best regards,? Robert?
>>>>>>>
>>>>>>> Please do not post HTML, only plain text. Almost certainly your model
>>>>>>> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
>>>>>>> untried). What is your cross-sectional size? Using sparse kronecker
>>>>>>> products, the "enormous" matrix may not be very big. Does it make any
>>>>>>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
>>>>>>> of the covariates time-varying? Please provide motivation and use area
>>>>>>> (preferably with affiliation (your email and user name are not
>>>>>>> informative) - this feels like a real estate problem, probably wrongly
>>>>>>> specified. You should use splm if time make sense in your case, but if it
>>>>>>> really doesn't, simplify your approach, as much of the data will be
>>>>>>> subject to very large temporal autocorrelation.
>>>>>>>
>>>>>>> If this is a continuation of your previous question about using
>>>>>>> self-neighbours, be aware that you should not use self-neighbours in
>>>>>>> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>>>>>>>
>>>>>>> Roger
>>>>>>>
>>>>>>>>
>>>>>>>>       [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> R-sig-Geo mailing list
>>>>>>>> [hidden email]
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>>>>
>>>>>>> --
>>>>>>> Roger Bivand
>>>>>>> Department of Economics, Norwegian School of Economics,
>>>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>>>>> https://orcid.org/0000-0003-2392-6140
>>>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Roger Bivand
>>>>>> Department of Economics, Norwegian School of Economics,
>>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>>>> https://orcid.org/0000-0003-2392-6140
>>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>>
>>>>>
>>>>> --
>>>>> Roger Bivand
>>>>> Department of Economics, Norwegian School of Economics,
>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>>> https://orcid.org/0000-0003-2392-6140
>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>
>>>>
>>>> --
>>>> Roger Bivand
>>>> Department of Economics, Norwegian School of Economics,
>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>> https://orcid.org/0000-0003-2392-6140
>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>
>>>
>>> --
>>> Roger Bivand
>>> Department of Economics, Norwegian School of Economics,
>>> Helleveien 30, N-5045 Bergen, Norway.
>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>> https://orcid.org/0000-0003-2392-6140
>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: [hidden email]
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway