Running huge dataset with dnearneigh

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Running huge dataset with dnearneigh

amoroso
Dear Roger,

How can we deal with a huge dataset when using dnearneigh?

Here is my code:

d <- dnearneigh(spdf,0, 22000)
all_listw <- nb2listw(d, style = "W")

where the spdf object is in the british national grid CRS:
+init=epsg:27700, with 227,973 observations/points. The distance of 22,000
was decided by a training set that had 214 observations and the spdf object
contains both the training set and the testing set.

I am using a Mac, with a processor of 2.3 GHz Intel Core i5 and 8 GB
memory. My laptop showed that when dnearneigh command was run on all
observations, around 6.9 out of 8GB was used by the rsession and that the
%CPU used by the rsession was stated to be around 98%, although another
indicator showed that my computer was around 60% idle. After running the
command for a day, rstudio alerted me that the connection to the rsession
could not be established, so I aborted the entire process altogether. I
think the problem here may be the size of the dataset and perhaps the
limitations of my laptop specs.

Do you have any advice on how I can go about making a neighbours list with
dnearneigh for 227,973 observations in a successful and efficient way?
Also, would you foresee any problems in the next steps, especially when I
will be using the neighbourhood listw object as an input in fitting and
predicting using the spatial lag/error models? (see code below)

model <-  spatialreg::lagsarlm(rest_formula, data=train, train_listw)
model_pred <- spatialreg::predict.sarlm(model, test, all_listw)

I think the predicting part may take some time, since my test set consists
of 227,973 - 214 observations = 227,759 observations.

Here are some solutions that I have thought of:

1. Interpolate the test set point data of 227,759 observations over a more
manageable spatial pixel dataframe with cell size of perhaps 10,000m by
10,000m which would give me around 4900 points. So instead of 227,759
observations, I can make the listw object based on just 4900 + 214 training
points and predict just on 4900 observations.

2. Get hold of better performance machines through cloud computing such as
AWS EC2 services and try running the commands and models there.

3. Parallel computing using the parallel package from r (although I am not
sure whether dnearneigh can be parallelised).

I believe option 1 would be the most manageable but I am not sure how and
by how much this would affect the accuracy of the predictions as
interpolating the dataset would be akin to introducing more estimations in
the prediction. However, I am also grappling with the trade-off between
accuracy and computation time. Hence, if options 2 and 3 can offer a
reasonable computation time (1-2 hours) then I would forgo option 1.

What do you think? Is it possible to make a neighbourhood listw object out
of 227,973 observations efficiently?

Thank you for reading to the end! Apologies for writing a lengthy one, just
wanted to fully describe what I am facing, I hope I didn't miss out
anything crucial.

Thank you so much once again!

jiawen

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Running huge dataset with dnearneigh

Kent Johnson
>
> Date: Sat, 29 Jun 2019 00:36:22 +0100
> From: Jiawen Ng <[hidden email]>
> To: [hidden email]
> Subject: [R-sig-Geo] Running huge dataset with dnearneigh
>
> How can we deal with a huge dataset when using dnearneigh?
>
> Here is my code:
>
> d <- dnearneigh(spdf,0, 22000)
> all_listw <- nb2listw(d, style = "W")
>
> where the spdf object is in the british national grid CRS:
> +init=epsg:27700, with 227,973 observations/points. The distance of 22,000
> was decided by a training set that had 214 observations and the spdf object
> contains both the training set and the testing set.


I have had good results using the rtree package to compute nearest
neighbors. It is very fast with relatively low memory requirements. I have
not tried it with so many points but it works well up to 10,000 or so. If I
understand the dnearneigh docs, the rtree::withinDistance function is
similar.
https://github.com/hunzikp/rtree

Kent Johnson

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Running huge dataset with dnearneigh

Roger Bivand
Administrator
In reply to this post by amoroso
On Sat, 29 Jun 2019, Jiawen Ng wrote:

> Dear Roger,

Postings go to the whole list ...

>
> How can we deal with a huge dataset when using dnearneigh?
>

First, why distance neighbours? What is the support of the data, point or
polygon? If polygon, contiguity neighbours are preferred. If not, and the
intensity of observations is similar across the whole area, distance may
be justified, but if the intensity varies, some observations will have
very many neighbours. In that case, unless you have a clear ecological or
environmental reason for knowing that a known distance threshold binds, it
is not a good choice.

> Here is my code:
>
> d <- dnearneigh(spdf,0, 22000)
> all_listw <- nb2listw(d, style = "W")
>
> where the spdf object is in the british national grid CRS:
> +init=epsg:27700, with 227,973 observations/points. The distance of 22,000
> was decided by a training set that had 214 observations and the spdf object
> contains both the training set and the testing set.
>

This is questionable. You train on 214 observations - do their areal
intensity match those of the whole data set? If chosen at random, you run
into the spatial sampling problems discussed in:

https://www.sciencedirect.com/science/article/pii/S0304380019302145?dgcid=author

Are 214 observations for training representative of 227,973 prediction
sites? Do you only have observations on the response for 214, and an
unobserved response otherwise? What are the data, what are you trying to
do and why? This is not a sensible setting for models using weights
matrices for prediction (I think), because we do not have estimates of the
prediction error in general.

> I am using a Mac, with a processor of 2.3 GHz Intel Core i5 and 8 GB
> memory. My laptop showed that when dnearneigh command was run on all
> observations, around 6.9 out of 8GB was used by the rsession and that the
> %CPU used by the rsession was stated to be around 98%, although another
> indicator showed that my computer was around 60% idle. After running the
> command for a day, rstudio alerted me that the connection to the rsession
> could not be established, so I aborted the entire process altogether. I
> think the problem here may be the size of the dataset and perhaps the
> limitations of my laptop specs.
>

On planar data, there is no good reason for this, as each observation is
treated separately, finding and sorting distances, and choosing those
under the threshold. It will undoubtedly slow if there are more than a few
neighbours within the threshold, but I already covered the inadvisability
of defining neighbours in that way.

Using an rtree might help, but you get hit badly if there are many
neighbours within the threshold you have chosen anyway.

On most 8GB hardware and modern OS, you do not have more than 3-4GB for
work. So something was swapping on your laptop.

> Do you have any advice on how I can go about making a neighbours list with
> dnearneigh for 227,973 observations in a successful and efficient way?
> Also, would you foresee any problems in the next steps, especially when I
> will be using the neighbourhood listw object as an input in fitting and
> predicting using the spatial lag/error models? (see code below)
>
> model <-  spatialreg::lagsarlm(rest_formula, data=train, train_listw)
> model_pred <- spatialreg::predict.sarlm(model, test, all_listw)
>

Why would using a spatial lag model make sense? Why are you suggesting
this model, do you have a behavioural for why only the spatially lagged
response should be included?

Why do you think that this is sensible? You are predicting 1000 times for
each observation - this is not what the prediction methods are written
for. Most involve inverting an nxn inverse matrix - did you refer to
Goulard et al. (2017) to get a good understanding of the underlying
methods?

> I think the predicting part may take some time, since my test set consists
> of 227,973 - 214 observations = 227,759 observations.
>
> Here are some solutions that I have thought of:
>
> 1. Interpolate the test set point data of 227,759 observations over a more
> manageable spatial pixel dataframe with cell size of perhaps 10,000m by
> 10,000m which would give me around 4900 points. So instead of 227,759
> observations, I can make the listw object based on just 4900 + 214 training
> points and predict just on 4900 observations.

But what are you trying to do? Are the observations output areas? House
sales? If you are not filling in missing areal units (the Goulard et al.
case), couldn't you simply use geostatistical methods which seem to match
your support better, and can be fitted and can predict using a local
neighbourhood? While you are doing that, you could switch to INLA with
SPDE, which interposes a mesh like the one you suggest. But in that case,
beware of the mesh choice issue in:

https://doi.org/10.1080/03610926.2018.1536209

>
> 2. Get hold of better performance machines through cloud computing such as
> AWS EC2 services and try running the commands and models there.
>

What you need are methods, not wasted money on hardware as a service.

> 3. Parallel computing using the parallel package from r (although I am not
> sure whether dnearneigh can be parallelised).
>

This could easily be implemented if it was really needed, which I don't
think it is; better methods understanding lets one do more with less.

> I believe option 1 would be the most manageable but I am not sure how and
> by how much this would affect the accuracy of the predictions as
> interpolating the dataset would be akin to introducing more estimations in
> the prediction. However, I am also grappling with the trade-off between
> accuracy and computation time. Hence, if options 2 and 3 can offer a
> reasonable computation time (1-2 hours) then I would forgo option 1.
>
> What do you think? Is it possible to make a neighbourhood listw object out
> of 227,973 observations efficiently?

Yes, but only if the numbers of neighbours are very small. Look in Bivand
et al. (2013) to see the use of some fairly large n, but only with few
neighbours for each observation. You seem to be getting average neighbour
counts in the thousands, which makes no sense.

>
> Thank you for reading to the end! Apologies for writing a lengthy one, just
> wanted to fully describe what I am facing, I hope I didn't miss out
> anything crucial.
>

Long is OK, but there is no motivation here for why you want to make 200K
predictions from 200 observations with point support (?) using weights
matrices.

Hope this clarifies,

Roger

> Thank you so much once again!
>
> jiawen
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway
Reply | Threaded
Open this post in threaded view
|

Re: Running huge dataset with dnearneigh

amoroso
Dear Roger,

Thank you so much for your detailed response and pointing out potential
pitfalls! It has prompted me to re-evalutate my approach.

Here is the context: I have some stores' sales data (this is my training
set of 214 points), I would like to find out where best to set up new
stores in UK. I am using a geodemographics approach to do this: Perform a
regression of sales against census data, then predict sales on UK output
areas (by centroids) and finally identify new areas with
location-allocation models. As the stores are points, this has led me to
define UK output areas by its population-weighted centroids, thus resulting
in the prediction by points rather than by areas. Tests (like moran's I and
lagrange multiplier) for spatial relationships among the points in my
training set were significant hence this has led me to implement some
spatial models (specifically spatial lag, error and durbin models) to
account for the spatial relationships in the data.

I am quite unsettled and unclear as to which neighbourhood definition to go
for actually. I thought of IDW at first as I thought this would summarise
each point's relationship with their neighbours very precisely thus making
the predictions more accurate. Upon your advice (don't use IDW or other
general weights for predictions), I decided not to use IDW, and changed it
to dnearneigh instead (although now I am questioning myself on the
definition of what is meant by general weights. Perhaps I am understanding
the definition of general weights wrong, if dnearneigh is still considered
to be a 'general weights' method) Why is the use of IDW not advisable
however? Is it due to computational reasons? Also, why would having
thousands of neighbours be making no sense? Apologies for asking so many
questions, I'd just like to really understand the concepts!

I believe that both the train and test set has varying intensities. I was
weighing the different neighbourhood methods: dnearneigh, knearneigh, using
IDW etc. and I felt like each method would have its disadvantages -- its
difficult to pinpoint which neighbourhood definition would be best. If one
were to go for knearneigh for example, results may not be fair due to the
inhomogeneity of the points -- for instance, point A's nearest neighbours
may be within a few hundreds of kilometres while point B's nearest
neighbours may be in the thousands. I feel like the choice of any
neighbourhood definition can be highly debateable... What do you think?

After analysing my problem again, I think that predicting by output areas
(points) would be best for my case as I would have to make use of the
population data after building the model. Interpolating census data of the
output area (points) would cause me to lose that information.

Thank you for the comments and the advice so far,  I would greatly welcome
and appreciate additional feedback!

Thank you so much once again!

Jiawen








On Sun, 30 Jun 2019 at 16:57, Roger Bivand <[hidden email]> wrote:

> On Sat, 29 Jun 2019, Jiawen Ng wrote:
>
> > Dear Roger,
>
> Postings go to the whole list ...
>
> >
> > How can we deal with a huge dataset when using dnearneigh?
> >
>
> First, why distance neighbours? What is the support of the data, point or
> polygon? If polygon, contiguity neighbours are preferred. If not, and the
> intensity of observations is similar across the whole area, distance may
> be justified, but if the intensity varies, some observations will have
> very many neighbours. In that case, unless you have a clear ecological or
> environmental reason for knowing that a known distance threshold binds, it
> is not a good choice.
>
> > Here is my code:
> >
> > d <- dnearneigh(spdf,0, 22000)
> > all_listw <- nb2listw(d, style = "W")
> >
> > where the spdf object is in the british national grid CRS:
> > +init=epsg:27700, with 227,973 observations/points. The distance of
> 22,000
> > was decided by a training set that had 214 observations and the spdf
> object
> > contains both the training set and the testing set.
> >
>
> This is questionable. You train on 214 observations - do their areal
> intensity match those of the whole data set? If chosen at random, you run
> into the spatial sampling problems discussed in:
>
>
> https://www.sciencedirect.com/science/article/pii/S0304380019302145?dgcid=author
>
> Are 214 observations for training representative of 227,973 prediction
> sites? Do you only have observations on the response for 214, and an
> unobserved response otherwise? What are the data, what are you trying to
> do and why? This is not a sensible setting for models using weights
> matrices for prediction (I think), because we do not have estimates of the
> prediction error in general.
>
> > I am using a Mac, with a processor of 2.3 GHz Intel Core i5 and 8 GB
> > memory. My laptop showed that when dnearneigh command was run on all
> > observations, around 6.9 out of 8GB was used by the rsession and that the
> > %CPU used by the rsession was stated to be around 98%, although another
> > indicator showed that my computer was around 60% idle. After running the
> > command for a day, rstudio alerted me that the connection to the rsession
> > could not be established, so I aborted the entire process altogether. I
> > think the problem here may be the size of the dataset and perhaps the
> > limitations of my laptop specs.
> >
>
> On planar data, there is no good reason for this, as each observation is
> treated separately, finding and sorting distances, and choosing those
> under the threshold. It will undoubtedly slow if there are more than a few
> neighbours within the threshold, but I already covered the inadvisability
> of defining neighbours in that way.
>
> Using an rtree might help, but you get hit badly if there are many
> neighbours within the threshold you have chosen anyway.
>
> On most 8GB hardware and modern OS, you do not have more than 3-4GB for
> work. So something was swapping on your laptop.
>
> > Do you have any advice on how I can go about making a neighbours list
> with
> > dnearneigh for 227,973 observations in a successful and efficient way?
> > Also, would you foresee any problems in the next steps, especially when I
> > will be using the neighbourhood listw object as an input in fitting and
> > predicting using the spatial lag/error models? (see code below)
> >
> > model <-  spatialreg::lagsarlm(rest_formula, data=train, train_listw)
> > model_pred <- spatialreg::predict.sarlm(model, test, all_listw)
> >
>
> Why would using a spatial lag model make sense? Why are you suggesting
> this model, do you have a behavioural for why only the spatially lagged
> response should be included?
>
> Why do you think that this is sensible? You are predicting 1000 times for
> each observation - this is not what the prediction methods are written
> for. Most involve inverting an nxn inverse matrix - did you refer to
> Goulard et al. (2017) to get a good understanding of the underlying
> methods?
>
> > I think the predicting part may take some time, since my test set
> consists
> > of 227,973 - 214 observations = 227,759 observations.
> >
> > Here are some solutions that I have thought of:
> >
> > 1. Interpolate the test set point data of 227,759 observations over a
> more
> > manageable spatial pixel dataframe with cell size of perhaps 10,000m by
> > 10,000m which would give me around 4900 points. So instead of 227,759
> > observations, I can make the listw object based on just 4900 + 214
> training
> > points and predict just on 4900 observations.
>
> But what are you trying to do? Are the observations output areas? House
> sales? If you are not filling in missing areal units (the Goulard et al.
> case), couldn't you simply use geostatistical methods which seem to match
> your support better, and can be fitted and can predict using a local
> neighbourhood? While you are doing that, you could switch to INLA with
> SPDE, which interposes a mesh like the one you suggest. But in that case,
> beware of the mesh choice issue in:
>
> https://doi.org/10.1080/03610926.2018.1536209
>
> >
> > 2. Get hold of better performance machines through cloud computing such
> as
> > AWS EC2 services and try running the commands and models there.
> >
>
> What you need are methods, not wasted money on hardware as a service.
>
> > 3. Parallel computing using the parallel package from r (although I am
> not
> > sure whether dnearneigh can be parallelised).
> >
>
> This could easily be implemented if it was really needed, which I don't
> think it is; better methods understanding lets one do more with less.
>
> > I believe option 1 would be the most manageable but I am not sure how and
> > by how much this would affect the accuracy of the predictions as
> > interpolating the dataset would be akin to introducing more estimations
> in
> > the prediction. However, I am also grappling with the trade-off between
> > accuracy and computation time. Hence, if options 2 and 3 can offer a
> > reasonable computation time (1-2 hours) then I would forgo option 1.
> >
> > What do you think? Is it possible to make a neighbourhood listw object
> out
> > of 227,973 observations efficiently?
>
> Yes, but only if the numbers of neighbours are very small. Look in Bivand
> et al. (2013) to see the use of some fairly large n, but only with few
> neighbours for each observation. You seem to be getting average neighbour
> counts in the thousands, which makes no sense.
>
> >
> > Thank you for reading to the end! Apologies for writing a lengthy one,
> just
> > wanted to fully describe what I am facing, I hope I didn't miss out
> > anything crucial.
> >
>
> Long is OK, but there is no motivation here for why you want to make 200K
> predictions from 200 observations with point support (?) using weights
> matrices.
>
> Hope this clarifies,
>
> Roger
>
> > Thank you so much once again!
> >
> > jiawen
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-Geo mailing list
> > [hidden email]
> > https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> >
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Running huge dataset with dnearneigh

Roger Bivand
Administrator
On Mon, 1 Jul 2019, Jiawen Ng wrote:

> Dear Roger,
>
> Thank you so much for your detailed response and pointing out potential
> pitfalls! It has prompted me to re-evalutate my approach.
>
> Here is the context: I have some stores' sales data (this is my training
> set of 214 points), I would like to find out where best to set up new
> stores in UK. I am using a geodemographics approach to do this: Perform a
> regression of sales against census data, then predict sales on UK output
> areas (by centroids) and finally identify new areas with
> location-allocation models. As the stores are points, this has led me to
> define UK output areas by its population-weighted centroids, thus resulting
> in the prediction by points rather than by areas. Tests (like moran's I and
> lagrange multiplier) for spatial relationships among the points in my
> training set were significant hence this has led me to implement some
> spatial models (specifically spatial lag, error and durbin models) to
> account for the spatial relationships in the data.

I'm afraid that my retail geography is not very up to date, but also that
your approach is most unlikely to yield constructive results.

Most retail stores are organised in large chains, so optimise costs
between wholesale and retail. Independent retail stores depend crucially
on access to wholesale stores, so anyway cannot locate without regard to
supply costs. Some service activities without wholesale dependencies are
less tied.

Most chains certainly behave strategically with regard to each other,
sometimes locating toe-to-toe to challenge a competing chain
(Carrefour/Tesco or their local shop variants), sometimes avoiding nearby
competing chain locations to establish a local monopoly (think Hotelling).

Population density doesn't express demand, especially unmet demand well at
all. Think food deserts - maybe plenty of people but little disposable
income. Look at the food desert literature, or the US food stamp
literature.

Finally (all bad news) retail is not only challenged by location shifting
from high streets to malls, but critically by online shopping, which
shifts the cost structures one the buyer is engaged at a proposed price to
logistics, to complete the order at the highest margin including returns.
That only marginally relates to population density.

So you'd need more data than you have, a model that explicitly handles
competition between chains as well as market gaps, and some way of
handling online leakage to move forward.

If population density was a proxy for accessibility (most often it isn't),
it might look like the beginnings of a model, but most often we don't know
what bid-rent surfaces look like, and then, most often different
activities sort differently across those surfaces.

>
> I am quite unsettled and unclear as to which neighbourhood definition to go
> for actually. I thought of IDW at first as I thought this would summarise
> each point's relationship with their neighbours very precisely thus making
> the predictions more accurate. Upon your advice (don't use IDW or other
> general weights for predictions), I decided not to use IDW, and changed it
> to dnearneigh instead (although now I am questioning myself on the
> definition of what is meant by general weights. Perhaps I am understanding
> the definition of general weights wrong, if dnearneigh is still considered
> to be a 'general weights' method) Why is the use of IDW not advisable
> however? Is it due to computational reasons? Also, why would having
> thousands of neighbours be making no sense? Apologies for asking so many
> questions, I'd just like to really understand the concepts!
>

The model underlying spatial regressions using neighbours tapers
dependency as the pairwise elements of (I - \rho W)^{-1} (conditional) and
[(I - \rho W) (I - \rho W')]^{-1} (see Wall 2004). These are NxN dense
matrices. (I - \rho W) is typically sparse, and under certain conditions
leads to (I - \rho W)^{-1} = \sum_{i=0}^{\inf} \rho^i W^i, the sum of a
power series in \rho and W. \rho is typically upward bounded < 1, so
\rho^i declines as i increases. This dampens \rho^i W^i, so that i
influences j less and less with increasing i. So in the general case IDW
is simply replicating what simple contiguity gives you anyway. So the
sparser W is (within reason), the better. Unless you really know that the
physics, chemistry or biology of your system give you a known systematic
relationship like IDW, you may as well stay with contiguity.

However, this isn't any use in solving a retail location problem at all.

> I believe that both the train and test set has varying intensities. I was
> weighing the different neighbourhood methods: dnearneigh, knearneigh, using
> IDW etc. and I felt like each method would have its disadvantages -- its
> difficult to pinpoint which neighbourhood definition would be best. If one
> were to go for knearneigh for example, results may not be fair due to the
> inhomogeneity of the points -- for instance, point A's nearest neighbours
> may be within a few hundreds of kilometres while point B's nearest
> neighbours may be in the thousands. I feel like the choice of any
> neighbourhood definition can be highly debateable... What do you think?
>

When in doubt use contiguity for polygons and similar graph based methods
for points. Try to keep the graphs planar (as few intersecting edges as
possible - rule of thumb).


> After analysing my problem again, I think that predicting by output areas
> (points) would be best for my case as I would have to make use of the
> population data after building the model. Interpolating census data of the
> output area (points) would cause me to lose that information.
>

Baseline, this is not going anywhere constructive, and simply approaching
retail location in this way is unhelpful - there is far too little
information in your model.

If you really must, first find a fully configured retail model with the
complete data set needed to replicate the results achieved, and use that
to benchmark how far your approach succeeds in reaching a similar result
for that restricted area. I think that you'll find that the retail model
is much more successful, but if not, there is less structure in
contemporary retail than I though.

Best wishes,

Roger

> Thank you for the comments and the advice so far,  I would greatly welcome
> and appreciate additional feedback!
>
> Thank you so much once again!
>
> Jiawen
>
>
>
>
>
>
>
>
> On Sun, 30 Jun 2019 at 16:57, Roger Bivand <[hidden email]> wrote:
>
>> On Sat, 29 Jun 2019, Jiawen Ng wrote:
>>
>>> Dear Roger,
>>
>> Postings go to the whole list ...
>>
>>>
>>> How can we deal with a huge dataset when using dnearneigh?
>>>
>>
>> First, why distance neighbours? What is the support of the data, point or
>> polygon? If polygon, contiguity neighbours are preferred. If not, and the
>> intensity of observations is similar across the whole area, distance may
>> be justified, but if the intensity varies, some observations will have
>> very many neighbours. In that case, unless you have a clear ecological or
>> environmental reason for knowing that a known distance threshold binds, it
>> is not a good choice.
>>
>>> Here is my code:
>>>
>>> d <- dnearneigh(spdf,0, 22000)
>>> all_listw <- nb2listw(d, style = "W")
>>>
>>> where the spdf object is in the british national grid CRS:
>>> +init=epsg:27700, with 227,973 observations/points. The distance of
>> 22,000
>>> was decided by a training set that had 214 observations and the spdf
>> object
>>> contains both the training set and the testing set.
>>>
>>
>> This is questionable. You train on 214 observations - do their areal
>> intensity match those of the whole data set? If chosen at random, you run
>> into the spatial sampling problems discussed in:
>>
>>
>> https://www.sciencedirect.com/science/article/pii/S0304380019302145?dgcid=author
>>
>> Are 214 observations for training representative of 227,973 prediction
>> sites? Do you only have observations on the response for 214, and an
>> unobserved response otherwise? What are the data, what are you trying to
>> do and why? This is not a sensible setting for models using weights
>> matrices for prediction (I think), because we do not have estimates of the
>> prediction error in general.
>>
>>> I am using a Mac, with a processor of 2.3 GHz Intel Core i5 and 8 GB
>>> memory. My laptop showed that when dnearneigh command was run on all
>>> observations, around 6.9 out of 8GB was used by the rsession and that the
>>> %CPU used by the rsession was stated to be around 98%, although another
>>> indicator showed that my computer was around 60% idle. After running the
>>> command for a day, rstudio alerted me that the connection to the rsession
>>> could not be established, so I aborted the entire process altogether. I
>>> think the problem here may be the size of the dataset and perhaps the
>>> limitations of my laptop specs.
>>>
>>
>> On planar data, there is no good reason for this, as each observation is
>> treated separately, finding and sorting distances, and choosing those
>> under the threshold. It will undoubtedly slow if there are more than a few
>> neighbours within the threshold, but I already covered the inadvisability
>> of defining neighbours in that way.
>>
>> Using an rtree might help, but you get hit badly if there are many
>> neighbours within the threshold you have chosen anyway.
>>
>> On most 8GB hardware and modern OS, you do not have more than 3-4GB for
>> work. So something was swapping on your laptop.
>>
>>> Do you have any advice on how I can go about making a neighbours list
>> with
>>> dnearneigh for 227,973 observations in a successful and efficient way?
>>> Also, would you foresee any problems in the next steps, especially when I
>>> will be using the neighbourhood listw object as an input in fitting and
>>> predicting using the spatial lag/error models? (see code below)
>>>
>>> model <-  spatialreg::lagsarlm(rest_formula, data=train, train_listw)
>>> model_pred <- spatialreg::predict.sarlm(model, test, all_listw)
>>>
>>
>> Why would using a spatial lag model make sense? Why are you suggesting
>> this model, do you have a behavioural for why only the spatially lagged
>> response should be included?
>>
>> Why do you think that this is sensible? You are predicting 1000 times for
>> each observation - this is not what the prediction methods are written
>> for. Most involve inverting an nxn inverse matrix - did you refer to
>> Goulard et al. (2017) to get a good understanding of the underlying
>> methods?
>>
>>> I think the predicting part may take some time, since my test set
>> consists
>>> of 227,973 - 214 observations = 227,759 observations.
>>>
>>> Here are some solutions that I have thought of:
>>>
>>> 1. Interpolate the test set point data of 227,759 observations over a
>> more
>>> manageable spatial pixel dataframe with cell size of perhaps 10,000m by
>>> 10,000m which would give me around 4900 points. So instead of 227,759
>>> observations, I can make the listw object based on just 4900 + 214
>> training
>>> points and predict just on 4900 observations.
>>
>> But what are you trying to do? Are the observations output areas? House
>> sales? If you are not filling in missing areal units (the Goulard et al.
>> case), couldn't you simply use geostatistical methods which seem to match
>> your support better, and can be fitted and can predict using a local
>> neighbourhood? While you are doing that, you could switch to INLA with
>> SPDE, which interposes a mesh like the one you suggest. But in that case,
>> beware of the mesh choice issue in:
>>
>> https://doi.org/10.1080/03610926.2018.1536209
>>
>>>
>>> 2. Get hold of better performance machines through cloud computing such
>> as
>>> AWS EC2 services and try running the commands and models there.
>>>
>>
>> What you need are methods, not wasted money on hardware as a service.
>>
>>> 3. Parallel computing using the parallel package from r (although I am
>> not
>>> sure whether dnearneigh can be parallelised).
>>>
>>
>> This could easily be implemented if it was really needed, which I don't
>> think it is; better methods understanding lets one do more with less.
>>
>>> I believe option 1 would be the most manageable but I am not sure how and
>>> by how much this would affect the accuracy of the predictions as
>>> interpolating the dataset would be akin to introducing more estimations
>> in
>>> the prediction. However, I am also grappling with the trade-off between
>>> accuracy and computation time. Hence, if options 2 and 3 can offer a
>>> reasonable computation time (1-2 hours) then I would forgo option 1.
>>>
>>> What do you think? Is it possible to make a neighbourhood listw object
>> out
>>> of 227,973 observations efficiently?
>>
>> Yes, but only if the numbers of neighbours are very small. Look in Bivand
>> et al. (2013) to see the use of some fairly large n, but only with few
>> neighbours for each observation. You seem to be getting average neighbour
>> counts in the thousands, which makes no sense.
>>
>>>
>>> Thank you for reading to the end! Apologies for writing a lengthy one,
>> just
>>> wanted to fully describe what I am facing, I hope I didn't miss out
>>> anything crucial.
>>>
>>
>> Long is OK, but there is no motivation here for why you want to make 200K
>> predictions from 200 observations with point support (?) using weights
>> matrices.
>>
>> Hope this clarifies,
>>
>> Roger
>>
>>> Thank you so much once again!
>>>
>>> jiawen
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-Geo mailing list
>>> [hidden email]
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: [hidden email]
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway
Reply | Threaded
Open this post in threaded view
|

Re: Running huge dataset with dnearneigh

amoroso
Dear Roger,

Thanks for your reply and explanation!

I am just exploring the aspect of geodemographics in store locations. There
are many factors that can be considered, as you have highlighted!

Thank you so much for taking the time to write back to me! I will study and
consider your advice! Thank you!

Jiawen

On Mon, 1 Jul 2019 at 19:12, Roger Bivand <[hidden email]> wrote:

> On Mon, 1 Jul 2019, Jiawen Ng wrote:
>
> > Dear Roger,
> >
> > Thank you so much for your detailed response and pointing out potential
> > pitfalls! It has prompted me to re-evalutate my approach.
> >
> > Here is the context: I have some stores' sales data (this is my training
> > set of 214 points), I would like to find out where best to set up new
> > stores in UK. I am using a geodemographics approach to do this: Perform a
> > regression of sales against census data, then predict sales on UK output
> > areas (by centroids) and finally identify new areas with
> > location-allocation models. As the stores are points, this has led me to
> > define UK output areas by its population-weighted centroids, thus
> resulting
> > in the prediction by points rather than by areas. Tests (like moran's I
> and
> > lagrange multiplier) for spatial relationships among the points in my
> > training set were significant hence this has led me to implement some
> > spatial models (specifically spatial lag, error and durbin models) to
> > account for the spatial relationships in the data.
>
> I'm afraid that my retail geography is not very up to date, but also that
> your approach is most unlikely to yield constructive results.
>
> Most retail stores are organised in large chains, so optimise costs
> between wholesale and retail. Independent retail stores depend crucially
> on access to wholesale stores, so anyway cannot locate without regard to
> supply costs. Some service activities without wholesale dependencies are
> less tied.
>
> Most chains certainly behave strategically with regard to each other,
> sometimes locating toe-to-toe to challenge a competing chain
> (Carrefour/Tesco or their local shop variants), sometimes avoiding nearby
> competing chain locations to establish a local monopoly (think Hotelling).
>
> Population density doesn't express demand, especially unmet demand well at
> all. Think food deserts - maybe plenty of people but little disposable
> income. Look at the food desert literature, or the US food stamp
> literature.
>
> Finally (all bad news) retail is not only challenged by location shifting
> from high streets to malls, but critically by online shopping, which
> shifts the cost structures one the buyer is engaged at a proposed price to
> logistics, to complete the order at the highest margin including returns.
> That only marginally relates to population density.
>
> So you'd need more data than you have, a model that explicitly handles
> competition between chains as well as market gaps, and some way of
> handling online leakage to move forward.
>
> If population density was a proxy for accessibility (most often it isn't),
> it might look like the beginnings of a model, but most often we don't know
> what bid-rent surfaces look like, and then, most often different
> activities sort differently across those surfaces.
>
> >
> > I am quite unsettled and unclear as to which neighbourhood definition to
> go
> > for actually. I thought of IDW at first as I thought this would summarise
> > each point's relationship with their neighbours very precisely thus
> making
> > the predictions more accurate. Upon your advice (don't use IDW or other
> > general weights for predictions), I decided not to use IDW, and changed
> it
> > to dnearneigh instead (although now I am questioning myself on the
> > definition of what is meant by general weights. Perhaps I am
> understanding
> > the definition of general weights wrong, if dnearneigh is still
> considered
> > to be a 'general weights' method) Why is the use of IDW not advisable
> > however? Is it due to computational reasons? Also, why would having
> > thousands of neighbours be making no sense? Apologies for asking so many
> > questions, I'd just like to really understand the concepts!
> >
>
> The model underlying spatial regressions using neighbours tapers
> dependency as the pairwise elements of (I - \rho W)^{-1} (conditional) and
> [(I - \rho W) (I - \rho W')]^{-1} (see Wall 2004). These are NxN dense
> matrices. (I - \rho W) is typically sparse, and under certain conditions
> leads to (I - \rho W)^{-1} = \sum_{i=0}^{\inf} \rho^i W^i, the sum of a
> power series in \rho and W. \rho is typically upward bounded < 1, so
> \rho^i declines as i increases. This dampens \rho^i W^i, so that i
> influences j less and less with increasing i. So in the general case IDW
> is simply replicating what simple contiguity gives you anyway. So the
> sparser W is (within reason), the better. Unless you really know that the
> physics, chemistry or biology of your system give you a known systematic
> relationship like IDW, you may as well stay with contiguity.
>
> However, this isn't any use in solving a retail location problem at all.
>
> > I believe that both the train and test set has varying intensities. I was
> > weighing the different neighbourhood methods: dnearneigh, knearneigh,
> using
> > IDW etc. and I felt like each method would have its disadvantages -- its
> > difficult to pinpoint which neighbourhood definition would be best. If
> one
> > were to go for knearneigh for example, results may not be fair due to the
> > inhomogeneity of the points -- for instance, point A's nearest neighbours
> > may be within a few hundreds of kilometres while point B's nearest
> > neighbours may be in the thousands. I feel like the choice of any
> > neighbourhood definition can be highly debateable... What do you think?
> >
>
> When in doubt use contiguity for polygons and similar graph based methods
> for points. Try to keep the graphs planar (as few intersecting edges as
> possible - rule of thumb).
>
>
> > After analysing my problem again, I think that predicting by output areas
> > (points) would be best for my case as I would have to make use of the
> > population data after building the model. Interpolating census data of
> the
> > output area (points) would cause me to lose that information.
> >
>
> Baseline, this is not going anywhere constructive, and simply approaching
> retail location in this way is unhelpful - there is far too little
> information in your model.
>
> If you really must, first find a fully configured retail model with the
> complete data set needed to replicate the results achieved, and use that
> to benchmark how far your approach succeeds in reaching a similar result
> for that restricted area. I think that you'll find that the retail model
> is much more successful, but if not, there is less structure in
> contemporary retail than I though.
>
> Best wishes,
>
> Roger
>
> > Thank you for the comments and the advice so far,  I would greatly
> welcome
> > and appreciate additional feedback!
> >
> > Thank you so much once again!
> >
> > Jiawen
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sun, 30 Jun 2019 at 16:57, Roger Bivand <[hidden email]> wrote:
> >
> >> On Sat, 29 Jun 2019, Jiawen Ng wrote:
> >>
> >>> Dear Roger,
> >>
> >> Postings go to the whole list ...
> >>
> >>>
> >>> How can we deal with a huge dataset when using dnearneigh?
> >>>
> >>
> >> First, why distance neighbours? What is the support of the data, point
> or
> >> polygon? If polygon, contiguity neighbours are preferred. If not, and
> the
> >> intensity of observations is similar across the whole area, distance may
> >> be justified, but if the intensity varies, some observations will have
> >> very many neighbours. In that case, unless you have a clear ecological
> or
> >> environmental reason for knowing that a known distance threshold binds,
> it
> >> is not a good choice.
> >>
> >>> Here is my code:
> >>>
> >>> d <- dnearneigh(spdf,0, 22000)
> >>> all_listw <- nb2listw(d, style = "W")
> >>>
> >>> where the spdf object is in the british national grid CRS:
> >>> +init=epsg:27700, with 227,973 observations/points. The distance of
> >> 22,000
> >>> was decided by a training set that had 214 observations and the spdf
> >> object
> >>> contains both the training set and the testing set.
> >>>
> >>
> >> This is questionable. You train on 214 observations - do their areal
> >> intensity match those of the whole data set? If chosen at random, you
> run
> >> into the spatial sampling problems discussed in:
> >>
> >>
> >>
> https://www.sciencedirect.com/science/article/pii/S0304380019302145?dgcid=author
> >>
> >> Are 214 observations for training representative of 227,973 prediction
> >> sites? Do you only have observations on the response for 214, and an
> >> unobserved response otherwise? What are the data, what are you trying to
> >> do and why? This is not a sensible setting for models using weights
> >> matrices for prediction (I think), because we do not have estimates of
> the
> >> prediction error in general.
> >>
> >>> I am using a Mac, with a processor of 2.3 GHz Intel Core i5 and 8 GB
> >>> memory. My laptop showed that when dnearneigh command was run on all
> >>> observations, around 6.9 out of 8GB was used by the rsession and that
> the
> >>> %CPU used by the rsession was stated to be around 98%, although another
> >>> indicator showed that my computer was around 60% idle. After running
> the
> >>> command for a day, rstudio alerted me that the connection to the
> rsession
> >>> could not be established, so I aborted the entire process altogether. I
> >>> think the problem here may be the size of the dataset and perhaps the
> >>> limitations of my laptop specs.
> >>>
> >>
> >> On planar data, there is no good reason for this, as each observation is
> >> treated separately, finding and sorting distances, and choosing those
> >> under the threshold. It will undoubtedly slow if there are more than a
> few
> >> neighbours within the threshold, but I already covered the
> inadvisability
> >> of defining neighbours in that way.
> >>
> >> Using an rtree might help, but you get hit badly if there are many
> >> neighbours within the threshold you have chosen anyway.
> >>
> >> On most 8GB hardware and modern OS, you do not have more than 3-4GB for
> >> work. So something was swapping on your laptop.
> >>
> >>> Do you have any advice on how I can go about making a neighbours list
> >> with
> >>> dnearneigh for 227,973 observations in a successful and efficient way?
> >>> Also, would you foresee any problems in the next steps, especially
> when I
> >>> will be using the neighbourhood listw object as an input in fitting and
> >>> predicting using the spatial lag/error models? (see code below)
> >>>
> >>> model <-  spatialreg::lagsarlm(rest_formula, data=train, train_listw)
> >>> model_pred <- spatialreg::predict.sarlm(model, test, all_listw)
> >>>
> >>
> >> Why would using a spatial lag model make sense? Why are you suggesting
> >> this model, do you have a behavioural for why only the spatially lagged
> >> response should be included?
> >>
> >> Why do you think that this is sensible? You are predicting 1000 times
> for
> >> each observation - this is not what the prediction methods are written
> >> for. Most involve inverting an nxn inverse matrix - did you refer to
> >> Goulard et al. (2017) to get a good understanding of the underlying
> >> methods?
> >>
> >>> I think the predicting part may take some time, since my test set
> >> consists
> >>> of 227,973 - 214 observations = 227,759 observations.
> >>>
> >>> Here are some solutions that I have thought of:
> >>>
> >>> 1. Interpolate the test set point data of 227,759 observations over a
> >> more
> >>> manageable spatial pixel dataframe with cell size of perhaps 10,000m by
> >>> 10,000m which would give me around 4900 points. So instead of 227,759
> >>> observations, I can make the listw object based on just 4900 + 214
> >> training
> >>> points and predict just on 4900 observations.
> >>
> >> But what are you trying to do? Are the observations output areas? House
> >> sales? If you are not filling in missing areal units (the Goulard et al.
> >> case), couldn't you simply use geostatistical methods which seem to
> match
> >> your support better, and can be fitted and can predict using a local
> >> neighbourhood? While you are doing that, you could switch to INLA with
> >> SPDE, which interposes a mesh like the one you suggest. But in that
> case,
> >> beware of the mesh choice issue in:
> >>
> >> https://doi.org/10.1080/03610926.2018.1536209
> >>
> >>>
> >>> 2. Get hold of better performance machines through cloud computing such
> >> as
> >>> AWS EC2 services and try running the commands and models there.
> >>>
> >>
> >> What you need are methods, not wasted money on hardware as a service.
> >>
> >>> 3. Parallel computing using the parallel package from r (although I am
> >> not
> >>> sure whether dnearneigh can be parallelised).
> >>>
> >>
> >> This could easily be implemented if it was really needed, which I don't
> >> think it is; better methods understanding lets one do more with less.
> >>
> >>> I believe option 1 would be the most manageable but I am not sure how
> and
> >>> by how much this would affect the accuracy of the predictions as
> >>> interpolating the dataset would be akin to introducing more estimations
> >> in
> >>> the prediction. However, I am also grappling with the trade-off between
> >>> accuracy and computation time. Hence, if options 2 and 3 can offer a
> >>> reasonable computation time (1-2 hours) then I would forgo option 1.
> >>>
> >>> What do you think? Is it possible to make a neighbourhood listw object
> >> out
> >>> of 227,973 observations efficiently?
> >>
> >> Yes, but only if the numbers of neighbours are very small. Look in
> Bivand
> >> et al. (2013) to see the use of some fairly large n, but only with few
> >> neighbours for each observation. You seem to be getting average
> neighbour
> >> counts in the thousands, which makes no sense.
> >>
> >>>
> >>> Thank you for reading to the end! Apologies for writing a lengthy one,
> >> just
> >>> wanted to fully describe what I am facing, I hope I didn't miss out
> >>> anything crucial.
> >>>
> >>
> >> Long is OK, but there is no motivation here for why you want to make
> 200K
> >> predictions from 200 observations with point support (?) using weights
> >> matrices.
> >>
> >> Hope this clarifies,
> >>
> >> Roger
> >>
> >>> Thank you so much once again!
> >>>
> >>> jiawen
> >>>
> >>>       [[alternative HTML version deleted]]
> >>>
> >>> _______________________________________________
> >>> R-sig-Geo mailing list
> >>> [hidden email]
> >>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> >>>
> >>
> >> --
> >> Roger Bivand
> >> Department of Economics, Norwegian School of Economics,
> >> Helleveien 30, N-5045 Bergen, Norway.
> >> voice: +47 55 95 93 55; e-mail: [hidden email]
> >> https://orcid.org/0000-0003-2392-6140
> >> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-Geo mailing list
> > [hidden email]
> > https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> >
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Running huge dataset with dnearneigh

Roger Bivand
Administrator
On Tue, 2 Jul 2019, Jiawen Ng wrote:

> Dear Roger,
>
> Thanks for your reply and explanation!
>
> I am just exploring the aspect of geodemographics in store locations. There
> are many factors that can be considered, as you have highlighted!

OK, so I suggest choosing a modest sized case until a selection of working
models emerges. Once you reach that stage, you can return to scaling up. I
think you need much more data on the customer behaviour around the stores
you use to train your models, particularly customer flows associated with
actual purchases. Firms used to do this through loyalty programmes and
cards, but this data is not open, so you'd need proxies which say city
bikes will not give you.

Geodemographics (used for direct mailing as a marketing tool) have largely
been eclipsed by profiling in social media with the exception of segments
without social media profiles. This is because postcode or OA profiling is
often too noisy and so is expensive because there are many false hits.
Retail is interesting but very multi-faceted, but some personal services
are more closely related to population as they are hard to digitise.

Hope this helps,

Roger

>
> Thank you so much for taking the time to write back to me! I will study and
> consider your advice! Thank you!
>
> Jiawen
>
> On Mon, 1 Jul 2019 at 19:12, Roger Bivand <[hidden email]> wrote:
>
>> On Mon, 1 Jul 2019, Jiawen Ng wrote:
>>
>>> Dear Roger,
>>>
>>> Thank you so much for your detailed response and pointing out potential
>>> pitfalls! It has prompted me to re-evalutate my approach.
>>>
>>> Here is the context: I have some stores' sales data (this is my training
>>> set of 214 points), I would like to find out where best to set up new
>>> stores in UK. I am using a geodemographics approach to do this: Perform a
>>> regression of sales against census data, then predict sales on UK output
>>> areas (by centroids) and finally identify new areas with
>>> location-allocation models. As the stores are points, this has led me to
>>> define UK output areas by its population-weighted centroids, thus
>> resulting
>>> in the prediction by points rather than by areas. Tests (like moran's I
>> and
>>> lagrange multiplier) for spatial relationships among the points in my
>>> training set were significant hence this has led me to implement some
>>> spatial models (specifically spatial lag, error and durbin models) to
>>> account for the spatial relationships in the data.
>>
>> I'm afraid that my retail geography is not very up to date, but also that
>> your approach is most unlikely to yield constructive results.
>>
>> Most retail stores are organised in large chains, so optimise costs
>> between wholesale and retail. Independent retail stores depend crucially
>> on access to wholesale stores, so anyway cannot locate without regard to
>> supply costs. Some service activities without wholesale dependencies are
>> less tied.
>>
>> Most chains certainly behave strategically with regard to each other,
>> sometimes locating toe-to-toe to challenge a competing chain
>> (Carrefour/Tesco or their local shop variants), sometimes avoiding nearby
>> competing chain locations to establish a local monopoly (think Hotelling).
>>
>> Population density doesn't express demand, especially unmet demand well at
>> all. Think food deserts - maybe plenty of people but little disposable
>> income. Look at the food desert literature, or the US food stamp
>> literature.
>>
>> Finally (all bad news) retail is not only challenged by location shifting
>> from high streets to malls, but critically by online shopping, which
>> shifts the cost structures one the buyer is engaged at a proposed price to
>> logistics, to complete the order at the highest margin including returns.
>> That only marginally relates to population density.
>>
>> So you'd need more data than you have, a model that explicitly handles
>> competition between chains as well as market gaps, and some way of
>> handling online leakage to move forward.
>>
>> If population density was a proxy for accessibility (most often it isn't),
>> it might look like the beginnings of a model, but most often we don't know
>> what bid-rent surfaces look like, and then, most often different
>> activities sort differently across those surfaces.
>>
>>>
>>> I am quite unsettled and unclear as to which neighbourhood definition to
>> go
>>> for actually. I thought of IDW at first as I thought this would summarise
>>> each point's relationship with their neighbours very precisely thus
>> making
>>> the predictions more accurate. Upon your advice (don't use IDW or other
>>> general weights for predictions), I decided not to use IDW, and changed
>> it
>>> to dnearneigh instead (although now I am questioning myself on the
>>> definition of what is meant by general weights. Perhaps I am
>> understanding
>>> the definition of general weights wrong, if dnearneigh is still
>> considered
>>> to be a 'general weights' method) Why is the use of IDW not advisable
>>> however? Is it due to computational reasons? Also, why would having
>>> thousands of neighbours be making no sense? Apologies for asking so many
>>> questions, I'd just like to really understand the concepts!
>>>
>>
>> The model underlying spatial regressions using neighbours tapers
>> dependency as the pairwise elements of (I - \rho W)^{-1} (conditional) and
>> [(I - \rho W) (I - \rho W')]^{-1} (see Wall 2004). These are NxN dense
>> matrices. (I - \rho W) is typically sparse, and under certain conditions
>> leads to (I - \rho W)^{-1} = \sum_{i=0}^{\inf} \rho^i W^i, the sum of a
>> power series in \rho and W. \rho is typically upward bounded < 1, so
>> \rho^i declines as i increases. This dampens \rho^i W^i, so that i
>> influences j less and less with increasing i. So in the general case IDW
>> is simply replicating what simple contiguity gives you anyway. So the
>> sparser W is (within reason), the better. Unless you really know that the
>> physics, chemistry or biology of your system give you a known systematic
>> relationship like IDW, you may as well stay with contiguity.
>>
>> However, this isn't any use in solving a retail location problem at all.
>>
>>> I believe that both the train and test set has varying intensities. I was
>>> weighing the different neighbourhood methods: dnearneigh, knearneigh,
>> using
>>> IDW etc. and I felt like each method would have its disadvantages -- its
>>> difficult to pinpoint which neighbourhood definition would be best. If
>> one
>>> were to go for knearneigh for example, results may not be fair due to the
>>> inhomogeneity of the points -- for instance, point A's nearest neighbours
>>> may be within a few hundreds of kilometres while point B's nearest
>>> neighbours may be in the thousands. I feel like the choice of any
>>> neighbourhood definition can be highly debateable... What do you think?
>>>
>>
>> When in doubt use contiguity for polygons and similar graph based methods
>> for points. Try to keep the graphs planar (as few intersecting edges as
>> possible - rule of thumb).
>>
>>
>>> After analysing my problem again, I think that predicting by output areas
>>> (points) would be best for my case as I would have to make use of the
>>> population data after building the model. Interpolating census data of
>> the
>>> output area (points) would cause me to lose that information.
>>>
>>
>> Baseline, this is not going anywhere constructive, and simply approaching
>> retail location in this way is unhelpful - there is far too little
>> information in your model.
>>
>> If you really must, first find a fully configured retail model with the
>> complete data set needed to replicate the results achieved, and use that
>> to benchmark how far your approach succeeds in reaching a similar result
>> for that restricted area. I think that you'll find that the retail model
>> is much more successful, but if not, there is less structure in
>> contemporary retail than I though.
>>
>> Best wishes,
>>
>> Roger
>>
>>> Thank you for the comments and the advice so far,  I would greatly
>> welcome
>>> and appreciate additional feedback!
>>>
>>> Thank you so much once again!
>>>
>>> Jiawen
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, 30 Jun 2019 at 16:57, Roger Bivand <[hidden email]> wrote:
>>>
>>>> On Sat, 29 Jun 2019, Jiawen Ng wrote:
>>>>
>>>>> Dear Roger,
>>>>
>>>> Postings go to the whole list ...
>>>>
>>>>>
>>>>> How can we deal with a huge dataset when using dnearneigh?
>>>>>
>>>>
>>>> First, why distance neighbours? What is the support of the data, point
>> or
>>>> polygon? If polygon, contiguity neighbours are preferred. If not, and
>> the
>>>> intensity of observations is similar across the whole area, distance may
>>>> be justified, but if the intensity varies, some observations will have
>>>> very many neighbours. In that case, unless you have a clear ecological
>> or
>>>> environmental reason for knowing that a known distance threshold binds,
>> it
>>>> is not a good choice.
>>>>
>>>>> Here is my code:
>>>>>
>>>>> d <- dnearneigh(spdf,0, 22000)
>>>>> all_listw <- nb2listw(d, style = "W")
>>>>>
>>>>> where the spdf object is in the british national grid CRS:
>>>>> +init=epsg:27700, with 227,973 observations/points. The distance of
>>>> 22,000
>>>>> was decided by a training set that had 214 observations and the spdf
>>>> object
>>>>> contains both the training set and the testing set.
>>>>>
>>>>
>>>> This is questionable. You train on 214 observations - do their areal
>>>> intensity match those of the whole data set? If chosen at random, you
>> run
>>>> into the spatial sampling problems discussed in:
>>>>
>>>>
>>>>
>> https://www.sciencedirect.com/science/article/pii/S0304380019302145?dgcid=author
>>>>
>>>> Are 214 observations for training representative of 227,973 prediction
>>>> sites? Do you only have observations on the response for 214, and an
>>>> unobserved response otherwise? What are the data, what are you trying to
>>>> do and why? This is not a sensible setting for models using weights
>>>> matrices for prediction (I think), because we do not have estimates of
>> the
>>>> prediction error in general.
>>>>
>>>>> I am using a Mac, with a processor of 2.3 GHz Intel Core i5 and 8 GB
>>>>> memory. My laptop showed that when dnearneigh command was run on all
>>>>> observations, around 6.9 out of 8GB was used by the rsession and that
>> the
>>>>> %CPU used by the rsession was stated to be around 98%, although another
>>>>> indicator showed that my computer was around 60% idle. After running
>> the
>>>>> command for a day, rstudio alerted me that the connection to the
>> rsession
>>>>> could not be established, so I aborted the entire process altogether. I
>>>>> think the problem here may be the size of the dataset and perhaps the
>>>>> limitations of my laptop specs.
>>>>>
>>>>
>>>> On planar data, there is no good reason for this, as each observation is
>>>> treated separately, finding and sorting distances, and choosing those
>>>> under the threshold. It will undoubtedly slow if there are more than a
>> few
>>>> neighbours within the threshold, but I already covered the
>> inadvisability
>>>> of defining neighbours in that way.
>>>>
>>>> Using an rtree might help, but you get hit badly if there are many
>>>> neighbours within the threshold you have chosen anyway.
>>>>
>>>> On most 8GB hardware and modern OS, you do not have more than 3-4GB for
>>>> work. So something was swapping on your laptop.
>>>>
>>>>> Do you have any advice on how I can go about making a neighbours list
>>>> with
>>>>> dnearneigh for 227,973 observations in a successful and efficient way?
>>>>> Also, would you foresee any problems in the next steps, especially
>> when I
>>>>> will be using the neighbourhood listw object as an input in fitting and
>>>>> predicting using the spatial lag/error models? (see code below)
>>>>>
>>>>> model <-  spatialreg::lagsarlm(rest_formula, data=train, train_listw)
>>>>> model_pred <- spatialreg::predict.sarlm(model, test, all_listw)
>>>>>
>>>>
>>>> Why would using a spatial lag model make sense? Why are you suggesting
>>>> this model, do you have a behavioural for why only the spatially lagged
>>>> response should be included?
>>>>
>>>> Why do you think that this is sensible? You are predicting 1000 times
>> for
>>>> each observation - this is not what the prediction methods are written
>>>> for. Most involve inverting an nxn inverse matrix - did you refer to
>>>> Goulard et al. (2017) to get a good understanding of the underlying
>>>> methods?
>>>>
>>>>> I think the predicting part may take some time, since my test set
>>>> consists
>>>>> of 227,973 - 214 observations = 227,759 observations.
>>>>>
>>>>> Here are some solutions that I have thought of:
>>>>>
>>>>> 1. Interpolate the test set point data of 227,759 observations over a
>>>> more
>>>>> manageable spatial pixel dataframe with cell size of perhaps 10,000m by
>>>>> 10,000m which would give me around 4900 points. So instead of 227,759
>>>>> observations, I can make the listw object based on just 4900 + 214
>>>> training
>>>>> points and predict just on 4900 observations.
>>>>
>>>> But what are you trying to do? Are the observations output areas? House
>>>> sales? If you are not filling in missing areal units (the Goulard et al.
>>>> case), couldn't you simply use geostatistical methods which seem to
>> match
>>>> your support better, and can be fitted and can predict using a local
>>>> neighbourhood? While you are doing that, you could switch to INLA with
>>>> SPDE, which interposes a mesh like the one you suggest. But in that
>> case,
>>>> beware of the mesh choice issue in:
>>>>
>>>> https://doi.org/10.1080/03610926.2018.1536209
>>>>
>>>>>
>>>>> 2. Get hold of better performance machines through cloud computing such
>>>> as
>>>>> AWS EC2 services and try running the commands and models there.
>>>>>
>>>>
>>>> What you need are methods, not wasted money on hardware as a service.
>>>>
>>>>> 3. Parallel computing using the parallel package from r (although I am
>>>> not
>>>>> sure whether dnearneigh can be parallelised).
>>>>>
>>>>
>>>> This could easily be implemented if it was really needed, which I don't
>>>> think it is; better methods understanding lets one do more with less.
>>>>
>>>>> I believe option 1 would be the most manageable but I am not sure how
>> and
>>>>> by how much this would affect the accuracy of the predictions as
>>>>> interpolating the dataset would be akin to introducing more estimations
>>>> in
>>>>> the prediction. However, I am also grappling with the trade-off between
>>>>> accuracy and computation time. Hence, if options 2 and 3 can offer a
>>>>> reasonable computation time (1-2 hours) then I would forgo option 1.
>>>>>
>>>>> What do you think? Is it possible to make a neighbourhood listw object
>>>> out
>>>>> of 227,973 observations efficiently?
>>>>
>>>> Yes, but only if the numbers of neighbours are very small. Look in
>> Bivand
>>>> et al. (2013) to see the use of some fairly large n, but only with few
>>>> neighbours for each observation. You seem to be getting average
>> neighbour
>>>> counts in the thousands, which makes no sense.
>>>>
>>>>>
>>>>> Thank you for reading to the end! Apologies for writing a lengthy one,
>>>> just
>>>>> wanted to fully describe what I am facing, I hope I didn't miss out
>>>>> anything crucial.
>>>>>
>>>>
>>>> Long is OK, but there is no motivation here for why you want to make
>> 200K
>>>> predictions from 200 observations with point support (?) using weights
>>>> matrices.
>>>>
>>>> Hope this clarifies,
>>>>
>>>> Roger
>>>>
>>>>> Thank you so much once again!
>>>>>
>>>>> jiawen
>>>>>
>>>>>       [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> R-sig-Geo mailing list
>>>>> [hidden email]
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>>
>>>>
>>>> --
>>>> Roger Bivand
>>>> Department of Economics, Norwegian School of Economics,
>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>> voice: +47 55 95 93 55; e-mail: [hidden email]
>>>> https://orcid.org/0000-0003-2392-6140
>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-Geo mailing list
>>> [hidden email]
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: [hidden email]
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway
Reply | Threaded
Open this post in threaded view
|

Re: Running huge dataset with dnearneigh

Roger Bivand
Administrator
Follow-up: maybe read: https://geocompr.robinlovelace.net/location.html 
for a geomarketing case.

Roger

On Tue, 2 Jul 2019, Roger Bivand wrote:

> On Tue, 2 Jul 2019, Jiawen Ng wrote:
>
>>  Dear Roger,
>>
>>  Thanks for your reply and explanation!
>>
>>  I am just exploring the aspect of geodemographics in store locations.
>>  There
>>  are many factors that can be considered, as you have highlighted!
>
> OK, so I suggest choosing a modest sized case until a selection of working
> models emerges. Once you reach that stage, you can return to scaling up. I
> think you need much more data on the customer behaviour around the stores you
> use to train your models, particularly customer flows associated with actual
> purchases. Firms used to do this through loyalty programmes and cards, but
> this data is not open, so you'd need proxies which say city bikes will not
> give you.
>
> Geodemographics (used for direct mailing as a marketing tool) have largely
> been eclipsed by profiling in social media with the exception of segments
> without social media profiles. This is because postcode or OA profiling is
> often too noisy and so is expensive because there are many false hits. Retail
> is interesting but very multi-faceted, but some personal services are more
> closely related to population as they are hard to digitise.
>
> Hope this helps,
>
> Roger
>
>>
>>  Thank you so much for taking the time to write back to me! I will study
>>  and
>>  consider your advice! Thank you!
>>
>>  Jiawen
>>
>>  On Mon, 1 Jul 2019 at 19:12, Roger Bivand <[hidden email]> wrote:
>>
>>>  On Mon, 1 Jul 2019, Jiawen Ng wrote:
>>>
>>>>  Dear Roger,
>>>>
>>>>  Thank you so much for your detailed response and pointing out potential
>>>>  pitfalls! It has prompted me to re-evalutate my approach.
>>>>
>>>>  Here is the context: I have some stores' sales data (this is my training
>>>>  set of 214 points), I would like to find out where best to set up new
>>>>  stores in UK. I am using a geodemographics approach to do this: Perform
>>>>  a
>>>>  regression of sales against census data, then predict sales on UK output
>>>>  areas (by centroids) and finally identify new areas with
>>>>  location-allocation models. As the stores are points, this has led me to
>>>>  define UK output areas by its population-weighted centroids, thus
>>>  resulting
>>>>  in the prediction by points rather than by areas. Tests (like moran's I
>>>  and
>>>>  lagrange multiplier) for spatial relationships among the points in my
>>>>  training set were significant hence this has led me to implement some
>>>>  spatial models (specifically spatial lag, error and durbin models) to
>>>>  account for the spatial relationships in the data.
>>>
>>>  I'm afraid that my retail geography is not very up to date, but also that
>>>  your approach is most unlikely to yield constructive results.
>>>
>>>  Most retail stores are organised in large chains, so optimise costs
>>>  between wholesale and retail. Independent retail stores depend crucially
>>>  on access to wholesale stores, so anyway cannot locate without regard to
>>>  supply costs. Some service activities without wholesale dependencies are
>>>  less tied.
>>>
>>>  Most chains certainly behave strategically with regard to each other,
>>>  sometimes locating toe-to-toe to challenge a competing chain
>>>  (Carrefour/Tesco or their local shop variants), sometimes avoiding nearby
>>>  competing chain locations to establish a local monopoly (think
>>>  Hotelling).
>>>
>>>  Population density doesn't express demand, especially unmet demand well
>>>  at
>>>  all. Think food deserts - maybe plenty of people but little disposable
>>>  income. Look at the food desert literature, or the US food stamp
>>>  literature.
>>>
>>>  Finally (all bad news) retail is not only challenged by location shifting
>>>  from high streets to malls, but critically by online shopping, which
>>>  shifts the cost structures one the buyer is engaged at a proposed price
>>>  to
>>>  logistics, to complete the order at the highest margin including returns.
>>>  That only marginally relates to population density.
>>>
>>>  So you'd need more data than you have, a model that explicitly handles
>>>  competition between chains as well as market gaps, and some way of
>>>  handling online leakage to move forward.
>>>
>>>  If population density was a proxy for accessibility (most often it
>>>  isn't),
>>>  it might look like the beginnings of a model, but most often we don't
>>>  know
>>>  what bid-rent surfaces look like, and then, most often different
>>>  activities sort differently across those surfaces.
>>>
>>>>
>>>>  I am quite unsettled and unclear as to which neighbourhood definition to
>>>  go
>>>>  for actually. I thought of IDW at first as I thought this would
>>>>  summarise
>>>>  each point's relationship with their neighbours very precisely thus
>>>  making
>>>>  the predictions more accurate. Upon your advice (don't use IDW or other
>>>>  general weights for predictions), I decided not to use IDW, and changed
>>>  it
>>>>  to dnearneigh instead (although now I am questioning myself on the
>>>>  definition of what is meant by general weights. Perhaps I am
>>>  understanding
>>>>  the definition of general weights wrong, if dnearneigh is still
>>>  considered
>>>>  to be a 'general weights' method) Why is the use of IDW not advisable
>>>>  however? Is it due to computational reasons? Also, why would having
>>>>  thousands of neighbours be making no sense? Apologies for asking so many
>>>>  questions, I'd just like to really understand the concepts!
>>>>
>>>
>>>  The model underlying spatial regressions using neighbours tapers
>>>  dependency as the pairwise elements of (I - \rho W)^{-1} (conditional)
>>>  and
>>>  [(I - \rho W) (I - \rho W')]^{-1} (see Wall 2004). These are NxN dense
>>>  matrices. (I - \rho W) is typically sparse, and under certain conditions
>>>  leads to (I - \rho W)^{-1} = \sum_{i=0}^{\inf} \rho^i W^i, the sum of a
>>>  power series in \rho and W. \rho is typically upward bounded < 1, so
>>>  \rho^i declines as i increases. This dampens \rho^i W^i, so that i
>>>  influences j less and less with increasing i. So in the general case IDW
>>>  is simply replicating what simple contiguity gives you anyway. So the
>>>  sparser W is (within reason), the better. Unless you really know that the
>>>  physics, chemistry or biology of your system give you a known systematic
>>>  relationship like IDW, you may as well stay with contiguity.
>>>
>>>  However, this isn't any use in solving a retail location problem at all.
>>>
>>>>  I believe that both the train and test set has varying intensities. I
>>>>  was
>>>>  weighing the different neighbourhood methods: dnearneigh, knearneigh,
>>>  using
>>>>  IDW etc. and I felt like each method would have its disadvantages -- its
>>>>  difficult to pinpoint which neighbourhood definition would be best. If
>>>  one
>>>>  were to go for knearneigh for example, results may not be fair due to
>>>>  the
>>>>  inhomogeneity of the points -- for instance, point A's nearest
>>>>  neighbours
>>>>  may be within a few hundreds of kilometres while point B's nearest
>>>>  neighbours may be in the thousands. I feel like the choice of any
>>>>  neighbourhood definition can be highly debateable... What do you think?
>>>>
>>>
>>>  When in doubt use contiguity for polygons and similar graph based methods
>>>  for points. Try to keep the graphs planar (as few intersecting edges as
>>>  possible - rule of thumb).
>>>
>>>
>>>>  After analysing my problem again, I think that predicting by output
>>>>  areas
>>>>  (points) would be best for my case as I would have to make use of the
>>>>  population data after building the model. Interpolating census data of
>>>  the
>>>>  output area (points) would cause me to lose that information.
>>>>
>>>
>>>  Baseline, this is not going anywhere constructive, and simply approaching
>>>  retail location in this way is unhelpful - there is far too little
>>>  information in your model.
>>>
>>>  If you really must, first find a fully configured retail model with the
>>>  complete data set needed to replicate the results achieved, and use that
>>>  to benchmark how far your approach succeeds in reaching a similar result
>>>  for that restricted area. I think that you'll find that the retail model
>>>  is much more successful, but if not, there is less structure in
>>>  contemporary retail than I though.
>>>
>>>  Best wishes,
>>>
>>>  Roger
>>>
>>>>  Thank you for the comments and the advice so far,  I would greatly
>>>  welcome
>>>>  and appreciate additional feedback!
>>>>
>>>>  Thank you so much once again!
>>>>
>>>>  Jiawen
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>  On Sun, 30 Jun 2019 at 16:57, Roger Bivand <[hidden email]> wrote:
>>>>
>>>>>  On Sat, 29 Jun 2019, Jiawen Ng wrote:
>>>>>
>>>>>>  Dear Roger,
>>>>>
>>>>>  Postings go to the whole list ...
>>>>>
>>>>>>
>>>>>>  How can we deal with a huge dataset when using dnearneigh?
>>>>>>
>>>>>
>>>>>  First, why distance neighbours? What is the support of the data, point
>>>  or
>>>>>  polygon? If polygon, contiguity neighbours are preferred. If not, and
>>>  the
>>>>>  intensity of observations is similar across the whole area, distance
>>>>>  may
>>>>>  be justified, but if the intensity varies, some observations will have
>>>>>  very many neighbours. In that case, unless you have a clear ecological
>>>  or
>>>>>  environmental reason for knowing that a known distance threshold binds,
>>>  it
>>>>>  is not a good choice.
>>>>>
>>>>>>  Here is my code:
>>>>>>
>>>>>>  d <- dnearneigh(spdf,0, 22000)
>>>>>>  all_listw <- nb2listw(d, style = "W")
>>>>>>
>>>>>>  where the spdf object is in the british national grid CRS:
>>>>>>  +init=epsg:27700, with 227,973 observations/points. The distance of
>>>>>  22,000
>>>>>>  was decided by a training set that had 214 observations and the spdf
>>>>>  object
>>>>>>  contains both the training set and the testing set.
>>>>>>
>>>>>
>>>>>  This is questionable. You train on 214 observations - do their areal
>>>>>  intensity match those of the whole data set? If chosen at random, you
>>>  run
>>>>>  into the spatial sampling problems discussed in:
>>>>>
>>>>>
>>>>>
>>>  https://www.sciencedirect.com/science/article/pii/S0304380019302145?dgcid=author
>>>>>
>>>>>  Are 214 observations for training representative of 227,973 prediction
>>>>>  sites? Do you only have observations on the response for 214, and an
>>>>>  unobserved response otherwise? What are the data, what are you trying
>>>>>  to
>>>>>  do and why? This is not a sensible setting for models using weights
>>>>>  matrices for prediction (I think), because we do not have estimates of
>>>  the
>>>>>  prediction error in general.
>>>>>
>>>>>>  I am using a Mac, with a processor of 2.3 GHz Intel Core i5 and 8 GB
>>>>>>  memory. My laptop showed that when dnearneigh command was run on all
>>>>>>  observations, around 6.9 out of 8GB was used by the rsession and that
>>>  the
>>>>>>  %CPU used by the rsession was stated to be around 98%, although
>>>>>>  another
>>>>>>  indicator showed that my computer was around 60% idle. After running
>>>  the
>>>>>>  command for a day, rstudio alerted me that the connection to the
>>>  rsession
>>>>>>  could not be established, so I aborted the entire process altogether.
>>>>>>  I
>>>>>>  think the problem here may be the size of the dataset and perhaps the
>>>>>>  limitations of my laptop specs.
>>>>>>
>>>>>
>>>>>  On planar data, there is no good reason for this, as each observation
>>>>>  is
>>>>>  treated separately, finding and sorting distances, and choosing those
>>>>>  under the threshold. It will undoubtedly slow if there are more than a
>>>  few
>>>>>  neighbours within the threshold, but I already covered the
>>>  inadvisability
>>>>>  of defining neighbours in that way.
>>>>>
>>>>>  Using an rtree might help, but you get hit badly if there are many
>>>>>  neighbours within the threshold you have chosen anyway.
>>>>>
>>>>>  On most 8GB hardware and modern OS, you do not have more than 3-4GB for
>>>>>  work. So something was swapping on your laptop.
>>>>>
>>>>>>  Do you have any advice on how I can go about making a neighbours list
>>>>>  with
>>>>>>  dnearneigh for 227,973 observations in a successful and efficient way?
>>>>>>  Also, would you foresee any problems in the next steps, especially
>>>  when I
>>>>>>  will be using the neighbourhood listw object as an input in fitting
>>>>>>  and
>>>>>>  predicting using the spatial lag/error models? (see code below)
>>>>>>
>>>>>>  model <-  spatialreg::lagsarlm(rest_formula, data=train, train_listw)
>>>>>>  model_pred <- spatialreg::predict.sarlm(model, test, all_listw)
>>>>>>
>>>>>
>>>>>  Why would using a spatial lag model make sense? Why are you suggesting
>>>>>  this model, do you have a behavioural for why only the spatially lagged
>>>>>  response should be included?
>>>>>
>>>>>  Why do you think that this is sensible? You are predicting 1000 times
>>>  for
>>>>>  each observation - this is not what the prediction methods are written
>>>>>  for. Most involve inverting an nxn inverse matrix - did you refer to
>>>>>  Goulard et al. (2017) to get a good understanding of the underlying
>>>>>  methods?
>>>>>
>>>>>>  I think the predicting part may take some time, since my test set
>>>>>  consists
>>>>>>  of 227,973 - 214 observations = 227,759 observations.
>>>>>>
>>>>>>  Here are some solutions that I have thought of:
>>>>>>
>>>>>>  1. Interpolate the test set point data of 227,759 observations over a
>>>>>  more
>>>>>>  manageable spatial pixel dataframe with cell size of perhaps 10,000m
>>>>>>  by
>>>>>>  10,000m which would give me around 4900 points. So instead of 227,759
>>>>>>  observations, I can make the listw object based on just 4900 + 214
>>>>>  training
>>>>>>  points and predict just on 4900 observations.
>>>>>
>>>>>  But what are you trying to do? Are the observations output areas? House
>>>>>  sales? If you are not filling in missing areal units (the Goulard et
>>>>>  al.
>>>>>  case), couldn't you simply use geostatistical methods which seem to
>>>  match
>>>>>  your support better, and can be fitted and can predict using a local
>>>>>  neighbourhood? While you are doing that, you could switch to INLA with
>>>>>  SPDE, which interposes a mesh like the one you suggest. But in that
>>>  case,
>>>>>  beware of the mesh choice issue in:
>>>>>
>>>>>  https://doi.org/10.1080/03610926.2018.1536209
>>>>>
>>>>>>
>>>>>>  2. Get hold of better performance machines through cloud computing
>>>>>>  such
>>>>>  as
>>>>>>  AWS EC2 services and try running the commands and models there.
>>>>>>
>>>>>
>>>>>  What you need are methods, not wasted money on hardware as a service.
>>>>>
>>>>>>  3. Parallel computing using the parallel package from r (although I am
>>>>>  not
>>>>>>  sure whether dnearneigh can be parallelised).
>>>>>>
>>>>>
>>>>>  This could easily be implemented if it was really needed, which I don't
>>>>>  think it is; better methods understanding lets one do more with less.
>>>>>
>>>>>>  I believe option 1 would be the most manageable but I am not sure how
>>>  and
>>>>>>  by how much this would affect the accuracy of the predictions as
>>>>>>  interpolating the dataset would be akin to introducing more
>>>>>>  estimations
>>>>>  in
>>>>>>  the prediction. However, I am also grappling with the trade-off
>>>>>>  between
>>>>>>  accuracy and computation time. Hence, if options 2 and 3 can offer a
>>>>>>  reasonable computation time (1-2 hours) then I would forgo option 1.
>>>>>>
>>>>>>  What do you think? Is it possible to make a neighbourhood listw object
>>>>>  out
>>>>>>  of 227,973 observations efficiently?
>>>>>
>>>>>  Yes, but only if the numbers of neighbours are very small. Look in
>>>  Bivand
>>>>>  et al. (2013) to see the use of some fairly large n, but only with few
>>>>>  neighbours for each observation. You seem to be getting average
>>>  neighbour
>>>>>  counts in the thousands, which makes no sense.
>>>>>
>>>>>>
>>>>>>  Thank you for reading to the end! Apologies for writing a lengthy one,
>>>>>  just
>>>>>>  wanted to fully describe what I am facing, I hope I didn't miss out
>>>>>>  anything crucial.
>>>>>>
>>>>>
>>>>>  Long is OK, but there is no motivation here for why you want to make
>>>  200K
>>>>>  predictions from 200 observations with point support (?) using weights
>>>>>  matrices.
>>>>>
>>>>>  Hope this clarifies,
>>>>>
>>>>>  Roger
>>>>>
>>>>>>  Thank you so much once again!
>>>>>>
>>>>>>  jiawen
>>>>>>
>>>>>>        [[alternative HTML version deleted]]
>>>>>>
>>>>>>  _______________________________________________
>>>>>>  R-sig-Geo mailing list
>>>>>>  [hidden email]
>>>>>>  https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>>>
>>>>>
>>>>>  --
>>>>>  Roger Bivand
>>>>>  Department of Economics, Norwegian School of Economics,
>>>>>  Helleveien 30, N-5045 Bergen, Norway.
>>>>>  voice: +47 55 95 93 55; e-mail: [hidden email]
>>>>>  https://orcid.org/0000-0003-2392-6140
>>>>>  https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>
>>>>
>>>>        [[alternative HTML version deleted]]
>>>>
>>>>  _______________________________________________
>>>>  R-sig-Geo mailing list
>>>>  [hidden email]
>>>>  https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>
>>>
>>>  --
>>>  Roger Bivand
>>>  Department of Economics, Norwegian School of Economics,
>>>  Helleveien 30, N-5045 Bergen, Norway.
>>>  voice: +47 55 95 93 55; e-mail: [hidden email]
>>>  https://orcid.org/0000-0003-2392-6140
>>>  https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>
>>
>>   [[alternative HTML version deleted]]
>>
>>  _______________________________________________
>>  R-sig-Geo mailing list
>>  [hidden email]
>>  https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway