spatial autocorrelation in GAM residuals for large data set

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

spatial autocorrelation in GAM residuals for large data set

Elizabeth Webb
Hello,

I have a large data set (~100k rows) containing observations at points (MODIS pixels) across the northern hemisphere.  I have created a GAM using the bam command in mgcv and I would like to check the model residuals for spatial autocorrelation.  

One idea is to use the DHARMa package (https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html#spatial-autocorrelation).  The code looks something like this:

    simulationOutput  <-   simulateResiduals(fittedModel = mymodel) # point at which R runs into memory problems
    testSpatialAutocorrelation(simulationOutput = simulationOutput, x =  data$latitude, y= data$longitude)

However, this runs into memory problems. 

Another idea is to use the following code, after this tutorial (http://www.flutterbys.com.au/stats/tut/tut8.4a.html):
    library(ape)
    library(fields)
    coords = cbind(data$longitude, data$latitude)     
    w = rdist(coords)  # point at which R runs into memory problems
    Moran.I(x = residuals(mymodel), w = w)

But this also runs into memory problems.  I have tried increasing the amount of memory allotted to R, but that just means R works for longer before timing out.  

So, two questions: (1) Is there a memory efficient way to check for spatial autocorrelation using Moran's I in large data sets? or (2) Is there another way to check for spatial autocorrelation (besides Moran's I) that won't have such memory problems?

Thanks in advance,

Elizabeth







   
_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: spatial autocorrelation in GAM residuals for large data set

Roger Bivand
Administrator
On Tue, 20 Aug 2019, Elizabeth Webb wrote:

> Hello,
>
> I have a large data set (~100k rows) containing observations at points
> (MODIS pixels) across the northern hemisphere.  I have created a GAM
> using the bam command in mgcv and I would like to check the model
> residuals for spatial autocorrelation.
>
> One idea is to use the DHARMa package
> (https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html#spatial-autocorrelation).
> The code looks something like this:
>
>     simulationOutput  <-   simulateResiduals(fittedModel = mymodel) # point at which R runs into memory problems
>     testSpatialAutocorrelation(simulationOutput = simulationOutput, x =  data$latitude, y= data$longitude)
>
> However, this runs into memory problems.
>
> Another idea is to use the following code, after this tutorial
> (http://www.flutterbys.com.au/stats/tut/tut8.4a.html):
>     library(ape)
>     library(fields)
>     coords = cbind(data$longitude, data$latitude)     
>    w = rdist(coords)  # point at which R runs into memory problems
>     Moran.I(x = residuals(mymodel), w = w)
>
> But this also runs into memory problems.  I have tried increasing the
> amount of memory allotted to R, but that just means R works for longer
> before timing out.
I do hope that you read

https://cran.r-project.org/web/packages/ape/vignettes/MoranI.pdf

first, because the approach used in ape has been revised.

The main problem is that ape uses by default a square matrix, and it is
uncertain whether sparse matrices are accepted. This means that completely
unneeded computations are carried out - dense matrices should never be
used unless there is a convincing scientific argument (see
https://edzer.github.io/UseR2019/part2.html#exercise-review-1 for a
development on why distances are wasteful when edge counts on a graph do
the same thing sparsely).

Use one of the approaches described in the tutorial and you may be OK, but
you should not trust the outcome of Moran's I on residuals without using
an appropriate variant. Say you can represent your GAM with a linear model
with say spline terms, you can use Moran's I for regression residuals.
Take care that the average number of neighbours is very small (6-10), and
large numbers of observations should not be a problem.

A larger problem is that Moran's I (also for residuals) also responds to
other mis-specifications than spatial autocorrelation, in particular
missing variables and spatial processes with a different scale from the
units of observation chosen.


>
> So, two questions: (1) Is there a memory efficient way to check for
> spatial autocorrelation using Moran's I in large data sets? or (2) Is
> there another way to check for spatial autocorrelation (besides Moran's
> I) that won't have such memory problems?

1) Yes, see above, do not use dense matrices

2) Consider a higher level MRF term in your GAM for aggregates of your
observations if such aggregation makes sense for your data.

Hope this clarifies,

Roger

>
> Thanks in advance,
>
> Elizabeth
>
>
>
>
>
>
>
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway
Reply | Threaded
Open this post in threaded view
|

Re: spatial autocorrelation in GAM residuals for large data set

Elizabeth Webb
Thank you, Roger for your help.   A quick follow-up:

What do you mean when you say "Use one of the approaches described in the tutorial and you may be OK, but you should not trust the outcome of Moran's I on residuals without using an appropriate variant." ?  Or in other words, what is an appropriate variant in this context?

Elizabeth

_______________________________________From: Roger Bivand <[hidden email]>
Sent: Tuesday, August 20, 2019 4:43 PM
To: Elizabeth Webb
Cc: [hidden email]
Subject: Re: [R-sig-Geo] spatial autocorrelation in GAM residuals for large data set

On Tue, 20 Aug 2019, Elizabeth Webb wrote:

> Hello,
>
> I have a large data set (~100k rows) containing observations at points
> (MODIS pixels) across the northern hemisphere.  I have created a GAM
> using the bam command in mgcv and I would like to check the model
> residuals for spatial autocorrelation.
>
> One idea is to use the DHARMa package
> (https://urldefense.proofpoint.com/v2/url?u=https-3A__cran.r-2Dproject.org_web_packages_DHARMa_vignettes_DHARMa.html-23spatial-2Dautocorrelation&d=DwIFAw&c=sJ6xIWYx-zLMB3EPkvcnVg&r=fIXSZTeOvV9o221vPRYLSw&m=IMILOSwwjZJkpsYiuj66ywrnluSI19Bn8ozD5p-NZks&s=lD9g96_TN9t-znQvV1M9V8CH2tgKAYHWKcTS8osjBSc&e= ).
> The code looks something like this:
>
>     simulationOutput  <-   simulateResiduals(fittedModel = mymodel) # point at which R runs into memory problems
>     testSpatialAutocorrelation(simulationOutput = simulationOutput, x =  data$latitude, y= data$longitude)
>
> However, this runs into memory problems.
>
> Another idea is to use the following code, after this tutorial
> (https://urldefense.proofpoint.com/v2/url?u=http-3A__www.flutterbys.com.au_stats_tut_tut8.4a.html&d=DwIFAw&c=sJ6xIWYx-zLMB3EPkvcnVg&r=fIXSZTeOvV9o221vPRYLSw&m=IMILOSwwjZJkpsYiuj66ywrnluSI19Bn8ozD5p-NZks&s=v9Op69bwvOVRi1ujar_P8-LHsJFDAN_aE25i1_m22U4&e= ):
>     library(ape)
>     library(fields)
>     coords = cbind(data$longitude, data$latitude)
>    w = rdist(coords)  # point at which R runs into memory problems
>     Moran.I(x = residuals(mymodel), w = w)
>
> But this also runs into memory problems.  I have tried increasing the
> amount of memory allotted to R, but that just means R works for longer
> before timing out.

I do hope that you read

https://urldefense.proofpoint.com/v2/url?u=https-3A__cran.r-2Dproject.org_web_packages_ape_vignettes_MoranI.pdf&d=DwIFAw&c=sJ6xIWYx-zLMB3EPkvcnVg&r=fIXSZTeOvV9o221vPRYLSw&m=IMILOSwwjZJkpsYiuj66ywrnluSI19Bn8ozD5p-NZks&s=tp6IoNz-8y-MBnLZldMpb3wUT_faHdoGMFszkZQxYBU&e=

first, because the approach used in ape has been revised.

The main problem is that ape uses by default a square matrix, and it is
uncertain whether sparse matrices are accepted. This means that completely
unneeded computations are carried out - dense matrices should never be
used unless there is a convincing scientific argument (see
https://urldefense.proofpoint.com/v2/url?u=https-3A__edzer.github.io_UseR2019_part2.html-23exercise-2Dreview-2D1&d=DwIFAw&c=sJ6xIWYx-zLMB3EPkvcnVg&r=fIXSZTeOvV9o221vPRYLSw&m=IMILOSwwjZJkpsYiuj66ywrnluSI19Bn8ozD5p-NZks&s=PIyJQgsz9qD81VCZsyfJQGdO-Gh6iJNgF2xH9jATbhI&e=  for a
development on why distances are wasteful when edge counts on a graph do
the same thing sparsely).

Use one of the approaches described in the tutorial and you may be OK, but
you should not trust the outcome of Moran's I on residuals without using
an appropriate variant. Say you can represent your GAM with a linear model
with say spline terms, you can use Moran's I for regression residuals.
Take care that the average number of neighbours is very small (6-10), and
large numbers of observations should not be a problem.

A larger problem is that Moran's I (also for residuals) also responds to
other mis-specifications than spatial autocorrelation, in particular
missing variables and spatial processes with a different scale from the
units of observation chosen.


>
> So, two questions: (1) Is there a memory efficient way to check for
> spatial autocorrelation using Moran's I in large data sets? or (2) Is
> there another way to check for spatial autocorrelation (besides Moran's
> I) that won't have such memory problems?

1) Yes, see above, do not use dense matrices

2) Consider a higher level MRF term in your GAM for aggregates of your
observations if such aggregation makes sense for your data.

Hope this clarifies,

Roger

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://urldefense.proofpoint.com/v2/url?u=https-3A__orcid.org_0000-2D0003-2D2392-2D6140&d=DwIFAw&c=sJ6xIWYx-zLMB3EPkvcnVg&r=fIXSZTeOvV9o221vPRYLSw&m=IMILOSwwjZJkpsYiuj66ywrnluSI19Bn8ozD5p-NZks&s=nyvB8TRse_NA-lALeTG3k_KOIVVzaNLMuDAqPo3dyGI&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__scholar.google.no_citations-3Fuser-3DAWeghB0AAAAJ-26hl-3Den&d=DwIFAw&c=sJ6xIWYx-zLMB3EPkvcnVg&r=fIXSZTeOvV9o221vPRYLSw&m=IMILOSwwjZJkpsYiuj66ywrnluSI19Bn8ozD5p-NZks&s=UBVoMNMtGxwGxOGDcIJHGAuFm8gqb8X3kqNkGpPVKS4&e=

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: spatial autocorrelation in GAM residuals for large data set

Roger Bivand
Administrator
On Wed, 21 Aug 2019, Elizabeth Webb wrote:

> Thank you, Roger for your help.   A quick follow-up:
>
> What do you mean when you say "Use one of the approaches described in
> the tutorial and you may be OK, but you should not trust the outcome of
> Moran's I on residuals without using an appropriate variant." ?  Or in
> other words, what is an appropriate variant in this context?

From Cliff & Ord (1973), we know that using the standard version of
Moran's I on OLS residuals is flawed, and they therefore proposed an
alternative taking into account the fact that the fitted values - needed
to calculate the residuals - depend on the fitted model, see the
differences that appear between running spdep::moran.test(...,
randomisation=FALSE) and lm.morantest(mod, ...) as one adds in covariates
in addition to the intercept. With just the intercept, they are identical,
as covariates are added, they diverge.

As of now, appropriate tests are not available for models other than OLS,
so one cannot take results as more than indicative. There has been work
moving from just lm() to glm() because the RHS may be seen as linear in
the covariates, but we don't have control of the distribution of the
residuals.

The problem is not widely discussed because it is intractable.

Had your pixels formed a rectangle, it might have been possible to use a
Moran eigenvector approach, as such approaches may use analytical
eigenvectors, but I am not aware of such work; proponents of Moran
eigenvectors claim that their use with larger data sets is possible; I
don't know what size data works in the spmoran package using ME.

Hope this helps,

Roger

>
> Elizabeth
>
> _______________________________________From: Roger Bivand <[hidden email]>
> Sent: Tuesday, August 20, 2019 4:43 PM
> To: Elizabeth Webb
> Cc: [hidden email]
> Subject: Re: [R-sig-Geo] spatial autocorrelation in GAM residuals for large data set
>
> On Tue, 20 Aug 2019, Elizabeth Webb wrote:
>
>> Hello,
>>
>> I have a large data set (~100k rows) containing observations at points
>> (MODIS pixels) across the northern hemisphere.  I have created a GAM
>> using the bam command in mgcv and I would like to check the model
>> residuals for spatial autocorrelation.
>>
>> One idea is to use the DHARMa package
>> (https://urldefense.proofpoint.com/v2/url?u=https-3A__cran.r-2Dproject.org_web_packages_DHARMa_vignettes_DHARMa.html-23spatial-2Dautocorrelation&d=DwIFAw&c=sJ6xIWYx-zLMB3EPkvcnVg&r=fIXSZTeOvV9o221vPRYLSw&m=IMILOSwwjZJkpsYiuj66ywrnluSI19Bn8ozD5p-NZks&s=lD9g96_TN9t-znQvV1M9V8CH2tgKAYHWKcTS8osjBSc&e= ).
>> The code looks something like this:
>>
>>     simulationOutput  <-   simulateResiduals(fittedModel = mymodel) # point at which R runs into memory problems
>>     testSpatialAutocorrelation(simulationOutput = simulationOutput, x =  data$latitude, y= data$longitude)
>>
>> However, this runs into memory problems.
>>
>> Another idea is to use the following code, after this tutorial
>> (https://urldefense.proofpoint.com/v2/url?u=http-3A__www.flutterbys.com.au_stats_tut_tut8.4a.html&d=DwIFAw&c=sJ6xIWYx-zLMB3EPkvcnVg&r=fIXSZTeOvV9o221vPRYLSw&m=IMILOSwwjZJkpsYiuj66ywrnluSI19Bn8ozD5p-NZks&s=v9Op69bwvOVRi1ujar_P8-LHsJFDAN_aE25i1_m22U4&e= ):
>>     library(ape)
>>     library(fields)
>>     coords = cbind(data$longitude, data$latitude)
>>    w = rdist(coords)  # point at which R runs into memory problems
>>     Moran.I(x = residuals(mymodel), w = w)
>>
>> But this also runs into memory problems.  I have tried increasing the
>> amount of memory allotted to R, but that just means R works for longer
>> before timing out.
>
> I do hope that you read
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cran.r-2Dproject.org_web_packages_ape_vignettes_MoranI.pdf&d=DwIFAw&c=sJ6xIWYx-zLMB3EPkvcnVg&r=fIXSZTeOvV9o221vPRYLSw&m=IMILOSwwjZJkpsYiuj66ywrnluSI19Bn8ozD5p-NZks&s=tp6IoNz-8y-MBnLZldMpb3wUT_faHdoGMFszkZQxYBU&e=
>
> first, because the approach used in ape has been revised.
>
> The main problem is that ape uses by default a square matrix, and it is
> uncertain whether sparse matrices are accepted. This means that completely
> unneeded computations are carried out - dense matrices should never be
> used unless there is a convincing scientific argument (see
> https://urldefense.proofpoint.com/v2/url?u=https-3A__edzer.github.io_UseR2019_part2.html-23exercise-2Dreview-2D1&d=DwIFAw&c=sJ6xIWYx-zLMB3EPkvcnVg&r=fIXSZTeOvV9o221vPRYLSw&m=IMILOSwwjZJkpsYiuj66ywrnluSI19Bn8ozD5p-NZks&s=PIyJQgsz9qD81VCZsyfJQGdO-Gh6iJNgF2xH9jATbhI&e=  for a
> development on why distances are wasteful when edge counts on a graph do
> the same thing sparsely).
>
> Use one of the approaches described in the tutorial and you may be OK, but
> you should not trust the outcome of Moran's I on residuals without using
> an appropriate variant. Say you can represent your GAM with a linear model
> with say spline terms, you can use Moran's I for regression residuals.
> Take care that the average number of neighbours is very small (6-10), and
> large numbers of observations should not be a problem.
>
> A larger problem is that Moran's I (also for residuals) also responds to
> other mis-specifications than spatial autocorrelation, in particular
> missing variables and spatial processes with a different scale from the
> units of observation chosen.
>
>
>>
>> So, two questions: (1) Is there a memory efficient way to check for
>> spatial autocorrelation using Moran's I in large data sets? or (2) Is
>> there another way to check for spatial autocorrelation (besides Moran's
>> I) that won't have such memory problems?
>
> 1) Yes, see above, do not use dense matrices
>
> 2) Consider a higher level MRF term in your GAM for aggregates of your
> observations if such aggregation makes sense for your data.
>
> Hope this clarifies,
>
> Roger
>
>>
>> Thanks in advance,
>>
>> Elizabeth
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> [hidden email]
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dsig-2Dgeo&d=DwIFAw&c=sJ6xIWYx-zLMB3EPkvcnVg&r=fIXSZTeOvV9o221vPRYLSw&m=IMILOSwwjZJkpsYiuj66ywrnluSI19Bn8ozD5p-NZks&s=qq-Vo4xAxTCARWawQQYjCYvxm_Hg_iYYW1_n-Xphyg4&e=
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: [hidden email]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__orcid.org_0000-2D0003-2D2392-2D6140&d=DwIFAw&c=sJ6xIWYx-zLMB3EPkvcnVg&r=fIXSZTeOvV9o221vPRYLSw&m=IMILOSwwjZJkpsYiuj66ywrnluSI19Bn8ozD5p-NZks&s=nyvB8TRse_NA-lALeTG3k_KOIVVzaNLMuDAqPo3dyGI&e=
> https://urldefense.proofpoint.com/v2/url?u=https-3A__scholar.google.no_citations-3Fuser-3DAWeghB0AAAAJ-26hl-3Den&d=DwIFAw&c=sJ6xIWYx-zLMB3EPkvcnVg&r=fIXSZTeOvV9o221vPRYLSw&m=IMILOSwwjZJkpsYiuj66ywrnluSI19Bn8ozD5p-NZks&s=UBVoMNMtGxwGxOGDcIJHGAuFm8gqb8X3kqNkGpPVKS4&e=
>
--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: [hidden email]
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway