Dear R - Geo community:

This is perhaps a bit off topic (really a geostats question), but

we've been struggling with these questions for about 6 months now, and

it just reared its head again when I posed a question on one of ESRI's

discussion groups. I realized you would be much more experienced in this

issue and might be interested in offering advice. And this might be

something others are struggling with as well.

Background: We are creating predictive habitat distribution models

(a good synthesis paper is Guisan and Zimmerman 2000. Ecological

Modelling 135:147-186); using known locations of species to create a

model (and subsequent raster) that predicts the locations for additional

similar habitats. Many different algorithms have been used for actually

doing the predictive modeling (e.g. logistic regression, Classification

and regression trees [CART], ordination, maximum entropy), but in all

cases the user needs a full coverage of all the environmental variables

to be used and known locations for the element (species) being modeled.

Known locations must be attributed with values from all the

environmental layers. We've been using two different different

algorithms, one parametric (MaxEnt - separate package) and one

non-parametric (the RandomForests implementation of CART). I'm using the

R implementation of randomForest, but (embarrasingly) I'm currently

doing it by dumping the data to ascii and then reading the tables back

into R (by looping through them to keep the size manageable). I saw the

recent post on this listserv

(

https://stat.ethz.ch/pipermail/r-sig-geo/2005-March/000338.html) that

mentions Furlanello et al. (wow!) and someday we'll strive to integrate

in a similar way.

This Furlanello paper is a good example of how we diverge and our

resulting dilemma. Furlanello et al. (and most others, as far as I can

tell, but please enlighten me if I'm wrong), begin with known POINT

locations. Point locations are easy to attribute with data from your

environmental layers. We have good-quality POLYGON data and we want to

be sure the environmental variability captured by the polygon is

represented in the 'presence' environmental data we pass on for

analysis. How to best capture that full variability? Note one of our

algorithms is parametric (MaxEnt) and one non-parametric

(randomForest).. and I have no problem with deviating from traditional

parametric sampling methodologies and leaving many formerly required

parametric assumptions behind (or is that a big mistake?). We've thought

long and hard about random sampling, regular sampling, and grabbing all

the points.

We currently have about 36 environmental variables, we've resampled

them all to one size (30m). That meant bringing all the rasters derived

from 10m DEM up to 30 meter and all the coarser rasters down to 30m. I'm

being transparent about breaking the 'rules' here just so you know we

are aware of the issues surrounding scale. Our rasters are pretty big

(22578 rows x 17160 cols.... a single ascii is about 2GB).

Again, our goal is to, as fully as possible, represent the environment

within each polygon. For random (or regular) sampling; if you do a

consistent number of samples for every polygon, small polygons are

overweighted. If you sample on a per/area basis, larger polygons get

more 'weight'. We settled on grabbing ALL the raster cells in each

polygon and sticking a point in the center of each cell and attributing

these points. This obviously blew up with larger polygons (much to

many cells), and we realized we'd need to go back to subsampling. I

thought the best step would be to sample randomly without replacement so

that I'd get as full representation as possible (e.g. if a sample of 90

cells/points are requested, return 90 different cells/points). One

problem that's been pointed out is that I'm treating rasters as discrete

data when most are really a representation of continuous data and

interpolating between cell centers is more appropriate. Unfortunately I

don't have an easy way of doing this, but I'm trying to look into it.

My questions:

Has anyone ever used *all* the cells in a polygon representation when

attributing data for later statistcal analyses?

Would sampling without replacement be an appropriate approach for

returning as much within polygon variability as possible? If so, how

would you do it?

If randomly sampling, has anyone sampled relatively more cells in

smaller polygons, and relatively fewer cells in larger polygons, so that

you reach an asymptote in number sampled per polygon?

and related: If randomly sampling, do you tend to sample a static

number of cells per polygon or a per-area number of cells.

Thank you for your time.

Tim Howard