best practice for reading large shapefiles?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

best practice for reading large shapefiles?

Vinh Nguyen
Hi,

I have a very large shapefile that I would like to read into R
(dbf=5.6gb and shp=2.3gb).

For reference, I downloaded the 30 shapefiles of the [Public Land
Survey System](http://www.geocommunicator.gov/GeoComm/lsis_home/home/)
and combined them into a single national file via gdal (ogr2ogr) as
described [here](http://www.northrivergeographic.com/ogr2ogr-merge-shapefiles);
I originally attempted to combine the files in R as described
[here](https://stat.ethz.ch/pipermail/r-sig-geo/2011-May/011814.html),
but ran out of memory about 80% in, but luckily discovered ogr2ogr.

I'm reading in the combined file in R via readOGR, and it's been over
an hour and R appears to hang.  When I check the task manager, the R
session currently consumes <10% CPU and 245MB.  Not sure if any
productive activity is going on, so I'm just waiting it out.
[This](http://r-sig-geo.2731867.n2.nabble.com/Long-time-to-load-shapefiles-td7584869.html)
thread describes that readOGR can be slow for large shapefiles, and
suggested that the SpatialDataFrame be saved in an R format.  My
problem is getting the entire shapefile read in the first place before
I could save it as an R object.

Does anyone have any suggestions for reading this large shapefile into
R?  Thank you for your help.

-- Vinh

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: best practice for reading large shapefiles?

Vinh Nguyen
Would loading the shapefile into postgresql first and then use readOGR
to read from postgres be a recommended approach?  That is, would the
bottleneck still occur?  Thank you.

-- Vinh


On Tue, Apr 26, 2016 at 11:18 AM, Vinh Nguyen <[hidden email]> wrote:

> Hi,
>
> I have a very large shapefile that I would like to read into R
> (dbf=5.6gb and shp=2.3gb).
>
> For reference, I downloaded the 30 shapefiles of the [Public Land
> Survey System](http://www.geocommunicator.gov/GeoComm/lsis_home/home/)
> and combined them into a single national file via gdal (ogr2ogr) as
> described [here](http://www.northrivergeographic.com/ogr2ogr-merge-shapefiles);
> I originally attempted to combine the files in R as described
> [here](https://stat.ethz.ch/pipermail/r-sig-geo/2011-May/011814.html),
> but ran out of memory about 80% in, but luckily discovered ogr2ogr.
>
> I'm reading in the combined file in R via readOGR, and it's been over
> an hour and R appears to hang.  When I check the task manager, the R
> session currently consumes <10% CPU and 245MB.  Not sure if any
> productive activity is going on, so I'm just waiting it out.
> [This](http://r-sig-geo.2731867.n2.nabble.com/Long-time-to-load-shapefiles-td7584869.html)
> thread describes that readOGR can be slow for large shapefiles, and
> suggested that the SpatialDataFrame be saved in an R format.  My
> problem is getting the entire shapefile read in the first place before
> I could save it as an R object.
>
> Does anyone have any suggestions for reading this large shapefile into
> R?  Thank you for your help.
>
> -- Vinh

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: best practice for reading large shapefiles?

Roger Bivand
Administrator
On Tue, 26 Apr 2016, Vinh Nguyen wrote:

> Would loading the shapefile into postgresql first and then use readOGR
> to read from postgres be a recommended approach?  That is, would the
> bottleneck still occur?  Thank you.

Most likely, as both use the respective OGR drivers. With data this size,
you'll need a competent platform (probably Linux, say 128GB RAM) as
everything is in memory. I find it hard to grasp what the point of doing
this might be - visualization won't work as none of the considerable
detail certainly in these files will be visible. Can you put the lot into
an SQLite file and access the attributes as SQL queries? I don't see the
analysis or statistics here.

Roger

>
> -- Vinh
>
>
> On Tue, Apr 26, 2016 at 11:18 AM, Vinh Nguyen <[hidden email]> wrote:
>> Hi,
>>
>> I have a very large shapefile that I would like to read into R
>> (dbf=5.6gb and shp=2.3gb).
>>
>> For reference, I downloaded the 30 shapefiles of the [Public Land
>> Survey System](http://www.geocommunicator.gov/GeoComm/lsis_home/home/)
>> and combined them into a single national file via gdal (ogr2ogr) as
>> described [here](http://www.northrivergeographic.com/ogr2ogr-merge-shapefiles);
>> I originally attempted to combine the files in R as described
>> [here](https://stat.ethz.ch/pipermail/r-sig-geo/2011-May/011814.html),
>> but ran out of memory about 80% in, but luckily discovered ogr2ogr.
>>
>> I'm reading in the combined file in R via readOGR, and it's been over
>> an hour and R appears to hang.  When I check the task manager, the R
>> session currently consumes <10% CPU and 245MB.  Not sure if any
>> productive activity is going on, so I'm just waiting it out.
>> [This](http://r-sig-geo.2731867.n2.nabble.com/Long-time-to-load-shapefiles-td7584869.html)
>> thread describes that readOGR can be slow for large shapefiles, and
>> suggested that the SpatialDataFrame be saved in an R format.  My
>> problem is getting the entire shapefile read in the first place before
>> I could save it as an R object.
>>
>> Does anyone have any suggestions for reading this large shapefile into
>> R?  Thank you for your help.
>>
>> -- Vinh
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; fax +47 55 95 91 00
e-mail: [hidden email]
http://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
http://depsy.org/person/434412

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand
Department of Economics
Norwegian School of Economics
Helleveien 30
N-5045 Bergen, Norway
Reply | Threaded
Open this post in threaded view
|

Re: best practice for reading large shapefiles?

Vinh Nguyen
On Tue, Apr 26, 2016 at 1:12 PM, Roger Bivand <[hidden email]> wrote:

> On Tue, 26 Apr 2016, Vinh Nguyen wrote:
>
>> Would loading the shapefile into postgresql first and then use readOGR
>> to read from postgres be a recommended approach?  That is, would the
>> bottleneck still occur?  Thank you.
>
>
> Most likely, as both use the respective OGR drivers. With data this size,
> you'll need a competent platform (probably Linux, say 128GB RAM) as
> everything is in memory. I find it hard to grasp what the point of doing
> this might be - visualization won't work as none of the considerable detail
> certainly in these files will be visible. Can you put the lot into an SQLite
> file and access the attributes as SQL queries? I don't see the analysis or
> statistics here.
>

- I can't tell from your response whether you are recommending PostGIS
is a recommended approach or not.  Could you clarify?

- I am working on a Windows server with 64gb ram, so not too weak,
especially for some files that are a few gb in size.  Again, not sure
if the job just halted or it's still running, but just rather slow.
I've killed it for now as the memory usage still has not grown after a
few hours.

- Yes, the shapes are quite granular and many in quantity.  The use
case was not to visualize them all at once.  Wanted a master file so
that when I get a data set of interest, I could intersect the two and
then subset the areas of interest (eg, within a state or county).
Then visualize/analyze from there.  The master shapefile was meant to
make it easy (reading in one file) as opposed to deciding which
shapefile to read in depending on the project.

- I just looked back at the 30 PLSS zip files, and they provide shapes
for 3 levels of granularity.  I went with the smallest.  I just
realized that the mid-size one would be sufficient for now, which
results in dbf=138mb and shp=501mb.  Attempting to read this in now (~
30 minutes), which I assume will read in fine after some time.  Will
respond to this thread if this is not the case.

Thanks for responding Roger.

-- Vinh

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: best practice for reading large shapefiles?

Chris Reudenbach
Vinh

Even if it might be in this list OT, IMHO R is not the best tool for
dealing with this amount of vector data. Actually I agree completely
with Roger's remarks and corresponding to the "competent platform" you
also may think about using software for big data...

As Roger already has clarified: The recommendation what might be best
depends highly  on your questions and issues or on the type of analysis
you need to run and cannot be answered straightforward.

I think Edzer can clarify up to which size sp object are still "usable",
following my experience  i would guess something like 500K polygons 1M
lines and up to 5M points but it is highly dependent on the number of
attributes. So you are far beyond this.


If you want to deal with this amount of spatial vector data using R, it
is highly reasonable to have a look at one of the mature GIS packages
like GRASS or QGIS. You can use them via their APIs.
Nevertheless you easily can put it in postgres/postgis and perform all
operations/analysis using the spatial capabilities and build in
functions of postgis if you are an experienced PostGis user.

cheers
Chris


Am 26.04.2016 um 22:33 schrieb Vinh Nguyen:

> On Tue, Apr 26, 2016 at 1:12 PM, Roger Bivand <[hidden email]> wrote:
>> On Tue, 26 Apr 2016, Vinh Nguyen wrote:
>>
>>> Would loading the shapefile into postgresql first and then use readOGR
>>> to read from postgres be a recommended approach?  That is, would the
>>> bottleneck still occur?  Thank you.
>>
>> Most likely, as both use the respective OGR drivers. With data this size,
>> you'll need a competent platform (probably Linux, say 128GB RAM) as
>> everything is in memory. I find it hard to grasp what the point of doing
>> this might be - visualization won't work as none of the considerable detail
>> certainly in these files will be visible. Can you put the lot into an SQLite
>> file and access the attributes as SQL queries? I don't see the analysis or
>> statistics here.
>>
> - I can't tell from your response whether you are recommending PostGIS
> is a recommended approach or not.  Could you clarify?
>
> - I am working on a Windows server with 64gb ram, so not too weak,
> especially for some files that are a few gb in size.  Again, not sure
> if the job just halted or it's still running, but just rather slow.
> I've killed it for now as the memory usage still has not grown after a
> few hours.
>
> - Yes, the shapes are quite granular and many in quantity.  The use
> case was not to visualize them all at once.  Wanted a master file so
> that when I get a data set of interest, I could intersect the two and
> then subset the areas of interest (eg, within a state or county).
> Then visualize/analyze from there.  The master shapefile was meant to
> make it easy (reading in one file) as opposed to deciding which
> shapefile to read in depending on the project.
>
> - I just looked back at the 30 PLSS zip files, and they provide shapes
> for 3 levels of granularity.  I went with the smallest.  I just
> realized that the mid-size one would be sufficient for now, which
> results in dbf=138mb and shp=501mb.  Attempting to read this in now (~
> 30 minutes), which I assume will read in fine after some time.  Will
> respond to this thread if this is not the case.
>
> Thanks for responding Roger.
>
> -- Vinh
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>


--
Dr Christoph Reudenbach, Philipps-University of Marburg, Faculty of
Geography, GIS and Environmental Modeling, Deutschhausstr. 10, D-35032
Marburg, fon: ++49.(0)6421.2824296, fax: ++49.(0)6421.2828950, web:
gis-ma.org, giswerk.org, moc.environmentalinformatics-marburg.de

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: best practice for reading large shapefiles?

Alex Mandel-2
So the trick I use is to load vector data into PostGIS or Spatialite.
Then do basic spatial filtering with queries in those DB (SQL). Once
I've subset and manipulated what I want, either create a new Table or
View with the results. Then read those results in R.

The bottleneck you have is likely the reading of everything into memory
in R, which usually takes more memory than the original file size. So
changing sources won't help, only subsetting prior to loading will help.

Enjoy,
Alex

On 04/26/2016 03:24 PM, Chris Reudenbach wrote:

> Vinh
>
> Even if it might be in this list OT, IMHO R is not the best tool for
> dealing with this amount of vector data. Actually I agree completely
> with Roger's remarks and corresponding to the "competent platform" you
> also may think about using software for big data...
>
> As Roger already has clarified: The recommendation what might be best
> depends highly  on your questions and issues or on the type of analysis
> you need to run and cannot be answered straightforward.
>
> I think Edzer can clarify up to which size sp object are still "usable",
> following my experience  i would guess something like 500K polygons 1M
> lines and up to 5M points but it is highly dependent on the number of
> attributes. So you are far beyond this.
>
>
> If you want to deal with this amount of spatial vector data using R, it
> is highly reasonable to have a look at one of the mature GIS packages
> like GRASS or QGIS. You can use them via their APIs.
> Nevertheless you easily can put it in postgres/postgis and perform all
> operations/analysis using the spatial capabilities and build in
> functions of postgis if you are an experienced PostGis user.
>
> cheers
> Chris
>
>
> Am 26.04.2016 um 22:33 schrieb Vinh Nguyen:
>> On Tue, Apr 26, 2016 at 1:12 PM, Roger Bivand <[hidden email]>
>> wrote:
>>> On Tue, 26 Apr 2016, Vinh Nguyen wrote:
>>>
>>>> Would loading the shapefile into postgresql first and then use readOGR
>>>> to read from postgres be a recommended approach?  That is, would the
>>>> bottleneck still occur?  Thank you.
>>>
>>> Most likely, as both use the respective OGR drivers. With data this
>>> size,
>>> you'll need a competent platform (probably Linux, say 128GB RAM) as
>>> everything is in memory. I find it hard to grasp what the point of doing
>>> this might be - visualization won't work as none of the considerable
>>> detail
>>> certainly in these files will be visible. Can you put the lot into an
>>> SQLite
>>> file and access the attributes as SQL queries? I don't see the
>>> analysis or
>>> statistics here.
>>>
>> - I can't tell from your response whether you are recommending PostGIS
>> is a recommended approach or not.  Could you clarify?
>>
>> - I am working on a Windows server with 64gb ram, so not too weak,
>> especially for some files that are a few gb in size.  Again, not sure
>> if the job just halted or it's still running, but just rather slow.
>> I've killed it for now as the memory usage still has not grown after a
>> few hours.
>>
>> - Yes, the shapes are quite granular and many in quantity.  The use
>> case was not to visualize them all at once.  Wanted a master file so
>> that when I get a data set of interest, I could intersect the two and
>> then subset the areas of interest (eg, within a state or county).
>> Then visualize/analyze from there.  The master shapefile was meant to
>> make it easy (reading in one file) as opposed to deciding which
>> shapefile to read in depending on the project.
>>
>> - I just looked back at the 30 PLSS zip files, and they provide shapes
>> for 3 levels of granularity.  I went with the smallest.  I just
>> realized that the mid-size one would be sufficient for now, which
>> results in dbf=138mb and shp=501mb.  Attempting to read this in now (~
>> 30 minutes), which I assume will read in fine after some time.  Will
>> respond to this thread if this is not the case.
>>
>> Thanks for responding Roger.
>>
>> -- Vinh
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> [hidden email]
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>
>

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: best practice for reading large shapefiles?

edzer
In reply to this post by Vinh Nguyen


On 26/04/16 22:33, Vinh Nguyen wrote:

> On Tue, Apr 26, 2016 at 1:12 PM, Roger Bivand <[hidden email]> wrote:
>> On Tue, 26 Apr 2016, Vinh Nguyen wrote:
>>
>>> Would loading the shapefile into postgresql first and then use readOGR
>>> to read from postgres be a recommended approach?  That is, would the
>>> bottleneck still occur?  Thank you.
>>
>>
>> Most likely, as both use the respective OGR drivers. With data this size,
>> you'll need a competent platform (probably Linux, say 128GB RAM) as
>> everything is in memory. I find it hard to grasp what the point of doing
>> this might be - visualization won't work as none of the considerable detail
>> certainly in these files will be visible. Can you put the lot into an SQLite
>> file and access the attributes as SQL queries? I don't see the analysis or
>> statistics here.
>>
>
> - I can't tell from your response whether you are recommending PostGIS
> is a recommended approach or not.  Could you clarify?
Roger said the bottleneck would most likely still occur, but couldn't
make much of a recommendation because you had not revealed the purpose
of reading this data in R.

>
> - I am working on a Windows server with 64gb ram, so not too weak,
> especially for some files that are a few gb in size.  Again, not sure
> if the job just halted or it's still running, but just rather slow.
> I've killed it for now as the memory usage still has not grown after a
> few hours.

Messages that certain things do not work are often helpful, leading to
improvement in the software. With your report, however, we can't do
really much.

>
> - Yes, the shapes are quite granular and many in quantity.  The use
> case was not to visualize them all at once.  Wanted a master file so
> that when I get a data set of interest, I could intersect the two and
> then subset the areas of interest (eg, within a state or county).
> Then visualize/analyze from there.  The master shapefile was meant to
> make it easy (reading in one file) as opposed to deciding which
> shapefile to read in depending on the project.

Using PostGIS for this use case may make sense, since PostGIS creates
and stores spatial indexes with its geometry data, and does everything
in database, rather than in memory. In R, you'd probably do
intersections with rgeos::gIntersects, which creates a spatial index on
the fly but doesn't store this index. Only experimentation can tell you
the magnitude of this difference.

>
> - I just looked back at the 30 PLSS zip files, and they provide shapes
> for 3 levels of granularity.  I went with the smallest.  I just
> realized that the mid-size one would be sufficient for now, which
> results in dbf=138mb and shp=501mb.  Attempting to read this in now (~
> 30 minutes), which I assume will read in fine after some time.  Will
> respond to this thread if this is not the case.
>
(see my 2nd comment)

Best regards,
--
Edzer Pebesma
Institute for Geoinformatics  (ifgi),  University of Münster
Heisenbergstraße 2, 48149 Münster, Germany; +49 251 83 33081
Journal of Statistical Software:   http://www.jstatsoft.org/
Computers & Geosciences:   http://elsevier.com/locate/cageo/
Spatial Statistics Society http://www.spatialstatistics.info


_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

signature.asc (501 bytes) Download Attachment