parallelize projectRaster()

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

parallelize projectRaster()

Yerguner
Hi All,

Below sample script do the job with a small multi-layered netcdf file (200M) but it takes ~8 hrs to complete. I need to do the same process with a large netcdf file(5.5 GB -that I cut this sample file from). I am wondering if there is way to parallelize this process using beginCluster()-counting on the build-in capacity of projectRaster()- OR using doParallel().

Thanks,
Yasemin

library(raster)
library(ncdf4)
library(rgdal)
rasterOptions(tmpdir=“/my_tmp_path/“)
ifile <- "cold2002_s1.nc"
colds1 <- brick(ifile)

> NAvalue(colds1)
[1] -9999
colds1[colds1 == -9999] <- NA
> NAvalue(colds1)
[1] -1.7e+308

newproj <- "+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs"
prjs1<- projectExtent(colds1, newproj)
res(prjs1) <- c(231.6564, 231.6563)
prjs1 <- projectRaster(colds1, prjs1, method='ngb')
ifile2 <- "/mod2002.tif"
mod02 <- raster(ifile2)
submod02 <- crop(mod02,prj1)
bb <- extent(581811.7, 791924.1, 347931.5, 454493.4)
extent(submod02) <- bb
submod02f <- setExtent(submod02, bb, keepres=FALSE)
stackSelect(prj1,submod02f,filename="modcold02.tif",options="INTERLEAVE=BAND")
Reply | Threaded
Open this post in threaded view
|

Re: parallelize projectRaster()

Jonathan Greenberg-3
Yerguner:

You might want to consider using gdalUtils
(http://cran.r-project.org/web/packages/gdalUtils/index.html) --
reprojecting via gdal warp is really fast and you can even get it
cranking on multiple processors.

--j

On Wed, Jun 18, 2014 at 12:24 AM, Yerguner <[hidden email]> wrote:

> Hi All,
>
> Below sample script do the job with a small multi-layered netcdf file (200M)
> but it takes ~8 hrs to complete. I need to do the same process with a large
> netcdf file(5.5 GB -that I cut this sample file from). I am wondering if
> there is way to parallelize this process using beginCluster()-counting on
> the build-in capacity of projectRaster()- OR using doParallel().
>
> Thanks,
> Yasemin
>
> library(raster)
> library(ncdf4)
> library(rgdal)
> rasterOptions(tmpdir=“/my_tmp_path/“)
> ifile <- "cold2002_s1.nc"
> colds1 <- brick(ifile)
>
>> NAvalue(colds1)
> [1] -9999
> colds1[colds1 == -9999] <- NA
>> NAvalue(colds1)
> [1] -1.7e+308
>
> newproj <- "+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997
> +b=6370997 +units=m +no_defs"
> prjs1<- projectExtent(colds1, newproj)
> res(prjs1) <- c(231.6564, 231.6563)
> prjs1 <- projectRaster(colds1, prjs1, method='ngb')
> ifile2 <- "/mod2002.tif"
> mod02 <- raster(ifile2)
> submod02 <- crop(mod02,prj1)
> bb <- extent(581811.7, 791924.1, 347931.5, 454493.4)
> extent(submod02) <- bb
> submod02f <- setExtent(submod02, bb, keepres=FALSE)
> stackSelect(prj1,submod02f,filename="modcold02.tif",options="INTERLEAVE=BAND")
>
>
>
> --
> View this message in context: http://r-sig-geo.2731867.n2.nabble.com/parallelize-projectRaster-tp7586598.html
> Sent from the R-sig-geo mailing list archive at Nabble.com.
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo



--
Jonathan A. Greenberg, PhD
Assistant Professor
Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
Department of Geography and Geographic Information Science
University of Illinois at Urbana-Champaign
259 Computing Applications Building, MC-150
605 East Springfield Avenue
Champaign, IL  61820-6371
Phone: 217-300-1924
http://www.geog.illinois.edu/~jgrn/
AIM: jgrn307, MSN: [hidden email], Gchat: jgrn307, Skype: jgrn3007

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: parallelize projectRaster()

Yerguner
Hi Jonathan,

Thanks a lot for your reply. I've already tried your gdalUtils() package but it creates a empty .tif file (all zero values) and continues to run several (>14) hrs. May be I'm not using the function correctly:

library(raster)
rasterOptions(tmpdir="/pegasus/yb3/modvar2000/")
library(ncdf4)
library(rgdal)
library(gdalUtils)
gdal_setInstallation()
ifile2 <- "cold00_s2.nc"
newproj <- "+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs"
gdalwarp(srcfile= ifile2, dstfile= "ohey_file.tif", t_srs= newproj, output_Raster=TRUE, multi=TRUE, tr=c(231.6564, 231.6563),r="near")
class       : RasterBrick
dimensions  : 2049, 4417, 9050433, 365  (nrow, ncol, ncell, nlayers)
resolution  : 231.6564, 231.6563  (x, y)
extent      : -1075057, -51831.17, 241865.9, 716529.7  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs
data source : /pegasus/yb3/daymet2000/ohey_file.tif
names       : ohey_file.1, ohey_file.2, ohey_file.3, ohey_file.4, ohey_file.5, ohey_file.6, ohey_file.7, ohey_file.8, ohey_file.9, ohey_file.10, ohey_file.11, ohey_file.12, ohey_file.13, ohey_file.14, ohey_file.15, ...

Thanks,
Yasemin

Reply | Threaded
Open this post in threaded view
|

Re: parallelize projectRaster()

Jonathan Greenberg-3
Yerguner:

Have you tried first converting the NetCDF to another "friendlier"
raster format (netcdf isn't really a raster format, keep in mind)--j
first via gdal_translate?  Then you can follow up with a gdalwarp.
Keep in mind gdalUtils is merely a wrapper for the GDAL Utilities --
from their documentation on NetCDF
(http://www.gdal.org/frmt_netcdf.html):

"If the file contains only one netCDF array which appears to be an
image, it may be accessed directly, but if the file contains multiple
images it may be necessary to import the file via a two step process.

The first step is to get a report of the components images (dataset)
in the file using gdalinfo, and then to import the desired images
using gdal_translate. The gdalinfo utility lists all multidimensional
subdatasets from the input netCDF file.

The name of individual images are assigned to the SUBDATASET_n_NAME
metadata item. The description for each image is found in the
SUBDATASET_n_DESC metadata item. For netCDF images will follow this
format: NETCDF:filename:variable_name"



On Wed, Jun 18, 2014 at 10:44 AM, Yerguner <[hidden email]> wrote:

> Hi Jonathan,
>
> Thanks a lot for your reply. I've already tried your gdalUtils() package but
> it creates a empty .tif file (all zero values) and continues to run several
> (>14) hrs. May be I'm not using the function correctly:
>
> library(raster)
> rasterOptions(tmpdir="/pegasus/yb3/modvar2000/")
> library(ncdf4)
> library(rgdal)
> library(gdalUtils)
> gdal_setInstallation()
> ifile2 <- "cold00_s2.nc"
> newproj <- "+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997
> +b=6370997 +units=m +no_defs"
> gdalwarp(srcfile= ifile2, dstfile= "ohey_file.tif", t_srs= newproj,
> output_Raster=TRUE, multi=TRUE, tr=c(231.6564, 231.6563),r="near")
> class       : RasterBrick
> dimensions  : 2049, 4417, 9050433, 365  (nrow, ncol, ncell, nlayers)
> resolution  : 231.6564, 231.6563  (x, y)
> extent      : -1075057, -51831.17, 241865.9, 716529.7  (xmin, xmax, ymin,
> ymax)
> coord. ref. : +proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997
> +b=6370997 +units=m +no_defs
> data source : /pegasus/yb3/daymet2000/ohey_file.tif
> names       : ohey_file.1, ohey_file.2, ohey_file.3, ohey_file.4,
> ohey_file.5, ohey_file.6, ohey_file.7, ohey_file.8, ohey_file.9,
> ohey_file.10, ohey_file.11, ohey_file.12, ohey_file.13, ohey_file.14,
> ohey_file.15, ...
>
> Thanks,
> Yasemin
>
>
>
>
>
> --
> View this message in context: http://r-sig-geo.2731867.n2.nabble.com/parallelize-projectRaster-tp7586598p7586604.html
> Sent from the R-sig-geo mailing list archive at Nabble.com.
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo



--
Jonathan A. Greenberg, PhD
Assistant Professor
Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
Department of Geography and Geographic Information Science
University of Illinois at Urbana-Champaign
259 Computing Applications Building, MC-150
605 East Springfield Avenue
Champaign, IL  61820-6371
Phone: 217-300-1924
http://www.geog.illinois.edu/~jgrn/
AIM: jgrn307, MSN: [hidden email], Gchat: jgrn307, Skype: jgrn3007

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: parallelize projectRaster()

Yerguner
Hi Jonathan,

Believe me I tried that and all other different combinations but starting with the gdal_translate step only adds extra hrs to finish the whole process. Actually, when I run the below gdalwarp commanline outside R, it completes the same small netcdf file in ~ 2hrs but I just do the reprojection and then do the resampling step in R in ~25 min. So, this was the my first approach alternative to projectRaster() approach and 2,5 hrs is better than 8 hrs. But again, unless there is a solution to speed up the process (at least the reprojection part), both approaches do not work my case. That's why I'm looking for way to do this all process in R to be able to parallelize the functions as for the first approach. And I have also many large files (each about the same size 5.5 GB) to repeat this process. So, both the file sizes are large and many, in addition to doParallel(), foreach() package might also be a solution?

gdalwarp and resample() approach:

gdalwarp -t_srs "+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +no_defs +a=6370997 +b=6370997 +to_meter=1.0" NETCDF:"colds1.nc":tmin colds1.tif      ## ~2 hrs

library(raster)
library(rgdal)
rasterOptions(tmpdir="some_tmp_path")
source_file1 <- "mod2000.tif"
source_file2 <- "colds1.tif"
mod00 <- raster(source_file1)
cold00s1 <- brick(source_file2)
submod00 <- crop(mod00,cold00s1)
rsmp1 <- resample(cold00s1,submod00,method="ngb")     ## 25 min
stackSelect (rsmp1,submod00,filename="modcold00.tif",options="INTERLEAVE=BAND")

Thank you so much and looking forward to hearing from you,
Yasemin
Reply | Threaded
Open this post in threaded view
|

Re: parallelize projectRaster()

Jonathan Greenberg-3
Can you make your test file available via google drive or dropbox or
something?  I'm really surprised GDAL is taking more time than the R
version.  It is hard to replicate your example without the test file.

As an FYI, to enable parallel processing of a single gdalwarp, use
(from: http://lists.osgeo.org/pipermail/gdal-dev/2012-June/033084.html):

gdalwarp -multi src.tif dst.tif -wo NUM_THREADS=ALL_CPUS

You can use these options from gdalUtils if you'd like.

--j

On Wed, Jun 18, 2014 at 2:32 PM, Yerguner <[hidden email]> wrote:

> Hi Jonathan,
>
> Believe me I tried that and all other different combinations but starting
> with the gdal_translate step only adds extra hrs to finish the whole
> process. Actually, when I run the below gdalwarp commanline outside R, it
> completes the same small netcdf file in ~ 2hrs but I just do the
> reprojection and then do the resampling step in R in ~25 min. So, this was
> the my first approach alternative to projectRaster() approach and 2,5 hrs is
> better than 8 hrs. But again, unless there is a solution to speed up the
> process (at least the reprojection part), both approaches do not work my
> case. That's why I'm looking for way to do this all process in R to be able
> to parallelize the functions as for the first approach. And I have also many
> large files (each about the same size 5.5 GB) to repeat this process. So,
> both the file sizes are large and many, in addition to doParallel(),
> foreach() package might also be a solution?
>
> gdalwarp and resample() approach:
>
> gdalwarp -t_srs "+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +no_defs
> +a=6370997 +b=6370997 +to_meter=1.0" NETCDF:"colds1.nc":tmin colds1.tif
> ## ~2 hrs
>
> library(raster)
> library(rgdal)
> rasterOptions(tmpdir="some_tmp_path")
> source_file1 <- "mod2000.tif"
> source_file2 <- "colds1.tif"
> mod00 <- raster(source_file1)
> cold00s1 <- brick(source_file2)
> submod00 <- crop(mod00,cold00s1)
> rsmp1 <- resample(cold00s1,submod00,method="ngb")     ## 25 min
> stackSelect
> (rsmp1,submod00,filename="modcold00.tif",options="INTERLEAVE=BAND")
>
> Thank you so much and looking forward to hearing from you,
> Yasemin
>
>
>
>
> --
> View this message in context: http://r-sig-geo.2731867.n2.nabble.com/parallelize-projectRaster-tp7586598p7586608.html
> Sent from the R-sig-geo mailing list archive at Nabble.com.
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo



--
Jonathan A. Greenberg, PhD
Assistant Professor
Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
Department of Geography and Geographic Information Science
University of Illinois at Urbana-Champaign
259 Computing Applications Building, MC-150
605 East Springfield Avenue
Champaign, IL  61820-6371
Phone: 217-300-1924
http://www.geog.illinois.edu/~jgrn/
AIM: jgrn307, MSN: [hidden email], Gchat: jgrn307, Skype: jgrn3007

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: parallelize projectRaster()

Michael Sumner-2
In reply to this post by Yerguner
Do you really need to warp this file? There may be better options to
improve the workflow if you are only warping as part of another goal.  If
warping is the goal then carry on and ignore me.
There are myriad excellent ways of extracting data from NetCDF in R, and
reprojecting a grid should be a last resort.

Cheers, Mike.


On Thu, Jun 19, 2014 at 5:32 AM, Yerguner <[hidden email]> wrote:

> Hi Jonathan,
>
> Believe me I tried that and all other different combinations but starting
> with the gdal_translate step only adds extra hrs to finish the whole
> process. Actually, when I run the below gdalwarp commanline outside R, it
> completes the same small netcdf file in ~ 2hrs but I just do the
> reprojection and then do the resampling step in R in ~25 min. So, this was
> the my first approach alternative to projectRaster() approach and 2,5 hrs
> is
> better than 8 hrs. But again, unless there is a solution to speed up the
> process (at least the reprojection part), both approaches do not work my
> case. That's why I'm looking for way to do this all process in R to be able
> to parallelize the functions as for the first approach. And I have also
> many
> large files (each about the same size 5.5 GB) to repeat this process. So,
> both the file sizes are large and many, in addition to doParallel(),
> foreach() package might also be a solution?
>
> gdalwarp and resample() approach:
>
> gdalwarp -t_srs "+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +no_defs
> +a=6370997 +b=6370997 +to_meter=1.0" NETCDF:"colds1.nc":tmin colds1.tif
> ## ~2 hrs
>
> library(raster)
> library(rgdal)
> rasterOptions(tmpdir="some_tmp_path")
> source_file1 <- "mod2000.tif"
> source_file2 <- "colds1.tif"
> mod00 <- raster(source_file1)
> cold00s1 <- brick(source_file2)
> submod00 <- crop(mod00,cold00s1)
> rsmp1 <- resample(cold00s1,submod00,method="ngb")     ## 25 min
> stackSelect
> (rsmp1,submod00,filename="modcold00.tif",options="INTERLEAVE=BAND")
>
> Thank you so much and looking forward to hearing from you,
> Yasemin
>
>
>
>
> --
> View this message in context:
> http://r-sig-geo.2731867.n2.nabble.com/parallelize-projectRaster-tp7586598p7586608.html
> Sent from the R-sig-geo mailing list archive at Nabble.com.
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>



--
Michael Sumner
Software and Database Engineer
Australian Antarctic Division
Hobart, Australia
e-mail: [hidden email]

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo