Subsetting dataframe by all factor levels

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Subsetting dataframe by all factor levels

Rich Shepard
   I need to learn geospatial analyses in R to complement my GIS knowledge.
I've just re-read the subsetting chapter in Hadley's 'Advanced R' without
seeing how to create separate data frames based by extracting all rows for
each site name in the parent data frame in one step. I believe that what I
need to do is create a list of the factor names and feed them to a loop
subsetting each to a new dataframe. Perhaps there's a better way unknown to
me and I need advice, suggestions, and recommendations how to proceed.

   The inclusive data frame has this structure:

str(rainfall)
'data.frame': 113569 obs. of  6 variables:
  $ name    : Factor w/ 58 levels "Blazed Alder",..: 20 20 20 20 20 20 20 ...
  $ easting : num  2370575 2370575 2370575 2370575 2370575 ...
  $ northing: num  199338 199338 199338 199338 199338 ...
  $ elev    : num  228 228 228 228 228 228 228 228 228 228 ...
  $ sampdate: Date, format: "2005-01-01" "2005-01-02" ...
  $ prcp    : num  0.59 0.08 0.1 0 0 0.02 0.05 0.1 0 0.02 ...

   My goal is to use the monthly mean rainfall at each of the 58 reporting
stations to interpolate/extrapolate rainfall over the entire county for
selected years to show variability. The data points are not evenly
distributed but clustered in more populated areas and dispersed in rural
areas. My geochemical data typically are like this and I need to also learn
how this distribution affects how the data are analyzed.

TIA,

Rich

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Subsetting dataframe by all factor levels

jhilliard
Hi Rich,

For the sake of example, here's a solution for a simple aggregation.

>aggregate(rainfall, list(rainfall$name), mean)  #This will aggregate all
columns and determine their mean. You're left with 58 rows.
>aggregate( rainfall[, #:#], list(rainfall$name), mean)  #In case you only
want to aggregate over select columns.


I am assuming you want rows with every combination of year and station with
their average precipitations. To aggregate it in that way you will need to
create a new column that represents the year (or month/year if the data are
appropriate for that resolution).

>rainfall.year<-with(rainfall, tapply(prcp, list(name, year), mean))  #This
does the aggregation.
>rainfall.year<-data.frame(as.table(rainfall.year))  #However, you are
given a "wide" data frame. This makes it "long" as you probably want it.

A for-do-done loop option.

for (i in levels(rainfall.year[,#year])) {
print(i)
print(mean(rainfall.year[rainfall.year$year==i,#prcp]))
}

The loop will return the mean rainfall per year, where #year is the number
for the year column and #prcp is for precipitation.
Try running that loop to see that it is properly looping through the factor
you want and then stick in the interpolation function.

I hope that helps!

Cheers,
Justin

On Fri, Sep 14, 2018 at 1:13 PM Rich Shepard <[hidden email]>
wrote:

>    I need to learn geospatial analyses in R to complement my GIS knowledge.
> I've just re-read the subsetting chapter in Hadley's 'Advanced R' without
> seeing how to create separate data frames based by extracting all rows for
> each site name in the parent data frame in one step. I believe that what I
> need to do is create a list of the factor names and feed them to a loop
> subsetting each to a new dataframe. Perhaps there's a better way unknown to
> me and I need advice, suggestions, and recommendations how to proceed.
>
>    The inclusive data frame has this structure:
>
> str(rainfall)
> 'data.frame':   113569 obs. of  6 variables:
>   $ name    : Factor w/ 58 levels "Blazed Alder",..: 20 20 20 20 20 20 20
> ...
>   $ easting : num  2370575 2370575 2370575 2370575 2370575 ...
>   $ northing: num  199338 199338 199338 199338 199338 ...
>   $ elev    : num  228 228 228 228 228 228 228 228 228 228 ...
>   $ sampdate: Date, format: "2005-01-01" "2005-01-02" ...
>   $ prcp    : num  0.59 0.08 0.1 0 0 0.02 0.05 0.1 0 0.02 ...
>
>    My goal is to use the monthly mean rainfall at each of the 58 reporting
> stations to interpolate/extrapolate rainfall over the entire county for
> selected years to show variability. The data points are not evenly
> distributed but clustered in more populated areas and dispersed in rural
> areas. My geochemical data typically are like this and I need to also learn
> how this distribution affects how the data are analyzed.
>
> TIA,
>
> Rich
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Subsetting dataframe by all factor levels

Rich Shepard
In reply to this post by Rich Shepard
On Fri, 14 Sep 2018, Phil Radtke wrote:

> would something as simple as this do what you need?
> rainfall_by_site <- split(rainfall,rainfall$name)

Phil,

   Yes, it certainly would do the job.

> I found this on stackoverflow by the web search: R create separate data
> frame based on factor
> https://stackoverflow.com/questions/9713294/split-data-frame-based-on-levels-of-a-factor-into-new-data-frames

   Either I missed this hit when I tried the same search string or duckduckgo
missed it.

   Thanks very much.

Regards,

Rich

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Subsetting dataframe by all factor levels

jhilliard
In reply to this post by jhilliard
One more thing. Let's still assume you want to interpolate it yearly. The
below code will assign names to the output during the loop.

for (i in levels(rainfall.year[,#year])) {
assign ( paste (i,"interpolation output",sep = "_")
, interpolation_function()
}


Cheers,
Justin

On Fri, Sep 14, 2018 at 2:03 PM Justin H. <[hidden email]> wrote:

> Hi Rich,
>
> For the sake of example, here's a solution for a simple aggregation.
>
> >aggregate(rainfall, list(rainfall$name), mean)  #This will aggregate all
> columns and determine their mean. You're left with 58 rows.
> >aggregate( rainfall[, #:#], list(rainfall$name), mean)  #In case you only
> want to aggregate over select columns.
>
>
> I am assuming you want rows with every combination of year and station
> with their average precipitations. To aggregate it in that way you will
> need to create a new column that represents the year (or month/year if the
> data are appropriate for that resolution).
>
> >rainfall.year<-with(rainfall, tapply(prcp, list(name, year), mean))
> #This does the aggregation.
> >rainfall.year<-data.frame(as.table(rainfall.year))  #However, you are
> given a "wide" data frame. This makes it "long" as you probably want it.
>
> A for-do-done loop option.
>
> for (i in levels(rainfall.year[,#year])) {
> print(i)
> print(mean(rainfall.year[rainfall.year$year==i,#prcp]))
> }
>
> The loop will return the mean rainfall per year, where #year is the number
> for the year column and #prcp is for precipitation.
> Try running that loop to see that it is properly looping through the
> factor you want and then stick in the interpolation function.
>
> I hope that helps!
>
> Cheers,
> Justin
>
> On Fri, Sep 14, 2018 at 1:13 PM Rich Shepard <[hidden email]>
> wrote:
>
>>    I need to learn geospatial analyses in R to complement my GIS
>> knowledge.
>> I've just re-read the subsetting chapter in Hadley's 'Advanced R' without
>> seeing how to create separate data frames based by extracting all rows for
>> each site name in the parent data frame in one step. I believe that what I
>> need to do is create a list of the factor names and feed them to a loop
>> subsetting each to a new dataframe. Perhaps there's a better way unknown
>> to
>> me and I need advice, suggestions, and recommendations how to proceed.
>>
>>    The inclusive data frame has this structure:
>>
>> str(rainfall)
>> 'data.frame':   113569 obs. of  6 variables:
>>   $ name    : Factor w/ 58 levels "Blazed Alder",..: 20 20 20 20 20 20 20
>> ...
>>   $ easting : num  2370575 2370575 2370575 2370575 2370575 ...
>>   $ northing: num  199338 199338 199338 199338 199338 ...
>>   $ elev    : num  228 228 228 228 228 228 228 228 228 228 ...
>>   $ sampdate: Date, format: "2005-01-01" "2005-01-02" ...
>>   $ prcp    : num  0.59 0.08 0.1 0 0 0.02 0.05 0.1 0 0.02 ...
>>
>>    My goal is to use the monthly mean rainfall at each of the 58 reporting
>> stations to interpolate/extrapolate rainfall over the entire county for
>> selected years to show variability. The data points are not evenly
>> distributed but clustered in more populated areas and dispersed in rural
>> areas. My geochemical data typically are like this and I need to also
>> learn
>> how this distribution affects how the data are analyzed.
>>
>> TIA,
>>
>> Rich
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> [hidden email]
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Subsetting dataframe by all factor levels

Rich Shepard
On Fri, 14 Sep 2018, Justin H. wrote:

> One more thing. Let's still assume you want to interpolate it yearly. The
> below code will assign names to the output during the loop.
>
> for (i in levels(rainfall.year[,#year])) {
> assign ( paste (i,"interpolation output",sep = "_")
> , interpolation_function()
> }

Justin,

   Thanks again.

Regards,

Rich

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Subsetting dataframe by all factor levels

Rich Shepard
In reply to this post by jhilliard
On Fri, 14 Sep 2018, Justin H. wrote:

> For the sake of example, here's a solution for a simple aggregation.
> rainfall.year<-with(rainfall, tapply(prcp, list(name, year), mean))  #This
> does the aggregation.

Justin,

   The results have sites with 'NA' for the mean precipiation despite having
recorded values some days. For example,

> rainfall.year
                                            0
Blazed Alder                              NA

while

> head(rainfall_by_site[[1]])
              name easting northing   elev   sampdate prcp
7741 Blazed Alder 2393589 196840.8 1112.5 2005-01-01  0.2
7742 Blazed Alder 2393589 196840.8 1112.5 2005-01-02  0.2
7743 Blazed Alder 2393589 196840.8 1112.5 2005-01-03  0.4
7744 Blazed Alder 2393589 196840.8 1112.5 2005-01-04  0.0
7745 Blazed Alder 2393589 196840.8 1112.5 2005-01-05  0.0
7746 Blazed Alder 2393589 196840.8 1112.5 2005-01-06  0.0

   I'll read up on tapply and see if I can identify the reason for the
discrepancy.

Regards,

Rich

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Subsetting dataframe by all factor levels

Rich Shepard
On Fri, 14 Sep 2018, Justin H. wrote:

> I'm not sure how it handles your date format. It's probably grouping
> things weirdly. If you can split out that column into two columns, one for
> year and one for month. I'd have to tinker with it later as I can't think
> of code off the top of my head. It should play nicer then.

Justin,

   The rainfall data.frame structure:
'data.frame':   113569 obs. of  6 variables:
  $ name    : Factor w/ 58 levels "Blazed Alder",..: 20 20 20 20 20 20 20 20 ...
  $ easting : num  2370575 2370575 2370575 2370575 2370575 ...
  $ northing: num  199338 199338 199338 199338 199338 ...
  $ elev    : num  228 228 228 228 228 228 228 228 228 228 ...
  $ sampdate: Date, format: "2005-01-01" "2005-01-02" ...
  $ prcp    : num  0.59 0.08 0.1 0 0 0.02 0.05 0.1 0 0.02 ...

After splitting by name (only the first one shown):
str(rainfall_by_site)
List of 58
  $ Blazed Alder                 :'data.frame':      4900 obs. of  6 variables:
   ..$ name    : Factor w/ 58 levels "Blazed Alder",..: 1 1 1 1 1 1 1 1 1 1 ...
   ..$ easting : num [1:4900] 2393589 2393589 2393589 2393589 2393589 ...
   ..$ northing: num [1:4900] 196841 196841 196841 196841 196841 ...
   ..$ elev    : num [1:4900] 1112 1112 1112 1112 1112 ...
   ..$ sampdate: Date[1:4900], format: "2005-01-01" "2005-01-02" ...
   ..$ prcp    : num [1:4900] 0.2 0.2 0.4 0 0 0 0.1 0.1 0.1 0.2 ...

Adding a year column to the end:
     $ year    : num  0 0 0 0 0 0 0 0 0 0 ...

I've not separated the sampdate structure into years and months; I can and
that might make the difference. Will try to find time this weekend to do so.
Otherwise, it'll be next week.

Regards,

Rich

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Subsetting dataframe by all factor levels

Gabriel Gaona
An other option is the tydiverse way. I assume when you say " My goal is to
use the monthly mean rainfall at each of the 58 reporting stations..." you
mean you want for each year and station a average value of monthly prcp .
From your input data frame, is like:

library(dplyr)
library(lubridate)
rainfall_yearly <- rainfall %>%
    group_by(name,
        Month = floor_date(sampdate, "month")) %>% #Monthly round down
dates for grouping
    summarise(prcp = sum(prcp, na.rm = TRUE) %>% #calculating monthly prcp
    group_by(name,
        Year = floor_date(Month, "year")) %>% #Yearly round down dates for
grouping
    summarise(prcp = mean(prcp, na.rm = TRUE) #calculating monthly average
per year
_________________________
* Gabriel Gaona*
*Teléfono*: +593 9 91665888
*Twitter/Hangouts*: gavg712
Loja - Ecuador
https://www.researchgate.net/profile/Gabriel_Gaona


El vie., 14 sept. 2018 a las 17:54, Rich Shepard (<[hidden email]>)
escribió:

> On Fri, 14 Sep 2018, Justin H. wrote:
>
> > I'm not sure how it handles your date format. It's probably grouping
> > things weirdly. If you can split out that column into two columns, one
> for
> > year and one for month. I'd have to tinker with it later as I can't think
> > of code off the top of my head. It should play nicer then.
>
> Justin,
>
>    The rainfall data.frame structure:
> 'data.frame':   113569 obs. of  6 variables:
>   $ name    : Factor w/ 58 levels "Blazed Alder",..: 20 20 20 20 20 20 20
> 20 ...
>   $ easting : num  2370575 2370575 2370575 2370575 2370575 ...
>   $ northing: num  199338 199338 199338 199338 199338 ...
>   $ elev    : num  228 228 228 228 228 228 228 228 228 228 ...
>   $ sampdate: Date, format: "2005-01-01" "2005-01-02" ...
>   $ prcp    : num  0.59 0.08 0.1 0 0 0.02 0.05 0.1 0 0.02 ...
>
> After splitting by name (only the first one shown):
> str(rainfall_by_site)
> List of 58
>   $ Blazed Alder                 :'data.frame':      4900 obs. of  6
> variables:
>    ..$ name    : Factor w/ 58 levels "Blazed Alder",..: 1 1 1 1 1 1 1 1 1
> 1 ...
>    ..$ easting : num [1:4900] 2393589 2393589 2393589 2393589 2393589 ...
>    ..$ northing: num [1:4900] 196841 196841 196841 196841 196841 ...
>    ..$ elev    : num [1:4900] 1112 1112 1112 1112 1112 ...
>    ..$ sampdate: Date[1:4900], format: "2005-01-01" "2005-01-02" ...
>    ..$ prcp    : num [1:4900] 0.2 0.2 0.4 0 0 0 0.1 0.1 0.1 0.2 ...
>
> Adding a year column to the end:
>      $ year    : num  0 0 0 0 0 0 0 0 0 0 ...
>
> I've not separated the sampdate structure into years and months; I can and
> that might make the difference. Will try to find time this weekend to do
> so.
> Otherwise, it'll be next week.
>
> Regards,
>
> Rich
>
> _______________________________________________
> R-sig-Geo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Reply | Threaded
Open this post in threaded view
|

Re: Subsetting dataframe by all factor levels

Rich Shepard
On Sun, 16 Sep 2018, Gabriel Gaona wrote:

> An other option is the tydiverse way. I assume when you say " My goal is to
> use the monthly mean rainfall at each of the 58 reporting stations..." you
> mean you want for each year and station a average value of monthly prcp .
>> From your input data frame, is like:
>
> library(dplyr)
> library(lubridate)
> rainfall_yearly <- rainfall %>%
>    group_by(name,
>        Month = floor_date(sampdate, "month")) %>% #Monthly round down
> dates for grouping
>    summarise(prcp = sum(prcp, na.rm = TRUE) %>% #calculating monthly prcp
>    group_by(name,
>        Year = floor_date(Month, "year")) %>% #Yearly round down dates for
> grouping
>    summarise(prcp = mean(prcp, na.rm = TRUE) #calculating monthly average
> per year

Gabriel,

   Thank you very much I'm learning a lot from all the responses.

Best regards,

Rich

_______________________________________________
R-sig-Geo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo