

Greetings,
I am testing to see if linear relationships exist between my x and y variables. I conducted various diagnoses in R to test for normality of the x variable data by using qqnorm, qqline and histograms that show the distribution of the data. If the data is shown to be normally distributed in either normal quantile plots or in the histograms (i.e. a bell curveshaped distribution), I would assume normality and apply the linear regression model, using "lm". However, in some cases, my distributions do not satisfy the normality criteria, and so I feel that using the linear regression model, in those cases, would not be appropriate. For that reason, would you be able to suggest an alternate test to the linear regression model in R? Maybe a nonparametric counterpart to it?
Thank you, and any help would be greatly appreciated!
[[alternative HTML version deleted]]
_______________________________________________
RsigGeo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/rsiggeo


Note that the normality assumptions are about the residuals (or about
y conditional on x), not on the x variable(s) or all of y
(nonconditional). If x is highly skewed and the residuals are normal
then diagnostics just on y will also show skewness (if there is a
relationship between x and y).
Also, the normality assumptions are about the tests and confidence
intervals, the least squares fit is legitimate (but possibly not the
most interesting fit) whether the residuals are normal or not. The
Central Limit Theorem also applies in regression, so if the residuals
are nonnormal, but you have a large sample size then the tests and
intervals will still be approximately correct (with the quality of the
approximation depending on the degree of nonnormality and sample
size).
There are many alternative tools. There is a task view on CRAN for
Robust Statistical Methods that gives summaries of many packages and
tools for robust regression (and other things as well) which does not
depend on the normality assumptions.
On Wed, Oct 23, 2019 at 9:21 AM rain1290 via RsigGeo
< [hidden email]> wrote:
>
> Greetings,
> I am testing to see if linear relationships exist between my x and y variables. I conducted various diagnoses in R to test for normality of the x variable data by using qqnorm, qqline and histograms that show the distribution of the data. If the data is shown to be normally distributed in either normal quantile plots or in the histograms (i.e. a bell curveshaped distribution), I would assume normality and apply the linear regression model, using "lm". However, in some cases, my distributions do not satisfy the normality criteria, and so I feel that using the linear regression model, in those cases, would not be appropriate. For that reason, would you be able to suggest an alternate test to the linear regression model in R? Maybe a nonparametric counterpart to it?
> Thank you, and any help would be greatly appreciated!
> [[alternative HTML version deleted]]
>
> _______________________________________________
> RsigGeo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/rsiggeo
Gregory (Greg) L. Snow Ph.D.
[hidden email]
_______________________________________________
RsigGeo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/rsiggeo


Hi Greg and others,
Thank you for your very informative response! I actually made a mistake in my initial message, in that I was actually testing for the y variable, not the x. I will also look into those packages on CRAN, but even if there is some skewness on the y, because my sample size is much larger than 30 (N>30), it might be safe to apply a linear regression analysis, if we can assume linearity?
A useful alternative would be to use correlation coefficients to test the degree of association between the x and y variables; specifically, the Pearson correlation coefficient, since both x and y variables are quantitative. Does that make sense?
Thanks again,
Original Message
From: Greg Snow < [hidden email]>
To: rain1290 < [hidden email]>
Cc: rsiggeo < [hidden email]>
Sent: Wed, Oct 23, 2019 1:00 pm
Subject: Re: [RsigGeo] Alternate statistical test to linear regression?
Note that the normality assumptions are about the residuals (or about
y conditional on x), not on the x variable(s) or all of y
(nonconditional). If x is highly skewed and the residuals are normal
then diagnostics just on y will also show skewness (if there is a
relationship between x and y).
Also, the normality assumptions are about the tests and confidence
intervals, the least squares fit is legitimate (but possibly not the
most interesting fit) whether the residuals are normal or not. The
Central Limit Theorem also applies in regression, so if the residuals
are nonnormal, but you have a large sample size then the tests and
intervals will still be approximately correct (with the quality of the
approximation depending on the degree of nonnormality and sample
size).
There are many alternative tools. There is a task view on CRAN for
Robust Statistical Methods that gives summaries of many packages and
tools for robust regression (and other things as well) which does not
depend on the normality assumptions.
On Wed, Oct 23, 2019 at 9:21 AM rain1290 via RsigGeo
< [hidden email]> wrote:
>
> Greetings,
> I am testing to see if linear relationships exist between my x and y variables. I conducted various diagnoses in R to test for normality of the x variable data by using qqnorm, qqline and histograms that show the distribution of the data. If the data is shown to be normally distributed in either normal quantile plots or in the histograms (i.e. a bell curveshaped distribution), I would assume normality and apply the linear regression model, using "lm". However, in some cases, my distributions do not satisfy the normality criteria, and so I feel that using the linear regression model, in those cases, would not be appropriate. For that reason, would you be able to suggest an alternate test to the linear regression model in R? Maybe a nonparametric counterpart to it?
> Thank you, and any help would be greatly appreciated!
> [[alternative HTML version deleted]]
>
> _______________________________________________
> RsigGeo mailing list
> [hidden email]
> https://stat.ethz.ch/mailman/listinfo/rsiggeo
Gregory (Greg) L. Snow Ph.D.
[hidden email]
[[alternative HTML version deleted]]
_______________________________________________
RsigGeo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/rsiggeo


First, please expunge the "(N>30)" concept from your mind. This is an
oversimplified rule of thumb used in introductory statistics courses
(I am guilty of doing this in intro stat as well, but I try to
emphasize to my students that it is only a rule of thumb for that
class and the truth is more complex once you are in the real world, so
consult with a statistician). There is nothing magical about a sample
size of 30, I have seen cases where n=6 is large enough for the CLT
and cases where n=10,000 was not big enough.
If the data is not overly skewed and your sample size is large then
you can just use regression as is and the inference will be
approximately correct (with a really good approximation). But with
skewness we often prefer the median over the mean and least squares
regression is equivalent to fitting a mean, some of the robust
regression options are equivalent to fitting a median, so they may be
preferable on that count.
Note that Pearson's correlation does not test linearity, it assumes
linearity (and bivariate normality). Most issues with regression will
be the same for the correlation.
On Wed, Oct 23, 2019 at 11:25 AM < [hidden email]> wrote:
>
> Hi Greg and others,
>
> Thank you for your very informative response! I actually made a mistake in my initial message, in that I was actually testing for the y variable, not the x. I will also look into those packages on CRAN, but even if there is some skewness on the y, because my sample size is much larger than 30 (N>30), it might be safe to apply a linear regression analysis, if we can assume linearity?
>
> A useful alternative would be to use correlation coefficients to test the degree of association between the x and y variables; specifically, the Pearson correlation coefficient, since both x and y variables are quantitative. Does that make sense?
>
> Thanks again,
>
>
> Original Message
> From: Greg Snow < [hidden email]>
> To: rain1290 < [hidden email]>
> Cc: rsiggeo < [hidden email]>
> Sent: Wed, Oct 23, 2019 1:00 pm
> Subject: Re: [RsigGeo] Alternate statistical test to linear regression?
>
> Note that the normality assumptions are about the residuals (or about
> y conditional on x), not on the x variable(s) or all of y
> (nonconditional). If x is highly skewed and the residuals are normal
> then diagnostics just on y will also show skewness (if there is a
> relationship between x and y).
>
> Also, the normality assumptions are about the tests and confidence
> intervals, the least squares fit is legitimate (but possibly not the
> most interesting fit) whether the residuals are normal or not. The
> Central Limit Theorem also applies in regression, so if the residuals
> are nonnormal, but you have a large sample size then the tests and
> intervals will still be approximately correct (with the quality of the
> approximation depending on the degree of nonnormality and sample
> size).
>
> There are many alternative tools. There is a task view on CRAN for
> Robust Statistical Methods that gives summaries of many packages and
> tools for robust regression (and other things as well) which does not
> depend on the normality assumptions.
>
>
> On Wed, Oct 23, 2019 at 9:21 AM rain1290 via RsigGeo
> < [hidden email]> wrote:
> >
> > Greetings,
> > I am testing to see if linear relationships exist between my x and y variables. I conducted various diagnoses in R to test for normality of the x variable data by using qqnorm, qqline and histograms that show the distribution of the data. If the data is shown to be normally distributed in either normal quantile plots or in the histograms (i.e. a bell curveshaped distribution), I would assume normality and apply the linear regression model, using "lm". However, in some cases, my distributions do not satisfy the normality criteria, and so I feel that using the linear regression model, in those cases, would not be appropriate. For that reason, would you be able to suggest an alternate test to the linear regression model in R? Maybe a nonparametric counterpart to it?
> > Thank you, and any help would be greatly appreciated!
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > RsigGeo mailing list
> > [hidden email]
> > https://stat.ethz.ch/mailman/listinfo/rsiggeo>
>
>
>
> 
> Gregory (Greg) L. Snow Ph.D.
> [hidden email]

Gregory (Greg) L. Snow Ph.D.
[hidden email]
_______________________________________________
RsigGeo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/rsiggeo


Hi Greg and others,
Thank you for these explanations and clarifications, as they are much appreciated!
Indeed, I do have some datasets that exhibit some distinct skewness. Simple scatter plots do show at least some linearity between my x and y variables (albeit weak, given the scattered nature of data points), but could this be sufficient to try simple linear regression? Also, if the data is overly skewed, could transforming it (such as logarithmically) justify the use of simple linear regression and/or correlation, if it causes the data to become mildly skewed in distribution? I have large sample sizes for all of my datasets, and the variables are continuous.
That would pretty much cover all of my questions concerning this!
Thank you, once again, for your time!
Original Message
From: Greg Snow < [hidden email]>
To: rain1290 < [hidden email]>
Cc: rsiggeo < [hidden email]>
Sent: Wed, Oct 23, 2019 3:49 pm
Subject: Re: [RsigGeo] Alternate statistical test to linear regression?
First, please expunge the "(N>30)" concept from your mind. This is an
oversimplified rule of thumb used in introductory statistics courses
(I am guilty of doing this in intro stat as well, but I try to
emphasize to my students that it is only a rule of thumb for that
class and the truth is more complex once you are in the real world, so
consult with a statistician). There is nothing magical about a sample
size of 30, I have seen cases where n=6 is large enough for the CLT
and cases where n=10,000 was not big enough.
If the data is not overly skewed and your sample size is large then
you can just use regression as is and the inference will be
approximately correct (with a really good approximation). But with
skewness we often prefer the median over the mean and least squares
regression is equivalent to fitting a mean, some of the robust
regression options are equivalent to fitting a median, so they may be
preferable on that count.
Note that Pearson's correlation does not test linearity, it assumes
linearity (and bivariate normality). Most issues with regression will
be the same for the correlation.
On Wed, Oct 23, 2019 at 11:25 AM < [hidden email]> wrote:
>
> Hi Greg and others,
>
> Thank you for your very informative response! I actually made a mistake in my initial message, in that I was actually testing for the y variable, not the x. I will also look into those packages on CRAN, but even if there is some skewness on the y, because my sample size is much larger than 30 (N>30), it might be safe to apply a linear regression analysis, if we can assume linearity?
>
> A useful alternative would be to use correlation coefficients to test the degree of association between the x and y variables; specifically, the Pearson correlation coefficient, since both x and y variables are quantitative. Does that make sense?
>
> Thanks again,
>
>
> Original Message
> From: Greg Snow < [hidden email]>
> To: rain1290 < [hidden email]>
> Cc: rsiggeo < [hidden email]>
> Sent: Wed, Oct 23, 2019 1:00 pm
> Subject: Re: [RsigGeo] Alternate statistical test to linear regression?
>
> Note that the normality assumptions are about the residuals (or about
> y conditional on x), not on the x variable(s) or all of y
> (nonconditional). If x is highly skewed and the residuals are normal
> then diagnostics just on y will also show skewness (if there is a
> relationship between x and y).
>
> Also, the normality assumptions are about the tests and confidence
> intervals, the least squares fit is legitimate (but possibly not the
> most interesting fit) whether the residuals are normal or not. The
> Central Limit Theorem also applies in regression, so if the residuals
> are nonnormal, but you have a large sample size then the tests and
> intervals will still be approximately correct (with the quality of the
> approximation depending on the degree of nonnormality and sample
> size).
>
> There are many alternative tools. There is a task view on CRAN for
> Robust Statistical Methods that gives summaries of many packages and
> tools for robust regression (and other things as well) which does not
> depend on the normality assumptions.
>
>
> On Wed, Oct 23, 2019 at 9:21 AM rain1290 via RsigGeo
> < [hidden email]> wrote:
> >
> > Greetings,
> > I am testing to see if linear relationships exist between my x and y variables. I conducted various diagnoses in R to test for normality of the x variable data by using qqnorm, qqline and histograms that show the distribution of the data. If the data is shown to be normally distributed in either normal quantile plots or in the histograms (i.e. a bell curveshaped distribution), I would assume normality and apply the linear regression model, using "lm". However, in some cases, my distributions do not satisfy the normality criteria, and so I feel that using the linear regression model, in those cases, would not be appropriate. For that reason, would you be able to suggest an alternate test to the linear regression model in R? Maybe a nonparametric counterpart to it?
> > Thank you, and any help would be greatly appreciated!
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > RsigGeo mailing list
> > [hidden email]
> > https://stat.ethz.ch/mailman/listinfo/rsiggeo>
>
>
>
> 
> Gregory (Greg) L. Snow Ph.D.
> [hidden email]

Gregory (Greg) L. Snow Ph.D.
[hidden email]
[[alternative HTML version deleted]]
_______________________________________________
RsigGeo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/rsiggeo


Yes, linear regression is a good place to start. But I would consider
the robust regressions as well since they answer a different question
from regular linear regression and that question may be more
appropriate for the skewed data.
On Wed, Oct 23, 2019 at 2:52 PM < [hidden email]> wrote:
>
> Hi Greg and others,
>
> Thank you for these explanations and clarifications, as they are much appreciated!
>
> Indeed, I do have some datasets that exhibit some distinct skewness. Simple scatter plots do show at least some linearity between my x and y variables (albeit weak, given the scattered nature of data points), but could this be sufficient to try simple linear regression? Also, if the data is overly skewed, could transforming it (such as logarithmically) justify the use of simple linear regression and/or correlation, if it causes the data to become mildly skewed in distribution? I have large sample sizes for all of my datasets, and the variables are continuous.
>
> That would pretty much cover all of my questions concerning this!
>
> Thank you, once again, for your time!
>
> Original Message
> From: Greg Snow < [hidden email]>
> To: rain1290 < [hidden email]>
> Cc: rsiggeo < [hidden email]>
> Sent: Wed, Oct 23, 2019 3:49 pm
> Subject: Re: [RsigGeo] Alternate statistical test to linear regression?
>
> First, please expunge the "(N>30)" concept from your mind. This is an
> oversimplified rule of thumb used in introductory statistics courses
> (I am guilty of doing this in intro stat as well, but I try to
> emphasize to my students that it is only a rule of thumb for that
> class and the truth is more complex once you are in the real world, so
> consult with a statistician). There is nothing magical about a sample
> size of 30, I have seen cases where n=6 is large enough for the CLT
> and cases where n=10,000 was not big enough.
>
> If the data is not overly skewed and your sample size is large then
> you can just use regression as is and the inference will be
> approximately correct (with a really good approximation). But with
> skewness we often prefer the median over the mean and least squares
> regression is equivalent to fitting a mean, some of the robust
> regression options are equivalent to fitting a median, so they may be
> preferable on that count.
>
> Note that Pearson's correlation does not test linearity, it assumes
> linearity (and bivariate normality). Most issues with regression will
> be the same for the correlation.
>
> On Wed, Oct 23, 2019 at 11:25 AM < [hidden email]> wrote:
> >
> > Hi Greg and others,
> >
> > Thank you for your very informative response! I actually made a mistake in my initial message, in that I was actually testing for the y variable, not the x. I will also look into those packages on CRAN, but even if there is some skewness on the y, because my sample size is much larger than 30 (N>30), it might be safe to apply a linear regression analysis, if we can assume linearity?
> >
> > A useful alternative would be to use correlation coefficients to test the degree of association between the x and y variables; specifically, the Pearson correlation coefficient, since both x and y variables are quantitative. Does that make sense?
> >
> > Thanks again,
> >
> >
> > Original Message
> > From: Greg Snow < [hidden email]>
> > To: rain1290 < [hidden email]>
> > Cc: rsiggeo < [hidden email]>
> > Sent: Wed, Oct 23, 2019 1:00 pm
> > Subject: Re: [RsigGeo] Alternate statistical test to linear regression?
> >
> > Note that the normality assumptions are about the residuals (or about
> > y conditional on x), not on the x variable(s) or all of y
> > (nonconditional). If x is highly skewed and the residuals are normal
> > then diagnostics just on y will also show skewness (if there is a
> > relationship between x and y).
> >
> > Also, the normality assumptions are about the tests and confidence
> > intervals, the least squares fit is legitimate (but possibly not the
> > most interesting fit) whether the residuals are normal or not. The
> > Central Limit Theorem also applies in regression, so if the residuals
> > are nonnormal, but you have a large sample size then the tests and
> > intervals will still be approximately correct (with the quality of the
> > approximation depending on the degree of nonnormality and sample
> > size).
> >
> > There are many alternative tools. There is a task view on CRAN for
> > Robust Statistical Methods that gives summaries of many packages and
> > tools for robust regression (and other things as well) which does not
> > depend on the normality assumptions.
> >
> >
> > On Wed, Oct 23, 2019 at 9:21 AM rain1290 via RsigGeo
> > < [hidden email]> wrote:
> > >
> > > Greetings,
> > > I am testing to see if linear relationships exist between my x and y variables. I conducted various diagnoses in R to test for normality of the x variable data by using qqnorm, qqline and histograms that show the distribution of the data. If the data is shown to be normally distributed in either normal quantile plots or in the histograms (i.e. a bell curveshaped distribution), I would assume normality and apply the linear regression model, using "lm". However, in some cases, my distributions do not satisfy the normality criteria, and so I feel that using the linear regression model, in those cases, would not be appropriate. For that reason, would you be able to suggest an alternate test to the linear regression model in R? Maybe a nonparametric counterpart to it?
> > > Thank you, and any help would be greatly appreciated!
> > > [[alternative HTML version deleted]]
> > >
> > > _______________________________________________
> > > RsigGeo mailing list
> > > [hidden email]
> > > https://stat.ethz.ch/mailman/listinfo/rsiggeo> >
> >
> >
> >
> > 
> > Gregory (Greg) L. Snow Ph.D.
>
> > [hidden email]
>
>
>
> 
> Gregory (Greg) L. Snow Ph.D.
> [hidden email]

Gregory (Greg) L. Snow Ph.D.
[hidden email]
_______________________________________________
RsigGeo mailing list
[hidden email]
https://stat.ethz.ch/mailman/listinfo/rsiggeo

