ANOVA analysis to compare mean values

up vote
0
down vote

favorite

According to my findings, we can use Anova analysis to compare a set of mean values. ANOVA depends on 3 main assumptions; Normality, Homogeneity of variance, Independent observations.

According to central limit theorem, when the sample size is large, mean(x) has a normal distribution, even though the distribution of x is not normal.

My question is, can we use ANOVA analysis to compare means, even if the original distributions of each data set is not normal and size of each data set is greater than 1000?

edited Aug 6 at 7:41

asked Aug 6 at 7:33

Pasindu

163

I think this depends on the size of your dataset.
â€“Â pointguard0
Aug 6 at 7:39

@pointguard0 Size is greater than 1000.
â€“Â Pasindu
Aug 6 at 7:42

Are the conditions for a Kruskal-Wallis test met? K-W is for differences in population medians, but if your distributions are symmetrical that wouldn't matter.
â€“Â BruceET
Aug 8 at 15:31

add a commentÂ |Â

up vote
0
down vote

favorite

According to my findings, we can use Anova analysis to compare a set of mean values. ANOVA depends on 3 main assumptions; Normality, Homogeneity of variance, Independent observations.

According to central limit theorem, when the sample size is large, mean(x) has a normal distribution, even though the distribution of x is not normal.

My question is, can we use ANOVA analysis to compare means, even if the original distributions of each data set is not normal and size of each data set is greater than 1000?

edited Aug 6 at 7:41

asked Aug 6 at 7:33

Pasindu

163

I think this depends on the size of your dataset.
â€“Â pointguard0
Aug 6 at 7:39

@pointguard0 Size is greater than 1000.
â€“Â Pasindu
Aug 6 at 7:42

Are the conditions for a Kruskal-Wallis test met? K-W is for differences in population medians, but if your distributions are symmetrical that wouldn't matter.
â€“Â BruceET
Aug 8 at 15:31

add a commentÂ |Â

up vote
0
down vote

favorite

According to my findings, we can use Anova analysis to compare a set of mean values. ANOVA depends on 3 main assumptions; Normality, Homogeneity of variance, Independent observations.

According to central limit theorem, when the sample size is large, mean(x) has a normal distribution, even though the distribution of x is not normal.

My question is, can we use ANOVA analysis to compare means, even if the original distributions of each data set is not normal and size of each data set is greater than 1000?

edited Aug 6 at 7:41

asked Aug 6 at 7:33

Pasindu

163

According to my findings, we can use Anova analysis to compare a set of mean values. ANOVA depends on 3 main assumptions; Normality, Homogeneity of variance, Independent observations.

According to central limit theorem, when the sample size is large, mean(x) has a normal distribution, even though the distribution of x is not normal.

My question is, can we use ANOVA analysis to compare means, even if the original distributions of each data set is not normal and size of each data set is greater than 1000?

edited Aug 6 at 7:41

asked Aug 6 at 7:33

Pasindu

163

edited Aug 6 at 7:41

asked Aug 6 at 7:33

Pasindu

163

asked Aug 6 at 7:33

Pasindu

163

asked Aug 6 at 7:33

Pasindu

163

I think this depends on the size of your dataset.
â€“Â pointguard0
Aug 6 at 7:39

@pointguard0 Size is greater than 1000.
â€“Â Pasindu
Aug 6 at 7:42

Are the conditions for a Kruskal-Wallis test met? K-W is for differences in population medians, but if your distributions are symmetrical that wouldn't matter.
â€“Â BruceET
Aug 8 at 15:31

add a commentÂ |Â

I think this depends on the size of your dataset.
â€“Â pointguard0
Aug 6 at 7:39

@pointguard0 Size is greater than 1000.
â€“Â Pasindu
Aug 6 at 7:42

Are the conditions for a Kruskal-Wallis test met? K-W is for differences in population medians, but if your distributions are symmetrical that wouldn't matter.
â€“Â BruceET
Aug 8 at 15:31

I think this depends on the size of your dataset.
â€“Â pointguard0
Aug 6 at 7:39

@pointguard0 Size is greater than 1000.
â€“Â Pasindu
Aug 6 at 7:42

Are the conditions for a Kruskal-Wallis test met? K-W is for differences in population medians, but if your distributions are symmetrical that wouldn't matter.
â€“Â BruceET
Aug 8 at 15:31

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
1
down vote

Shifted-exponential data: Here is a demonstration for particular datasets, showing that ANOVA
can have relatively poor power distinguishing among samples of size 1000
from slightly-shifted exponential distributions, all with population SD
$sigma = 1.$

I'm not saying an ANOVA never works on exponential data;
I am saying that there is good reason for the normality assumption.
(The distribution of the F-statistic under $H_0$ is not as expected
unless data are normal.)

set.seed(1888); x = rexp(3000)
d = rep((1:3)/20, each=1000); x = s + d # shift by 1/20, 2/20,3/20
g=as.factor(d*20)

Sample means are slightly different:

mean(x[1:1000]); mean(x[1001:2000]); mean(x[2001:3000])
[1] 1.088035
[1] 1.089778
[1] 1.166204

ANOVA not significant:

anova(lm(x~g))
Analysis of Variance Table

Response: x
 Df Sum Sq Mean Sq F value Pr(>F)
g 2 4.0 1.9924 1.8201 0.1622
Residuals 2997 3280.6 1.0946

Kruskal-Wallis detects different shifts:

kruskal.test(x~g)

 Kruskal-Wallis rank sum test

data: x by g
Kruskal-Wallis chi-squared = 7.4152, df = 2, p-value = 0.02454

The boxplots at the left below shows the three shifted-exponential samples, each of size 1000.

par(mfrow=c(1,2))
boxplot(x~g, col="skyblue2")

Shifted-normal data: By contrast both tests detect similar shifts in normal populations.

set.seed(1888); x = rnorm(3000); d = rep((1:3)/20, each=1000); 
g=as.factor(d*20); x =x+d
anova(lm(x~g))
Analysis of Variance Table

Response: x
 Df Sum Sq Mean Sq F value Pr(>F) 
g 2 8.27 4.1346 4.1808 0.01538 *
Residuals 2997 2963.87 0.9889 
---
Signif. codes: 0 Ã¢Â€Â˜***Ã¢Â€Â™ 0.001 Ã¢Â€Â˜**Ã¢Â€Â™ 0.01 Ã¢Â€Â˜*Ã¢Â€Â™ 0.05 Ã¢Â€Â˜.Ã¢Â€Â™ 0.1 Ã¢Â€Â˜ Ã¢Â€Â™ 1

kruskal.test(x~g)

 Kruskal-Wallis rank sum test

data: x by g
Kruskal-Wallis chi-squared = 8.7825, df = 2, p-value = 0.01239

boxplot(x~g, col="skyblue2")
par(mfrow=c(1,1))

Boxplots at right below show the three normal samples.

enter image description here

answered Aug 8 at 17:26

BruceET

33.3k61440

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2873665%2fanova-analysis-to-compare-mean-values%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

set.seed(1888); x = rexp(3000)
d = rep((1:3)/20, each=1000); x = s + d # shift by 1/20, 2/20,3/20
g=as.factor(d*20)

Sample means are slightly different:

mean(x[1:1000]); mean(x[1001:2000]); mean(x[2001:3000])
[1] 1.088035
[1] 1.089778
[1] 1.166204

ANOVA not significant:

anova(lm(x~g))
Analysis of Variance Table

Response: x
 Df Sum Sq Mean Sq F value Pr(>F)
g 2 4.0 1.9924 1.8201 0.1622
Residuals 2997 3280.6 1.0946

Kruskal-Wallis detects different shifts:

kruskal.test(x~g)

 Kruskal-Wallis rank sum test

data: x by g
Kruskal-Wallis chi-squared = 7.4152, df = 2, p-value = 0.02454

The boxplots at the left below shows the three shifted-exponential samples, each of size 1000.

par(mfrow=c(1,2))
boxplot(x~g, col="skyblue2")

Shifted-normal data: By contrast both tests detect similar shifts in normal populations.

set.seed(1888); x = rnorm(3000); d = rep((1:3)/20, each=1000); 
g=as.factor(d*20); x =x+d
anova(lm(x~g))
Analysis of Variance Table

Response: x
 Df Sum Sq Mean Sq F value Pr(>F) 
g 2 8.27 4.1346 4.1808 0.01538 *
Residuals 2997 2963.87 0.9889 
---
Signif. codes: 0 Ã¢Â€Â˜***Ã¢Â€Â™ 0.001 Ã¢Â€Â˜**Ã¢Â€Â™ 0.01 Ã¢Â€Â˜*Ã¢Â€Â™ 0.05 Ã¢Â€Â˜.Ã¢Â€Â™ 0.1 Ã¢Â€Â˜ Ã¢Â€Â™ 1

kruskal.test(x~g)

 Kruskal-Wallis rank sum test

data: x by g
Kruskal-Wallis chi-squared = 8.7825, df = 2, p-value = 0.01239

boxplot(x~g, col="skyblue2")
par(mfrow=c(1,1))

Boxplots at right below show the three normal samples.

enter image description here

answered Aug 8 at 17:26

BruceET

33.3k61440

add a commentÂ |Â

up vote
1
down vote

set.seed(1888); x = rexp(3000)
d = rep((1:3)/20, each=1000); x = s + d # shift by 1/20, 2/20,3/20
g=as.factor(d*20)

Sample means are slightly different:

mean(x[1:1000]); mean(x[1001:2000]); mean(x[2001:3000])
[1] 1.088035
[1] 1.089778
[1] 1.166204

ANOVA not significant:

anova(lm(x~g))
Analysis of Variance Table

Response: x
 Df Sum Sq Mean Sq F value Pr(>F)
g 2 4.0 1.9924 1.8201 0.1622
Residuals 2997 3280.6 1.0946

Kruskal-Wallis detects different shifts:

kruskal.test(x~g)

 Kruskal-Wallis rank sum test

data: x by g
Kruskal-Wallis chi-squared = 7.4152, df = 2, p-value = 0.02454

The boxplots at the left below shows the three shifted-exponential samples, each of size 1000.

par(mfrow=c(1,2))
boxplot(x~g, col="skyblue2")

Shifted-normal data: By contrast both tests detect similar shifts in normal populations.

set.seed(1888); x = rnorm(3000); d = rep((1:3)/20, each=1000); 
g=as.factor(d*20); x =x+d
anova(lm(x~g))
Analysis of Variance Table

Response: x
 Df Sum Sq Mean Sq F value Pr(>F) 
g 2 8.27 4.1346 4.1808 0.01538 *
Residuals 2997 2963.87 0.9889 
---
Signif. codes: 0 Ã¢Â€Â˜***Ã¢Â€Â™ 0.001 Ã¢Â€Â˜**Ã¢Â€Â™ 0.01 Ã¢Â€Â˜*Ã¢Â€Â™ 0.05 Ã¢Â€Â˜.Ã¢Â€Â™ 0.1 Ã¢Â€Â˜ Ã¢Â€Â™ 1

kruskal.test(x~g)

 Kruskal-Wallis rank sum test

data: x by g
Kruskal-Wallis chi-squared = 8.7825, df = 2, p-value = 0.01239

boxplot(x~g, col="skyblue2")
par(mfrow=c(1,1))

Boxplots at right below show the three normal samples.

enter image description here

answered Aug 8 at 17:26

BruceET

33.3k61440

add a commentÂ |Â

up vote
1
down vote

set.seed(1888); x = rexp(3000)
d = rep((1:3)/20, each=1000); x = s + d # shift by 1/20, 2/20,3/20
g=as.factor(d*20)

Sample means are slightly different:

mean(x[1:1000]); mean(x[1001:2000]); mean(x[2001:3000])
[1] 1.088035
[1] 1.089778
[1] 1.166204

ANOVA not significant:

anova(lm(x~g))
Analysis of Variance Table

Response: x
 Df Sum Sq Mean Sq F value Pr(>F)
g 2 4.0 1.9924 1.8201 0.1622
Residuals 2997 3280.6 1.0946

Kruskal-Wallis detects different shifts:

kruskal.test(x~g)

 Kruskal-Wallis rank sum test

data: x by g
Kruskal-Wallis chi-squared = 7.4152, df = 2, p-value = 0.02454

The boxplots at the left below shows the three shifted-exponential samples, each of size 1000.

par(mfrow=c(1,2))
boxplot(x~g, col="skyblue2")

Shifted-normal data: By contrast both tests detect similar shifts in normal populations.

set.seed(1888); x = rnorm(3000); d = rep((1:3)/20, each=1000); 
g=as.factor(d*20); x =x+d
anova(lm(x~g))
Analysis of Variance Table

Response: x
 Df Sum Sq Mean Sq F value Pr(>F) 
g 2 8.27 4.1346 4.1808 0.01538 *
Residuals 2997 2963.87 0.9889 
---
Signif. codes: 0 Ã¢Â€Â˜***Ã¢Â€Â™ 0.001 Ã¢Â€Â˜**Ã¢Â€Â™ 0.01 Ã¢Â€Â˜*Ã¢Â€Â™ 0.05 Ã¢Â€Â˜.Ã¢Â€Â™ 0.1 Ã¢Â€Â˜ Ã¢Â€Â™ 1

kruskal.test(x~g)

 Kruskal-Wallis rank sum test

data: x by g
Kruskal-Wallis chi-squared = 8.7825, df = 2, p-value = 0.01239

boxplot(x~g, col="skyblue2")
par(mfrow=c(1,1))

Boxplots at right below show the three normal samples.

enter image description here

answered Aug 8 at 17:26

BruceET

33.3k61440

set.seed(1888); x = rexp(3000)
d = rep((1:3)/20, each=1000); x = s + d # shift by 1/20, 2/20,3/20
g=as.factor(d*20)

Sample means are slightly different:

mean(x[1:1000]); mean(x[1001:2000]); mean(x[2001:3000])
[1] 1.088035
[1] 1.089778
[1] 1.166204

ANOVA not significant:

anova(lm(x~g))
Analysis of Variance Table

Response: x
 Df Sum Sq Mean Sq F value Pr(>F)
g 2 4.0 1.9924 1.8201 0.1622
Residuals 2997 3280.6 1.0946

Kruskal-Wallis detects different shifts:

kruskal.test(x~g)

 Kruskal-Wallis rank sum test

data: x by g
Kruskal-Wallis chi-squared = 7.4152, df = 2, p-value = 0.02454

The boxplots at the left below shows the three shifted-exponential samples, each of size 1000.

par(mfrow=c(1,2))
boxplot(x~g, col="skyblue2")

Shifted-normal data: By contrast both tests detect similar shifts in normal populations.

set.seed(1888); x = rnorm(3000); d = rep((1:3)/20, each=1000); 
g=as.factor(d*20); x =x+d
anova(lm(x~g))
Analysis of Variance Table

Response: x
 Df Sum Sq Mean Sq F value Pr(>F) 
g 2 8.27 4.1346 4.1808 0.01538 *
Residuals 2997 2963.87 0.9889 
---
Signif. codes: 0 Ã¢Â€Â˜***Ã¢Â€Â™ 0.001 Ã¢Â€Â˜**Ã¢Â€Â™ 0.01 Ã¢Â€Â˜*Ã¢Â€Â™ 0.05 Ã¢Â€Â˜.Ã¢Â€Â™ 0.1 Ã¢Â€Â˜ Ã¢Â€Â™ 1

kruskal.test(x~g)

 Kruskal-Wallis rank sum test

data: x by g
Kruskal-Wallis chi-squared = 8.7825, df = 2, p-value = 0.01239

boxplot(x~g, col="skyblue2")
par(mfrow=c(1,1))

Boxplots at right below show the three normal samples.

enter image description here

answered Aug 8 at 17:26

BruceET

33.3k61440

answered Aug 8 at 17:26

BruceET

33.3k61440

answered Aug 8 at 17:26

BruceET

33.3k61440

answered Aug 8 at 17:26

BruceET

33.3k61440

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

ukmuiik