Formula of combined variance of two data sets yields wrong output
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
I have some distribution from which I sample two datasets x1
and x2
. I wanted to calculate their combined mean and variance by using these two formulas:
$$bar X_c = fracn_1 overlineX_1 + n_1 overlineX_1n_1 + n_2$$
$$S_c^2 = fracn_1S_1^2 + n_2S_2^2 + n_1left( overline X _1 - overline X _c right)^2 + n_2left( overline X _2 - overline X _c right)^2n_1 + n_2$$
where $n$ is the number of samples of the dataset. The subscript $c$ indicates the combined values.
For testing purposes, I wanted to check if the formulas yield the same result as when stacking the two datasets to create $x3 = x1+x2$ and calculating the mean and variance of it. So I created a dummy dataset like this:
x1 x2
98 69
49 54
33 38
73 9
51
I calculated the means and variances just for $x1$ and $x2$, and then for the 2 combination methods. It yielded:
x1 x2 x3 xC
mean 60.80 42.50 52.66 52.66
var 635.2 659.0 657.75 728.47
As you can see, the formula worked for the means, but fails to reproduce the correct variance (x3
).
Can somebody tell me what I am doing wrong? Simple answers would be nice, as I am not a great mathematician.
Thank you!
variance means
add a comment |Â
up vote
0
down vote
favorite
I have some distribution from which I sample two datasets x1
and x2
. I wanted to calculate their combined mean and variance by using these two formulas:
$$bar X_c = fracn_1 overlineX_1 + n_1 overlineX_1n_1 + n_2$$
$$S_c^2 = fracn_1S_1^2 + n_2S_2^2 + n_1left( overline X _1 - overline X _c right)^2 + n_2left( overline X _2 - overline X _c right)^2n_1 + n_2$$
where $n$ is the number of samples of the dataset. The subscript $c$ indicates the combined values.
For testing purposes, I wanted to check if the formulas yield the same result as when stacking the two datasets to create $x3 = x1+x2$ and calculating the mean and variance of it. So I created a dummy dataset like this:
x1 x2
98 69
49 54
33 38
73 9
51
I calculated the means and variances just for $x1$ and $x2$, and then for the 2 combination methods. It yielded:
x1 x2 x3 xC
mean 60.80 42.50 52.66 52.66
var 635.2 659.0 657.75 728.47
As you can see, the formula worked for the means, but fails to reproduce the correct variance (x3
).
Can somebody tell me what I am doing wrong? Simple answers would be nice, as I am not a great mathematician.
Thank you!
variance means
You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
â joriki
Jul 31 at 11:30
What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
â drhab
Jul 31 at 11:38
@drhab: I normalized with $n-1$. I also edited the question to make it more readable.
â DocDriven
Jul 31 at 11:55
From where did you get the formula of $S_c^2$? It does not look okay to me.
â drhab
Jul 31 at 12:10
@drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
â DocDriven
Jul 31 at 12:17
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have some distribution from which I sample two datasets x1
and x2
. I wanted to calculate their combined mean and variance by using these two formulas:
$$bar X_c = fracn_1 overlineX_1 + n_1 overlineX_1n_1 + n_2$$
$$S_c^2 = fracn_1S_1^2 + n_2S_2^2 + n_1left( overline X _1 - overline X _c right)^2 + n_2left( overline X _2 - overline X _c right)^2n_1 + n_2$$
where $n$ is the number of samples of the dataset. The subscript $c$ indicates the combined values.
For testing purposes, I wanted to check if the formulas yield the same result as when stacking the two datasets to create $x3 = x1+x2$ and calculating the mean and variance of it. So I created a dummy dataset like this:
x1 x2
98 69
49 54
33 38
73 9
51
I calculated the means and variances just for $x1$ and $x2$, and then for the 2 combination methods. It yielded:
x1 x2 x3 xC
mean 60.80 42.50 52.66 52.66
var 635.2 659.0 657.75 728.47
As you can see, the formula worked for the means, but fails to reproduce the correct variance (x3
).
Can somebody tell me what I am doing wrong? Simple answers would be nice, as I am not a great mathematician.
Thank you!
variance means
I have some distribution from which I sample two datasets x1
and x2
. I wanted to calculate their combined mean and variance by using these two formulas:
$$bar X_c = fracn_1 overlineX_1 + n_1 overlineX_1n_1 + n_2$$
$$S_c^2 = fracn_1S_1^2 + n_2S_2^2 + n_1left( overline X _1 - overline X _c right)^2 + n_2left( overline X _2 - overline X _c right)^2n_1 + n_2$$
where $n$ is the number of samples of the dataset. The subscript $c$ indicates the combined values.
For testing purposes, I wanted to check if the formulas yield the same result as when stacking the two datasets to create $x3 = x1+x2$ and calculating the mean and variance of it. So I created a dummy dataset like this:
x1 x2
98 69
49 54
33 38
73 9
51
I calculated the means and variances just for $x1$ and $x2$, and then for the 2 combination methods. It yielded:
x1 x2 x3 xC
mean 60.80 42.50 52.66 52.66
var 635.2 659.0 657.75 728.47
As you can see, the formula worked for the means, but fails to reproduce the correct variance (x3
).
Can somebody tell me what I am doing wrong? Simple answers would be nice, as I am not a great mathematician.
Thank you!
variance means
edited Jul 31 at 11:48
asked Jul 31 at 11:25
DocDriven
1036
1036
You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
â joriki
Jul 31 at 11:30
What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
â drhab
Jul 31 at 11:38
@drhab: I normalized with $n-1$. I also edited the question to make it more readable.
â DocDriven
Jul 31 at 11:55
From where did you get the formula of $S_c^2$? It does not look okay to me.
â drhab
Jul 31 at 12:10
@drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
â DocDriven
Jul 31 at 12:17
add a comment |Â
You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
â joriki
Jul 31 at 11:30
What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
â drhab
Jul 31 at 11:38
@drhab: I normalized with $n-1$. I also edited the question to make it more readable.
â DocDriven
Jul 31 at 11:55
From where did you get the formula of $S_c^2$? It does not look okay to me.
â drhab
Jul 31 at 12:10
@drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
â DocDriven
Jul 31 at 12:17
You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
â joriki
Jul 31 at 11:30
You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
â joriki
Jul 31 at 11:30
What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
â drhab
Jul 31 at 11:38
What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
â drhab
Jul 31 at 11:38
@drhab: I normalized with $n-1$. I also edited the question to make it more readable.
â DocDriven
Jul 31 at 11:55
@drhab: I normalized with $n-1$. I also edited the question to make it more readable.
â DocDriven
Jul 31 at 11:55
From where did you get the formula of $S_c^2$? It does not look okay to me.
â drhab
Jul 31 at 12:10
From where did you get the formula of $S_c^2$? It does not look okay to me.
â drhab
Jul 31 at 12:10
@drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
â DocDriven
Jul 31 at 12:17
@drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
â DocDriven
Jul 31 at 12:17
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
This formula is for the sample variances. What you wrote in the row labeled var
in the table is the estimate for the population variance based on the sample variance, which differs from the sample variance due to Bessel's correction.
P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
â joriki
Jul 31 at 11:43
Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
â DocDriven
Jul 31 at 12:15
@DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
â joriki
Jul 31 at 12:21
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
This formula is for the sample variances. What you wrote in the row labeled var
in the table is the estimate for the population variance based on the sample variance, which differs from the sample variance due to Bessel's correction.
P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
â joriki
Jul 31 at 11:43
Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
â DocDriven
Jul 31 at 12:15
@DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
â joriki
Jul 31 at 12:21
add a comment |Â
up vote
2
down vote
accepted
This formula is for the sample variances. What you wrote in the row labeled var
in the table is the estimate for the population variance based on the sample variance, which differs from the sample variance due to Bessel's correction.
P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
â joriki
Jul 31 at 11:43
Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
â DocDriven
Jul 31 at 12:15
@DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
â joriki
Jul 31 at 12:21
add a comment |Â
up vote
2
down vote
accepted
up vote
2
down vote
accepted
This formula is for the sample variances. What you wrote in the row labeled var
in the table is the estimate for the population variance based on the sample variance, which differs from the sample variance due to Bessel's correction.
This formula is for the sample variances. What you wrote in the row labeled var
in the table is the estimate for the population variance based on the sample variance, which differs from the sample variance due to Bessel's correction.
answered Jul 31 at 11:37
joriki
164k10179328
164k10179328
P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
â joriki
Jul 31 at 11:43
Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
â DocDriven
Jul 31 at 12:15
@DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
â joriki
Jul 31 at 12:21
add a comment |Â
P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
â joriki
Jul 31 at 11:43
Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
â DocDriven
Jul 31 at 12:15
@DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
â joriki
Jul 31 at 12:21
P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
â joriki
Jul 31 at 11:43
P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
â joriki
Jul 31 at 11:43
Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
â DocDriven
Jul 31 at 12:15
Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
â DocDriven
Jul 31 at 12:15
@DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
â joriki
Jul 31 at 12:21
@DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
â joriki
Jul 31 at 12:21
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2867951%2fformula-of-combined-variance-of-two-data-sets-yields-wrong-output%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
â joriki
Jul 31 at 11:30
What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
â drhab
Jul 31 at 11:38
@drhab: I normalized with $n-1$. I also edited the question to make it more readable.
â DocDriven
Jul 31 at 11:55
From where did you get the formula of $S_c^2$? It does not look okay to me.
â drhab
Jul 31 at 12:10
@drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
â DocDriven
Jul 31 at 12:17