Formula of combined variance of two data sets yields wrong output

up vote
0
down vote

favorite

I have some distribution from which I sample two datasets x1 and x2. I wanted to calculate their combined mean and variance by using these two formulas:

$$bar X_c = fracn_1 overlineX_1 + n_1 overlineX_1n_1 + n_2$$

$$S_c^2 = fracn_1S_1^2 + n_2S_2^2 + n_1left( overline X _1 - overline X _c right)^2 + n_2left( overline X _2 - overline X _c right)^2n_1 + n_2$$

where $n$ is the number of samples of the dataset. The subscript $c$ indicates the combined values.

For testing purposes, I wanted to check if the formulas yield the same result as when stacking the two datasets to create $x3 = x1+x2$ and calculating the mean and variance of it. So I created a dummy dataset like this:

I calculated the means and variances just for $x1$ and $x2$, and then for the 2 combination methods. It yielded:

 x1 x2 x3 xC

mean 60.80 42.50 52.66 52.66

var 635.2 659.0 657.75 728.47

As you can see, the formula worked for the means, but fails to reproduce the correct variance (x3).

Can somebody tell me what I am doing wrong? Simple answers would be nice, as I am not a great mathematician.

Thank you!

edited Jul 31 at 11:48

asked Jul 31 at 11:25

DocDriven

1036

You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
â€“Â joriki
Jul 31 at 11:30

What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
â€“Â drhab
Jul 31 at 11:38

@drhab: I normalized with $n-1$. I also edited the question to make it more readable.
â€“Â DocDriven
Jul 31 at 11:55

From where did you get the formula of $S_c^2$? It does not look okay to me.
â€“Â drhab
Jul 31 at 12:10

@drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
â€“Â DocDriven
Jul 31 at 12:17

add a commentÂ |Â

up vote
0
down vote

favorite

I have some distribution from which I sample two datasets x1 and x2. I wanted to calculate their combined mean and variance by using these two formulas:

$$bar X_c = fracn_1 overlineX_1 + n_1 overlineX_1n_1 + n_2$$

$$S_c^2 = fracn_1S_1^2 + n_2S_2^2 + n_1left( overline X _1 - overline X _c right)^2 + n_2left( overline X _2 - overline X _c right)^2n_1 + n_2$$

where $n$ is the number of samples of the dataset. The subscript $c$ indicates the combined values.

I calculated the means and variances just for $x1$ and $x2$, and then for the 2 combination methods. It yielded:

 x1 x2 x3 xC

mean 60.80 42.50 52.66 52.66

var 635.2 659.0 657.75 728.47

As you can see, the formula worked for the means, but fails to reproduce the correct variance (x3).

Can somebody tell me what I am doing wrong? Simple answers would be nice, as I am not a great mathematician.

Thank you!

edited Jul 31 at 11:48

asked Jul 31 at 11:25

DocDriven

1036

You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
â€“Â joriki
Jul 31 at 11:30

What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
â€“Â drhab
Jul 31 at 11:38

@drhab: I normalized with $n-1$. I also edited the question to make it more readable.
â€“Â DocDriven
Jul 31 at 11:55

From where did you get the formula of $S_c^2$? It does not look okay to me.
â€“Â drhab
Jul 31 at 12:10

@drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
â€“Â DocDriven
Jul 31 at 12:17

add a commentÂ |Â

up vote
0
down vote

favorite

I have some distribution from which I sample two datasets x1 and x2. I wanted to calculate their combined mean and variance by using these two formulas:

$$bar X_c = fracn_1 overlineX_1 + n_1 overlineX_1n_1 + n_2$$

$$S_c^2 = fracn_1S_1^2 + n_2S_2^2 + n_1left( overline X _1 - overline X _c right)^2 + n_2left( overline X _2 - overline X _c right)^2n_1 + n_2$$

where $n$ is the number of samples of the dataset. The subscript $c$ indicates the combined values.

I calculated the means and variances just for $x1$ and $x2$, and then for the 2 combination methods. It yielded:

 x1 x2 x3 xC

mean 60.80 42.50 52.66 52.66

var 635.2 659.0 657.75 728.47

As you can see, the formula worked for the means, but fails to reproduce the correct variance (x3).

Can somebody tell me what I am doing wrong? Simple answers would be nice, as I am not a great mathematician.

Thank you!

edited Jul 31 at 11:48

asked Jul 31 at 11:25

DocDriven

1036

I have some distribution from which I sample two datasets x1 and x2. I wanted to calculate their combined mean and variance by using these two formulas:

$$bar X_c = fracn_1 overlineX_1 + n_1 overlineX_1n_1 + n_2$$

$$S_c^2 = fracn_1S_1^2 + n_2S_2^2 + n_1left( overline X _1 - overline X _c right)^2 + n_2left( overline X _2 - overline X _c right)^2n_1 + n_2$$

where $n$ is the number of samples of the dataset. The subscript $c$ indicates the combined values.

I calculated the means and variances just for $x1$ and $x2$, and then for the 2 combination methods. It yielded:

 x1 x2 x3 xC

mean 60.80 42.50 52.66 52.66

var 635.2 659.0 657.75 728.47

As you can see, the formula worked for the means, but fails to reproduce the correct variance (x3).

Can somebody tell me what I am doing wrong? Simple answers would be nice, as I am not a great mathematician.

Thank you!

edited Jul 31 at 11:48

asked Jul 31 at 11:25

DocDriven

1036

edited Jul 31 at 11:48

asked Jul 31 at 11:25

DocDriven

1036

asked Jul 31 at 11:25

DocDriven

1036

asked Jul 31 at 11:25

DocDriven

1036

You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
â€“Â joriki
Jul 31 at 11:30

What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
â€“Â drhab
Jul 31 at 11:38

@drhab: I normalized with $n-1$. I also edited the question to make it more readable.
â€“Â DocDriven
Jul 31 at 11:55

From where did you get the formula of $S_c^2$? It does not look okay to me.
â€“Â drhab
Jul 31 at 12:10

@drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
â€“Â DocDriven
Jul 31 at 12:17

add a commentÂ |Â

You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
â€“Â joriki
Jul 31 at 11:30

What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
â€“Â drhab
Jul 31 at 11:38

@drhab: I normalized with $n-1$. I also edited the question to make it more readable.
â€“Â DocDriven
Jul 31 at 11:55

From where did you get the formula of $S_c^2$? It does not look okay to me.
â€“Â drhab
Jul 31 at 12:10

@drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
â€“Â DocDriven
Jul 31 at 12:17

You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
â€“Â joriki
Jul 31 at 11:30

What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
â€“Â drhab
Jul 31 at 11:38

@drhab: I normalized with $n-1$. I also edited the question to make it more readable.
â€“Â DocDriven
Jul 31 at 11:55

From where did you get the formula of $S_c^2$? It does not look okay to me.
â€“Â drhab
Jul 31 at 12:10

@drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
â€“Â DocDriven
Jul 31 at 12:17

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

This formula is for the sample variances. What you wrote in the row labeled var in the table is the estimate for the population variance based on the sample variance, which differs from the sample variance due to Bessel's correction.

answered Jul 31 at 11:37

joriki

164k10179328

P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
â€“Â joriki
Jul 31 at 11:43

Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
â€“Â DocDriven
Jul 31 at 12:15

@DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
â€“Â joriki
Jul 31 at 12:21

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2867951%2fformula-of-combined-variance-of-two-data-sets-yields-wrong-output%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

answered Jul 31 at 11:37

joriki

164k10179328

P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
â€“Â joriki
Jul 31 at 11:43

Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
â€“Â DocDriven
Jul 31 at 12:15

@DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
â€“Â joriki
Jul 31 at 12:21

add a commentÂ |Â

up vote
2
down vote

accepted

answered Jul 31 at 11:37

joriki

164k10179328

P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
â€“Â joriki
Jul 31 at 11:43

Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
â€“Â DocDriven
Jul 31 at 12:15

@DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
â€“Â joriki
Jul 31 at 12:21

add a commentÂ |Â

up vote
2
down vote

accepted

answered Jul 31 at 11:37

joriki

164k10179328

answered Jul 31 at 11:37

joriki

164k10179328

answered Jul 31 at 11:37

joriki

164k10179328

answered Jul 31 at 11:37

joriki

164k10179328

answered Jul 31 at 11:37

joriki

164k10179328

P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
â€“Â joriki
Jul 31 at 11:43

Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
â€“Â DocDriven
Jul 31 at 12:15

@DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
â€“Â joriki
Jul 31 at 12:21

add a commentÂ |Â

P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
â€“Â joriki
Jul 31 at 11:43

Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
â€“Â DocDriven
Jul 31 at 12:15

@DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
â€“Â joriki
Jul 31 at 12:21

P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
â€“Â joriki
Jul 31 at 11:43

Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
â€“Â DocDriven
Jul 31 at 12:15

@DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
â€“Â joriki
Jul 31 at 12:21

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

ukmuiik