Formula of combined variance of two data sets yields wrong output

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












I have some distribution from which I sample two datasets x1 and x2. I wanted to calculate their combined mean and variance by using these two formulas:



$$bar X_c = fracn_1 overlineX_1 + n_1 overlineX_1n_1 + n_2$$



$$S_c^2 = fracn_1S_1^2 + n_2S_2^2 + n_1left( overline X _1 - overline X _c right)^2 + n_2left( overline X _2 - overline X _c right)^2n_1 + n_2$$



where $n$ is the number of samples of the dataset. The subscript $c$ indicates the combined values.



For testing purposes, I wanted to check if the formulas yield the same result as when stacking the two datasets to create $x3 = x1+x2$ and calculating the mean and variance of it. So I created a dummy dataset like this:



x1 x2

98 69
49 54
33 38
73 9
51


I calculated the means and variances just for $x1$ and $x2$, and then for the 2 combination methods. It yielded:



 x1 x2 x3 xC

mean 60.80 42.50 52.66 52.66

var 635.2 659.0 657.75 728.47


As you can see, the formula worked for the means, but fails to reproduce the correct variance (x3).



Can somebody tell me what I am doing wrong? Simple answers would be nice, as I am not a great mathematician.



Thank you!







share|cite|improve this question





















  • You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
    – joriki
    Jul 31 at 11:30










  • What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
    – drhab
    Jul 31 at 11:38










  • @drhab: I normalized with $n-1$. I also edited the question to make it more readable.
    – DocDriven
    Jul 31 at 11:55










  • From where did you get the formula of $S_c^2$? It does not look okay to me.
    – drhab
    Jul 31 at 12:10










  • @drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
    – DocDriven
    Jul 31 at 12:17














up vote
0
down vote

favorite












I have some distribution from which I sample two datasets x1 and x2. I wanted to calculate their combined mean and variance by using these two formulas:



$$bar X_c = fracn_1 overlineX_1 + n_1 overlineX_1n_1 + n_2$$



$$S_c^2 = fracn_1S_1^2 + n_2S_2^2 + n_1left( overline X _1 - overline X _c right)^2 + n_2left( overline X _2 - overline X _c right)^2n_1 + n_2$$



where $n$ is the number of samples of the dataset. The subscript $c$ indicates the combined values.



For testing purposes, I wanted to check if the formulas yield the same result as when stacking the two datasets to create $x3 = x1+x2$ and calculating the mean and variance of it. So I created a dummy dataset like this:



x1 x2

98 69
49 54
33 38
73 9
51


I calculated the means and variances just for $x1$ and $x2$, and then for the 2 combination methods. It yielded:



 x1 x2 x3 xC

mean 60.80 42.50 52.66 52.66

var 635.2 659.0 657.75 728.47


As you can see, the formula worked for the means, but fails to reproduce the correct variance (x3).



Can somebody tell me what I am doing wrong? Simple answers would be nice, as I am not a great mathematician.



Thank you!







share|cite|improve this question





















  • You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
    – joriki
    Jul 31 at 11:30










  • What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
    – drhab
    Jul 31 at 11:38










  • @drhab: I normalized with $n-1$. I also edited the question to make it more readable.
    – DocDriven
    Jul 31 at 11:55










  • From where did you get the formula of $S_c^2$? It does not look okay to me.
    – drhab
    Jul 31 at 12:10










  • @drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
    – DocDriven
    Jul 31 at 12:17












up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have some distribution from which I sample two datasets x1 and x2. I wanted to calculate their combined mean and variance by using these two formulas:



$$bar X_c = fracn_1 overlineX_1 + n_1 overlineX_1n_1 + n_2$$



$$S_c^2 = fracn_1S_1^2 + n_2S_2^2 + n_1left( overline X _1 - overline X _c right)^2 + n_2left( overline X _2 - overline X _c right)^2n_1 + n_2$$



where $n$ is the number of samples of the dataset. The subscript $c$ indicates the combined values.



For testing purposes, I wanted to check if the formulas yield the same result as when stacking the two datasets to create $x3 = x1+x2$ and calculating the mean and variance of it. So I created a dummy dataset like this:



x1 x2

98 69
49 54
33 38
73 9
51


I calculated the means and variances just for $x1$ and $x2$, and then for the 2 combination methods. It yielded:



 x1 x2 x3 xC

mean 60.80 42.50 52.66 52.66

var 635.2 659.0 657.75 728.47


As you can see, the formula worked for the means, but fails to reproduce the correct variance (x3).



Can somebody tell me what I am doing wrong? Simple answers would be nice, as I am not a great mathematician.



Thank you!







share|cite|improve this question













I have some distribution from which I sample two datasets x1 and x2. I wanted to calculate their combined mean and variance by using these two formulas:



$$bar X_c = fracn_1 overlineX_1 + n_1 overlineX_1n_1 + n_2$$



$$S_c^2 = fracn_1S_1^2 + n_2S_2^2 + n_1left( overline X _1 - overline X _c right)^2 + n_2left( overline X _2 - overline X _c right)^2n_1 + n_2$$



where $n$ is the number of samples of the dataset. The subscript $c$ indicates the combined values.



For testing purposes, I wanted to check if the formulas yield the same result as when stacking the two datasets to create $x3 = x1+x2$ and calculating the mean and variance of it. So I created a dummy dataset like this:



x1 x2

98 69
49 54
33 38
73 9
51


I calculated the means and variances just for $x1$ and $x2$, and then for the 2 combination methods. It yielded:



 x1 x2 x3 xC

mean 60.80 42.50 52.66 52.66

var 635.2 659.0 657.75 728.47


As you can see, the formula worked for the means, but fails to reproduce the correct variance (x3).



Can somebody tell me what I am doing wrong? Simple answers would be nice, as I am not a great mathematician.



Thank you!









share|cite|improve this question












share|cite|improve this question




share|cite|improve this question








edited Jul 31 at 11:48
























asked Jul 31 at 11:25









DocDriven

1036




1036











  • You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
    – joriki
    Jul 31 at 11:30










  • What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
    – drhab
    Jul 31 at 11:38










  • @drhab: I normalized with $n-1$. I also edited the question to make it more readable.
    – DocDriven
    Jul 31 at 11:55










  • From where did you get the formula of $S_c^2$? It does not look okay to me.
    – drhab
    Jul 31 at 12:10










  • @drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
    – DocDriven
    Jul 31 at 12:17
















  • You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
    – joriki
    Jul 31 at 11:30










  • What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
    – drhab
    Jul 31 at 11:38










  • @drhab: I normalized with $n-1$. I also edited the question to make it more readable.
    – DocDriven
    Jul 31 at 11:55










  • From where did you get the formula of $S_c^2$? It does not look okay to me.
    – drhab
    Jul 31 at 12:10










  • @drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
    – DocDriven
    Jul 31 at 12:17















You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
– joriki
Jul 31 at 11:30




You can get displayed equations by enclosing them in double instead of single dollar signs. Especially when you mix fractions, subscripts and superscripts, that makes them a lot easier to read.
– joriki
Jul 31 at 11:30












What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
– drhab
Jul 31 at 11:38




What definition are you using for sample variance? $frac1nsumdots$ or $frac1n-1sumcdots$?
– drhab
Jul 31 at 11:38












@drhab: I normalized with $n-1$. I also edited the question to make it more readable.
– DocDriven
Jul 31 at 11:55




@drhab: I normalized with $n-1$. I also edited the question to make it more readable.
– DocDriven
Jul 31 at 11:55












From where did you get the formula of $S_c^2$? It does not look okay to me.
– drhab
Jul 31 at 12:10




From where did you get the formula of $S_c^2$? It does not look okay to me.
– drhab
Jul 31 at 12:10












@drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
– DocDriven
Jul 31 at 12:17




@drhab: I got it from here: LINK. When using it with the biased variance, it produces the correct result though.
– DocDriven
Jul 31 at 12:17










1 Answer
1






active

oldest

votes

















up vote
2
down vote



accepted










This formula is for the sample variances. What you wrote in the row labeled var in the table is the estimate for the population variance based on the sample variance, which differs from the sample variance due to Bessel's correction.






share|cite|improve this answer





















  • P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
    – joriki
    Jul 31 at 11:43











  • Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
    – DocDriven
    Jul 31 at 12:15










  • @DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
    – joriki
    Jul 31 at 12:21










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2867951%2fformula-of-combined-variance-of-two-data-sets-yields-wrong-output%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote



accepted










This formula is for the sample variances. What you wrote in the row labeled var in the table is the estimate for the population variance based on the sample variance, which differs from the sample variance due to Bessel's correction.






share|cite|improve this answer





















  • P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
    – joriki
    Jul 31 at 11:43











  • Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
    – DocDriven
    Jul 31 at 12:15










  • @DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
    – joriki
    Jul 31 at 12:21














up vote
2
down vote



accepted










This formula is for the sample variances. What you wrote in the row labeled var in the table is the estimate for the population variance based on the sample variance, which differs from the sample variance due to Bessel's correction.






share|cite|improve this answer





















  • P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
    – joriki
    Jul 31 at 11:43











  • Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
    – DocDriven
    Jul 31 at 12:15










  • @DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
    – joriki
    Jul 31 at 12:21












up vote
2
down vote



accepted







up vote
2
down vote



accepted






This formula is for the sample variances. What you wrote in the row labeled var in the table is the estimate for the population variance based on the sample variance, which differs from the sample variance due to Bessel's correction.






share|cite|improve this answer













This formula is for the sample variances. What you wrote in the row labeled var in the table is the estimate for the population variance based on the sample variance, which differs from the sample variance due to Bessel's correction.







share|cite|improve this answer













share|cite|improve this answer



share|cite|improve this answer











answered Jul 31 at 11:37









joriki

164k10179328




164k10179328











  • P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
    – joriki
    Jul 31 at 11:43











  • Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
    – DocDriven
    Jul 31 at 12:15










  • @DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
    – joriki
    Jul 31 at 12:21
















  • P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
    – joriki
    Jul 31 at 11:43











  • Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
    – DocDriven
    Jul 31 at 12:15










  • @DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
    – joriki
    Jul 31 at 12:21















P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
– joriki
Jul 31 at 11:43





P.S.: The teminology section of that article says that the corrected estimate for the population variance is also sometimes called the "sample variance". I wasn't aware of that (and it seems like an extremely bad choice of terminology to me). So with that in mind, perhaps replace "sample variance" and "estimate for the population variance" by "uncorrected estimate for the population variance" and "corrected estimate for the population variance", respectively, in the above answer, to avoid ambiguity.
– joriki
Jul 31 at 11:43













Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
– DocDriven
Jul 31 at 12:15




Thank you for the clarification. Turned out it was due to me using the unbiased variance in my calculations. Shame on me for not checking what Excel uses!
– DocDriven
Jul 31 at 12:15












@DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
– joriki
Jul 31 at 12:21




@DocDriven: Yes, you could also put "biased" and "unbiased" above for "uncorrected" and "corrected", respectively.
– joriki
Jul 31 at 12:21












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2867951%2fformula-of-combined-variance-of-two-data-sets-yields-wrong-output%23new-answer', 'question_page');

);

Post as a guest













































































Comments

Popular posts from this blog

Relationship between determinant of matrix and determinant of adjoint?

Color the edges and diagonals of a regular polygon

What is the equation of a 3D cone with generalised tilt?