unbiased pool estimator of variance
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
I'm not sure I'm calculating the unbiased pooled estimator for the variance correctly.
Assuming 2 samples where $sigma_1 = sigma_2 = sigma$ and is uknown, these are my definitions:
Sample variance: $S^2 = 1overn sum(X_i - barX)^2$
Unbiased estimator: $hatS^2 = novern-1S^2 = 1overn-1 sum(X_i - barX)^2$
Unbiased pooled variance: $(n_1 - 1)hatS_1^2 + (n_2 - 1)hatS_2^2over(n_1 - 1) + (n_2 - 1) = n_1S_1^2 + n_2S_2^2overn_1 + n_2 -2$
The last equation, which should give the unbiased pooled estimate, reduces to:
$sum(X_1i - barX)^2 + sum(X_2i - barX)^2overn_1 + n_2 -2$
Is that correct? Should I expect that the biased pooled estimate's variance will be lower than the estimated variance of each individual data set ($underlineX_1$ or $underlineX_2$)?
statistics estimation hypothesis-testing estimation-theory
add a comment |Â
up vote
0
down vote
favorite
I'm not sure I'm calculating the unbiased pooled estimator for the variance correctly.
Assuming 2 samples where $sigma_1 = sigma_2 = sigma$ and is uknown, these are my definitions:
Sample variance: $S^2 = 1overn sum(X_i - barX)^2$
Unbiased estimator: $hatS^2 = novern-1S^2 = 1overn-1 sum(X_i - barX)^2$
Unbiased pooled variance: $(n_1 - 1)hatS_1^2 + (n_2 - 1)hatS_2^2over(n_1 - 1) + (n_2 - 1) = n_1S_1^2 + n_2S_2^2overn_1 + n_2 -2$
The last equation, which should give the unbiased pooled estimate, reduces to:
$sum(X_1i - barX)^2 + sum(X_2i - barX)^2overn_1 + n_2 -2$
Is that correct? Should I expect that the biased pooled estimate's variance will be lower than the estimated variance of each individual data set ($underlineX_1$ or $underlineX_2$)?
statistics estimation hypothesis-testing estimation-theory
Shouldn't it be $n_1+n_2-1$ in the denominator?
â joriki
Aug 6 at 4:33
@joriki No, -2 is correct.
â s5s
Aug 6 at 10:05
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I'm not sure I'm calculating the unbiased pooled estimator for the variance correctly.
Assuming 2 samples where $sigma_1 = sigma_2 = sigma$ and is uknown, these are my definitions:
Sample variance: $S^2 = 1overn sum(X_i - barX)^2$
Unbiased estimator: $hatS^2 = novern-1S^2 = 1overn-1 sum(X_i - barX)^2$
Unbiased pooled variance: $(n_1 - 1)hatS_1^2 + (n_2 - 1)hatS_2^2over(n_1 - 1) + (n_2 - 1) = n_1S_1^2 + n_2S_2^2overn_1 + n_2 -2$
The last equation, which should give the unbiased pooled estimate, reduces to:
$sum(X_1i - barX)^2 + sum(X_2i - barX)^2overn_1 + n_2 -2$
Is that correct? Should I expect that the biased pooled estimate's variance will be lower than the estimated variance of each individual data set ($underlineX_1$ or $underlineX_2$)?
statistics estimation hypothesis-testing estimation-theory
I'm not sure I'm calculating the unbiased pooled estimator for the variance correctly.
Assuming 2 samples where $sigma_1 = sigma_2 = sigma$ and is uknown, these are my definitions:
Sample variance: $S^2 = 1overn sum(X_i - barX)^2$
Unbiased estimator: $hatS^2 = novern-1S^2 = 1overn-1 sum(X_i - barX)^2$
Unbiased pooled variance: $(n_1 - 1)hatS_1^2 + (n_2 - 1)hatS_2^2over(n_1 - 1) + (n_2 - 1) = n_1S_1^2 + n_2S_2^2overn_1 + n_2 -2$
The last equation, which should give the unbiased pooled estimate, reduces to:
$sum(X_1i - barX)^2 + sum(X_2i - barX)^2overn_1 + n_2 -2$
Is that correct? Should I expect that the biased pooled estimate's variance will be lower than the estimated variance of each individual data set ($underlineX_1$ or $underlineX_2$)?
statistics estimation hypothesis-testing estimation-theory
asked Aug 5 at 23:51
s5s
19118
19118
Shouldn't it be $n_1+n_2-1$ in the denominator?
â joriki
Aug 6 at 4:33
@joriki No, -2 is correct.
â s5s
Aug 6 at 10:05
add a comment |Â
Shouldn't it be $n_1+n_2-1$ in the denominator?
â joriki
Aug 6 at 4:33
@joriki No, -2 is correct.
â s5s
Aug 6 at 10:05
Shouldn't it be $n_1+n_2-1$ in the denominator?
â joriki
Aug 6 at 4:33
Shouldn't it be $n_1+n_2-1$ in the denominator?
â joriki
Aug 6 at 4:33
@joriki No, -2 is correct.
â s5s
Aug 6 at 10:05
@joriki No, -2 is correct.
â s5s
Aug 6 at 10:05
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
First, your notation for the sample variance seems to be muddled. The sample variance is ordinarily defined as $S^2 = frac1n-1sum_i=1^n (X_i - bar X)^2,$ which makes it an unbiased estimator of the population variance $sigma^2.$
Perhaps the most common context for 'unbiased pooled estimator' of variance is for the 'pooled t test': Suppose you have two random samples $X_i$ of size $n$ and $Y_i$ of size $m$ from populations with the same variance $sigma^2.$ Then
the pooled estimator of $sigma^2$ is
$$S_p^2 = frac(n-1)S_X^2 + (m-1)S_Y^2m+n-2.$$
This estimator is unbiased.
Because one says the samples have respective 'degrees of freedom' $n-1$ and $m-1,$ one sometimes says the $S_p^2$ is a 'degrees-of-freedom' weighted average of
the two sample variances. If $n = m,$ then $S_p^2 = 0.5S_x^2 + 0.5S_Y^2.$
Note: Some authors do define the sample variance as $frac1nsum_i=1^n (X_i - bar X)^2,$ but then the sample variance is not an unbiased estimator of $sigma^2,$ even though it might have other properties desirable for the author's task at hand. However, most agree that the notation $S^2$ is reserved for the version with $n-1$ in the denominator, unless a specific warning is given otherwise.
Example: One common measure of the 'goodness' of an estimator is that it have a small
'root mean squared error'. If $T$ is an estimate of $tau$ then
$textMSE_T(tau) = E[(T-tau)^2]$ and RMSE is its square root.
The simulation below illustrates for normal data with $n = 5$ and $sigma^2 = 10^2 = 100,$ that
the version of the sample variance with $n$ in the denominator has smaller
RMSE than the version with $n-1$ in the denominator. (A formal proof for
$n > 1$ is not difficult.)
set.seed(1888); m = 10^6; n = 5; sigma = 10; sg.sq = 100
v.a = replicate(m, var(rnorm(n, 100, sigma))) # denom n-1
v.b = (n-1)*v.a/n # denom n
mean(v.a); RMS.a = sqrt(mean((v.a-sg.sq)^2)); RMS.a
[1] 100.0564 # 70.81563
[1] 70.81563 # larger RMSE
mean(v.b); RMS.b = sqrt(mean((v.b-sg.sq)^2)); RMS.b
[1] 80.0451 # biased
[1] 60.06415 # smaller RMSE
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
First, your notation for the sample variance seems to be muddled. The sample variance is ordinarily defined as $S^2 = frac1n-1sum_i=1^n (X_i - bar X)^2,$ which makes it an unbiased estimator of the population variance $sigma^2.$
Perhaps the most common context for 'unbiased pooled estimator' of variance is for the 'pooled t test': Suppose you have two random samples $X_i$ of size $n$ and $Y_i$ of size $m$ from populations with the same variance $sigma^2.$ Then
the pooled estimator of $sigma^2$ is
$$S_p^2 = frac(n-1)S_X^2 + (m-1)S_Y^2m+n-2.$$
This estimator is unbiased.
Because one says the samples have respective 'degrees of freedom' $n-1$ and $m-1,$ one sometimes says the $S_p^2$ is a 'degrees-of-freedom' weighted average of
the two sample variances. If $n = m,$ then $S_p^2 = 0.5S_x^2 + 0.5S_Y^2.$
Note: Some authors do define the sample variance as $frac1nsum_i=1^n (X_i - bar X)^2,$ but then the sample variance is not an unbiased estimator of $sigma^2,$ even though it might have other properties desirable for the author's task at hand. However, most agree that the notation $S^2$ is reserved for the version with $n-1$ in the denominator, unless a specific warning is given otherwise.
Example: One common measure of the 'goodness' of an estimator is that it have a small
'root mean squared error'. If $T$ is an estimate of $tau$ then
$textMSE_T(tau) = E[(T-tau)^2]$ and RMSE is its square root.
The simulation below illustrates for normal data with $n = 5$ and $sigma^2 = 10^2 = 100,$ that
the version of the sample variance with $n$ in the denominator has smaller
RMSE than the version with $n-1$ in the denominator. (A formal proof for
$n > 1$ is not difficult.)
set.seed(1888); m = 10^6; n = 5; sigma = 10; sg.sq = 100
v.a = replicate(m, var(rnorm(n, 100, sigma))) # denom n-1
v.b = (n-1)*v.a/n # denom n
mean(v.a); RMS.a = sqrt(mean((v.a-sg.sq)^2)); RMS.a
[1] 100.0564 # 70.81563
[1] 70.81563 # larger RMSE
mean(v.b); RMS.b = sqrt(mean((v.b-sg.sq)^2)); RMS.b
[1] 80.0451 # biased
[1] 60.06415 # smaller RMSE
add a comment |Â
up vote
1
down vote
accepted
First, your notation for the sample variance seems to be muddled. The sample variance is ordinarily defined as $S^2 = frac1n-1sum_i=1^n (X_i - bar X)^2,$ which makes it an unbiased estimator of the population variance $sigma^2.$
Perhaps the most common context for 'unbiased pooled estimator' of variance is for the 'pooled t test': Suppose you have two random samples $X_i$ of size $n$ and $Y_i$ of size $m$ from populations with the same variance $sigma^2.$ Then
the pooled estimator of $sigma^2$ is
$$S_p^2 = frac(n-1)S_X^2 + (m-1)S_Y^2m+n-2.$$
This estimator is unbiased.
Because one says the samples have respective 'degrees of freedom' $n-1$ and $m-1,$ one sometimes says the $S_p^2$ is a 'degrees-of-freedom' weighted average of
the two sample variances. If $n = m,$ then $S_p^2 = 0.5S_x^2 + 0.5S_Y^2.$
Note: Some authors do define the sample variance as $frac1nsum_i=1^n (X_i - bar X)^2,$ but then the sample variance is not an unbiased estimator of $sigma^2,$ even though it might have other properties desirable for the author's task at hand. However, most agree that the notation $S^2$ is reserved for the version with $n-1$ in the denominator, unless a specific warning is given otherwise.
Example: One common measure of the 'goodness' of an estimator is that it have a small
'root mean squared error'. If $T$ is an estimate of $tau$ then
$textMSE_T(tau) = E[(T-tau)^2]$ and RMSE is its square root.
The simulation below illustrates for normal data with $n = 5$ and $sigma^2 = 10^2 = 100,$ that
the version of the sample variance with $n$ in the denominator has smaller
RMSE than the version with $n-1$ in the denominator. (A formal proof for
$n > 1$ is not difficult.)
set.seed(1888); m = 10^6; n = 5; sigma = 10; sg.sq = 100
v.a = replicate(m, var(rnorm(n, 100, sigma))) # denom n-1
v.b = (n-1)*v.a/n # denom n
mean(v.a); RMS.a = sqrt(mean((v.a-sg.sq)^2)); RMS.a
[1] 100.0564 # 70.81563
[1] 70.81563 # larger RMSE
mean(v.b); RMS.b = sqrt(mean((v.b-sg.sq)^2)); RMS.b
[1] 80.0451 # biased
[1] 60.06415 # smaller RMSE
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
First, your notation for the sample variance seems to be muddled. The sample variance is ordinarily defined as $S^2 = frac1n-1sum_i=1^n (X_i - bar X)^2,$ which makes it an unbiased estimator of the population variance $sigma^2.$
Perhaps the most common context for 'unbiased pooled estimator' of variance is for the 'pooled t test': Suppose you have two random samples $X_i$ of size $n$ and $Y_i$ of size $m$ from populations with the same variance $sigma^2.$ Then
the pooled estimator of $sigma^2$ is
$$S_p^2 = frac(n-1)S_X^2 + (m-1)S_Y^2m+n-2.$$
This estimator is unbiased.
Because one says the samples have respective 'degrees of freedom' $n-1$ and $m-1,$ one sometimes says the $S_p^2$ is a 'degrees-of-freedom' weighted average of
the two sample variances. If $n = m,$ then $S_p^2 = 0.5S_x^2 + 0.5S_Y^2.$
Note: Some authors do define the sample variance as $frac1nsum_i=1^n (X_i - bar X)^2,$ but then the sample variance is not an unbiased estimator of $sigma^2,$ even though it might have other properties desirable for the author's task at hand. However, most agree that the notation $S^2$ is reserved for the version with $n-1$ in the denominator, unless a specific warning is given otherwise.
Example: One common measure of the 'goodness' of an estimator is that it have a small
'root mean squared error'. If $T$ is an estimate of $tau$ then
$textMSE_T(tau) = E[(T-tau)^2]$ and RMSE is its square root.
The simulation below illustrates for normal data with $n = 5$ and $sigma^2 = 10^2 = 100,$ that
the version of the sample variance with $n$ in the denominator has smaller
RMSE than the version with $n-1$ in the denominator. (A formal proof for
$n > 1$ is not difficult.)
set.seed(1888); m = 10^6; n = 5; sigma = 10; sg.sq = 100
v.a = replicate(m, var(rnorm(n, 100, sigma))) # denom n-1
v.b = (n-1)*v.a/n # denom n
mean(v.a); RMS.a = sqrt(mean((v.a-sg.sq)^2)); RMS.a
[1] 100.0564 # 70.81563
[1] 70.81563 # larger RMSE
mean(v.b); RMS.b = sqrt(mean((v.b-sg.sq)^2)); RMS.b
[1] 80.0451 # biased
[1] 60.06415 # smaller RMSE
First, your notation for the sample variance seems to be muddled. The sample variance is ordinarily defined as $S^2 = frac1n-1sum_i=1^n (X_i - bar X)^2,$ which makes it an unbiased estimator of the population variance $sigma^2.$
Perhaps the most common context for 'unbiased pooled estimator' of variance is for the 'pooled t test': Suppose you have two random samples $X_i$ of size $n$ and $Y_i$ of size $m$ from populations with the same variance $sigma^2.$ Then
the pooled estimator of $sigma^2$ is
$$S_p^2 = frac(n-1)S_X^2 + (m-1)S_Y^2m+n-2.$$
This estimator is unbiased.
Because one says the samples have respective 'degrees of freedom' $n-1$ and $m-1,$ one sometimes says the $S_p^2$ is a 'degrees-of-freedom' weighted average of
the two sample variances. If $n = m,$ then $S_p^2 = 0.5S_x^2 + 0.5S_Y^2.$
Note: Some authors do define the sample variance as $frac1nsum_i=1^n (X_i - bar X)^2,$ but then the sample variance is not an unbiased estimator of $sigma^2,$ even though it might have other properties desirable for the author's task at hand. However, most agree that the notation $S^2$ is reserved for the version with $n-1$ in the denominator, unless a specific warning is given otherwise.
Example: One common measure of the 'goodness' of an estimator is that it have a small
'root mean squared error'. If $T$ is an estimate of $tau$ then
$textMSE_T(tau) = E[(T-tau)^2]$ and RMSE is its square root.
The simulation below illustrates for normal data with $n = 5$ and $sigma^2 = 10^2 = 100,$ that
the version of the sample variance with $n$ in the denominator has smaller
RMSE than the version with $n-1$ in the denominator. (A formal proof for
$n > 1$ is not difficult.)
set.seed(1888); m = 10^6; n = 5; sigma = 10; sg.sq = 100
v.a = replicate(m, var(rnorm(n, 100, sigma))) # denom n-1
v.b = (n-1)*v.a/n # denom n
mean(v.a); RMS.a = sqrt(mean((v.a-sg.sq)^2)); RMS.a
[1] 100.0564 # 70.81563
[1] 70.81563 # larger RMSE
mean(v.b); RMS.b = sqrt(mean((v.b-sg.sq)^2)); RMS.b
[1] 80.0451 # biased
[1] 60.06415 # smaller RMSE
edited Aug 9 at 15:44
answered Aug 8 at 19:34
BruceET
33.3k61440
33.3k61440
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2873443%2funbiased-pool-estimator-of-variance%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Shouldn't it be $n_1+n_2-1$ in the denominator?
â joriki
Aug 6 at 4:33
@joriki No, -2 is correct.
â s5s
Aug 6 at 10:05