Minimizing RSS by taking partial derivative

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












I am learning about linear regression, and the goal is to find parameters $beta$, that minimize the RSS. My textbook accomplishes this by finding $partial text RSS /partial beta = 0$ However, I am slightly stuck on the following step:



They define:



$RSS(beta) = (mathbfy - mathbfXbeta)^T (mathbfy-mathbfXbeta$,



where $beta$ are scalars, $y$ is a column vector, and $X$ is a matrix.



They find that



$fracpartial RSSpartial beta = -2mathbfX^T(mathbfy-mathbfXbeta)$



I tried deriving this result. I first wrote:
$(mathbfy - mathbfXbeta)^T (mathbfy-mathbfXbeta) = (mathbfy^T - mathbfX^Tbeta)(mathbfy - mathbfXbeta)$



I then expanded the two terms in brackets:
$mathbfy^Tmathbfy - mathbfy^TmathbfXbeta - mathbfymathbfX^Tbeta + mathbfX^TmathbfXbeta^2$



Now, I differentiate this with respect to $beta$:
$-mathbfy^TmathbfX - mathbfymathbfX^T + 2beta mathbfX^TmathbfX$



This is where I get stuck, comparing my result with the derived result, we both have the $2beta mathbfX^TmathbfX$ term, but I don't know how my first 2 terms should simplify to give $-2mathbfX^Tmathbfy$.







share|cite|improve this question

















  • 1




    Did you appreciate any of the answers? You should accept the best answer to mark this question as answered.
    – LinAlg
    Jul 31 at 19:46














up vote
0
down vote

favorite












I am learning about linear regression, and the goal is to find parameters $beta$, that minimize the RSS. My textbook accomplishes this by finding $partial text RSS /partial beta = 0$ However, I am slightly stuck on the following step:



They define:



$RSS(beta) = (mathbfy - mathbfXbeta)^T (mathbfy-mathbfXbeta$,



where $beta$ are scalars, $y$ is a column vector, and $X$ is a matrix.



They find that



$fracpartial RSSpartial beta = -2mathbfX^T(mathbfy-mathbfXbeta)$



I tried deriving this result. I first wrote:
$(mathbfy - mathbfXbeta)^T (mathbfy-mathbfXbeta) = (mathbfy^T - mathbfX^Tbeta)(mathbfy - mathbfXbeta)$



I then expanded the two terms in brackets:
$mathbfy^Tmathbfy - mathbfy^TmathbfXbeta - mathbfymathbfX^Tbeta + mathbfX^TmathbfXbeta^2$



Now, I differentiate this with respect to $beta$:
$-mathbfy^TmathbfX - mathbfymathbfX^T + 2beta mathbfX^TmathbfX$



This is where I get stuck, comparing my result with the derived result, we both have the $2beta mathbfX^TmathbfX$ term, but I don't know how my first 2 terms should simplify to give $-2mathbfX^Tmathbfy$.







share|cite|improve this question

















  • 1




    Did you appreciate any of the answers? You should accept the best answer to mark this question as answered.
    – LinAlg
    Jul 31 at 19:46












up vote
0
down vote

favorite









up vote
0
down vote

favorite











I am learning about linear regression, and the goal is to find parameters $beta$, that minimize the RSS. My textbook accomplishes this by finding $partial text RSS /partial beta = 0$ However, I am slightly stuck on the following step:



They define:



$RSS(beta) = (mathbfy - mathbfXbeta)^T (mathbfy-mathbfXbeta$,



where $beta$ are scalars, $y$ is a column vector, and $X$ is a matrix.



They find that



$fracpartial RSSpartial beta = -2mathbfX^T(mathbfy-mathbfXbeta)$



I tried deriving this result. I first wrote:
$(mathbfy - mathbfXbeta)^T (mathbfy-mathbfXbeta) = (mathbfy^T - mathbfX^Tbeta)(mathbfy - mathbfXbeta)$



I then expanded the two terms in brackets:
$mathbfy^Tmathbfy - mathbfy^TmathbfXbeta - mathbfymathbfX^Tbeta + mathbfX^TmathbfXbeta^2$



Now, I differentiate this with respect to $beta$:
$-mathbfy^TmathbfX - mathbfymathbfX^T + 2beta mathbfX^TmathbfX$



This is where I get stuck, comparing my result with the derived result, we both have the $2beta mathbfX^TmathbfX$ term, but I don't know how my first 2 terms should simplify to give $-2mathbfX^Tmathbfy$.







share|cite|improve this question













I am learning about linear regression, and the goal is to find parameters $beta$, that minimize the RSS. My textbook accomplishes this by finding $partial text RSS /partial beta = 0$ However, I am slightly stuck on the following step:



They define:



$RSS(beta) = (mathbfy - mathbfXbeta)^T (mathbfy-mathbfXbeta$,



where $beta$ are scalars, $y$ is a column vector, and $X$ is a matrix.



They find that



$fracpartial RSSpartial beta = -2mathbfX^T(mathbfy-mathbfXbeta)$



I tried deriving this result. I first wrote:
$(mathbfy - mathbfXbeta)^T (mathbfy-mathbfXbeta) = (mathbfy^T - mathbfX^Tbeta)(mathbfy - mathbfXbeta)$



I then expanded the two terms in brackets:
$mathbfy^Tmathbfy - mathbfy^TmathbfXbeta - mathbfymathbfX^Tbeta + mathbfX^TmathbfXbeta^2$



Now, I differentiate this with respect to $beta$:
$-mathbfy^TmathbfX - mathbfymathbfX^T + 2beta mathbfX^TmathbfX$



This is where I get stuck, comparing my result with the derived result, we both have the $2beta mathbfX^TmathbfX$ term, but I don't know how my first 2 terms should simplify to give $-2mathbfX^Tmathbfy$.









share|cite|improve this question












share|cite|improve this question




share|cite|improve this question








edited Jul 31 at 16:09









Foobaz John

18k41245




18k41245









asked Jul 31 at 15:29









Thomas Moore

407210




407210







  • 1




    Did you appreciate any of the answers? You should accept the best answer to mark this question as answered.
    – LinAlg
    Jul 31 at 19:46












  • 1




    Did you appreciate any of the answers? You should accept the best answer to mark this question as answered.
    – LinAlg
    Jul 31 at 19:46







1




1




Did you appreciate any of the answers? You should accept the best answer to mark this question as answered.
– LinAlg
Jul 31 at 19:46




Did you appreciate any of the answers? You should accept the best answer to mark this question as answered.
– LinAlg
Jul 31 at 19:46










4 Answers
4






active

oldest

votes

















up vote
1
down vote



accepted










Note that $beta$ is not a scalar, but a vector.



Let
$$mathbfy = beginbmatrix
y_1 \
y_2 \
vdots \
y_N
endbmatrix$$
$$mathbfX = beginbmatrix
x_11 & x_12 & cdots & x_1p \
x_21 & x_22 & cdots & x_2p \
vdots & vdots & vdots & vdots \
x_N1 & x_N2 & cdots & x_Np
endbmatrix$$
and
$$beta = beginbmatrix
b_1 \
b_2 \
vdots \
b_p
endbmatrixtext.$$
Then $mathbfXbeta in mathbbR^N$ and
$$mathbfXbeta = beginbmatrix
sum_j=1^pb_jx_1j \
sum_j=1^pb_jx_2j \
vdots \
sum_j=1^pb_jx_Nj
endbmatrix implies mathbfy-mathbfXbeta=beginbmatrix
y_1 - sum_j=1^pb_jx_1j \
y_2 - sum_j=1^pb_jx_2j \
vdots \
y_N - sum_j=1^pb_jx_Nj
endbmatrix text.$$
Therefore,
$$(mathbfy-mathbfXbeta)^T(mathbfy-mathbfXbeta) = |mathbfy-mathbfXbeta |^2 = sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)^2text. $$
We have, for each $k = 1, dots, p$,
$$dfracpartial textRSSpartial b_k = 2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)(-x_ik) = -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_iktext.$$
Then
$$beginaligndfracpartial textRSSpartial beta &= beginbmatrix
dfracpartial textRSSpartial b_1 \
dfracpartial textRSSpartial b_2 \
vdots \
dfracpartial textRSSpartial b_p
endbmatrix \
&= beginbmatrix
-2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i1 \
-2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i2 \
vdots \
-2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_ip
endbmatrix \
&= -2beginbmatrix
sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i1 \
sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i2 \
vdots \
sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_ip
endbmatrix \
&= -2mathbfX^T(mathbfy-mathbfXbeta)text.
endalign$$






share|cite|improve this answer





















  • Hi. Thanks for this very detailed derivation! Just a quick question: you have denoted as $x_ip$, above, which would mean that the transpose is $x_pi$, but in your final line, you don't have $x_pi$, but $x_ip$, so wondering where the $X^T$ comes from in the last equality.
    – Thomas Moore
    Aug 1 at 16:09






  • 1




    @ThomasMoore I would recommend that you do the multiplication yourself to see that the above is true, but at a very simplistic level: remember that all you're doing when you're doing matrix multiplication is dot products of rows of the first matrix to columns of the second matrix. The $p$th column of $mathbfX$ ends up being the $p$th row of $mathbfX^T$ when you're doing the transposition. Thus, when you're performing a dot product between the $p$th row of $mathbfX^T$ and $mathbfy - mathbfXbeta$, ultimately, you're using all of the entries of the $p$th column of $mathbfX$.
    – Clarinetist
    Aug 1 at 16:28


















up vote
1
down vote













The correct transpose (see property 3) is $(mathbfy - mathbfXbeta)^T (mathbfy-mathbfXbeta) = (mathbfy^T - beta^TmathbfX^T)(mathbfy - mathbfXbeta)$



The correct expansion is $mathbfy^Tmathbfy - mathbfy^TmathbfXbeta - beta^T mathbfX^T mathbfy + beta^TmathbfX^TmathbfXbeta$



You can simplify the expansion to:
$$mathbfy^Tmathbfy + (-mathbfX^T mathbfy)^T beta + (-mathbfX^T mathbfy)^T beta + beta^TmathbfX^TmathbfXbeta$$
And the result readily follows.






share|cite|improve this answer





















  • Hi. Thanks for this. But since $beta$ is a scalar, isn't $beta^T = beta$?
    – Thomas Moore
    Jul 31 at 15:49







  • 1




    @ThomasMoore my derivation applies both to scalars and to vectors; if you focus on scalars, your derivation goes wrong where you write $yX^T$ in the expansion: that is a matrix and should be $X^T y$. You can then use that $X^Ty = y^T X$.
    – LinAlg
    Jul 31 at 15:52


















up vote
1
down vote













Expand the brackets to write
$$
beginalign
RSS(beta)&=y'y-y'Xbeta-beta'X'y+beta'X'Xbeta\
&=y'y-2beta'X'y+beta'X'Xbeta
endalign
$$
where primes denote the transpose and $y'Xbeta=beta'X'y= (y'Xbeta)'$ since $y'Xbeta$ is a $1times 1$ vector. Now we can differentiate to get that
$$
fracpartial RSS(beta)partial beta=-2X'y+2X'Xbeta=-2X'(y-Xbeta)
$$
Here we used two properties. First, if $u=alpha'x$ where $alpha,xinmathbbR^n$, then
$$
fracpartial upartial x_j=alpha_jimplies fracpartial upartial x=alpha.
$$
One should notice that $fracpartial upartial x$ in this case represents the gradient. Second if $u=x'Ax=sum_i=1^nsum_j=1^na_ij x_i x_j$ where $Ain M_ntimes n(mathbbR)
$
and $xinmathbbR^n$, then
$$
fracpartial upartial x_ell=sum_i=1^na_iellx_i+sum_i=1^na_ell ix_i=[(A'+A)x]_ell
implies
fracpartial upartial x=(A'+A)x.
$$
In particular if $A$ is symmetric (like $X'X$ as above), we have that $fracpartial upartial x=2Ax$






share|cite|improve this answer






























    up vote
    1
    down vote













    Remark: $beta$ is a vector.



    In multiple regression, if you have $n$ independent variables, therefore you have $n+1$ parameters to estimate (included intercept), that is: $$y_t=beta_0+beta_1X_1t+...beta_nX_nt+e_t,$$ where each $beta_i$ is scalar.
    We can write aforementioned with matrix notation (your problem is in matrix notation):
    $$y=Xbeta+e,$$
    where $X$ is matrix, $y,beta$ and $e$ are vectors!
    More precisely, $beta_i$ is scalar, but $beta$ is vector. Furthermore, you can note that unique solution of the problem that you have mentioned is the following:
    $$beta=(X^TX)^-1X^Ty,$$
    where you can note easily that $beta$ is a vector.






    share|cite|improve this answer























      Your Answer




      StackExchange.ifUsing("editor", function ()
      return StackExchange.using("mathjaxEditing", function ()
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      );
      );
      , "mathjax-editing");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "69"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: true,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      noCode: true, onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );








       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2868182%2fminimizing-rss-by-taking-partial-derivative%23new-answer', 'question_page');

      );

      Post as a guest






























      4 Answers
      4






      active

      oldest

      votes








      4 Answers
      4






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      1
      down vote



      accepted










      Note that $beta$ is not a scalar, but a vector.



      Let
      $$mathbfy = beginbmatrix
      y_1 \
      y_2 \
      vdots \
      y_N
      endbmatrix$$
      $$mathbfX = beginbmatrix
      x_11 & x_12 & cdots & x_1p \
      x_21 & x_22 & cdots & x_2p \
      vdots & vdots & vdots & vdots \
      x_N1 & x_N2 & cdots & x_Np
      endbmatrix$$
      and
      $$beta = beginbmatrix
      b_1 \
      b_2 \
      vdots \
      b_p
      endbmatrixtext.$$
      Then $mathbfXbeta in mathbbR^N$ and
      $$mathbfXbeta = beginbmatrix
      sum_j=1^pb_jx_1j \
      sum_j=1^pb_jx_2j \
      vdots \
      sum_j=1^pb_jx_Nj
      endbmatrix implies mathbfy-mathbfXbeta=beginbmatrix
      y_1 - sum_j=1^pb_jx_1j \
      y_2 - sum_j=1^pb_jx_2j \
      vdots \
      y_N - sum_j=1^pb_jx_Nj
      endbmatrix text.$$
      Therefore,
      $$(mathbfy-mathbfXbeta)^T(mathbfy-mathbfXbeta) = |mathbfy-mathbfXbeta |^2 = sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)^2text. $$
      We have, for each $k = 1, dots, p$,
      $$dfracpartial textRSSpartial b_k = 2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)(-x_ik) = -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_iktext.$$
      Then
      $$beginaligndfracpartial textRSSpartial beta &= beginbmatrix
      dfracpartial textRSSpartial b_1 \
      dfracpartial textRSSpartial b_2 \
      vdots \
      dfracpartial textRSSpartial b_p
      endbmatrix \
      &= beginbmatrix
      -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i1 \
      -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i2 \
      vdots \
      -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_ip
      endbmatrix \
      &= -2beginbmatrix
      sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i1 \
      sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i2 \
      vdots \
      sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_ip
      endbmatrix \
      &= -2mathbfX^T(mathbfy-mathbfXbeta)text.
      endalign$$






      share|cite|improve this answer





















      • Hi. Thanks for this very detailed derivation! Just a quick question: you have denoted as $x_ip$, above, which would mean that the transpose is $x_pi$, but in your final line, you don't have $x_pi$, but $x_ip$, so wondering where the $X^T$ comes from in the last equality.
        – Thomas Moore
        Aug 1 at 16:09






      • 1




        @ThomasMoore I would recommend that you do the multiplication yourself to see that the above is true, but at a very simplistic level: remember that all you're doing when you're doing matrix multiplication is dot products of rows of the first matrix to columns of the second matrix. The $p$th column of $mathbfX$ ends up being the $p$th row of $mathbfX^T$ when you're doing the transposition. Thus, when you're performing a dot product between the $p$th row of $mathbfX^T$ and $mathbfy - mathbfXbeta$, ultimately, you're using all of the entries of the $p$th column of $mathbfX$.
        – Clarinetist
        Aug 1 at 16:28















      up vote
      1
      down vote



      accepted










      Note that $beta$ is not a scalar, but a vector.



      Let
      $$mathbfy = beginbmatrix
      y_1 \
      y_2 \
      vdots \
      y_N
      endbmatrix$$
      $$mathbfX = beginbmatrix
      x_11 & x_12 & cdots & x_1p \
      x_21 & x_22 & cdots & x_2p \
      vdots & vdots & vdots & vdots \
      x_N1 & x_N2 & cdots & x_Np
      endbmatrix$$
      and
      $$beta = beginbmatrix
      b_1 \
      b_2 \
      vdots \
      b_p
      endbmatrixtext.$$
      Then $mathbfXbeta in mathbbR^N$ and
      $$mathbfXbeta = beginbmatrix
      sum_j=1^pb_jx_1j \
      sum_j=1^pb_jx_2j \
      vdots \
      sum_j=1^pb_jx_Nj
      endbmatrix implies mathbfy-mathbfXbeta=beginbmatrix
      y_1 - sum_j=1^pb_jx_1j \
      y_2 - sum_j=1^pb_jx_2j \
      vdots \
      y_N - sum_j=1^pb_jx_Nj
      endbmatrix text.$$
      Therefore,
      $$(mathbfy-mathbfXbeta)^T(mathbfy-mathbfXbeta) = |mathbfy-mathbfXbeta |^2 = sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)^2text. $$
      We have, for each $k = 1, dots, p$,
      $$dfracpartial textRSSpartial b_k = 2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)(-x_ik) = -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_iktext.$$
      Then
      $$beginaligndfracpartial textRSSpartial beta &= beginbmatrix
      dfracpartial textRSSpartial b_1 \
      dfracpartial textRSSpartial b_2 \
      vdots \
      dfracpartial textRSSpartial b_p
      endbmatrix \
      &= beginbmatrix
      -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i1 \
      -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i2 \
      vdots \
      -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_ip
      endbmatrix \
      &= -2beginbmatrix
      sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i1 \
      sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i2 \
      vdots \
      sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_ip
      endbmatrix \
      &= -2mathbfX^T(mathbfy-mathbfXbeta)text.
      endalign$$






      share|cite|improve this answer





















      • Hi. Thanks for this very detailed derivation! Just a quick question: you have denoted as $x_ip$, above, which would mean that the transpose is $x_pi$, but in your final line, you don't have $x_pi$, but $x_ip$, so wondering where the $X^T$ comes from in the last equality.
        – Thomas Moore
        Aug 1 at 16:09






      • 1




        @ThomasMoore I would recommend that you do the multiplication yourself to see that the above is true, but at a very simplistic level: remember that all you're doing when you're doing matrix multiplication is dot products of rows of the first matrix to columns of the second matrix. The $p$th column of $mathbfX$ ends up being the $p$th row of $mathbfX^T$ when you're doing the transposition. Thus, when you're performing a dot product between the $p$th row of $mathbfX^T$ and $mathbfy - mathbfXbeta$, ultimately, you're using all of the entries of the $p$th column of $mathbfX$.
        – Clarinetist
        Aug 1 at 16:28













      up vote
      1
      down vote



      accepted







      up vote
      1
      down vote



      accepted






      Note that $beta$ is not a scalar, but a vector.



      Let
      $$mathbfy = beginbmatrix
      y_1 \
      y_2 \
      vdots \
      y_N
      endbmatrix$$
      $$mathbfX = beginbmatrix
      x_11 & x_12 & cdots & x_1p \
      x_21 & x_22 & cdots & x_2p \
      vdots & vdots & vdots & vdots \
      x_N1 & x_N2 & cdots & x_Np
      endbmatrix$$
      and
      $$beta = beginbmatrix
      b_1 \
      b_2 \
      vdots \
      b_p
      endbmatrixtext.$$
      Then $mathbfXbeta in mathbbR^N$ and
      $$mathbfXbeta = beginbmatrix
      sum_j=1^pb_jx_1j \
      sum_j=1^pb_jx_2j \
      vdots \
      sum_j=1^pb_jx_Nj
      endbmatrix implies mathbfy-mathbfXbeta=beginbmatrix
      y_1 - sum_j=1^pb_jx_1j \
      y_2 - sum_j=1^pb_jx_2j \
      vdots \
      y_N - sum_j=1^pb_jx_Nj
      endbmatrix text.$$
      Therefore,
      $$(mathbfy-mathbfXbeta)^T(mathbfy-mathbfXbeta) = |mathbfy-mathbfXbeta |^2 = sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)^2text. $$
      We have, for each $k = 1, dots, p$,
      $$dfracpartial textRSSpartial b_k = 2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)(-x_ik) = -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_iktext.$$
      Then
      $$beginaligndfracpartial textRSSpartial beta &= beginbmatrix
      dfracpartial textRSSpartial b_1 \
      dfracpartial textRSSpartial b_2 \
      vdots \
      dfracpartial textRSSpartial b_p
      endbmatrix \
      &= beginbmatrix
      -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i1 \
      -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i2 \
      vdots \
      -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_ip
      endbmatrix \
      &= -2beginbmatrix
      sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i1 \
      sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i2 \
      vdots \
      sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_ip
      endbmatrix \
      &= -2mathbfX^T(mathbfy-mathbfXbeta)text.
      endalign$$






      share|cite|improve this answer













      Note that $beta$ is not a scalar, but a vector.



      Let
      $$mathbfy = beginbmatrix
      y_1 \
      y_2 \
      vdots \
      y_N
      endbmatrix$$
      $$mathbfX = beginbmatrix
      x_11 & x_12 & cdots & x_1p \
      x_21 & x_22 & cdots & x_2p \
      vdots & vdots & vdots & vdots \
      x_N1 & x_N2 & cdots & x_Np
      endbmatrix$$
      and
      $$beta = beginbmatrix
      b_1 \
      b_2 \
      vdots \
      b_p
      endbmatrixtext.$$
      Then $mathbfXbeta in mathbbR^N$ and
      $$mathbfXbeta = beginbmatrix
      sum_j=1^pb_jx_1j \
      sum_j=1^pb_jx_2j \
      vdots \
      sum_j=1^pb_jx_Nj
      endbmatrix implies mathbfy-mathbfXbeta=beginbmatrix
      y_1 - sum_j=1^pb_jx_1j \
      y_2 - sum_j=1^pb_jx_2j \
      vdots \
      y_N - sum_j=1^pb_jx_Nj
      endbmatrix text.$$
      Therefore,
      $$(mathbfy-mathbfXbeta)^T(mathbfy-mathbfXbeta) = |mathbfy-mathbfXbeta |^2 = sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)^2text. $$
      We have, for each $k = 1, dots, p$,
      $$dfracpartial textRSSpartial b_k = 2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)(-x_ik) = -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_iktext.$$
      Then
      $$beginaligndfracpartial textRSSpartial beta &= beginbmatrix
      dfracpartial textRSSpartial b_1 \
      dfracpartial textRSSpartial b_2 \
      vdots \
      dfracpartial textRSSpartial b_p
      endbmatrix \
      &= beginbmatrix
      -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i1 \
      -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i2 \
      vdots \
      -2sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_ip
      endbmatrix \
      &= -2beginbmatrix
      sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i1 \
      sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_i2 \
      vdots \
      sum_i=1^Nleft(y_i-sum_j=1^pb_jx_ijright)x_ip
      endbmatrix \
      &= -2mathbfX^T(mathbfy-mathbfXbeta)text.
      endalign$$







      share|cite|improve this answer













      share|cite|improve this answer



      share|cite|improve this answer











      answered Jul 31 at 17:12









      Clarinetist

      10.3k32767




      10.3k32767











      • Hi. Thanks for this very detailed derivation! Just a quick question: you have denoted as $x_ip$, above, which would mean that the transpose is $x_pi$, but in your final line, you don't have $x_pi$, but $x_ip$, so wondering where the $X^T$ comes from in the last equality.
        – Thomas Moore
        Aug 1 at 16:09






      • 1




        @ThomasMoore I would recommend that you do the multiplication yourself to see that the above is true, but at a very simplistic level: remember that all you're doing when you're doing matrix multiplication is dot products of rows of the first matrix to columns of the second matrix. The $p$th column of $mathbfX$ ends up being the $p$th row of $mathbfX^T$ when you're doing the transposition. Thus, when you're performing a dot product between the $p$th row of $mathbfX^T$ and $mathbfy - mathbfXbeta$, ultimately, you're using all of the entries of the $p$th column of $mathbfX$.
        – Clarinetist
        Aug 1 at 16:28

















      • Hi. Thanks for this very detailed derivation! Just a quick question: you have denoted as $x_ip$, above, which would mean that the transpose is $x_pi$, but in your final line, you don't have $x_pi$, but $x_ip$, so wondering where the $X^T$ comes from in the last equality.
        – Thomas Moore
        Aug 1 at 16:09






      • 1




        @ThomasMoore I would recommend that you do the multiplication yourself to see that the above is true, but at a very simplistic level: remember that all you're doing when you're doing matrix multiplication is dot products of rows of the first matrix to columns of the second matrix. The $p$th column of $mathbfX$ ends up being the $p$th row of $mathbfX^T$ when you're doing the transposition. Thus, when you're performing a dot product between the $p$th row of $mathbfX^T$ and $mathbfy - mathbfXbeta$, ultimately, you're using all of the entries of the $p$th column of $mathbfX$.
        – Clarinetist
        Aug 1 at 16:28
















      Hi. Thanks for this very detailed derivation! Just a quick question: you have denoted as $x_ip$, above, which would mean that the transpose is $x_pi$, but in your final line, you don't have $x_pi$, but $x_ip$, so wondering where the $X^T$ comes from in the last equality.
      – Thomas Moore
      Aug 1 at 16:09




      Hi. Thanks for this very detailed derivation! Just a quick question: you have denoted as $x_ip$, above, which would mean that the transpose is $x_pi$, but in your final line, you don't have $x_pi$, but $x_ip$, so wondering where the $X^T$ comes from in the last equality.
      – Thomas Moore
      Aug 1 at 16:09




      1




      1




      @ThomasMoore I would recommend that you do the multiplication yourself to see that the above is true, but at a very simplistic level: remember that all you're doing when you're doing matrix multiplication is dot products of rows of the first matrix to columns of the second matrix. The $p$th column of $mathbfX$ ends up being the $p$th row of $mathbfX^T$ when you're doing the transposition. Thus, when you're performing a dot product between the $p$th row of $mathbfX^T$ and $mathbfy - mathbfXbeta$, ultimately, you're using all of the entries of the $p$th column of $mathbfX$.
      – Clarinetist
      Aug 1 at 16:28





      @ThomasMoore I would recommend that you do the multiplication yourself to see that the above is true, but at a very simplistic level: remember that all you're doing when you're doing matrix multiplication is dot products of rows of the first matrix to columns of the second matrix. The $p$th column of $mathbfX$ ends up being the $p$th row of $mathbfX^T$ when you're doing the transposition. Thus, when you're performing a dot product between the $p$th row of $mathbfX^T$ and $mathbfy - mathbfXbeta$, ultimately, you're using all of the entries of the $p$th column of $mathbfX$.
      – Clarinetist
      Aug 1 at 16:28











      up vote
      1
      down vote













      The correct transpose (see property 3) is $(mathbfy - mathbfXbeta)^T (mathbfy-mathbfXbeta) = (mathbfy^T - beta^TmathbfX^T)(mathbfy - mathbfXbeta)$



      The correct expansion is $mathbfy^Tmathbfy - mathbfy^TmathbfXbeta - beta^T mathbfX^T mathbfy + beta^TmathbfX^TmathbfXbeta$



      You can simplify the expansion to:
      $$mathbfy^Tmathbfy + (-mathbfX^T mathbfy)^T beta + (-mathbfX^T mathbfy)^T beta + beta^TmathbfX^TmathbfXbeta$$
      And the result readily follows.






      share|cite|improve this answer





















      • Hi. Thanks for this. But since $beta$ is a scalar, isn't $beta^T = beta$?
        – Thomas Moore
        Jul 31 at 15:49







      • 1




        @ThomasMoore my derivation applies both to scalars and to vectors; if you focus on scalars, your derivation goes wrong where you write $yX^T$ in the expansion: that is a matrix and should be $X^T y$. You can then use that $X^Ty = y^T X$.
        – LinAlg
        Jul 31 at 15:52















      up vote
      1
      down vote













      The correct transpose (see property 3) is $(mathbfy - mathbfXbeta)^T (mathbfy-mathbfXbeta) = (mathbfy^T - beta^TmathbfX^T)(mathbfy - mathbfXbeta)$



      The correct expansion is $mathbfy^Tmathbfy - mathbfy^TmathbfXbeta - beta^T mathbfX^T mathbfy + beta^TmathbfX^TmathbfXbeta$



      You can simplify the expansion to:
      $$mathbfy^Tmathbfy + (-mathbfX^T mathbfy)^T beta + (-mathbfX^T mathbfy)^T beta + beta^TmathbfX^TmathbfXbeta$$
      And the result readily follows.






      share|cite|improve this answer





















      • Hi. Thanks for this. But since $beta$ is a scalar, isn't $beta^T = beta$?
        – Thomas Moore
        Jul 31 at 15:49







      • 1




        @ThomasMoore my derivation applies both to scalars and to vectors; if you focus on scalars, your derivation goes wrong where you write $yX^T$ in the expansion: that is a matrix and should be $X^T y$. You can then use that $X^Ty = y^T X$.
        – LinAlg
        Jul 31 at 15:52













      up vote
      1
      down vote










      up vote
      1
      down vote









      The correct transpose (see property 3) is $(mathbfy - mathbfXbeta)^T (mathbfy-mathbfXbeta) = (mathbfy^T - beta^TmathbfX^T)(mathbfy - mathbfXbeta)$



      The correct expansion is $mathbfy^Tmathbfy - mathbfy^TmathbfXbeta - beta^T mathbfX^T mathbfy + beta^TmathbfX^TmathbfXbeta$



      You can simplify the expansion to:
      $$mathbfy^Tmathbfy + (-mathbfX^T mathbfy)^T beta + (-mathbfX^T mathbfy)^T beta + beta^TmathbfX^TmathbfXbeta$$
      And the result readily follows.






      share|cite|improve this answer













      The correct transpose (see property 3) is $(mathbfy - mathbfXbeta)^T (mathbfy-mathbfXbeta) = (mathbfy^T - beta^TmathbfX^T)(mathbfy - mathbfXbeta)$



      The correct expansion is $mathbfy^Tmathbfy - mathbfy^TmathbfXbeta - beta^T mathbfX^T mathbfy + beta^TmathbfX^TmathbfXbeta$



      You can simplify the expansion to:
      $$mathbfy^Tmathbfy + (-mathbfX^T mathbfy)^T beta + (-mathbfX^T mathbfy)^T beta + beta^TmathbfX^TmathbfXbeta$$
      And the result readily follows.







      share|cite|improve this answer













      share|cite|improve this answer



      share|cite|improve this answer











      answered Jul 31 at 15:48









      LinAlg

      5,4111319




      5,4111319











      • Hi. Thanks for this. But since $beta$ is a scalar, isn't $beta^T = beta$?
        – Thomas Moore
        Jul 31 at 15:49







      • 1




        @ThomasMoore my derivation applies both to scalars and to vectors; if you focus on scalars, your derivation goes wrong where you write $yX^T$ in the expansion: that is a matrix and should be $X^T y$. You can then use that $X^Ty = y^T X$.
        – LinAlg
        Jul 31 at 15:52

















      • Hi. Thanks for this. But since $beta$ is a scalar, isn't $beta^T = beta$?
        – Thomas Moore
        Jul 31 at 15:49







      • 1




        @ThomasMoore my derivation applies both to scalars and to vectors; if you focus on scalars, your derivation goes wrong where you write $yX^T$ in the expansion: that is a matrix and should be $X^T y$. You can then use that $X^Ty = y^T X$.
        – LinAlg
        Jul 31 at 15:52
















      Hi. Thanks for this. But since $beta$ is a scalar, isn't $beta^T = beta$?
      – Thomas Moore
      Jul 31 at 15:49





      Hi. Thanks for this. But since $beta$ is a scalar, isn't $beta^T = beta$?
      – Thomas Moore
      Jul 31 at 15:49





      1




      1




      @ThomasMoore my derivation applies both to scalars and to vectors; if you focus on scalars, your derivation goes wrong where you write $yX^T$ in the expansion: that is a matrix and should be $X^T y$. You can then use that $X^Ty = y^T X$.
      – LinAlg
      Jul 31 at 15:52





      @ThomasMoore my derivation applies both to scalars and to vectors; if you focus on scalars, your derivation goes wrong where you write $yX^T$ in the expansion: that is a matrix and should be $X^T y$. You can then use that $X^Ty = y^T X$.
      – LinAlg
      Jul 31 at 15:52











      up vote
      1
      down vote













      Expand the brackets to write
      $$
      beginalign
      RSS(beta)&=y'y-y'Xbeta-beta'X'y+beta'X'Xbeta\
      &=y'y-2beta'X'y+beta'X'Xbeta
      endalign
      $$
      where primes denote the transpose and $y'Xbeta=beta'X'y= (y'Xbeta)'$ since $y'Xbeta$ is a $1times 1$ vector. Now we can differentiate to get that
      $$
      fracpartial RSS(beta)partial beta=-2X'y+2X'Xbeta=-2X'(y-Xbeta)
      $$
      Here we used two properties. First, if $u=alpha'x$ where $alpha,xinmathbbR^n$, then
      $$
      fracpartial upartial x_j=alpha_jimplies fracpartial upartial x=alpha.
      $$
      One should notice that $fracpartial upartial x$ in this case represents the gradient. Second if $u=x'Ax=sum_i=1^nsum_j=1^na_ij x_i x_j$ where $Ain M_ntimes n(mathbbR)
      $
      and $xinmathbbR^n$, then
      $$
      fracpartial upartial x_ell=sum_i=1^na_iellx_i+sum_i=1^na_ell ix_i=[(A'+A)x]_ell
      implies
      fracpartial upartial x=(A'+A)x.
      $$
      In particular if $A$ is symmetric (like $X'X$ as above), we have that $fracpartial upartial x=2Ax$






      share|cite|improve this answer



























        up vote
        1
        down vote













        Expand the brackets to write
        $$
        beginalign
        RSS(beta)&=y'y-y'Xbeta-beta'X'y+beta'X'Xbeta\
        &=y'y-2beta'X'y+beta'X'Xbeta
        endalign
        $$
        where primes denote the transpose and $y'Xbeta=beta'X'y= (y'Xbeta)'$ since $y'Xbeta$ is a $1times 1$ vector. Now we can differentiate to get that
        $$
        fracpartial RSS(beta)partial beta=-2X'y+2X'Xbeta=-2X'(y-Xbeta)
        $$
        Here we used two properties. First, if $u=alpha'x$ where $alpha,xinmathbbR^n$, then
        $$
        fracpartial upartial x_j=alpha_jimplies fracpartial upartial x=alpha.
        $$
        One should notice that $fracpartial upartial x$ in this case represents the gradient. Second if $u=x'Ax=sum_i=1^nsum_j=1^na_ij x_i x_j$ where $Ain M_ntimes n(mathbbR)
        $
        and $xinmathbbR^n$, then
        $$
        fracpartial upartial x_ell=sum_i=1^na_iellx_i+sum_i=1^na_ell ix_i=[(A'+A)x]_ell
        implies
        fracpartial upartial x=(A'+A)x.
        $$
        In particular if $A$ is symmetric (like $X'X$ as above), we have that $fracpartial upartial x=2Ax$






        share|cite|improve this answer

























          up vote
          1
          down vote










          up vote
          1
          down vote









          Expand the brackets to write
          $$
          beginalign
          RSS(beta)&=y'y-y'Xbeta-beta'X'y+beta'X'Xbeta\
          &=y'y-2beta'X'y+beta'X'Xbeta
          endalign
          $$
          where primes denote the transpose and $y'Xbeta=beta'X'y= (y'Xbeta)'$ since $y'Xbeta$ is a $1times 1$ vector. Now we can differentiate to get that
          $$
          fracpartial RSS(beta)partial beta=-2X'y+2X'Xbeta=-2X'(y-Xbeta)
          $$
          Here we used two properties. First, if $u=alpha'x$ where $alpha,xinmathbbR^n$, then
          $$
          fracpartial upartial x_j=alpha_jimplies fracpartial upartial x=alpha.
          $$
          One should notice that $fracpartial upartial x$ in this case represents the gradient. Second if $u=x'Ax=sum_i=1^nsum_j=1^na_ij x_i x_j$ where $Ain M_ntimes n(mathbbR)
          $
          and $xinmathbbR^n$, then
          $$
          fracpartial upartial x_ell=sum_i=1^na_iellx_i+sum_i=1^na_ell ix_i=[(A'+A)x]_ell
          implies
          fracpartial upartial x=(A'+A)x.
          $$
          In particular if $A$ is symmetric (like $X'X$ as above), we have that $fracpartial upartial x=2Ax$






          share|cite|improve this answer















          Expand the brackets to write
          $$
          beginalign
          RSS(beta)&=y'y-y'Xbeta-beta'X'y+beta'X'Xbeta\
          &=y'y-2beta'X'y+beta'X'Xbeta
          endalign
          $$
          where primes denote the transpose and $y'Xbeta=beta'X'y= (y'Xbeta)'$ since $y'Xbeta$ is a $1times 1$ vector. Now we can differentiate to get that
          $$
          fracpartial RSS(beta)partial beta=-2X'y+2X'Xbeta=-2X'(y-Xbeta)
          $$
          Here we used two properties. First, if $u=alpha'x$ where $alpha,xinmathbbR^n$, then
          $$
          fracpartial upartial x_j=alpha_jimplies fracpartial upartial x=alpha.
          $$
          One should notice that $fracpartial upartial x$ in this case represents the gradient. Second if $u=x'Ax=sum_i=1^nsum_j=1^na_ij x_i x_j$ where $Ain M_ntimes n(mathbbR)
          $
          and $xinmathbbR^n$, then
          $$
          fracpartial upartial x_ell=sum_i=1^na_iellx_i+sum_i=1^na_ell ix_i=[(A'+A)x]_ell
          implies
          fracpartial upartial x=(A'+A)x.
          $$
          In particular if $A$ is symmetric (like $X'X$ as above), we have that $fracpartial upartial x=2Ax$







          share|cite|improve this answer















          share|cite|improve this answer



          share|cite|improve this answer








          edited Jul 31 at 16:06


























          answered Jul 31 at 15:50









          Foobaz John

          18k41245




          18k41245




















              up vote
              1
              down vote













              Remark: $beta$ is a vector.



              In multiple regression, if you have $n$ independent variables, therefore you have $n+1$ parameters to estimate (included intercept), that is: $$y_t=beta_0+beta_1X_1t+...beta_nX_nt+e_t,$$ where each $beta_i$ is scalar.
              We can write aforementioned with matrix notation (your problem is in matrix notation):
              $$y=Xbeta+e,$$
              where $X$ is matrix, $y,beta$ and $e$ are vectors!
              More precisely, $beta_i$ is scalar, but $beta$ is vector. Furthermore, you can note that unique solution of the problem that you have mentioned is the following:
              $$beta=(X^TX)^-1X^Ty,$$
              where you can note easily that $beta$ is a vector.






              share|cite|improve this answer



























                up vote
                1
                down vote













                Remark: $beta$ is a vector.



                In multiple regression, if you have $n$ independent variables, therefore you have $n+1$ parameters to estimate (included intercept), that is: $$y_t=beta_0+beta_1X_1t+...beta_nX_nt+e_t,$$ where each $beta_i$ is scalar.
                We can write aforementioned with matrix notation (your problem is in matrix notation):
                $$y=Xbeta+e,$$
                where $X$ is matrix, $y,beta$ and $e$ are vectors!
                More precisely, $beta_i$ is scalar, but $beta$ is vector. Furthermore, you can note that unique solution of the problem that you have mentioned is the following:
                $$beta=(X^TX)^-1X^Ty,$$
                where you can note easily that $beta$ is a vector.






                share|cite|improve this answer

























                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  Remark: $beta$ is a vector.



                  In multiple regression, if you have $n$ independent variables, therefore you have $n+1$ parameters to estimate (included intercept), that is: $$y_t=beta_0+beta_1X_1t+...beta_nX_nt+e_t,$$ where each $beta_i$ is scalar.
                  We can write aforementioned with matrix notation (your problem is in matrix notation):
                  $$y=Xbeta+e,$$
                  where $X$ is matrix, $y,beta$ and $e$ are vectors!
                  More precisely, $beta_i$ is scalar, but $beta$ is vector. Furthermore, you can note that unique solution of the problem that you have mentioned is the following:
                  $$beta=(X^TX)^-1X^Ty,$$
                  where you can note easily that $beta$ is a vector.






                  share|cite|improve this answer















                  Remark: $beta$ is a vector.



                  In multiple regression, if you have $n$ independent variables, therefore you have $n+1$ parameters to estimate (included intercept), that is: $$y_t=beta_0+beta_1X_1t+...beta_nX_nt+e_t,$$ where each $beta_i$ is scalar.
                  We can write aforementioned with matrix notation (your problem is in matrix notation):
                  $$y=Xbeta+e,$$
                  where $X$ is matrix, $y,beta$ and $e$ are vectors!
                  More precisely, $beta_i$ is scalar, but $beta$ is vector. Furthermore, you can note that unique solution of the problem that you have mentioned is the following:
                  $$beta=(X^TX)^-1X^Ty,$$
                  where you can note easily that $beta$ is a vector.







                  share|cite|improve this answer















                  share|cite|improve this answer



                  share|cite|improve this answer








                  edited Jul 31 at 17:16


























                  answered Jul 31 at 17:02









                  David

                  277




                  277






















                       

                      draft saved


                      draft discarded


























                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2868182%2fminimizing-rss-by-taking-partial-derivative%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Comments

                      Popular posts from this blog

                      What is the equation of a 3D cone with generalised tilt?

                      Color the edges and diagonals of a regular polygon

                      Relationship between determinant of matrix and determinant of adjoint?