transpose appearing for the chain rule in the matrix form

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite
1












I am trying to understand deriving the derivative of a matrix equation of the form:



$$a = tanh(WX + b)$$
in which $W$ is a $M*N$ matrix, $X$ is $N*1$, and $b$ is $M*1$. I'm trying to take the derivative of $a$ with respect to $W$, $X$, and $b$. I already have the final answers as:



  1. $$partial a/ partial X= W^T(1 - tanh(WX+b)^2)$$
    I don't understand how $W$ moves to the left hand side of $(1 - tanh(WX+b)^2$ and gets transposed!? I understand that the chain rule is:


$$partial f(u)/ partial x= f'(u)partial u/ partial x$$ so in my example $partial u/ partial x$ is on the right hand side of the equation.



  1. $$partial a/ partial W=(1 - tanh(WX+b)^2)X^T$$
    in which I don't understand how $X$ gets transposed and moves to the left hand side.






share|cite|improve this question





















  • $WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
    – DinosaurEgg
    Aug 3 at 20:52










  • @DinosaurEgg I think it is applied component-wise.
    – angryavian
    Aug 3 at 20:55










  • But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
    – DinosaurEgg
    Aug 3 at 21:06










  • yes tanh and power of 2 is component wise
    – Amir Hossein F
    Aug 3 at 21:08














up vote
0
down vote

favorite
1












I am trying to understand deriving the derivative of a matrix equation of the form:



$$a = tanh(WX + b)$$
in which $W$ is a $M*N$ matrix, $X$ is $N*1$, and $b$ is $M*1$. I'm trying to take the derivative of $a$ with respect to $W$, $X$, and $b$. I already have the final answers as:



  1. $$partial a/ partial X= W^T(1 - tanh(WX+b)^2)$$
    I don't understand how $W$ moves to the left hand side of $(1 - tanh(WX+b)^2$ and gets transposed!? I understand that the chain rule is:


$$partial f(u)/ partial x= f'(u)partial u/ partial x$$ so in my example $partial u/ partial x$ is on the right hand side of the equation.



  1. $$partial a/ partial W=(1 - tanh(WX+b)^2)X^T$$
    in which I don't understand how $X$ gets transposed and moves to the left hand side.






share|cite|improve this question





















  • $WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
    – DinosaurEgg
    Aug 3 at 20:52










  • @DinosaurEgg I think it is applied component-wise.
    – angryavian
    Aug 3 at 20:55










  • But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
    – DinosaurEgg
    Aug 3 at 21:06










  • yes tanh and power of 2 is component wise
    – Amir Hossein F
    Aug 3 at 21:08












up vote
0
down vote

favorite
1









up vote
0
down vote

favorite
1






1





I am trying to understand deriving the derivative of a matrix equation of the form:



$$a = tanh(WX + b)$$
in which $W$ is a $M*N$ matrix, $X$ is $N*1$, and $b$ is $M*1$. I'm trying to take the derivative of $a$ with respect to $W$, $X$, and $b$. I already have the final answers as:



  1. $$partial a/ partial X= W^T(1 - tanh(WX+b)^2)$$
    I don't understand how $W$ moves to the left hand side of $(1 - tanh(WX+b)^2$ and gets transposed!? I understand that the chain rule is:


$$partial f(u)/ partial x= f'(u)partial u/ partial x$$ so in my example $partial u/ partial x$ is on the right hand side of the equation.



  1. $$partial a/ partial W=(1 - tanh(WX+b)^2)X^T$$
    in which I don't understand how $X$ gets transposed and moves to the left hand side.






share|cite|improve this question













I am trying to understand deriving the derivative of a matrix equation of the form:



$$a = tanh(WX + b)$$
in which $W$ is a $M*N$ matrix, $X$ is $N*1$, and $b$ is $M*1$. I'm trying to take the derivative of $a$ with respect to $W$, $X$, and $b$. I already have the final answers as:



  1. $$partial a/ partial X= W^T(1 - tanh(WX+b)^2)$$
    I don't understand how $W$ moves to the left hand side of $(1 - tanh(WX+b)^2$ and gets transposed!? I understand that the chain rule is:


$$partial f(u)/ partial x= f'(u)partial u/ partial x$$ so in my example $partial u/ partial x$ is on the right hand side of the equation.



  1. $$partial a/ partial W=(1 - tanh(WX+b)^2)X^T$$
    in which I don't understand how $X$ gets transposed and moves to the left hand side.








share|cite|improve this question












share|cite|improve this question




share|cite|improve this question








edited Aug 3 at 21:09
























asked Aug 3 at 20:46









Amir Hossein F

64




64











  • $WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
    – DinosaurEgg
    Aug 3 at 20:52










  • @DinosaurEgg I think it is applied component-wise.
    – angryavian
    Aug 3 at 20:55










  • But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
    – DinosaurEgg
    Aug 3 at 21:06










  • yes tanh and power of 2 is component wise
    – Amir Hossein F
    Aug 3 at 21:08
















  • $WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
    – DinosaurEgg
    Aug 3 at 20:52










  • @DinosaurEgg I think it is applied component-wise.
    – angryavian
    Aug 3 at 20:55










  • But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
    – DinosaurEgg
    Aug 3 at 21:06










  • yes tanh and power of 2 is component wise
    – Amir Hossein F
    Aug 3 at 21:08















$WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
– DinosaurEgg
Aug 3 at 20:52




$WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
– DinosaurEgg
Aug 3 at 20:52












@DinosaurEgg I think it is applied component-wise.
– angryavian
Aug 3 at 20:55




@DinosaurEgg I think it is applied component-wise.
– angryavian
Aug 3 at 20:55












But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
– DinosaurEgg
Aug 3 at 21:06




But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
– DinosaurEgg
Aug 3 at 21:06












yes tanh and power of 2 is component wise
– Amir Hossein F
Aug 3 at 21:08




yes tanh and power of 2 is component wise
– Amir Hossein F
Aug 3 at 21:08










1 Answer
1






active

oldest

votes

















up vote
1
down vote













Define the variables
$$eqalign
y &= Wx+b cr
a &= tanh(y) implies A = rm Diag(a) cr
$$
Now calculate the differential and gradient of $a$ wrt $x$
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dy cr
&= (I-A^2)W,dx cr
fracpartial apartial x &= (I-A^2)W cr
$$
Depending on your Layout convention, you might prefer the transpose of this result
$$eqalign
fracpartial apartial x &= W^T(I-A^2) crcr
$$
Note that the gradient of a vector wrt a vector produces a matrix result. Your second question is about the gradient of a vector wrt a matrix, which will produce a $3rd$ order tensor.
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dW,x cr
&= (I-A^2)mathcal Hx:dW cr
fracpartial apartial W &= (I-A^2)mathcal Hx cr
$$
The above steps use several different product notations which you may not be familiar with
$$eqalign
lambda &=A:B &implies lambda = sum_isum_j A_ij B_ij cr
L &=Aodot B &implies L_ij = A_ij B_ij cr
C &= AB &implies C_ij = sum_k A_ik B_kj cr
$$
The symbol $mathcal H$ is a $4th$ order tensor with components
$$mathcal H_ijkl = delta_ik delta_jl$$






share|cite|improve this answer























  • just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
    – user550103
    yesterday






  • 1




    Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/…
    – greg
    yesterday











  • thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
    – user550103
    yesterday










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2871479%2ftranspose-appearing-for-the-chain-rule-in-the-matrix-form%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote













Define the variables
$$eqalign
y &= Wx+b cr
a &= tanh(y) implies A = rm Diag(a) cr
$$
Now calculate the differential and gradient of $a$ wrt $x$
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dy cr
&= (I-A^2)W,dx cr
fracpartial apartial x &= (I-A^2)W cr
$$
Depending on your Layout convention, you might prefer the transpose of this result
$$eqalign
fracpartial apartial x &= W^T(I-A^2) crcr
$$
Note that the gradient of a vector wrt a vector produces a matrix result. Your second question is about the gradient of a vector wrt a matrix, which will produce a $3rd$ order tensor.
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dW,x cr
&= (I-A^2)mathcal Hx:dW cr
fracpartial apartial W &= (I-A^2)mathcal Hx cr
$$
The above steps use several different product notations which you may not be familiar with
$$eqalign
lambda &=A:B &implies lambda = sum_isum_j A_ij B_ij cr
L &=Aodot B &implies L_ij = A_ij B_ij cr
C &= AB &implies C_ij = sum_k A_ik B_kj cr
$$
The symbol $mathcal H$ is a $4th$ order tensor with components
$$mathcal H_ijkl = delta_ik delta_jl$$






share|cite|improve this answer























  • just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
    – user550103
    yesterday






  • 1




    Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/…
    – greg
    yesterday











  • thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
    – user550103
    yesterday














up vote
1
down vote













Define the variables
$$eqalign
y &= Wx+b cr
a &= tanh(y) implies A = rm Diag(a) cr
$$
Now calculate the differential and gradient of $a$ wrt $x$
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dy cr
&= (I-A^2)W,dx cr
fracpartial apartial x &= (I-A^2)W cr
$$
Depending on your Layout convention, you might prefer the transpose of this result
$$eqalign
fracpartial apartial x &= W^T(I-A^2) crcr
$$
Note that the gradient of a vector wrt a vector produces a matrix result. Your second question is about the gradient of a vector wrt a matrix, which will produce a $3rd$ order tensor.
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dW,x cr
&= (I-A^2)mathcal Hx:dW cr
fracpartial apartial W &= (I-A^2)mathcal Hx cr
$$
The above steps use several different product notations which you may not be familiar with
$$eqalign
lambda &=A:B &implies lambda = sum_isum_j A_ij B_ij cr
L &=Aodot B &implies L_ij = A_ij B_ij cr
C &= AB &implies C_ij = sum_k A_ik B_kj cr
$$
The symbol $mathcal H$ is a $4th$ order tensor with components
$$mathcal H_ijkl = delta_ik delta_jl$$






share|cite|improve this answer























  • just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
    – user550103
    yesterday






  • 1




    Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/…
    – greg
    yesterday











  • thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
    – user550103
    yesterday












up vote
1
down vote










up vote
1
down vote









Define the variables
$$eqalign
y &= Wx+b cr
a &= tanh(y) implies A = rm Diag(a) cr
$$
Now calculate the differential and gradient of $a$ wrt $x$
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dy cr
&= (I-A^2)W,dx cr
fracpartial apartial x &= (I-A^2)W cr
$$
Depending on your Layout convention, you might prefer the transpose of this result
$$eqalign
fracpartial apartial x &= W^T(I-A^2) crcr
$$
Note that the gradient of a vector wrt a vector produces a matrix result. Your second question is about the gradient of a vector wrt a matrix, which will produce a $3rd$ order tensor.
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dW,x cr
&= (I-A^2)mathcal Hx:dW cr
fracpartial apartial W &= (I-A^2)mathcal Hx cr
$$
The above steps use several different product notations which you may not be familiar with
$$eqalign
lambda &=A:B &implies lambda = sum_isum_j A_ij B_ij cr
L &=Aodot B &implies L_ij = A_ij B_ij cr
C &= AB &implies C_ij = sum_k A_ik B_kj cr
$$
The symbol $mathcal H$ is a $4th$ order tensor with components
$$mathcal H_ijkl = delta_ik delta_jl$$






share|cite|improve this answer















Define the variables
$$eqalign
y &= Wx+b cr
a &= tanh(y) implies A = rm Diag(a) cr
$$
Now calculate the differential and gradient of $a$ wrt $x$
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dy cr
&= (I-A^2)W,dx cr
fracpartial apartial x &= (I-A^2)W cr
$$
Depending on your Layout convention, you might prefer the transpose of this result
$$eqalign
fracpartial apartial x &= W^T(I-A^2) crcr
$$
Note that the gradient of a vector wrt a vector produces a matrix result. Your second question is about the gradient of a vector wrt a matrix, which will produce a $3rd$ order tensor.
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dW,x cr
&= (I-A^2)mathcal Hx:dW cr
fracpartial apartial W &= (I-A^2)mathcal Hx cr
$$
The above steps use several different product notations which you may not be familiar with
$$eqalign
lambda &=A:B &implies lambda = sum_isum_j A_ij B_ij cr
L &=Aodot B &implies L_ij = A_ij B_ij cr
C &= AB &implies C_ij = sum_k A_ik B_kj cr
$$
The symbol $mathcal H$ is a $4th$ order tensor with components
$$mathcal H_ijkl = delta_ik delta_jl$$







share|cite|improve this answer















share|cite|improve this answer



share|cite|improve this answer








edited Aug 4 at 0:04


























answered Aug 3 at 21:13









greg

5,6081715




5,6081715











  • just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
    – user550103
    yesterday






  • 1




    Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/…
    – greg
    yesterday











  • thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
    – user550103
    yesterday
















  • just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
    – user550103
    yesterday






  • 1




    Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/…
    – greg
    yesterday











  • thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
    – user550103
    yesterday















just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
– user550103
yesterday




just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
– user550103
yesterday




1




1




Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/…
– greg
yesterday





Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/…
– greg
yesterday













thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
– user550103
yesterday




thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
– user550103
yesterday












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2871479%2ftranspose-appearing-for-the-chain-rule-in-the-matrix-form%23new-answer', 'question_page');

);

Post as a guest













































































Comments

Popular posts from this blog

What is the equation of a 3D cone with generalised tilt?

Relationship between determinant of matrix and determinant of adjoint?

Color the edges and diagonals of a regular polygon