transpose appearing for the chain rule in the matrix form
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
I am trying to understand deriving the derivative of a matrix equation of the form:
$$a = tanh(WX + b)$$
in which $W$ is a $M*N$ matrix, $X$ is $N*1$, and $b$ is $M*1$. I'm trying to take the derivative of $a$ with respect to $W$, $X$, and $b$. I already have the final answers as:
$$partial a/ partial X= W^T(1 - tanh(WX+b)^2)$$
I don't understand how $W$ moves to the left hand side of $(1 - tanh(WX+b)^2$ and gets transposed!? I understand that the chain rule is:
$$partial f(u)/ partial x= f'(u)partial u/ partial x$$ so in my example $partial u/ partial x$ is on the right hand side of the equation.
- $$partial a/ partial W=(1 - tanh(WX+b)^2)X^T$$
in which I don't understand how $X$ gets transposed and moves to the left hand side.
calculus derivatives matrix-calculus
add a comment |Â
up vote
0
down vote
favorite
I am trying to understand deriving the derivative of a matrix equation of the form:
$$a = tanh(WX + b)$$
in which $W$ is a $M*N$ matrix, $X$ is $N*1$, and $b$ is $M*1$. I'm trying to take the derivative of $a$ with respect to $W$, $X$, and $b$. I already have the final answers as:
$$partial a/ partial X= W^T(1 - tanh(WX+b)^2)$$
I don't understand how $W$ moves to the left hand side of $(1 - tanh(WX+b)^2$ and gets transposed!? I understand that the chain rule is:
$$partial f(u)/ partial x= f'(u)partial u/ partial x$$ so in my example $partial u/ partial x$ is on the right hand side of the equation.
- $$partial a/ partial W=(1 - tanh(WX+b)^2)X^T$$
in which I don't understand how $X$ gets transposed and moves to the left hand side.
calculus derivatives matrix-calculus
$WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
– DinosaurEgg
Aug 3 at 20:52
@DinosaurEgg I think it is applied component-wise.
– angryavian
Aug 3 at 20:55
But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
– DinosaurEgg
Aug 3 at 21:06
yes tanh and power of 2 is component wise
– Amir Hossein F
Aug 3 at 21:08
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I am trying to understand deriving the derivative of a matrix equation of the form:
$$a = tanh(WX + b)$$
in which $W$ is a $M*N$ matrix, $X$ is $N*1$, and $b$ is $M*1$. I'm trying to take the derivative of $a$ with respect to $W$, $X$, and $b$. I already have the final answers as:
$$partial a/ partial X= W^T(1 - tanh(WX+b)^2)$$
I don't understand how $W$ moves to the left hand side of $(1 - tanh(WX+b)^2$ and gets transposed!? I understand that the chain rule is:
$$partial f(u)/ partial x= f'(u)partial u/ partial x$$ so in my example $partial u/ partial x$ is on the right hand side of the equation.
- $$partial a/ partial W=(1 - tanh(WX+b)^2)X^T$$
in which I don't understand how $X$ gets transposed and moves to the left hand side.
calculus derivatives matrix-calculus
I am trying to understand deriving the derivative of a matrix equation of the form:
$$a = tanh(WX + b)$$
in which $W$ is a $M*N$ matrix, $X$ is $N*1$, and $b$ is $M*1$. I'm trying to take the derivative of $a$ with respect to $W$, $X$, and $b$. I already have the final answers as:
$$partial a/ partial X= W^T(1 - tanh(WX+b)^2)$$
I don't understand how $W$ moves to the left hand side of $(1 - tanh(WX+b)^2$ and gets transposed!? I understand that the chain rule is:
$$partial f(u)/ partial x= f'(u)partial u/ partial x$$ so in my example $partial u/ partial x$ is on the right hand side of the equation.
- $$partial a/ partial W=(1 - tanh(WX+b)^2)X^T$$
in which I don't understand how $X$ gets transposed and moves to the left hand side.
calculus derivatives matrix-calculus
edited Aug 3 at 21:09
asked Aug 3 at 20:46
Amir Hossein F
64
64
$WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
– DinosaurEgg
Aug 3 at 20:52
@DinosaurEgg I think it is applied component-wise.
– angryavian
Aug 3 at 20:55
But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
– DinosaurEgg
Aug 3 at 21:06
yes tanh and power of 2 is component wise
– Amir Hossein F
Aug 3 at 21:08
add a comment |Â
$WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
– DinosaurEgg
Aug 3 at 20:52
@DinosaurEgg I think it is applied component-wise.
– angryavian
Aug 3 at 20:55
But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
– DinosaurEgg
Aug 3 at 21:06
yes tanh and power of 2 is component wise
– Amir Hossein F
Aug 3 at 21:08
$WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
– DinosaurEgg
Aug 3 at 20:52
$WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
– DinosaurEgg
Aug 3 at 20:52
@DinosaurEgg I think it is applied component-wise.
– angryavian
Aug 3 at 20:55
@DinosaurEgg I think it is applied component-wise.
– angryavian
Aug 3 at 20:55
But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
– DinosaurEgg
Aug 3 at 21:06
But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
– DinosaurEgg
Aug 3 at 21:06
yes tanh and power of 2 is component wise
– Amir Hossein F
Aug 3 at 21:08
yes tanh and power of 2 is component wise
– Amir Hossein F
Aug 3 at 21:08
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
1
down vote
Define the variables
$$eqalign
y &= Wx+b cr
a &= tanh(y) implies A = rm Diag(a) cr
$$
Now calculate the differential and gradient of $a$ wrt $x$
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dy cr
&= (I-A^2)W,dx cr
fracpartial apartial x &= (I-A^2)W cr
$$
Depending on your Layout convention, you might prefer the transpose of this result
$$eqalign
fracpartial apartial x &= W^T(I-A^2) crcr
$$
Note that the gradient of a vector wrt a vector produces a matrix result. Your second question is about the gradient of a vector wrt a matrix, which will produce a $3rd$ order tensor.
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dW,x cr
&= (I-A^2)mathcal Hx:dW cr
fracpartial apartial W &= (I-A^2)mathcal Hx cr
$$
The above steps use several different product notations which you may not be familiar with
$$eqalign
lambda &=A:B &implies lambda = sum_isum_j A_ij B_ij cr
L &=Aodot B &implies L_ij = A_ij B_ij cr
C &= AB &implies C_ij = sum_k A_ik B_kj cr
$$
The symbol $mathcal H$ is a $4th$ order tensor with components
$$mathcal H_ijkl = delta_ik delta_jl$$
just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
– user550103
yesterday
1
Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/…
– greg
yesterday
thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
– user550103
yesterday
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
Define the variables
$$eqalign
y &= Wx+b cr
a &= tanh(y) implies A = rm Diag(a) cr
$$
Now calculate the differential and gradient of $a$ wrt $x$
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dy cr
&= (I-A^2)W,dx cr
fracpartial apartial x &= (I-A^2)W cr
$$
Depending on your Layout convention, you might prefer the transpose of this result
$$eqalign
fracpartial apartial x &= W^T(I-A^2) crcr
$$
Note that the gradient of a vector wrt a vector produces a matrix result. Your second question is about the gradient of a vector wrt a matrix, which will produce a $3rd$ order tensor.
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dW,x cr
&= (I-A^2)mathcal Hx:dW cr
fracpartial apartial W &= (I-A^2)mathcal Hx cr
$$
The above steps use several different product notations which you may not be familiar with
$$eqalign
lambda &=A:B &implies lambda = sum_isum_j A_ij B_ij cr
L &=Aodot B &implies L_ij = A_ij B_ij cr
C &= AB &implies C_ij = sum_k A_ik B_kj cr
$$
The symbol $mathcal H$ is a $4th$ order tensor with components
$$mathcal H_ijkl = delta_ik delta_jl$$
just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
– user550103
yesterday
1
Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/…
– greg
yesterday
thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
– user550103
yesterday
add a comment |Â
up vote
1
down vote
Define the variables
$$eqalign
y &= Wx+b cr
a &= tanh(y) implies A = rm Diag(a) cr
$$
Now calculate the differential and gradient of $a$ wrt $x$
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dy cr
&= (I-A^2)W,dx cr
fracpartial apartial x &= (I-A^2)W cr
$$
Depending on your Layout convention, you might prefer the transpose of this result
$$eqalign
fracpartial apartial x &= W^T(I-A^2) crcr
$$
Note that the gradient of a vector wrt a vector produces a matrix result. Your second question is about the gradient of a vector wrt a matrix, which will produce a $3rd$ order tensor.
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dW,x cr
&= (I-A^2)mathcal Hx:dW cr
fracpartial apartial W &= (I-A^2)mathcal Hx cr
$$
The above steps use several different product notations which you may not be familiar with
$$eqalign
lambda &=A:B &implies lambda = sum_isum_j A_ij B_ij cr
L &=Aodot B &implies L_ij = A_ij B_ij cr
C &= AB &implies C_ij = sum_k A_ik B_kj cr
$$
The symbol $mathcal H$ is a $4th$ order tensor with components
$$mathcal H_ijkl = delta_ik delta_jl$$
just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
– user550103
yesterday
1
Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/…
– greg
yesterday
thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
– user550103
yesterday
add a comment |Â
up vote
1
down vote
up vote
1
down vote
Define the variables
$$eqalign
y &= Wx+b cr
a &= tanh(y) implies A = rm Diag(a) cr
$$
Now calculate the differential and gradient of $a$ wrt $x$
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dy cr
&= (I-A^2)W,dx cr
fracpartial apartial x &= (I-A^2)W cr
$$
Depending on your Layout convention, you might prefer the transpose of this result
$$eqalign
fracpartial apartial x &= W^T(I-A^2) crcr
$$
Note that the gradient of a vector wrt a vector produces a matrix result. Your second question is about the gradient of a vector wrt a matrix, which will produce a $3rd$ order tensor.
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dW,x cr
&= (I-A^2)mathcal Hx:dW cr
fracpartial apartial W &= (I-A^2)mathcal Hx cr
$$
The above steps use several different product notations which you may not be familiar with
$$eqalign
lambda &=A:B &implies lambda = sum_isum_j A_ij B_ij cr
L &=Aodot B &implies L_ij = A_ij B_ij cr
C &= AB &implies C_ij = sum_k A_ik B_kj cr
$$
The symbol $mathcal H$ is a $4th$ order tensor with components
$$mathcal H_ijkl = delta_ik delta_jl$$
Define the variables
$$eqalign
y &= Wx+b cr
a &= tanh(y) implies A = rm Diag(a) cr
$$
Now calculate the differential and gradient of $a$ wrt $x$
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dy cr
&= (I-A^2)W,dx cr
fracpartial apartial x &= (I-A^2)W cr
$$
Depending on your Layout convention, you might prefer the transpose of this result
$$eqalign
fracpartial apartial x &= W^T(I-A^2) crcr
$$
Note that the gradient of a vector wrt a vector produces a matrix result. Your second question is about the gradient of a vector wrt a matrix, which will produce a $3rd$ order tensor.
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dW,x cr
&= (I-A^2)mathcal Hx:dW cr
fracpartial apartial W &= (I-A^2)mathcal Hx cr
$$
The above steps use several different product notations which you may not be familiar with
$$eqalign
lambda &=A:B &implies lambda = sum_isum_j A_ij B_ij cr
L &=Aodot B &implies L_ij = A_ij B_ij cr
C &= AB &implies C_ij = sum_k A_ik B_kj cr
$$
The symbol $mathcal H$ is a $4th$ order tensor with components
$$mathcal H_ijkl = delta_ik delta_jl$$
edited Aug 4 at 0:04
answered Aug 3 at 21:13
greg
5,6081715
5,6081715
just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
– user550103
yesterday
1
Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/…
– greg
yesterday
thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
– user550103
yesterday
add a comment |Â
just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
– user550103
yesterday
1
Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/…
– greg
yesterday
thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
– user550103
yesterday
just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
– user550103
yesterday
just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
– user550103
yesterday
1
1
Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/…
– greg
yesterday
Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/…
– greg
yesterday
thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
– user550103
yesterday
thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
– user550103
yesterday
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2871479%2ftranspose-appearing-for-the-chain-rule-in-the-matrix-form%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
$WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
– DinosaurEgg
Aug 3 at 20:52
@DinosaurEgg I think it is applied component-wise.
– angryavian
Aug 3 at 20:55
But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
– DinosaurEgg
Aug 3 at 21:06
yes tanh and power of 2 is component wise
– Amir Hossein F
Aug 3 at 21:08