transpose appearing for the chain rule in the matrix form

up vote
0
down vote

favorite

I am trying to understand deriving the derivative of a matrix equation of the form:

$$a = tanh(WX + b)$$
in which $W$ is a $M*N$ matrix, $X$ is $N*1$, and $b$ is $M*1$. I'm trying to take the derivative of $a$ with respect to $W$, $X$, and $b$. I already have the final answers as:

$$partial a/ partial X= W^T(1 - tanh(WX+b)^2)$$
I don't understand how $W$ moves to the left hand side of $(1 - tanh(WX+b)^2$ and gets transposed!? I understand that the chain rule is:

$$partial f(u)/ partial x= f'(u)partial u/ partial x$$ so in my example $partial u/ partial x$ is on the right hand side of the equation.

$$partial a/ partial W=(1 - tanh(WX+b)^2)X^T$$
in which I don't understand how $X$ gets transposed and moves to the left hand side.

edited Aug 3 at 21:09

asked Aug 3 at 20:46

Amir Hossein F

$WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
â€“Â DinosaurEgg
Aug 3 at 20:52

@DinosaurEgg I think it is applied component-wise.
â€“Â angryavian
Aug 3 at 20:55

But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
â€“Â DinosaurEgg
Aug 3 at 21:06

yes tanh and power of 2 is component wise
â€“Â Amir Hossein F
Aug 3 at 21:08

add a commentÂ |Â

up vote
0
down vote

favorite

I am trying to understand deriving the derivative of a matrix equation of the form:

$$a = tanh(WX + b)$$
in which $W$ is a $M*N$ matrix, $X$ is $N*1$, and $b$ is $M*1$. I'm trying to take the derivative of $a$ with respect to $W$, $X$, and $b$. I already have the final answers as:

$$partial a/ partial X= W^T(1 - tanh(WX+b)^2)$$
I don't understand how $W$ moves to the left hand side of $(1 - tanh(WX+b)^2$ and gets transposed!? I understand that the chain rule is:

$$partial f(u)/ partial x= f'(u)partial u/ partial x$$ so in my example $partial u/ partial x$ is on the right hand side of the equation.

$$partial a/ partial W=(1 - tanh(WX+b)^2)X^T$$
in which I don't understand how $X$ gets transposed and moves to the left hand side.

edited Aug 3 at 21:09

asked Aug 3 at 20:46

Amir Hossein F

$WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
â€“Â DinosaurEgg
Aug 3 at 20:52

@DinosaurEgg I think it is applied component-wise.
â€“Â angryavian
Aug 3 at 20:55

But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
â€“Â DinosaurEgg
Aug 3 at 21:06

yes tanh and power of 2 is component wise
â€“Â Amir Hossein F
Aug 3 at 21:08

add a commentÂ |Â

up vote
0
down vote

favorite

I am trying to understand deriving the derivative of a matrix equation of the form:

$$a = tanh(WX + b)$$
in which $W$ is a $M*N$ matrix, $X$ is $N*1$, and $b$ is $M*1$. I'm trying to take the derivative of $a$ with respect to $W$, $X$, and $b$. I already have the final answers as:

$$partial a/ partial X= W^T(1 - tanh(WX+b)^2)$$
I don't understand how $W$ moves to the left hand side of $(1 - tanh(WX+b)^2$ and gets transposed!? I understand that the chain rule is:

$$partial f(u)/ partial x= f'(u)partial u/ partial x$$ so in my example $partial u/ partial x$ is on the right hand side of the equation.

$$partial a/ partial W=(1 - tanh(WX+b)^2)X^T$$
in which I don't understand how $X$ gets transposed and moves to the left hand side.

edited Aug 3 at 21:09

asked Aug 3 at 20:46

Amir Hossein F

I am trying to understand deriving the derivative of a matrix equation of the form:

$$a = tanh(WX + b)$$
in which $W$ is a $M*N$ matrix, $X$ is $N*1$, and $b$ is $M*1$. I'm trying to take the derivative of $a$ with respect to $W$, $X$, and $b$. I already have the final answers as:

$$partial a/ partial X= W^T(1 - tanh(WX+b)^2)$$
I don't understand how $W$ moves to the left hand side of $(1 - tanh(WX+b)^2$ and gets transposed!? I understand that the chain rule is:

$$partial f(u)/ partial x= f'(u)partial u/ partial x$$ so in my example $partial u/ partial x$ is on the right hand side of the equation.

$$partial a/ partial W=(1 - tanh(WX+b)^2)X^T$$
in which I don't understand how $X$ gets transposed and moves to the left hand side.

edited Aug 3 at 21:09

asked Aug 3 at 20:46

Amir Hossein F

edited Aug 3 at 21:09

asked Aug 3 at 20:46

Amir Hossein F

asked Aug 3 at 20:46

Amir Hossein F

asked Aug 3 at 20:46

Amir Hossein F

$WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
â€“Â DinosaurEgg
Aug 3 at 20:52

@DinosaurEgg I think it is applied component-wise.
â€“Â angryavian
Aug 3 at 20:55

But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
â€“Â DinosaurEgg
Aug 3 at 21:06

yes tanh and power of 2 is component wise
â€“Â Amir Hossein F
Aug 3 at 21:08

add a commentÂ |Â

$WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
â€“Â DinosaurEgg
Aug 3 at 20:52

@DinosaurEgg I think it is applied component-wise.
â€“Â angryavian
Aug 3 at 20:55

But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
â€“Â DinosaurEgg
Aug 3 at 21:06

yes tanh and power of 2 is component wise
â€“Â Amir Hossein F
Aug 3 at 21:08

$WX+b$ seems to be a $mtimes 1$ vector. How do you define the hyperbolic tangent of a vector? Defining powers of a vector is tough.
â€“Â DinosaurEgg
Aug 3 at 20:52

@DinosaurEgg I think it is applied component-wise.
â€“Â angryavian
Aug 3 at 20:55

But then the derivatives would have to be taken component-wise as well! Notation seems a little vague...
â€“Â DinosaurEgg
Aug 3 at 21:06

yes tanh and power of 2 is component wise
â€“Â Amir Hossein F
Aug 3 at 21:08

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
1
down vote

Define the variables
$$eqalign
y &= Wx+b cr
a &= tanh(y) implies A = rm Diag(a) cr
$$
Now calculate the differential and gradient of $a$ wrt $x$
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dy cr
&= (I-A^2)W,dx cr
fracpartial apartial x &= (I-A^2)W cr
$$
Depending on your Layout convention, you might prefer the transpose of this result
$$eqalign
fracpartial apartial x &= W^T(I-A^2) crcr
$$
Note that the gradient of a vector wrt a vector produces a matrix result. Your second question is about the gradient of a vector wrt a matrix, which will produce a $3rd$ order tensor.
$$eqalign
da &= (1-aodot a)odot dy cr
&= (I-A^2),dW,x cr
&= (I-A^2)mathcal Hx:dW cr
fracpartial apartial W &= (I-A^2)mathcal Hx cr
$$
The above steps use several different product notations which you may not be familiar with
$$eqalign
lambda &=A:B &implies lambda = sum_isum_j A_ij B_ij cr
L &=Aodot B &implies L_ij = A_ij B_ij cr
C &= AB &implies C_ij = sum_k A_ik B_kj cr
$$
The symbol $mathcal H$ is a $4th$ order tensor with components
$$mathcal H_ijkl = delta_ik delta_jl$$

edited Aug 4 at 0:04

answered Aug 3 at 21:13

greg

5,6081715

just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
â€“Â user550103
yesterday

1

Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/â€¦
â€“Â greg
yesterday

thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
â€“Â user550103
yesterday

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2871479%2ftranspose-appearing-for-the-chain-rule-in-the-matrix-form%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

edited Aug 4 at 0:04

answered Aug 3 at 21:13

greg

5,6081715

just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
â€“Â user550103
yesterday

1

Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/â€¦
â€“Â greg
yesterday

thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
â€“Â user550103
yesterday

add a commentÂ |Â

up vote
1
down vote

edited Aug 4 at 0:04

answered Aug 3 at 21:13

greg

5,6081715

just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
â€“Â user550103
yesterday

1

Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/â€¦
â€“Â greg
yesterday

thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
â€“Â user550103
yesterday

add a commentÂ |Â

up vote
1
down vote

edited Aug 4 at 0:04

answered Aug 3 at 21:13

greg

5,6081715

edited Aug 4 at 0:04

answered Aug 3 at 21:13

greg

5,6081715

edited Aug 4 at 0:04

answered Aug 3 at 21:13

greg

5,6081715

answered Aug 3 at 21:13

greg

5,6081715

answered Aug 3 at 21:13

greg

5,6081715

just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
â€“Â user550103
yesterday

1

Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/â€¦
â€“Â greg
yesterday

thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
â€“Â user550103
yesterday

add a commentÂ |Â

just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
â€“Â user550103
yesterday

1

Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/â€¦
â€“Â greg
yesterday

thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
â€“Â user550103
yesterday

just out of curiosity and from the learning perspective, the first derivative, i.e., $fracpartial apartial x$ is comprehensible. However, the solution of the second derivative, i.e., $fracpartial apartial W$ is tricky. I hope someday I will get these tensors. Can we instead vectorize $rm vec(Wx + b)$ such that the differential would be $(x^T otimes I) rm vec(dW)$? What do you say?
â€“Â user550103
yesterday

Vectorization is certainly one way to handle things, but the real solution is to avoid higher-order tensors entirely. Nobody really cares about them, but they need them as an intermediate quantity in a chain rule calculation. Here's an example my approach to such problems math.stackexchange.com/questions/2391112/â€¦
â€“Â greg
yesterday

thanks for your enlightenment, greg! I have already liked your solution. I am learning a lot of matrix derivatives from you, must admit. Thank you.
â€“Â user550103
yesterday

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

ukmuiik