Hessian on linear least squares problem
Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
I tried to calculate the Hessian matrix of linear least squares problem (L-2 norm), in particular:
$$f(x) = |AX - B |_2$$
where $f:rm I!R^11times 2rightarrow rm I!R$
Can someone help me?
Thanks a lot.
linear-algebra least-squares hessian-matrix
add a comment |Â
up vote
2
down vote
favorite
I tried to calculate the Hessian matrix of linear least squares problem (L-2 norm), in particular:
$$f(x) = |AX - B |_2$$
where $f:rm I!R^11times 2rightarrow rm I!R$
Can someone help me?
Thanks a lot.
linear-algebra least-squares hessian-matrix
3
It makes more sense to work with the norm squared, since that is a smooth function (while the norm itself fails to be differentiable at one point).
– hardmath
Jul 27 at 16:55
It's not double differentiable?
– S-F
Jul 27 at 17:01
1
The Hessian is the matrix of second partial derivatives. However my point is to use the norm-squared as your objective function rather than the $L^2$-norm itself avoids problems with taking derivatives. Consider the one dimensional case, $f(x) = |x|$ and take the second derivative. I don't think it will be as useful as the function $f(x) = x^2$.
– hardmath
Jul 27 at 17:06
Yep squared norm is better. $|AX-B|_F^2 = (AX-B)^T(AX-B)$
– mathreadler
Jul 30 at 18:20
Are you using the spectral norm or the Frobenius norm?
– Rodrigo de Azevedo
Jul 30 at 19:17
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I tried to calculate the Hessian matrix of linear least squares problem (L-2 norm), in particular:
$$f(x) = |AX - B |_2$$
where $f:rm I!R^11times 2rightarrow rm I!R$
Can someone help me?
Thanks a lot.
linear-algebra least-squares hessian-matrix
I tried to calculate the Hessian matrix of linear least squares problem (L-2 norm), in particular:
$$f(x) = |AX - B |_2$$
where $f:rm I!R^11times 2rightarrow rm I!R$
Can someone help me?
Thanks a lot.
linear-algebra least-squares hessian-matrix
edited Jul 30 at 16:28
asked Jul 27 at 16:51


S-F
112
112
3
It makes more sense to work with the norm squared, since that is a smooth function (while the norm itself fails to be differentiable at one point).
– hardmath
Jul 27 at 16:55
It's not double differentiable?
– S-F
Jul 27 at 17:01
1
The Hessian is the matrix of second partial derivatives. However my point is to use the norm-squared as your objective function rather than the $L^2$-norm itself avoids problems with taking derivatives. Consider the one dimensional case, $f(x) = |x|$ and take the second derivative. I don't think it will be as useful as the function $f(x) = x^2$.
– hardmath
Jul 27 at 17:06
Yep squared norm is better. $|AX-B|_F^2 = (AX-B)^T(AX-B)$
– mathreadler
Jul 30 at 18:20
Are you using the spectral norm or the Frobenius norm?
– Rodrigo de Azevedo
Jul 30 at 19:17
add a comment |Â
3
It makes more sense to work with the norm squared, since that is a smooth function (while the norm itself fails to be differentiable at one point).
– hardmath
Jul 27 at 16:55
It's not double differentiable?
– S-F
Jul 27 at 17:01
1
The Hessian is the matrix of second partial derivatives. However my point is to use the norm-squared as your objective function rather than the $L^2$-norm itself avoids problems with taking derivatives. Consider the one dimensional case, $f(x) = |x|$ and take the second derivative. I don't think it will be as useful as the function $f(x) = x^2$.
– hardmath
Jul 27 at 17:06
Yep squared norm is better. $|AX-B|_F^2 = (AX-B)^T(AX-B)$
– mathreadler
Jul 30 at 18:20
Are you using the spectral norm or the Frobenius norm?
– Rodrigo de Azevedo
Jul 30 at 19:17
3
3
It makes more sense to work with the norm squared, since that is a smooth function (while the norm itself fails to be differentiable at one point).
– hardmath
Jul 27 at 16:55
It makes more sense to work with the norm squared, since that is a smooth function (while the norm itself fails to be differentiable at one point).
– hardmath
Jul 27 at 16:55
It's not double differentiable?
– S-F
Jul 27 at 17:01
It's not double differentiable?
– S-F
Jul 27 at 17:01
1
1
The Hessian is the matrix of second partial derivatives. However my point is to use the norm-squared as your objective function rather than the $L^2$-norm itself avoids problems with taking derivatives. Consider the one dimensional case, $f(x) = |x|$ and take the second derivative. I don't think it will be as useful as the function $f(x) = x^2$.
– hardmath
Jul 27 at 17:06
The Hessian is the matrix of second partial derivatives. However my point is to use the norm-squared as your objective function rather than the $L^2$-norm itself avoids problems with taking derivatives. Consider the one dimensional case, $f(x) = |x|$ and take the second derivative. I don't think it will be as useful as the function $f(x) = x^2$.
– hardmath
Jul 27 at 17:06
Yep squared norm is better. $|AX-B|_F^2 = (AX-B)^T(AX-B)$
– mathreadler
Jul 30 at 18:20
Yep squared norm is better. $|AX-B|_F^2 = (AX-B)^T(AX-B)$
– mathreadler
Jul 30 at 18:20
Are you using the spectral norm or the Frobenius norm?
– Rodrigo de Azevedo
Jul 30 at 19:17
Are you using the spectral norm or the Frobenius norm?
– Rodrigo de Azevedo
Jul 30 at 19:17
add a comment |Â
5 Answers
5
active
oldest
votes
up vote
0
down vote
Calculate first the gradient vector: use the chain rule and calculate the partial derivatives of $f(x)$ w.r.t $x in R^n$. You will get a function that eats a vector and produce other "vector" $g(x) in R^n$ (well this is an abuse of notation and terminology, $g(x)$ produces a vector of functions not a vector in $R^n$ so it is really a "vector operator").
Then you will take the partial derivatives of $g(x)$ w.r.t $x$ again applying the chain rule. For that you can see $g(x)$ as a vector of simpler functions $g_i(x) in R$ each of which eats a vector and produces a scalar value.
So for each dimension of $g(x)$ you have a function $g_i(x) in R$. So taking the partial derivatives of $g(x)$ w.r.t $x$ amounts to taking the partial derivatives of $g_i(x) in R$ w.r.t $x$ and put them toguether. That is the Hessian matrix.
In the same way that we see the derivative of $f(x)$ w.r.t $x$ is producing a vector operator, we can see the derivative of $g_i(x)$ w.r.t $x$ as producing a vector operator and hence the derivative of $g(x)$ w.r.t $x$ is producing a matrix operator named Hessian matrix.
add a comment |Â
up vote
0
down vote
Let $f:mathbb R^n to mathbb R$ be defined by
$$
f(x)=frac12 |Ax-b|^2.
$$
Notice that $f(x)=g(h(x))$, where $h(x)=Ax-b$ and $g(y) = frac12 |y|^2$. The derivatives of $g$ and $h$ are given by
$$
g'(y)=y^T, quad h'(x)=A.
$$
The chain rule tells us that
$$
f'(x)=g'(h(x))h'(x) = (Ax-b)^T A.
$$
If we use the convention that the gradient is a column vector, then
$$
nabla f(x)=f'(x)^T=A^T(Ax-b).
$$
The Hessian $Hf(x)$ is the derivative of the function $x mapsto nabla f(x)$, so:
$$
Hf(x)= A^T A.
$$
add a comment |Â
up vote
0
down vote
Let $f : mathbb R^m times n to mathbb R$ be defined by
$$f (mathrm X) := frac 12 | mathrm A mathrm X - mathrm B |_textF^2 = frac 12 | (mathrm I_n otimes mathrm A) , mboxvec (mathrm X) - mboxvec (mathrm B) |_2^2$$
where $mboxvec$ is the vectorization operator and $otimes$ is the Kronecker product. Thus, the Hessian of $f$ is
$$(mathrm I_n otimes mathrm A)^top (mathrm I_n otimes mathrm A) = mathrm I_n otimes mathrm A^top mathrm A$$
add a comment |Â
up vote
0
down vote
Yep squared norm is better.
$$|AX-B|_F^2 = (AX-B)^T(AX-B) = Big/text simplify Big/ = X^TA^TAX + textlinear & const terms$$
Now you should see what the Hessian is. If you still don't you can check out Hessian matrix - use in optimization.
If linear problem then the Hessian is directly in the second order term, if non-linear problem solved by trust-region approach it is matrix on second term of Taylor expansion around trust region.
add a comment |Â
up vote
0
down vote
Define a new matrix $P=(AX-B)$ and write the function as
$$f=|P|_F^2 = P:P$$
where the colon denotes the trace/Frobenius product, i.e. $,,A:B=rm tr(A^TB)$
Find the differential and gradient of $f$
$$eqalign
df &= 2P:dP = 2P:A,dX = 2A^TP:dX cr
G &= fracpartial fpartial X = 2A^TP cr
$$
Now find the differential and gradient of $G$
$$eqalign
dG &= 2A^T,dP = 2A^TA,dX = 2A^TAmathcal E:dX cr
mathcal H &= fracpartial Gpartial X = 2A^TAmathcal E cr
$$
Note that both $(mathcal H,mathcal E)$ are fourth-order tensors, the latter having components
$$mathcal E_ijkl = delta_ik delta_jl$$
So far everyone has answered a modified form of your question by squaring the function.
If you truly need the hessian of your original function, here it is
$$eqalignP$$
where $star$ is the tensor product, i.e.
$$mathcal M=Bstar C implies mathcal M_ijkl = B_ij,C_kl$$
add a comment |Â
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Calculate first the gradient vector: use the chain rule and calculate the partial derivatives of $f(x)$ w.r.t $x in R^n$. You will get a function that eats a vector and produce other "vector" $g(x) in R^n$ (well this is an abuse of notation and terminology, $g(x)$ produces a vector of functions not a vector in $R^n$ so it is really a "vector operator").
Then you will take the partial derivatives of $g(x)$ w.r.t $x$ again applying the chain rule. For that you can see $g(x)$ as a vector of simpler functions $g_i(x) in R$ each of which eats a vector and produces a scalar value.
So for each dimension of $g(x)$ you have a function $g_i(x) in R$. So taking the partial derivatives of $g(x)$ w.r.t $x$ amounts to taking the partial derivatives of $g_i(x) in R$ w.r.t $x$ and put them toguether. That is the Hessian matrix.
In the same way that we see the derivative of $f(x)$ w.r.t $x$ is producing a vector operator, we can see the derivative of $g_i(x)$ w.r.t $x$ as producing a vector operator and hence the derivative of $g(x)$ w.r.t $x$ is producing a matrix operator named Hessian matrix.
add a comment |Â
up vote
0
down vote
Calculate first the gradient vector: use the chain rule and calculate the partial derivatives of $f(x)$ w.r.t $x in R^n$. You will get a function that eats a vector and produce other "vector" $g(x) in R^n$ (well this is an abuse of notation and terminology, $g(x)$ produces a vector of functions not a vector in $R^n$ so it is really a "vector operator").
Then you will take the partial derivatives of $g(x)$ w.r.t $x$ again applying the chain rule. For that you can see $g(x)$ as a vector of simpler functions $g_i(x) in R$ each of which eats a vector and produces a scalar value.
So for each dimension of $g(x)$ you have a function $g_i(x) in R$. So taking the partial derivatives of $g(x)$ w.r.t $x$ amounts to taking the partial derivatives of $g_i(x) in R$ w.r.t $x$ and put them toguether. That is the Hessian matrix.
In the same way that we see the derivative of $f(x)$ w.r.t $x$ is producing a vector operator, we can see the derivative of $g_i(x)$ w.r.t $x$ as producing a vector operator and hence the derivative of $g(x)$ w.r.t $x$ is producing a matrix operator named Hessian matrix.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Calculate first the gradient vector: use the chain rule and calculate the partial derivatives of $f(x)$ w.r.t $x in R^n$. You will get a function that eats a vector and produce other "vector" $g(x) in R^n$ (well this is an abuse of notation and terminology, $g(x)$ produces a vector of functions not a vector in $R^n$ so it is really a "vector operator").
Then you will take the partial derivatives of $g(x)$ w.r.t $x$ again applying the chain rule. For that you can see $g(x)$ as a vector of simpler functions $g_i(x) in R$ each of which eats a vector and produces a scalar value.
So for each dimension of $g(x)$ you have a function $g_i(x) in R$. So taking the partial derivatives of $g(x)$ w.r.t $x$ amounts to taking the partial derivatives of $g_i(x) in R$ w.r.t $x$ and put them toguether. That is the Hessian matrix.
In the same way that we see the derivative of $f(x)$ w.r.t $x$ is producing a vector operator, we can see the derivative of $g_i(x)$ w.r.t $x$ as producing a vector operator and hence the derivative of $g(x)$ w.r.t $x$ is producing a matrix operator named Hessian matrix.
Calculate first the gradient vector: use the chain rule and calculate the partial derivatives of $f(x)$ w.r.t $x in R^n$. You will get a function that eats a vector and produce other "vector" $g(x) in R^n$ (well this is an abuse of notation and terminology, $g(x)$ produces a vector of functions not a vector in $R^n$ so it is really a "vector operator").
Then you will take the partial derivatives of $g(x)$ w.r.t $x$ again applying the chain rule. For that you can see $g(x)$ as a vector of simpler functions $g_i(x) in R$ each of which eats a vector and produces a scalar value.
So for each dimension of $g(x)$ you have a function $g_i(x) in R$. So taking the partial derivatives of $g(x)$ w.r.t $x$ amounts to taking the partial derivatives of $g_i(x) in R$ w.r.t $x$ and put them toguether. That is the Hessian matrix.
In the same way that we see the derivative of $f(x)$ w.r.t $x$ is producing a vector operator, we can see the derivative of $g_i(x)$ w.r.t $x$ as producing a vector operator and hence the derivative of $g(x)$ w.r.t $x$ is producing a matrix operator named Hessian matrix.
answered Jul 27 at 17:32
Mauricio Cele Lopez Belon
54728
54728
add a comment |Â
add a comment |Â
up vote
0
down vote
Let $f:mathbb R^n to mathbb R$ be defined by
$$
f(x)=frac12 |Ax-b|^2.
$$
Notice that $f(x)=g(h(x))$, where $h(x)=Ax-b$ and $g(y) = frac12 |y|^2$. The derivatives of $g$ and $h$ are given by
$$
g'(y)=y^T, quad h'(x)=A.
$$
The chain rule tells us that
$$
f'(x)=g'(h(x))h'(x) = (Ax-b)^T A.
$$
If we use the convention that the gradient is a column vector, then
$$
nabla f(x)=f'(x)^T=A^T(Ax-b).
$$
The Hessian $Hf(x)$ is the derivative of the function $x mapsto nabla f(x)$, so:
$$
Hf(x)= A^T A.
$$
add a comment |Â
up vote
0
down vote
Let $f:mathbb R^n to mathbb R$ be defined by
$$
f(x)=frac12 |Ax-b|^2.
$$
Notice that $f(x)=g(h(x))$, where $h(x)=Ax-b$ and $g(y) = frac12 |y|^2$. The derivatives of $g$ and $h$ are given by
$$
g'(y)=y^T, quad h'(x)=A.
$$
The chain rule tells us that
$$
f'(x)=g'(h(x))h'(x) = (Ax-b)^T A.
$$
If we use the convention that the gradient is a column vector, then
$$
nabla f(x)=f'(x)^T=A^T(Ax-b).
$$
The Hessian $Hf(x)$ is the derivative of the function $x mapsto nabla f(x)$, so:
$$
Hf(x)= A^T A.
$$
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Let $f:mathbb R^n to mathbb R$ be defined by
$$
f(x)=frac12 |Ax-b|^2.
$$
Notice that $f(x)=g(h(x))$, where $h(x)=Ax-b$ and $g(y) = frac12 |y|^2$. The derivatives of $g$ and $h$ are given by
$$
g'(y)=y^T, quad h'(x)=A.
$$
The chain rule tells us that
$$
f'(x)=g'(h(x))h'(x) = (Ax-b)^T A.
$$
If we use the convention that the gradient is a column vector, then
$$
nabla f(x)=f'(x)^T=A^T(Ax-b).
$$
The Hessian $Hf(x)$ is the derivative of the function $x mapsto nabla f(x)$, so:
$$
Hf(x)= A^T A.
$$
Let $f:mathbb R^n to mathbb R$ be defined by
$$
f(x)=frac12 |Ax-b|^2.
$$
Notice that $f(x)=g(h(x))$, where $h(x)=Ax-b$ and $g(y) = frac12 |y|^2$. The derivatives of $g$ and $h$ are given by
$$
g'(y)=y^T, quad h'(x)=A.
$$
The chain rule tells us that
$$
f'(x)=g'(h(x))h'(x) = (Ax-b)^T A.
$$
If we use the convention that the gradient is a column vector, then
$$
nabla f(x)=f'(x)^T=A^T(Ax-b).
$$
The Hessian $Hf(x)$ is the derivative of the function $x mapsto nabla f(x)$, so:
$$
Hf(x)= A^T A.
$$
edited Jul 30 at 18:54
answered Jul 30 at 18:49


littleO
25.9k540100
25.9k540100
add a comment |Â
add a comment |Â
up vote
0
down vote
Let $f : mathbb R^m times n to mathbb R$ be defined by
$$f (mathrm X) := frac 12 | mathrm A mathrm X - mathrm B |_textF^2 = frac 12 | (mathrm I_n otimes mathrm A) , mboxvec (mathrm X) - mboxvec (mathrm B) |_2^2$$
where $mboxvec$ is the vectorization operator and $otimes$ is the Kronecker product. Thus, the Hessian of $f$ is
$$(mathrm I_n otimes mathrm A)^top (mathrm I_n otimes mathrm A) = mathrm I_n otimes mathrm A^top mathrm A$$
add a comment |Â
up vote
0
down vote
Let $f : mathbb R^m times n to mathbb R$ be defined by
$$f (mathrm X) := frac 12 | mathrm A mathrm X - mathrm B |_textF^2 = frac 12 | (mathrm I_n otimes mathrm A) , mboxvec (mathrm X) - mboxvec (mathrm B) |_2^2$$
where $mboxvec$ is the vectorization operator and $otimes$ is the Kronecker product. Thus, the Hessian of $f$ is
$$(mathrm I_n otimes mathrm A)^top (mathrm I_n otimes mathrm A) = mathrm I_n otimes mathrm A^top mathrm A$$
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Let $f : mathbb R^m times n to mathbb R$ be defined by
$$f (mathrm X) := frac 12 | mathrm A mathrm X - mathrm B |_textF^2 = frac 12 | (mathrm I_n otimes mathrm A) , mboxvec (mathrm X) - mboxvec (mathrm B) |_2^2$$
where $mboxvec$ is the vectorization operator and $otimes$ is the Kronecker product. Thus, the Hessian of $f$ is
$$(mathrm I_n otimes mathrm A)^top (mathrm I_n otimes mathrm A) = mathrm I_n otimes mathrm A^top mathrm A$$
Let $f : mathbb R^m times n to mathbb R$ be defined by
$$f (mathrm X) := frac 12 | mathrm A mathrm X - mathrm B |_textF^2 = frac 12 | (mathrm I_n otimes mathrm A) , mboxvec (mathrm X) - mboxvec (mathrm B) |_2^2$$
where $mboxvec$ is the vectorization operator and $otimes$ is the Kronecker product. Thus, the Hessian of $f$ is
$$(mathrm I_n otimes mathrm A)^top (mathrm I_n otimes mathrm A) = mathrm I_n otimes mathrm A^top mathrm A$$
answered Jul 31 at 5:31
Rodrigo de Azevedo
12.5k41751
12.5k41751
add a comment |Â
add a comment |Â
up vote
0
down vote
Yep squared norm is better.
$$|AX-B|_F^2 = (AX-B)^T(AX-B) = Big/text simplify Big/ = X^TA^TAX + textlinear & const terms$$
Now you should see what the Hessian is. If you still don't you can check out Hessian matrix - use in optimization.
If linear problem then the Hessian is directly in the second order term, if non-linear problem solved by trust-region approach it is matrix on second term of Taylor expansion around trust region.
add a comment |Â
up vote
0
down vote
Yep squared norm is better.
$$|AX-B|_F^2 = (AX-B)^T(AX-B) = Big/text simplify Big/ = X^TA^TAX + textlinear & const terms$$
Now you should see what the Hessian is. If you still don't you can check out Hessian matrix - use in optimization.
If linear problem then the Hessian is directly in the second order term, if non-linear problem solved by trust-region approach it is matrix on second term of Taylor expansion around trust region.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Yep squared norm is better.
$$|AX-B|_F^2 = (AX-B)^T(AX-B) = Big/text simplify Big/ = X^TA^TAX + textlinear & const terms$$
Now you should see what the Hessian is. If you still don't you can check out Hessian matrix - use in optimization.
If linear problem then the Hessian is directly in the second order term, if non-linear problem solved by trust-region approach it is matrix on second term of Taylor expansion around trust region.
Yep squared norm is better.
$$|AX-B|_F^2 = (AX-B)^T(AX-B) = Big/text simplify Big/ = X^TA^TAX + textlinear & const terms$$
Now you should see what the Hessian is. If you still don't you can check out Hessian matrix - use in optimization.
If linear problem then the Hessian is directly in the second order term, if non-linear problem solved by trust-region approach it is matrix on second term of Taylor expansion around trust region.
edited Aug 1 at 9:28
answered Jul 30 at 18:22


mathreadler
13.6k71857
13.6k71857
add a comment |Â
add a comment |Â
up vote
0
down vote
Define a new matrix $P=(AX-B)$ and write the function as
$$f=|P|_F^2 = P:P$$
where the colon denotes the trace/Frobenius product, i.e. $,,A:B=rm tr(A^TB)$
Find the differential and gradient of $f$
$$eqalign
df &= 2P:dP = 2P:A,dX = 2A^TP:dX cr
G &= fracpartial fpartial X = 2A^TP cr
$$
Now find the differential and gradient of $G$
$$eqalign
dG &= 2A^T,dP = 2A^TA,dX = 2A^TAmathcal E:dX cr
mathcal H &= fracpartial Gpartial X = 2A^TAmathcal E cr
$$
Note that both $(mathcal H,mathcal E)$ are fourth-order tensors, the latter having components
$$mathcal E_ijkl = delta_ik delta_jl$$
So far everyone has answered a modified form of your question by squaring the function.
If you truly need the hessian of your original function, here it is
$$eqalignP$$
where $star$ is the tensor product, i.e.
$$mathcal M=Bstar C implies mathcal M_ijkl = B_ij,C_kl$$
add a comment |Â
up vote
0
down vote
Define a new matrix $P=(AX-B)$ and write the function as
$$f=|P|_F^2 = P:P$$
where the colon denotes the trace/Frobenius product, i.e. $,,A:B=rm tr(A^TB)$
Find the differential and gradient of $f$
$$eqalign
df &= 2P:dP = 2P:A,dX = 2A^TP:dX cr
G &= fracpartial fpartial X = 2A^TP cr
$$
Now find the differential and gradient of $G$
$$eqalign
dG &= 2A^T,dP = 2A^TA,dX = 2A^TAmathcal E:dX cr
mathcal H &= fracpartial Gpartial X = 2A^TAmathcal E cr
$$
Note that both $(mathcal H,mathcal E)$ are fourth-order tensors, the latter having components
$$mathcal E_ijkl = delta_ik delta_jl$$
So far everyone has answered a modified form of your question by squaring the function.
If you truly need the hessian of your original function, here it is
$$eqalignP$$
where $star$ is the tensor product, i.e.
$$mathcal M=Bstar C implies mathcal M_ijkl = B_ij,C_kl$$
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Define a new matrix $P=(AX-B)$ and write the function as
$$f=|P|_F^2 = P:P$$
where the colon denotes the trace/Frobenius product, i.e. $,,A:B=rm tr(A^TB)$
Find the differential and gradient of $f$
$$eqalign
df &= 2P:dP = 2P:A,dX = 2A^TP:dX cr
G &= fracpartial fpartial X = 2A^TP cr
$$
Now find the differential and gradient of $G$
$$eqalign
dG &= 2A^T,dP = 2A^TA,dX = 2A^TAmathcal E:dX cr
mathcal H &= fracpartial Gpartial X = 2A^TAmathcal E cr
$$
Note that both $(mathcal H,mathcal E)$ are fourth-order tensors, the latter having components
$$mathcal E_ijkl = delta_ik delta_jl$$
So far everyone has answered a modified form of your question by squaring the function.
If you truly need the hessian of your original function, here it is
$$eqalignP$$
where $star$ is the tensor product, i.e.
$$mathcal M=Bstar C implies mathcal M_ijkl = B_ij,C_kl$$
Define a new matrix $P=(AX-B)$ and write the function as
$$f=|P|_F^2 = P:P$$
where the colon denotes the trace/Frobenius product, i.e. $,,A:B=rm tr(A^TB)$
Find the differential and gradient of $f$
$$eqalign
df &= 2P:dP = 2P:A,dX = 2A^TP:dX cr
G &= fracpartial fpartial X = 2A^TP cr
$$
Now find the differential and gradient of $G$
$$eqalign
dG &= 2A^T,dP = 2A^TA,dX = 2A^TAmathcal E:dX cr
mathcal H &= fracpartial Gpartial X = 2A^TAmathcal E cr
$$
Note that both $(mathcal H,mathcal E)$ are fourth-order tensors, the latter having components
$$mathcal E_ijkl = delta_ik delta_jl$$
So far everyone has answered a modified form of your question by squaring the function.
If you truly need the hessian of your original function, here it is
$$eqalignP$$
where $star$ is the tensor product, i.e.
$$mathcal M=Bstar C implies mathcal M_ijkl = B_ij,C_kl$$
answered Aug 5 at 20:12
lynn
1,451166
1,451166
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2864585%2fhessian-on-linear-least-squares-problem%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
3
It makes more sense to work with the norm squared, since that is a smooth function (while the norm itself fails to be differentiable at one point).
– hardmath
Jul 27 at 16:55
It's not double differentiable?
– S-F
Jul 27 at 17:01
1
The Hessian is the matrix of second partial derivatives. However my point is to use the norm-squared as your objective function rather than the $L^2$-norm itself avoids problems with taking derivatives. Consider the one dimensional case, $f(x) = |x|$ and take the second derivative. I don't think it will be as useful as the function $f(x) = x^2$.
– hardmath
Jul 27 at 17:06
Yep squared norm is better. $|AX-B|_F^2 = (AX-B)^T(AX-B)$
– mathreadler
Jul 30 at 18:20
Are you using the spectral norm or the Frobenius norm?
– Rodrigo de Azevedo
Jul 30 at 19:17