Why does the projected gradient method work?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite
1












Consider the problem
beginalign*
min_x in mathbbR^n &quad f(x) \
s.t.: &quad x in C,
endalign*
where $C$ is a convex set. As $C$ is convex, the projection onto $C$, $P_C$, is well defined for every $x in mathbbR^n$. The projected gradient method is a method that proposes solving the above optimization problem taking steps of the form $x_t+1 = P_C[x_t - eta nabla f(x_t)]$.



It is well known that for unconstrained problems the gradient has the maximum slope direction, but why does it still work after projecting? Namely, is there a result in the way of the following?



"Exists $varepsilon > 0$ so that for $0 < delta < varepsilon$, we have $f(P_C[x - deltanabla f(x)]) le f(x)$"



As a bonus question, when there are more than one convex restrictions (like $x in C$ and $x in D$, with both convex sets), there is an alternated projection method that consists in projecting repeteadly to $C$ and then to $D$. The sequence of projections converge to a point in $Ccap D$, if $C cap D ne emptyset$. But that limit is not necessarily the projection of the initial point onto $C cap D$, so why alternated projections can work with gradient descent?







share|cite|improve this question

















  • 1




    The projected gradient method is a special case of the proximal gradient method, and you can find a convergence proof for the proximal gradient method in many places, for example in Vandenberghe's UCLA 236c course notes.
    – littleO
    Jul 19 at 14:41










  • Do you have an example for where the repeated projection does not converge to $P_Ccap D[x]$?
    – M. Winter
    Jul 20 at 13:19











  • I think that an example can be a square inside a circle in $mathbbR^2$. Supposing that the square edges are aligned with the axis, the projection of every point onto the square is given by intersecting a vertical or horizontal line that contains the point with the square, if they intersect, or a diagonal point, if they don't intersect. But the projection on the circle may not fall on the same vertical or horizontal line, and the intersection of both is the square.
    – gtf
    Jul 20 at 13:56














up vote
0
down vote

favorite
1












Consider the problem
beginalign*
min_x in mathbbR^n &quad f(x) \
s.t.: &quad x in C,
endalign*
where $C$ is a convex set. As $C$ is convex, the projection onto $C$, $P_C$, is well defined for every $x in mathbbR^n$. The projected gradient method is a method that proposes solving the above optimization problem taking steps of the form $x_t+1 = P_C[x_t - eta nabla f(x_t)]$.



It is well known that for unconstrained problems the gradient has the maximum slope direction, but why does it still work after projecting? Namely, is there a result in the way of the following?



"Exists $varepsilon > 0$ so that for $0 < delta < varepsilon$, we have $f(P_C[x - deltanabla f(x)]) le f(x)$"



As a bonus question, when there are more than one convex restrictions (like $x in C$ and $x in D$, with both convex sets), there is an alternated projection method that consists in projecting repeteadly to $C$ and then to $D$. The sequence of projections converge to a point in $Ccap D$, if $C cap D ne emptyset$. But that limit is not necessarily the projection of the initial point onto $C cap D$, so why alternated projections can work with gradient descent?







share|cite|improve this question

















  • 1




    The projected gradient method is a special case of the proximal gradient method, and you can find a convergence proof for the proximal gradient method in many places, for example in Vandenberghe's UCLA 236c course notes.
    – littleO
    Jul 19 at 14:41










  • Do you have an example for where the repeated projection does not converge to $P_Ccap D[x]$?
    – M. Winter
    Jul 20 at 13:19











  • I think that an example can be a square inside a circle in $mathbbR^2$. Supposing that the square edges are aligned with the axis, the projection of every point onto the square is given by intersecting a vertical or horizontal line that contains the point with the square, if they intersect, or a diagonal point, if they don't intersect. But the projection on the circle may not fall on the same vertical or horizontal line, and the intersection of both is the square.
    – gtf
    Jul 20 at 13:56












up vote
0
down vote

favorite
1









up vote
0
down vote

favorite
1






1





Consider the problem
beginalign*
min_x in mathbbR^n &quad f(x) \
s.t.: &quad x in C,
endalign*
where $C$ is a convex set. As $C$ is convex, the projection onto $C$, $P_C$, is well defined for every $x in mathbbR^n$. The projected gradient method is a method that proposes solving the above optimization problem taking steps of the form $x_t+1 = P_C[x_t - eta nabla f(x_t)]$.



It is well known that for unconstrained problems the gradient has the maximum slope direction, but why does it still work after projecting? Namely, is there a result in the way of the following?



"Exists $varepsilon > 0$ so that for $0 < delta < varepsilon$, we have $f(P_C[x - deltanabla f(x)]) le f(x)$"



As a bonus question, when there are more than one convex restrictions (like $x in C$ and $x in D$, with both convex sets), there is an alternated projection method that consists in projecting repeteadly to $C$ and then to $D$. The sequence of projections converge to a point in $Ccap D$, if $C cap D ne emptyset$. But that limit is not necessarily the projection of the initial point onto $C cap D$, so why alternated projections can work with gradient descent?







share|cite|improve this question













Consider the problem
beginalign*
min_x in mathbbR^n &quad f(x) \
s.t.: &quad x in C,
endalign*
where $C$ is a convex set. As $C$ is convex, the projection onto $C$, $P_C$, is well defined for every $x in mathbbR^n$. The projected gradient method is a method that proposes solving the above optimization problem taking steps of the form $x_t+1 = P_C[x_t - eta nabla f(x_t)]$.



It is well known that for unconstrained problems the gradient has the maximum slope direction, but why does it still work after projecting? Namely, is there a result in the way of the following?



"Exists $varepsilon > 0$ so that for $0 < delta < varepsilon$, we have $f(P_C[x - deltanabla f(x)]) le f(x)$"



As a bonus question, when there are more than one convex restrictions (like $x in C$ and $x in D$, with both convex sets), there is an alternated projection method that consists in projecting repeteadly to $C$ and then to $D$. The sequence of projections converge to a point in $Ccap D$, if $C cap D ne emptyset$. But that limit is not necessarily the projection of the initial point onto $C cap D$, so why alternated projections can work with gradient descent?









share|cite|improve this question












share|cite|improve this question




share|cite|improve this question








edited Jul 19 at 16:22









Rodrigo de Azevedo

12.6k41751




12.6k41751









asked Jul 19 at 12:14









gtf

205




205







  • 1




    The projected gradient method is a special case of the proximal gradient method, and you can find a convergence proof for the proximal gradient method in many places, for example in Vandenberghe's UCLA 236c course notes.
    – littleO
    Jul 19 at 14:41










  • Do you have an example for where the repeated projection does not converge to $P_Ccap D[x]$?
    – M. Winter
    Jul 20 at 13:19











  • I think that an example can be a square inside a circle in $mathbbR^2$. Supposing that the square edges are aligned with the axis, the projection of every point onto the square is given by intersecting a vertical or horizontal line that contains the point with the square, if they intersect, or a diagonal point, if they don't intersect. But the projection on the circle may not fall on the same vertical or horizontal line, and the intersection of both is the square.
    – gtf
    Jul 20 at 13:56












  • 1




    The projected gradient method is a special case of the proximal gradient method, and you can find a convergence proof for the proximal gradient method in many places, for example in Vandenberghe's UCLA 236c course notes.
    – littleO
    Jul 19 at 14:41










  • Do you have an example for where the repeated projection does not converge to $P_Ccap D[x]$?
    – M. Winter
    Jul 20 at 13:19











  • I think that an example can be a square inside a circle in $mathbbR^2$. Supposing that the square edges are aligned with the axis, the projection of every point onto the square is given by intersecting a vertical or horizontal line that contains the point with the square, if they intersect, or a diagonal point, if they don't intersect. But the projection on the circle may not fall on the same vertical or horizontal line, and the intersection of both is the square.
    – gtf
    Jul 20 at 13:56







1




1




The projected gradient method is a special case of the proximal gradient method, and you can find a convergence proof for the proximal gradient method in many places, for example in Vandenberghe's UCLA 236c course notes.
– littleO
Jul 19 at 14:41




The projected gradient method is a special case of the proximal gradient method, and you can find a convergence proof for the proximal gradient method in many places, for example in Vandenberghe's UCLA 236c course notes.
– littleO
Jul 19 at 14:41












Do you have an example for where the repeated projection does not converge to $P_Ccap D[x]$?
– M. Winter
Jul 20 at 13:19





Do you have an example for where the repeated projection does not converge to $P_Ccap D[x]$?
– M. Winter
Jul 20 at 13:19













I think that an example can be a square inside a circle in $mathbbR^2$. Supposing that the square edges are aligned with the axis, the projection of every point onto the square is given by intersecting a vertical or horizontal line that contains the point with the square, if they intersect, or a diagonal point, if they don't intersect. But the projection on the circle may not fall on the same vertical or horizontal line, and the intersection of both is the square.
– gtf
Jul 20 at 13:56




I think that an example can be a square inside a circle in $mathbbR^2$. Supposing that the square edges are aligned with the axis, the projection of every point onto the square is given by intersecting a vertical or horizontal line that contains the point with the square, if they intersect, or a diagonal point, if they don't intersect. But the projection on the circle may not fall on the same vertical or horizontal line, and the intersection of both is the square.
– gtf
Jul 20 at 13:56










2 Answers
2






active

oldest

votes

















up vote
2
down vote



accepted










I think a rigorous proof of convergence is not so easy but can certainly be found in several books on the topic. I will give you a heuristic approach to see the plausibility of the algorithm. I concentrate on the first part of your question for now.



Heuristic arguments



  • While going into the direction of the negative gradient $-nabla f(x)$ leads to descent, this is not the only direction with this property. Actually, for every direction $dinBbb R^n$ with $def<langledef>rangle<d,nabla f(x)> < 0$ there is some $epsilon > 0$ so that
    $$f(x+eta d)< f(x),qquadtextfor all $eta<epsilon$.$$
    However, if $<d,nabla f(x)>ge 0$, then this direction leads to ascent (or at least no descent).

$qquadqquadqquadqquadqquadqquadquad$



  • I assume by projection you mean
    $$P_C[x] := mathopmathrmargminlimits_cin C d(x,c).$$
    There is some insightful way to find this projection: every $xinBbb R^n$ can be written as $x=c+v$ in a unique way, where $cin C$ and $vin N_C(c)$ is from the normal cone (also called polar cone) of $C$ at $c$. We then have $P_C[x]=c$. Or to put it another way:
    $$P_C[x]=cquadLongleftrightarrow quad x-cin N_C(c).$$

$qquadqquadqquadqquadqquadqquad$



  • We now show that if $x_t+1=P_C[x_t-eta nabla f(x_t)]$ is different from the current point $x_t$, then it lies in the direction of descent. In other words: $x_t+1-x_t$ is a direction of descent, or
    $$<x_t+1-x_t,nabla f(x_t)>< 0.$$
    This is very evident from a picture, but it's a bit messy to prove. Also, this does not show that $f(x_t+1)<f(x_t)$, but it makes it plausible that this will happen if $eta$ is sufficiently small. So let's assume the opposite, that $x_t+1$ lies in the direction of ascent:
    $$<x_t+1-x_t,nabla f(x_t)>ge 0qquadimpliesqquad (*);; <x_t-x_t+1,colorlightgrayundersetbeginarraycuparrow\[-0.5ex] llaptextadding a $eta>0$ does notrlaptext hurt the inequalityendarraycolorblacketanabla f(x_t)>le 0.$$
    Because the unconstrained gradient descent step $colorredx_t - etanabla f(x)$ projects to $x_t+1$, it can be written as $colorbluex_t+1+v$ for some $vin N_C(x_t+1)$. Since $v$ is in the normal cone, this gives
    $$<c-x_t+1,v>le 0,;textfor all $cin C$qquadimpliesqquad (**);; <x_t-x_t+1,v>le 0.$$
    By adding $(*)$ and $(**)$ we obtain
    $$(times);; <x_t-x_t+1,v+etanabla f(x_t)> le 0.$$
    But actually both factors of this inner product are equal, as evident from
    $$colorredx_t-etanabla f(x_t) = colorbluex_t+1+vqquadimpliesqquad x_t-x_t+1 = v+etanabla f(x_t).$$
    So $(times)$ gives us $|x_t-x_t+1|^2le 0implies x_t=x_t+1$. Thus, the only way to not make a move in a descent direction is to not move at all.

$qquadqquadqquadqquadqquadqquad$



  • It remains to show that not moving only happens if there is no descent direction available, hence that we are in a local minimum. If $x_t-etanabla f(x)$ projects to $x_t$ again, then this means
    beginalign
    x_t-etanabla f(x)-x_tin N_C(x_t) quad
    &impliesquad -nabla f(x)in N_C(x_t) \
    &impliesquad <c-x_t,nabla f(x_t)> ge 0,; textfor all $cin C$.
    endalign
    The last line means that moving in the direction of any point $cin C$ (and in a convex set these are all available directions) will lead to ascent, hence that we are indeed in a local minimum.





share|cite|improve this answer






























    up vote
    0
    down vote













    There is a formal way to prove it you can look at many Convex Optimization course notes to see the full proof.



    In my intuition the reason it works is that the Projected Gradient Descent is actually alternating projection.

    It is clear when you use the Proximal Gradient Method (Framework).






    share|cite|improve this answer





















      Your Answer




      StackExchange.ifUsing("editor", function ()
      return StackExchange.using("mathjaxEditing", function ()
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      );
      );
      , "mathjax-editing");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "69"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: true,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      noCode: true, onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );








       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2856559%2fwhy-does-the-projected-gradient-method-work%23new-answer', 'question_page');

      );

      Post as a guest






























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      2
      down vote



      accepted










      I think a rigorous proof of convergence is not so easy but can certainly be found in several books on the topic. I will give you a heuristic approach to see the plausibility of the algorithm. I concentrate on the first part of your question for now.



      Heuristic arguments



      • While going into the direction of the negative gradient $-nabla f(x)$ leads to descent, this is not the only direction with this property. Actually, for every direction $dinBbb R^n$ with $def<langledef>rangle<d,nabla f(x)> < 0$ there is some $epsilon > 0$ so that
        $$f(x+eta d)< f(x),qquadtextfor all $eta<epsilon$.$$
        However, if $<d,nabla f(x)>ge 0$, then this direction leads to ascent (or at least no descent).

      $qquadqquadqquadqquadqquadqquadquad$



      • I assume by projection you mean
        $$P_C[x] := mathopmathrmargminlimits_cin C d(x,c).$$
        There is some insightful way to find this projection: every $xinBbb R^n$ can be written as $x=c+v$ in a unique way, where $cin C$ and $vin N_C(c)$ is from the normal cone (also called polar cone) of $C$ at $c$. We then have $P_C[x]=c$. Or to put it another way:
        $$P_C[x]=cquadLongleftrightarrow quad x-cin N_C(c).$$

      $qquadqquadqquadqquadqquadqquad$



      • We now show that if $x_t+1=P_C[x_t-eta nabla f(x_t)]$ is different from the current point $x_t$, then it lies in the direction of descent. In other words: $x_t+1-x_t$ is a direction of descent, or
        $$<x_t+1-x_t,nabla f(x_t)>< 0.$$
        This is very evident from a picture, but it's a bit messy to prove. Also, this does not show that $f(x_t+1)<f(x_t)$, but it makes it plausible that this will happen if $eta$ is sufficiently small. So let's assume the opposite, that $x_t+1$ lies in the direction of ascent:
        $$<x_t+1-x_t,nabla f(x_t)>ge 0qquadimpliesqquad (*);; <x_t-x_t+1,colorlightgrayundersetbeginarraycuparrow\[-0.5ex] llaptextadding a $eta>0$ does notrlaptext hurt the inequalityendarraycolorblacketanabla f(x_t)>le 0.$$
        Because the unconstrained gradient descent step $colorredx_t - etanabla f(x)$ projects to $x_t+1$, it can be written as $colorbluex_t+1+v$ for some $vin N_C(x_t+1)$. Since $v$ is in the normal cone, this gives
        $$<c-x_t+1,v>le 0,;textfor all $cin C$qquadimpliesqquad (**);; <x_t-x_t+1,v>le 0.$$
        By adding $(*)$ and $(**)$ we obtain
        $$(times);; <x_t-x_t+1,v+etanabla f(x_t)> le 0.$$
        But actually both factors of this inner product are equal, as evident from
        $$colorredx_t-etanabla f(x_t) = colorbluex_t+1+vqquadimpliesqquad x_t-x_t+1 = v+etanabla f(x_t).$$
        So $(times)$ gives us $|x_t-x_t+1|^2le 0implies x_t=x_t+1$. Thus, the only way to not make a move in a descent direction is to not move at all.

      $qquadqquadqquadqquadqquadqquad$



      • It remains to show that not moving only happens if there is no descent direction available, hence that we are in a local minimum. If $x_t-etanabla f(x)$ projects to $x_t$ again, then this means
        beginalign
        x_t-etanabla f(x)-x_tin N_C(x_t) quad
        &impliesquad -nabla f(x)in N_C(x_t) \
        &impliesquad <c-x_t,nabla f(x_t)> ge 0,; textfor all $cin C$.
        endalign
        The last line means that moving in the direction of any point $cin C$ (and in a convex set these are all available directions) will lead to ascent, hence that we are indeed in a local minimum.





      share|cite|improve this answer



























        up vote
        2
        down vote



        accepted










        I think a rigorous proof of convergence is not so easy but can certainly be found in several books on the topic. I will give you a heuristic approach to see the plausibility of the algorithm. I concentrate on the first part of your question for now.



        Heuristic arguments



        • While going into the direction of the negative gradient $-nabla f(x)$ leads to descent, this is not the only direction with this property. Actually, for every direction $dinBbb R^n$ with $def<langledef>rangle<d,nabla f(x)> < 0$ there is some $epsilon > 0$ so that
          $$f(x+eta d)< f(x),qquadtextfor all $eta<epsilon$.$$
          However, if $<d,nabla f(x)>ge 0$, then this direction leads to ascent (or at least no descent).

        $qquadqquadqquadqquadqquadqquadquad$



        • I assume by projection you mean
          $$P_C[x] := mathopmathrmargminlimits_cin C d(x,c).$$
          There is some insightful way to find this projection: every $xinBbb R^n$ can be written as $x=c+v$ in a unique way, where $cin C$ and $vin N_C(c)$ is from the normal cone (also called polar cone) of $C$ at $c$. We then have $P_C[x]=c$. Or to put it another way:
          $$P_C[x]=cquadLongleftrightarrow quad x-cin N_C(c).$$

        $qquadqquadqquadqquadqquadqquad$



        • We now show that if $x_t+1=P_C[x_t-eta nabla f(x_t)]$ is different from the current point $x_t$, then it lies in the direction of descent. In other words: $x_t+1-x_t$ is a direction of descent, or
          $$<x_t+1-x_t,nabla f(x_t)>< 0.$$
          This is very evident from a picture, but it's a bit messy to prove. Also, this does not show that $f(x_t+1)<f(x_t)$, but it makes it plausible that this will happen if $eta$ is sufficiently small. So let's assume the opposite, that $x_t+1$ lies in the direction of ascent:
          $$<x_t+1-x_t,nabla f(x_t)>ge 0qquadimpliesqquad (*);; <x_t-x_t+1,colorlightgrayundersetbeginarraycuparrow\[-0.5ex] llaptextadding a $eta>0$ does notrlaptext hurt the inequalityendarraycolorblacketanabla f(x_t)>le 0.$$
          Because the unconstrained gradient descent step $colorredx_t - etanabla f(x)$ projects to $x_t+1$, it can be written as $colorbluex_t+1+v$ for some $vin N_C(x_t+1)$. Since $v$ is in the normal cone, this gives
          $$<c-x_t+1,v>le 0,;textfor all $cin C$qquadimpliesqquad (**);; <x_t-x_t+1,v>le 0.$$
          By adding $(*)$ and $(**)$ we obtain
          $$(times);; <x_t-x_t+1,v+etanabla f(x_t)> le 0.$$
          But actually both factors of this inner product are equal, as evident from
          $$colorredx_t-etanabla f(x_t) = colorbluex_t+1+vqquadimpliesqquad x_t-x_t+1 = v+etanabla f(x_t).$$
          So $(times)$ gives us $|x_t-x_t+1|^2le 0implies x_t=x_t+1$. Thus, the only way to not make a move in a descent direction is to not move at all.

        $qquadqquadqquadqquadqquadqquad$



        • It remains to show that not moving only happens if there is no descent direction available, hence that we are in a local minimum. If $x_t-etanabla f(x)$ projects to $x_t$ again, then this means
          beginalign
          x_t-etanabla f(x)-x_tin N_C(x_t) quad
          &impliesquad -nabla f(x)in N_C(x_t) \
          &impliesquad <c-x_t,nabla f(x_t)> ge 0,; textfor all $cin C$.
          endalign
          The last line means that moving in the direction of any point $cin C$ (and in a convex set these are all available directions) will lead to ascent, hence that we are indeed in a local minimum.





        share|cite|improve this answer

























          up vote
          2
          down vote



          accepted







          up vote
          2
          down vote



          accepted






          I think a rigorous proof of convergence is not so easy but can certainly be found in several books on the topic. I will give you a heuristic approach to see the plausibility of the algorithm. I concentrate on the first part of your question for now.



          Heuristic arguments



          • While going into the direction of the negative gradient $-nabla f(x)$ leads to descent, this is not the only direction with this property. Actually, for every direction $dinBbb R^n$ with $def<langledef>rangle<d,nabla f(x)> < 0$ there is some $epsilon > 0$ so that
            $$f(x+eta d)< f(x),qquadtextfor all $eta<epsilon$.$$
            However, if $<d,nabla f(x)>ge 0$, then this direction leads to ascent (or at least no descent).

          $qquadqquadqquadqquadqquadqquadquad$



          • I assume by projection you mean
            $$P_C[x] := mathopmathrmargminlimits_cin C d(x,c).$$
            There is some insightful way to find this projection: every $xinBbb R^n$ can be written as $x=c+v$ in a unique way, where $cin C$ and $vin N_C(c)$ is from the normal cone (also called polar cone) of $C$ at $c$. We then have $P_C[x]=c$. Or to put it another way:
            $$P_C[x]=cquadLongleftrightarrow quad x-cin N_C(c).$$

          $qquadqquadqquadqquadqquadqquad$



          • We now show that if $x_t+1=P_C[x_t-eta nabla f(x_t)]$ is different from the current point $x_t$, then it lies in the direction of descent. In other words: $x_t+1-x_t$ is a direction of descent, or
            $$<x_t+1-x_t,nabla f(x_t)>< 0.$$
            This is very evident from a picture, but it's a bit messy to prove. Also, this does not show that $f(x_t+1)<f(x_t)$, but it makes it plausible that this will happen if $eta$ is sufficiently small. So let's assume the opposite, that $x_t+1$ lies in the direction of ascent:
            $$<x_t+1-x_t,nabla f(x_t)>ge 0qquadimpliesqquad (*);; <x_t-x_t+1,colorlightgrayundersetbeginarraycuparrow\[-0.5ex] llaptextadding a $eta>0$ does notrlaptext hurt the inequalityendarraycolorblacketanabla f(x_t)>le 0.$$
            Because the unconstrained gradient descent step $colorredx_t - etanabla f(x)$ projects to $x_t+1$, it can be written as $colorbluex_t+1+v$ for some $vin N_C(x_t+1)$. Since $v$ is in the normal cone, this gives
            $$<c-x_t+1,v>le 0,;textfor all $cin C$qquadimpliesqquad (**);; <x_t-x_t+1,v>le 0.$$
            By adding $(*)$ and $(**)$ we obtain
            $$(times);; <x_t-x_t+1,v+etanabla f(x_t)> le 0.$$
            But actually both factors of this inner product are equal, as evident from
            $$colorredx_t-etanabla f(x_t) = colorbluex_t+1+vqquadimpliesqquad x_t-x_t+1 = v+etanabla f(x_t).$$
            So $(times)$ gives us $|x_t-x_t+1|^2le 0implies x_t=x_t+1$. Thus, the only way to not make a move in a descent direction is to not move at all.

          $qquadqquadqquadqquadqquadqquad$



          • It remains to show that not moving only happens if there is no descent direction available, hence that we are in a local minimum. If $x_t-etanabla f(x)$ projects to $x_t$ again, then this means
            beginalign
            x_t-etanabla f(x)-x_tin N_C(x_t) quad
            &impliesquad -nabla f(x)in N_C(x_t) \
            &impliesquad <c-x_t,nabla f(x_t)> ge 0,; textfor all $cin C$.
            endalign
            The last line means that moving in the direction of any point $cin C$ (and in a convex set these are all available directions) will lead to ascent, hence that we are indeed in a local minimum.





          share|cite|improve this answer















          I think a rigorous proof of convergence is not so easy but can certainly be found in several books on the topic. I will give you a heuristic approach to see the plausibility of the algorithm. I concentrate on the first part of your question for now.



          Heuristic arguments



          • While going into the direction of the negative gradient $-nabla f(x)$ leads to descent, this is not the only direction with this property. Actually, for every direction $dinBbb R^n$ with $def<langledef>rangle<d,nabla f(x)> < 0$ there is some $epsilon > 0$ so that
            $$f(x+eta d)< f(x),qquadtextfor all $eta<epsilon$.$$
            However, if $<d,nabla f(x)>ge 0$, then this direction leads to ascent (or at least no descent).

          $qquadqquadqquadqquadqquadqquadquad$



          • I assume by projection you mean
            $$P_C[x] := mathopmathrmargminlimits_cin C d(x,c).$$
            There is some insightful way to find this projection: every $xinBbb R^n$ can be written as $x=c+v$ in a unique way, where $cin C$ and $vin N_C(c)$ is from the normal cone (also called polar cone) of $C$ at $c$. We then have $P_C[x]=c$. Or to put it another way:
            $$P_C[x]=cquadLongleftrightarrow quad x-cin N_C(c).$$

          $qquadqquadqquadqquadqquadqquad$



          • We now show that if $x_t+1=P_C[x_t-eta nabla f(x_t)]$ is different from the current point $x_t$, then it lies in the direction of descent. In other words: $x_t+1-x_t$ is a direction of descent, or
            $$<x_t+1-x_t,nabla f(x_t)>< 0.$$
            This is very evident from a picture, but it's a bit messy to prove. Also, this does not show that $f(x_t+1)<f(x_t)$, but it makes it plausible that this will happen if $eta$ is sufficiently small. So let's assume the opposite, that $x_t+1$ lies in the direction of ascent:
            $$<x_t+1-x_t,nabla f(x_t)>ge 0qquadimpliesqquad (*);; <x_t-x_t+1,colorlightgrayundersetbeginarraycuparrow\[-0.5ex] llaptextadding a $eta>0$ does notrlaptext hurt the inequalityendarraycolorblacketanabla f(x_t)>le 0.$$
            Because the unconstrained gradient descent step $colorredx_t - etanabla f(x)$ projects to $x_t+1$, it can be written as $colorbluex_t+1+v$ for some $vin N_C(x_t+1)$. Since $v$ is in the normal cone, this gives
            $$<c-x_t+1,v>le 0,;textfor all $cin C$qquadimpliesqquad (**);; <x_t-x_t+1,v>le 0.$$
            By adding $(*)$ and $(**)$ we obtain
            $$(times);; <x_t-x_t+1,v+etanabla f(x_t)> le 0.$$
            But actually both factors of this inner product are equal, as evident from
            $$colorredx_t-etanabla f(x_t) = colorbluex_t+1+vqquadimpliesqquad x_t-x_t+1 = v+etanabla f(x_t).$$
            So $(times)$ gives us $|x_t-x_t+1|^2le 0implies x_t=x_t+1$. Thus, the only way to not make a move in a descent direction is to not move at all.

          $qquadqquadqquadqquadqquadqquad$



          • It remains to show that not moving only happens if there is no descent direction available, hence that we are in a local minimum. If $x_t-etanabla f(x)$ projects to $x_t$ again, then this means
            beginalign
            x_t-etanabla f(x)-x_tin N_C(x_t) quad
            &impliesquad -nabla f(x)in N_C(x_t) \
            &impliesquad <c-x_t,nabla f(x_t)> ge 0,; textfor all $cin C$.
            endalign
            The last line means that moving in the direction of any point $cin C$ (and in a convex set these are all available directions) will lead to ascent, hence that we are indeed in a local minimum.






          share|cite|improve this answer















          share|cite|improve this answer



          share|cite|improve this answer








          edited Jul 20 at 13:16


























          answered Jul 20 at 11:23









          M. Winter

          17.7k62764




          17.7k62764




















              up vote
              0
              down vote













              There is a formal way to prove it you can look at many Convex Optimization course notes to see the full proof.



              In my intuition the reason it works is that the Projected Gradient Descent is actually alternating projection.

              It is clear when you use the Proximal Gradient Method (Framework).






              share|cite|improve this answer

























                up vote
                0
                down vote













                There is a formal way to prove it you can look at many Convex Optimization course notes to see the full proof.



                In my intuition the reason it works is that the Projected Gradient Descent is actually alternating projection.

                It is clear when you use the Proximal Gradient Method (Framework).






                share|cite|improve this answer























                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  There is a formal way to prove it you can look at many Convex Optimization course notes to see the full proof.



                  In my intuition the reason it works is that the Projected Gradient Descent is actually alternating projection.

                  It is clear when you use the Proximal Gradient Method (Framework).






                  share|cite|improve this answer













                  There is a formal way to prove it you can look at many Convex Optimization course notes to see the full proof.



                  In my intuition the reason it works is that the Projected Gradient Descent is actually alternating projection.

                  It is clear when you use the Proximal Gradient Method (Framework).







                  share|cite|improve this answer













                  share|cite|improve this answer



                  share|cite|improve this answer











                  answered Jul 19 at 14:30









                  Royi

                  2,99512046




                  2,99512046






















                       

                      draft saved


                      draft discarded


























                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2856559%2fwhy-does-the-projected-gradient-method-work%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Comments

                      Popular posts from this blog

                      What is the equation of a 3D cone with generalised tilt?

                      Color the edges and diagonals of a regular polygon

                      Relationship between determinant of matrix and determinant of adjoint?