How to improve the numerical stability of the inverse rank-one Cholesky update?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I am trying to use the inverse Cholesky update from the page 10 of the Efficient covariance matrix update for variable metric evolution strategies paper as a part of the optimization step in a neural network and am struggling significantly as it is so unstable. There is nothing wrong with the logic of it, but I've found that it requires really low learning rates $beta$ and even then works quite poorly. The full reasons for that are unknown to me, but there is some indication that the expression as originally shown is quite numerically unstable. I tend to set $alpha=beta-1$.



$$
A^-1_t+1 = frac 1 sqrt alpha A^-1_t - frac 1 ^2 left(1 - frac 1 sqrt z_tright right) z_t [z^T_tA^-1_t]
$$



By distributing $sqrt alpha$, I think I've managed to find the first place on how to improve this expression.



$$
A^-1_t+1 = frac 1 sqrt alpha A^-1_t - frac 1 z_tright left(frac 1 sqrt alpha - frac 1 sqrt ^2 right) z_t [z^T_tA^-1_t]
$$



I've yet to test this, but I have a reason to expect this would be better. While testing the back-whitening in the last layer I had the situation that the inverse Cholesky factor was not updating at all for some reason. Looking into it the square L2 norm $left|z_tright|^2$ was around $10^-3$ while the learning rate $beta$ was really low around $10^-5$ due to higher ones diverging. Hence what happened was that $sqrt z_tright$ always evaluated to zero and no updates ever took place because $1 + 10^-8 = 1$ with float32 numbers.



Distributing the $sqrt alpha$ definitely feels right here, but I am hardly an expert in numerical optimization and am just going off my intuition as a programmer.



Are there any more moves I could take here to make the expression behave better?







share|cite|improve this question

























    up vote
    1
    down vote

    favorite












    I am trying to use the inverse Cholesky update from the page 10 of the Efficient covariance matrix update for variable metric evolution strategies paper as a part of the optimization step in a neural network and am struggling significantly as it is so unstable. There is nothing wrong with the logic of it, but I've found that it requires really low learning rates $beta$ and even then works quite poorly. The full reasons for that are unknown to me, but there is some indication that the expression as originally shown is quite numerically unstable. I tend to set $alpha=beta-1$.



    $$
    A^-1_t+1 = frac 1 sqrt alpha A^-1_t - frac 1 ^2 left(1 - frac 1 sqrt z_tright right) z_t [z^T_tA^-1_t]
    $$



    By distributing $sqrt alpha$, I think I've managed to find the first place on how to improve this expression.



    $$
    A^-1_t+1 = frac 1 sqrt alpha A^-1_t - frac 1 z_tright left(frac 1 sqrt alpha - frac 1 sqrt ^2 right) z_t [z^T_tA^-1_t]
    $$



    I've yet to test this, but I have a reason to expect this would be better. While testing the back-whitening in the last layer I had the situation that the inverse Cholesky factor was not updating at all for some reason. Looking into it the square L2 norm $left|z_tright|^2$ was around $10^-3$ while the learning rate $beta$ was really low around $10^-5$ due to higher ones diverging. Hence what happened was that $sqrt z_tright$ always evaluated to zero and no updates ever took place because $1 + 10^-8 = 1$ with float32 numbers.



    Distributing the $sqrt alpha$ definitely feels right here, but I am hardly an expert in numerical optimization and am just going off my intuition as a programmer.



    Are there any more moves I could take here to make the expression behave better?







    share|cite|improve this question























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I am trying to use the inverse Cholesky update from the page 10 of the Efficient covariance matrix update for variable metric evolution strategies paper as a part of the optimization step in a neural network and am struggling significantly as it is so unstable. There is nothing wrong with the logic of it, but I've found that it requires really low learning rates $beta$ and even then works quite poorly. The full reasons for that are unknown to me, but there is some indication that the expression as originally shown is quite numerically unstable. I tend to set $alpha=beta-1$.



      $$
      A^-1_t+1 = frac 1 sqrt alpha A^-1_t - frac 1 ^2 left(1 - frac 1 sqrt z_tright right) z_t [z^T_tA^-1_t]
      $$



      By distributing $sqrt alpha$, I think I've managed to find the first place on how to improve this expression.



      $$
      A^-1_t+1 = frac 1 sqrt alpha A^-1_t - frac 1 z_tright left(frac 1 sqrt alpha - frac 1 sqrt ^2 right) z_t [z^T_tA^-1_t]
      $$



      I've yet to test this, but I have a reason to expect this would be better. While testing the back-whitening in the last layer I had the situation that the inverse Cholesky factor was not updating at all for some reason. Looking into it the square L2 norm $left|z_tright|^2$ was around $10^-3$ while the learning rate $beta$ was really low around $10^-5$ due to higher ones diverging. Hence what happened was that $sqrt z_tright$ always evaluated to zero and no updates ever took place because $1 + 10^-8 = 1$ with float32 numbers.



      Distributing the $sqrt alpha$ definitely feels right here, but I am hardly an expert in numerical optimization and am just going off my intuition as a programmer.



      Are there any more moves I could take here to make the expression behave better?







      share|cite|improve this question













      I am trying to use the inverse Cholesky update from the page 10 of the Efficient covariance matrix update for variable metric evolution strategies paper as a part of the optimization step in a neural network and am struggling significantly as it is so unstable. There is nothing wrong with the logic of it, but I've found that it requires really low learning rates $beta$ and even then works quite poorly. The full reasons for that are unknown to me, but there is some indication that the expression as originally shown is quite numerically unstable. I tend to set $alpha=beta-1$.



      $$
      A^-1_t+1 = frac 1 sqrt alpha A^-1_t - frac 1 ^2 left(1 - frac 1 sqrt z_tright right) z_t [z^T_tA^-1_t]
      $$



      By distributing $sqrt alpha$, I think I've managed to find the first place on how to improve this expression.



      $$
      A^-1_t+1 = frac 1 sqrt alpha A^-1_t - frac 1 z_tright left(frac 1 sqrt alpha - frac 1 sqrt ^2 right) z_t [z^T_tA^-1_t]
      $$



      I've yet to test this, but I have a reason to expect this would be better. While testing the back-whitening in the last layer I had the situation that the inverse Cholesky factor was not updating at all for some reason. Looking into it the square L2 norm $left|z_tright|^2$ was around $10^-3$ while the learning rate $beta$ was really low around $10^-5$ due to higher ones diverging. Hence what happened was that $sqrt z_tright$ always evaluated to zero and no updates ever took place because $1 + 10^-8 = 1$ with float32 numbers.



      Distributing the $sqrt alpha$ definitely feels right here, but I am hardly an expert in numerical optimization and am just going off my intuition as a programmer.



      Are there any more moves I could take here to make the expression behave better?









      share|cite|improve this question












      share|cite|improve this question




      share|cite|improve this question








      edited Jul 16 at 12:43
























      asked Jul 15 at 16:38









      Marko Grdinic

      1369




      1369

























          active

          oldest

          votes











          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "69"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          noCode: true, onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2852664%2fhow-to-improve-the-numerical-stability-of-the-inverse-rank-one-cholesky-update%23new-answer', 'question_page');

          );

          Post as a guest



































          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes










           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2852664%2fhow-to-improve-the-numerical-stability-of-the-inverse-rank-one-cholesky-update%23new-answer', 'question_page');

          );

          Post as a guest













































































          Comments

          Popular posts from this blog

          What is the equation of a 3D cone with generalised tilt?

          Color the edges and diagonals of a regular polygon

          Relationship between determinant of matrix and determinant of adjoint?