What's wrong with my derivation of the gradient of KL divergence?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












I'm reading the paper Visualizing Data using t-SNE, and I get stuck on the gradient of KL divergence.



In the paper, it defines the similarity between datapoints, $p_j$, as follows



$$p_j=x_i-x_kright$$



the similarity between map points is given by



$$q_j=^2)over sum_kne iexp(-left$$



and at last, the cost function, which is the sum of KL divergences over all datapoints is defined by
$$C=sum_iKL(P_ivert|Q_i)=sum_isum_jp_jlogp_jover q_j$$



Then it says the gradient is



$$partial Cover partial y_i=2sum_j(p_j-q_j+p_j-q_j)(y_i-y_j)$$



However, I derive a different result



$$beginalign
partial Cover partial y_i&=-sum_j p_j1 over q_jnabla_y_iq_j + p_j1over q_jnabla_y_iq_j
endalign$$
and
$$beginalign
nabla_y_iq_j&=q_jleft(2(y_j-y_i)-2^2)right)\
nabla_y_iq_j&=q_j(2(y_j-y_i)-2q_j(y_j-y_i))
endalign$$
so
$$partial Coverpartial y_i=2sum_j(y_i-y_j)left(p_j+p_j-q_j-^2)right)$$



I've checked my derivation several times, but still have no clue about where the wrong is. Could someone help me check it? I'll be very grateful for that. Thanks.







share|cite|improve this question

























    up vote
    0
    down vote

    favorite












    I'm reading the paper Visualizing Data using t-SNE, and I get stuck on the gradient of KL divergence.



    In the paper, it defines the similarity between datapoints, $p_j$, as follows



    $$p_j=x_i-x_kright$$



    the similarity between map points is given by



    $$q_j=^2)over sum_kne iexp(-left$$



    and at last, the cost function, which is the sum of KL divergences over all datapoints is defined by
    $$C=sum_iKL(P_ivert|Q_i)=sum_isum_jp_jlogp_jover q_j$$



    Then it says the gradient is



    $$partial Cover partial y_i=2sum_j(p_j-q_j+p_j-q_j)(y_i-y_j)$$



    However, I derive a different result



    $$beginalign
    partial Cover partial y_i&=-sum_j p_j1 over q_jnabla_y_iq_j + p_j1over q_jnabla_y_iq_j
    endalign$$
    and
    $$beginalign
    nabla_y_iq_j&=q_jleft(2(y_j-y_i)-2^2)right)\
    nabla_y_iq_j&=q_j(2(y_j-y_i)-2q_j(y_j-y_i))
    endalign$$
    so
    $$partial Coverpartial y_i=2sum_j(y_i-y_j)left(p_j+p_j-q_j-^2)right)$$



    I've checked my derivation several times, but still have no clue about where the wrong is. Could someone help me check it? I'll be very grateful for that. Thanks.







    share|cite|improve this question























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I'm reading the paper Visualizing Data using t-SNE, and I get stuck on the gradient of KL divergence.



      In the paper, it defines the similarity between datapoints, $p_j$, as follows



      $$p_j=x_i-x_kright$$



      the similarity between map points is given by



      $$q_j=^2)over sum_kne iexp(-left$$



      and at last, the cost function, which is the sum of KL divergences over all datapoints is defined by
      $$C=sum_iKL(P_ivert|Q_i)=sum_isum_jp_jlogp_jover q_j$$



      Then it says the gradient is



      $$partial Cover partial y_i=2sum_j(p_j-q_j+p_j-q_j)(y_i-y_j)$$



      However, I derive a different result



      $$beginalign
      partial Cover partial y_i&=-sum_j p_j1 over q_jnabla_y_iq_j + p_j1over q_jnabla_y_iq_j
      endalign$$
      and
      $$beginalign
      nabla_y_iq_j&=q_jleft(2(y_j-y_i)-2^2)right)\
      nabla_y_iq_j&=q_j(2(y_j-y_i)-2q_j(y_j-y_i))
      endalign$$
      so
      $$partial Coverpartial y_i=2sum_j(y_i-y_j)left(p_j+p_j-q_j-^2)right)$$



      I've checked my derivation several times, but still have no clue about where the wrong is. Could someone help me check it? I'll be very grateful for that. Thanks.







      share|cite|improve this question













      I'm reading the paper Visualizing Data using t-SNE, and I get stuck on the gradient of KL divergence.



      In the paper, it defines the similarity between datapoints, $p_j$, as follows



      $$p_j=x_i-x_kright$$



      the similarity between map points is given by



      $$q_j=^2)over sum_kne iexp(-left$$



      and at last, the cost function, which is the sum of KL divergences over all datapoints is defined by
      $$C=sum_iKL(P_ivert|Q_i)=sum_isum_jp_jlogp_jover q_j$$



      Then it says the gradient is



      $$partial Cover partial y_i=2sum_j(p_j-q_j+p_j-q_j)(y_i-y_j)$$



      However, I derive a different result



      $$beginalign
      partial Cover partial y_i&=-sum_j p_j1 over q_jnabla_y_iq_j + p_j1over q_jnabla_y_iq_j
      endalign$$
      and
      $$beginalign
      nabla_y_iq_j&=q_jleft(2(y_j-y_i)-2^2)right)\
      nabla_y_iq_j&=q_j(2(y_j-y_i)-2q_j(y_j-y_i))
      endalign$$
      so
      $$partial Coverpartial y_i=2sum_j(y_i-y_j)left(p_j+p_j-q_j-^2)right)$$



      I've checked my derivation several times, but still have no clue about where the wrong is. Could someone help me check it? I'll be very grateful for that. Thanks.









      share|cite|improve this question












      share|cite|improve this question




      share|cite|improve this question








      edited Jul 17 at 3:01
























      asked Jul 16 at 4:56









      Sherwin Chen

      536




      536

























          active

          oldest

          votes











          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "69"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          noCode: true, onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2853107%2fwhats-wrong-with-my-derivation-of-the-gradient-of-kl-divergence%23new-answer', 'question_page');

          );

          Post as a guest



































          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes










           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2853107%2fwhats-wrong-with-my-derivation-of-the-gradient-of-kl-divergence%23new-answer', 'question_page');

          );

          Post as a guest













































































          Comments

          Popular posts from this blog

          What is the equation of a 3D cone with generalised tilt?

          Color the edges and diagonals of a regular polygon

          Relationship between determinant of matrix and determinant of adjoint?