What happens when logistic regression does not quite capture the data?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I have modeled the probability of an aggressive (vs indolent) form of recurrent respiratory papillomatosis as a function of age of diagnosis. Generally speaking, those who are diagnosed before the age of 5 have a 80% probability of running an aggressive course. Those diagnosed after the age of 10 years have about a 30% chance. Between 5 years and 10 years it is somewhere in between. In all three age groups there does not seem to be a correlation with age (within the limits of the age group).



Look at the graph (open circles) logistic regression wants to go with but look at my manual line (dotted line) that seems to better describe what is going on. My x-axis is log of diagnostic age. The y-axis is probability of aggressive disease. How do I model the dotted line? I thought of using my own logistic function but I do not know how to make R find the best parameters.



Am I missing something in my understanding of the mathematics of the two graphs?
How do I operationalize this in R. Or perhaps I am looking for the dashed green line. I cannot believe the dashed line is correct. Biologically speaking there is little to imagine that the risk of someone diagnosed at age 9.9 is very different than one diagnosed at age 10.1 years



enter image description here







share|improve this question











migrated from stats.stackexchange.com 2 days ago


This question came from our site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.










  • 1




    Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
    – Ben Bolker
    2 days ago














up vote
1
down vote

favorite












I have modeled the probability of an aggressive (vs indolent) form of recurrent respiratory papillomatosis as a function of age of diagnosis. Generally speaking, those who are diagnosed before the age of 5 have a 80% probability of running an aggressive course. Those diagnosed after the age of 10 years have about a 30% chance. Between 5 years and 10 years it is somewhere in between. In all three age groups there does not seem to be a correlation with age (within the limits of the age group).



Look at the graph (open circles) logistic regression wants to go with but look at my manual line (dotted line) that seems to better describe what is going on. My x-axis is log of diagnostic age. The y-axis is probability of aggressive disease. How do I model the dotted line? I thought of using my own logistic function but I do not know how to make R find the best parameters.



Am I missing something in my understanding of the mathematics of the two graphs?
How do I operationalize this in R. Or perhaps I am looking for the dashed green line. I cannot believe the dashed line is correct. Biologically speaking there is little to imagine that the risk of someone diagnosed at age 9.9 is very different than one diagnosed at age 10.1 years



enter image description here







share|improve this question











migrated from stats.stackexchange.com 2 days ago


This question came from our site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.










  • 1




    Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
    – Ben Bolker
    2 days ago












up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have modeled the probability of an aggressive (vs indolent) form of recurrent respiratory papillomatosis as a function of age of diagnosis. Generally speaking, those who are diagnosed before the age of 5 have a 80% probability of running an aggressive course. Those diagnosed after the age of 10 years have about a 30% chance. Between 5 years and 10 years it is somewhere in between. In all three age groups there does not seem to be a correlation with age (within the limits of the age group).



Look at the graph (open circles) logistic regression wants to go with but look at my manual line (dotted line) that seems to better describe what is going on. My x-axis is log of diagnostic age. The y-axis is probability of aggressive disease. How do I model the dotted line? I thought of using my own logistic function but I do not know how to make R find the best parameters.



Am I missing something in my understanding of the mathematics of the two graphs?
How do I operationalize this in R. Or perhaps I am looking for the dashed green line. I cannot believe the dashed line is correct. Biologically speaking there is little to imagine that the risk of someone diagnosed at age 9.9 is very different than one diagnosed at age 10.1 years



enter image description here







share|improve this question











I have modeled the probability of an aggressive (vs indolent) form of recurrent respiratory papillomatosis as a function of age of diagnosis. Generally speaking, those who are diagnosed before the age of 5 have a 80% probability of running an aggressive course. Those diagnosed after the age of 10 years have about a 30% chance. Between 5 years and 10 years it is somewhere in between. In all three age groups there does not seem to be a correlation with age (within the limits of the age group).



Look at the graph (open circles) logistic regression wants to go with but look at my manual line (dotted line) that seems to better describe what is going on. My x-axis is log of diagnostic age. The y-axis is probability of aggressive disease. How do I model the dotted line? I thought of using my own logistic function but I do not know how to make R find the best parameters.



Am I missing something in my understanding of the mathematics of the two graphs?
How do I operationalize this in R. Or perhaps I am looking for the dashed green line. I cannot believe the dashed line is correct. Biologically speaking there is little to imagine that the risk of someone diagnosed at age 9.9 is very different than one diagnosed at age 10.1 years



enter image description here









share|improve this question










share|improve this question




share|improve this question









asked 2 days ago









Farrel

4,654165089




4,654165089




migrated from stats.stackexchange.com 2 days ago


This question came from our site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.






migrated from stats.stackexchange.com 2 days ago


This question came from our site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.









  • 1




    Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
    – Ben Bolker
    2 days ago












  • 1




    Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
    – Ben Bolker
    2 days ago







1




1




Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
– Ben Bolker
2 days ago




Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
– Ben Bolker
2 days ago










2 Answers
2






active

oldest

votes

















up vote
2
down vote













I agree that discontinuous or step functions typically make little ecological sense. Then again, probably your dotted line doesn't, either. If we can agree that the level won't make any discontinuous jumps (as in your green dashed line), then why should the regression coefficient of the response to age make discontinuous jumps to yield the "kinks" in your green line?



You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.






share|improve this answer




























    up vote
    2
    down vote













    The "standard" logistic function $frac11+e^-x$ passes through 0 and 1 at $±infty$. This is not a great match for your data, which doesn't seem to approach either of those values, but instead approaches 0.8 from the left and 0.3 from the right.



    You may want to add scale and offset parameters so that you can squash and shift that curve into that range. My guess is that, despite the extra parameters, the model will fit better (via AIC, etc) and will end up resembling your dashed line.



    Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like



    dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset



    You would then fit this with nls: simply pass in the formula (above), and the data. If you have reasonable guesses for the starting values (which you do here), providing them can help speed converge.






    share|improve this answer























    • By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
      – Farrel
      2 days ago










    Your Answer





    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51706083%2fwhat-happens-when-logistic-regression-does-not-quite-capture-the-data%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    2
    down vote













    I agree that discontinuous or step functions typically make little ecological sense. Then again, probably your dotted line doesn't, either. If we can agree that the level won't make any discontinuous jumps (as in your green dashed line), then why should the regression coefficient of the response to age make discontinuous jumps to yield the "kinks" in your green line?



    You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.






    share|improve this answer

























      up vote
      2
      down vote













      I agree that discontinuous or step functions typically make little ecological sense. Then again, probably your dotted line doesn't, either. If we can agree that the level won't make any discontinuous jumps (as in your green dashed line), then why should the regression coefficient of the response to age make discontinuous jumps to yield the "kinks" in your green line?



      You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.






      share|improve this answer























        up vote
        2
        down vote










        up vote
        2
        down vote









        I agree that discontinuous or step functions typically make little ecological sense. Then again, probably your dotted line doesn't, either. If we can agree that the level won't make any discontinuous jumps (as in your green dashed line), then why should the regression coefficient of the response to age make discontinuous jumps to yield the "kinks" in your green line?



        You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.






        share|improve this answer













        I agree that discontinuous or step functions typically make little ecological sense. Then again, probably your dotted line doesn't, either. If we can agree that the level won't make any discontinuous jumps (as in your green dashed line), then why should the regression coefficient of the response to age make discontinuous jumps to yield the "kinks" in your green line?



        You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.







        share|improve this answer













        share|improve this answer



        share|improve this answer











        answered 2 days ago









        Stephan Kolassa

        6,04421835




        6,04421835






















            up vote
            2
            down vote













            The "standard" logistic function $frac11+e^-x$ passes through 0 and 1 at $±infty$. This is not a great match for your data, which doesn't seem to approach either of those values, but instead approaches 0.8 from the left and 0.3 from the right.



            You may want to add scale and offset parameters so that you can squash and shift that curve into that range. My guess is that, despite the extra parameters, the model will fit better (via AIC, etc) and will end up resembling your dashed line.



            Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like



            dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset



            You would then fit this with nls: simply pass in the formula (above), and the data. If you have reasonable guesses for the starting values (which you do here), providing them can help speed converge.






            share|improve this answer























            • By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
              – Farrel
              2 days ago














            up vote
            2
            down vote













            The "standard" logistic function $frac11+e^-x$ passes through 0 and 1 at $±infty$. This is not a great match for your data, which doesn't seem to approach either of those values, but instead approaches 0.8 from the left and 0.3 from the right.



            You may want to add scale and offset parameters so that you can squash and shift that curve into that range. My guess is that, despite the extra parameters, the model will fit better (via AIC, etc) and will end up resembling your dashed line.



            Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like



            dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset



            You would then fit this with nls: simply pass in the formula (above), and the data. If you have reasonable guesses for the starting values (which you do here), providing them can help speed converge.






            share|improve this answer























            • By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
              – Farrel
              2 days ago












            up vote
            2
            down vote










            up vote
            2
            down vote









            The "standard" logistic function $frac11+e^-x$ passes through 0 and 1 at $±infty$. This is not a great match for your data, which doesn't seem to approach either of those values, but instead approaches 0.8 from the left and 0.3 from the right.



            You may want to add scale and offset parameters so that you can squash and shift that curve into that range. My guess is that, despite the extra parameters, the model will fit better (via AIC, etc) and will end up resembling your dashed line.



            Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like



            dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset



            You would then fit this with nls: simply pass in the formula (above), and the data. If you have reasonable guesses for the starting values (which you do here), providing them can help speed converge.






            share|improve this answer















            The "standard" logistic function $frac11+e^-x$ passes through 0 and 1 at $±infty$. This is not a great match for your data, which doesn't seem to approach either of those values, but instead approaches 0.8 from the left and 0.3 from the right.



            You may want to add scale and offset parameters so that you can squash and shift that curve into that range. My guess is that, despite the extra parameters, the model will fit better (via AIC, etc) and will end up resembling your dashed line.



            Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like



            dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset



            You would then fit this with nls: simply pass in the formula (above), and the data. If you have reasonable guesses for the starting values (which you do here), providing them can help speed converge.







            share|improve this answer















            share|improve this answer



            share|improve this answer








            edited yesterday


























            answered 2 days ago









            Matt Krause

            638924




            638924











            • By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
              – Farrel
              2 days ago
















            • By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
              – Farrel
              2 days ago















            By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
            – Farrel
            2 days ago




            By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
            – Farrel
            2 days ago












             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51706083%2fwhat-happens-when-logistic-regression-does-not-quite-capture-the-data%23new-answer', 'question_page');

            );

            Post as a guest













































































            Comments

            Popular posts from this blog

            What is the equation of a 3D cone with generalised tilt?

            Color the edges and diagonals of a regular polygon

            Relationship between determinant of matrix and determinant of adjoint?