What happens when logistic regression does not quite capture the data?
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I have modeled the probability of an aggressive (vs indolent) form of recurrent respiratory papillomatosis as a function of age of diagnosis. Generally speaking, those who are diagnosed before the age of 5 have a 80% probability of running an aggressive course. Those diagnosed after the age of 10 years have about a 30% chance. Between 5 years and 10 years it is somewhere in between. In all three age groups there does not seem to be a correlation with age (within the limits of the age group).
Look at the graph (open circles) logistic regression wants to go with but look at my manual line (dotted line) that seems to better describe what is going on. My x-axis is log of diagnostic age. The y-axis is probability of aggressive disease. How do I model the dotted line? I thought of using my own logistic function but I do not know how to make R find the best parameters.
Am I missing something in my understanding of the mathematics of the two graphs?
How do I operationalize this in R. Or perhaps I am looking for the dashed green line. I cannot believe the dashed line is correct. Biologically speaking there is little to imagine that the risk of someone diagnosed at age 9.9 is very different than one diagnosed at age 10.1 years
r
migrated from stats.stackexchange.com 2 days ago
This question came from our site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.
add a comment |Â
up vote
1
down vote
favorite
I have modeled the probability of an aggressive (vs indolent) form of recurrent respiratory papillomatosis as a function of age of diagnosis. Generally speaking, those who are diagnosed before the age of 5 have a 80% probability of running an aggressive course. Those diagnosed after the age of 10 years have about a 30% chance. Between 5 years and 10 years it is somewhere in between. In all three age groups there does not seem to be a correlation with age (within the limits of the age group).
Look at the graph (open circles) logistic regression wants to go with but look at my manual line (dotted line) that seems to better describe what is going on. My x-axis is log of diagnostic age. The y-axis is probability of aggressive disease. How do I model the dotted line? I thought of using my own logistic function but I do not know how to make R find the best parameters.
Am I missing something in my understanding of the mathematics of the two graphs?
How do I operationalize this in R. Or perhaps I am looking for the dashed green line. I cannot believe the dashed line is correct. Biologically speaking there is little to imagine that the risk of someone diagnosed at age 9.9 is very different than one diagnosed at age 10.1 years
r
migrated from stats.stackexchange.com 2 days ago
This question came from our site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.
1
Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
– Ben Bolker
2 days ago
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have modeled the probability of an aggressive (vs indolent) form of recurrent respiratory papillomatosis as a function of age of diagnosis. Generally speaking, those who are diagnosed before the age of 5 have a 80% probability of running an aggressive course. Those diagnosed after the age of 10 years have about a 30% chance. Between 5 years and 10 years it is somewhere in between. In all three age groups there does not seem to be a correlation with age (within the limits of the age group).
Look at the graph (open circles) logistic regression wants to go with but look at my manual line (dotted line) that seems to better describe what is going on. My x-axis is log of diagnostic age. The y-axis is probability of aggressive disease. How do I model the dotted line? I thought of using my own logistic function but I do not know how to make R find the best parameters.
Am I missing something in my understanding of the mathematics of the two graphs?
How do I operationalize this in R. Or perhaps I am looking for the dashed green line. I cannot believe the dashed line is correct. Biologically speaking there is little to imagine that the risk of someone diagnosed at age 9.9 is very different than one diagnosed at age 10.1 years
r
I have modeled the probability of an aggressive (vs indolent) form of recurrent respiratory papillomatosis as a function of age of diagnosis. Generally speaking, those who are diagnosed before the age of 5 have a 80% probability of running an aggressive course. Those diagnosed after the age of 10 years have about a 30% chance. Between 5 years and 10 years it is somewhere in between. In all three age groups there does not seem to be a correlation with age (within the limits of the age group).
Look at the graph (open circles) logistic regression wants to go with but look at my manual line (dotted line) that seems to better describe what is going on. My x-axis is log of diagnostic age. The y-axis is probability of aggressive disease. How do I model the dotted line? I thought of using my own logistic function but I do not know how to make R find the best parameters.
Am I missing something in my understanding of the mathematics of the two graphs?
How do I operationalize this in R. Or perhaps I am looking for the dashed green line. I cannot believe the dashed line is correct. Biologically speaking there is little to imagine that the risk of someone diagnosed at age 9.9 is very different than one diagnosed at age 10.1 years
r
asked 2 days ago
Farrel
4,654165089
4,654165089
migrated from stats.stackexchange.com 2 days ago
This question came from our site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.
migrated from stats.stackexchange.com 2 days ago
This question came from our site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.
1
Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
– Ben Bolker
2 days ago
add a comment |Â
1
Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
– Ben Bolker
2 days ago
1
1
Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
– Ben Bolker
2 days ago
Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
– Ben Bolker
2 days ago
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
2
down vote
I agree that discontinuous or step functions typically make little ecological sense. Then again, probably your dotted line doesn't, either. If we can agree that the level won't make any discontinuous jumps (as in your green dashed line), then why should the regression coefficient of the response to age make discontinuous jumps to yield the "kinks" in your green line?
You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.
add a comment |Â
up vote
2
down vote
The "standard" logistic function $frac11+e^-x$ passes through 0 and 1 at $±infty$. This is not a great match for your data, which doesn't seem to approach either of those values, but instead approaches 0.8 from the left and 0.3 from the right.
You may want to add scale and offset parameters so that you can squash and shift that curve into that range. My guess is that, despite the extra parameters, the model will fit better (via AIC, etc) and will end up resembling your dashed line.
Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like
dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset
You would then fit this with nls
: simply pass in the formula (above), and the data. If you have reasonable guesses for the starting values (which you do here), providing them can help speed converge.
By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
– Farrel
2 days ago
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
I agree that discontinuous or step functions typically make little ecological sense. Then again, probably your dotted line doesn't, either. If we can agree that the level won't make any discontinuous jumps (as in your green dashed line), then why should the regression coefficient of the response to age make discontinuous jumps to yield the "kinks" in your green line?
You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.
add a comment |Â
up vote
2
down vote
I agree that discontinuous or step functions typically make little ecological sense. Then again, probably your dotted line doesn't, either. If we can agree that the level won't make any discontinuous jumps (as in your green dashed line), then why should the regression coefficient of the response to age make discontinuous jumps to yield the "kinks" in your green line?
You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.
add a comment |Â
up vote
2
down vote
up vote
2
down vote
I agree that discontinuous or step functions typically make little ecological sense. Then again, probably your dotted line doesn't, either. If we can agree that the level won't make any discontinuous jumps (as in your green dashed line), then why should the regression coefficient of the response to age make discontinuous jumps to yield the "kinks" in your green line?
You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.
I agree that discontinuous or step functions typically make little ecological sense. Then again, probably your dotted line doesn't, either. If we can agree that the level won't make any discontinuous jumps (as in your green dashed line), then why should the regression coefficient of the response to age make discontinuous jumps to yield the "kinks" in your green line?
You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.
answered 2 days ago
Stephan Kolassa
6,04421835
6,04421835
add a comment |Â
add a comment |Â
up vote
2
down vote
The "standard" logistic function $frac11+e^-x$ passes through 0 and 1 at $±infty$. This is not a great match for your data, which doesn't seem to approach either of those values, but instead approaches 0.8 from the left and 0.3 from the right.
You may want to add scale and offset parameters so that you can squash and shift that curve into that range. My guess is that, despite the extra parameters, the model will fit better (via AIC, etc) and will end up resembling your dashed line.
Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like
dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset
You would then fit this with nls
: simply pass in the formula (above), and the data. If you have reasonable guesses for the starting values (which you do here), providing them can help speed converge.
By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
– Farrel
2 days ago
add a comment |Â
up vote
2
down vote
The "standard" logistic function $frac11+e^-x$ passes through 0 and 1 at $±infty$. This is not a great match for your data, which doesn't seem to approach either of those values, but instead approaches 0.8 from the left and 0.3 from the right.
You may want to add scale and offset parameters so that you can squash and shift that curve into that range. My guess is that, despite the extra parameters, the model will fit better (via AIC, etc) and will end up resembling your dashed line.
Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like
dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset
You would then fit this with nls
: simply pass in the formula (above), and the data. If you have reasonable guesses for the starting values (which you do here), providing them can help speed converge.
By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
– Farrel
2 days ago
add a comment |Â
up vote
2
down vote
up vote
2
down vote
The "standard" logistic function $frac11+e^-x$ passes through 0 and 1 at $±infty$. This is not a great match for your data, which doesn't seem to approach either of those values, but instead approaches 0.8 from the left and 0.3 from the right.
You may want to add scale and offset parameters so that you can squash and shift that curve into that range. My guess is that, despite the extra parameters, the model will fit better (via AIC, etc) and will end up resembling your dashed line.
Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like
dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset
You would then fit this with nls
: simply pass in the formula (above), and the data. If you have reasonable guesses for the starting values (which you do here), providing them can help speed converge.
The "standard" logistic function $frac11+e^-x$ passes through 0 and 1 at $±infty$. This is not a great match for your data, which doesn't seem to approach either of those values, but instead approaches 0.8 from the left and 0.3 from the right.
You may want to add scale and offset parameters so that you can squash and shift that curve into that range. My guess is that, despite the extra parameters, the model will fit better (via AIC, etc) and will end up resembling your dashed line.
Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like
dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset
You would then fit this with nls
: simply pass in the formula (above), and the data. If you have reasonable guesses for the starting values (which you do here), providing them can help speed converge.
edited yesterday
answered 2 days ago
Matt Krause
638924
638924
By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
– Farrel
2 days ago
add a comment |Â
By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
– Farrel
2 days ago
By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
– Farrel
2 days ago
By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
– Farrel
2 days ago
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51706083%2fwhat-happens-when-logistic-regression-does-not-quite-capture-the-data%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
– Ben Bolker
2 days ago