What happens when logistic regression does not quite capture the data?

up vote
1
down vote

favorite

I have modeled the probability of an aggressive (vs indolent) form of recurrent respiratory papillomatosis as a function of age of diagnosis. Generally speaking, those who are diagnosed before the age of 5 have a 80% probability of running an aggressive course. Those diagnosed after the age of 10 years have about a 30% chance. Between 5 years and 10 years it is somewhere in between. In all three age groups there does not seem to be a correlation with age (within the limits of the age group).

Look at the graph (open circles) logistic regression wants to go with but look at my manual line (dotted line) that seems to better describe what is going on. My x-axis is log of diagnostic age. The y-axis is probability of aggressive disease. How do I model the dotted line? I thought of using my own logistic function but I do not know how to make R find the best parameters.

Am I missing something in my understanding of the mathematics of the two graphs?
How do I operationalize this in R. Or perhaps I am looking for the dashed green line. I cannot believe the dashed line is correct. Biologically speaking there is little to imagine that the risk of someone diagnosed at age 9.9 is very different than one diagnosed at age 10.1 years

enter image description here

asked 2 days ago

Farrel

4,654165089

migrated from stats.stackexchange.com 2 days ago

This question came from our site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.

1

Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
â€“Â Ben Bolker
2 days ago

add a commentÂ |Â

up vote
1
down vote

favorite

enter image description here

asked 2 days ago

Farrel

4,654165089

migrated from stats.stackexchange.com 2 days ago

This question came from our site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.

1

Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
â€“Â Ben Bolker
2 days ago

add a commentÂ |Â

up vote
1
down vote

favorite

enter image description here

asked 2 days ago

Farrel

4,654165089

enter image description here

asked 2 days ago

Farrel

4,654165089

asked 2 days ago

Farrel

4,654165089

asked 2 days ago

Farrel

4,654165089

asked 2 days ago

Farrel

4,654165089

migrated from stats.stackexchange.com 2 days ago

This question came from our site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.

migrated from stats.stackexchange.com 2 days ago

This question came from our site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.

1

Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
â€“Â Ben Bolker
2 days ago

add a commentÂ |Â

1

Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
â€“Â Ben Bolker
2 days ago

Logistic regression assumes that the probability eventually goes to 0 or 1 for sufficiently large/small values of the predictor. Yes, you probably need to fit your own custom logistic model. "How do I operationalize this in R" is off-topic for this site ... but you could ask on StackOverflow and/or wait for this to be migrated ...
â€“Â Ben Bolker
2 days ago

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
2
down vote

I agree that discontinuous or step functions typically make little ecological sense. Then again, probably your dotted line doesn't, either. If we can agree that the level won't make any discontinuous jumps (as in your green dashed line), then why should the regression coefficient of the response to age make discontinuous jumps to yield the "kinks" in your green line?

You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.

answered 2 days ago

Stephan Kolassa

6,04421835

add a commentÂ |Â

up vote
2
down vote

The "standard" logistic function $frac11+e^-x$ passes through 0 and 1 at $Ã‚Â±infty$. This is not a great match for your data, which doesn't seem to approach either of those values, but instead approaches 0.8 from the left and 0.3 from the right.

You may want to add scale and offset parameters so that you can squash and shift that curve into that range. My guess is that, despite the extra parameters, the model will fit better (via AIC, etc) and will end up resembling your dashed line.

Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like

dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset

You would then fit this with nls: simply pass in the formula (above), and the data. If you have reasonable guesses for the starting values (which you do here), providing them can help speed converge.

edited yesterday

answered 2 days ago

Matt Krause

638924

By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
â€“Â Farrel
2 days ago

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f51706083%2fwhat-happens-when-logistic-regression-does-not-quite-capture-the-data%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
2
down vote

You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.

answered 2 days ago

Stephan Kolassa

6,04421835

add a commentÂ |Â

up vote
2
down vote

You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.

answered 2 days ago

Stephan Kolassa

6,04421835

add a commentÂ |Â

up vote
2
down vote

You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.

answered 2 days ago

Stephan Kolassa

6,04421835

You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.

answered 2 days ago

Stephan Kolassa

6,04421835

answered 2 days ago

Stephan Kolassa

6,04421835

answered 2 days ago

Stephan Kolassa

6,04421835

answered 2 days ago

Stephan Kolassa

6,04421835

add a commentÂ |Â

up vote
2
down vote

Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like

dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset

edited yesterday

answered 2 days ago

Matt Krause

638924

By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
â€“Â Farrel
2 days ago

add a commentÂ |Â

up vote
2
down vote

Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like

dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset

edited yesterday

answered 2 days ago

Matt Krause

638924

By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
â€“Â Farrel
2 days ago

add a commentÂ |Â

up vote
2
down vote

Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like

dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset

edited yesterday

answered 2 days ago

Matt Krause

638924

Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like

dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset

edited yesterday

answered 2 days ago

Matt Krause

638924

edited yesterday

answered 2 days ago

Matt Krause

638924

answered 2 days ago

Matt Krause

638924

answered 2 days ago

Matt Krause

638924

By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
â€“Â Farrel
2 days ago

add a commentÂ |Â

By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
â€“Â Farrel
2 days ago

By changing the numerator from 1 to 0.5 then the maximum is 0.5, by adding a +0.3 to the equation, it lifts the lowest value from 0 to 0.3. I guess that is what you mean by scale and offset parameters. I already have a logistic regression in which I just used a trichotomous age of diagnosis classification. Yes, AIC was lower. Nevertheless, I want to get R to find the best numerator the best constant and all the other parameters. I wanted to check that what I was doing was statistically sensible. Perhaps to operationalize I will follow advice from @benb and ask in stackoverflow.
â€“Â Farrel
2 days ago

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

ukmuiik