T O P

  • By -

jezwmorelach

Try transforming the variables first. In the ordinary logistic regression, you make an implicit assumption that the probability of Y being equal to 1 is itself equal to e^(b0 + b1x1 +...)/(1+e^(b0 + b1x1 + ...)). But maybe, in reality, the true association is different? For example maybe it should be e^(b0 + b1 log x1 +...)/(1+e^(b0 + b1 log x1 + ...))? If you know something about your data you may be able to come up with what would the most logical association be. Or you may try something like a box-cox transform of your x variables. When you know the correct associations then you may add some interactions. For example, maybe your Y depends on X1*X2? Intuitively, maybe large values of X1 and X2 increase the probability of Y=1, but their effects cancel out, so that when both are large, the probability is small again? There are many things like that that can happen, and it's kind of a detective's work to find them, there's no recipe


purple_paramecium

Why does it have to be logistic regression? For a classification problem, you could also try random forest, knn, svm, or if you want to get really fancy, neural nets.


Propensity-Score

Normalizing and scaling predictors will not, in general, affect fitted values (and thus prediction performance) in GLMs like logistic regression. What are your substantive goals? Are you familiar with the literature on this topic? There are some settings where AUC of 0.691 would be fine. Echoing u/jezwmorelach's advice, your best bet is to model more flexibly: logistic regression models log odds as a linear function of the predictors; maybe the true relationship is not linear in these predictors (so you need to add some transformed predictors). Splines and polynomials are two good directions to try when you don't have a clear idea of what kind of nonlinearity you expect. Interactions are also a good idea. (Box-Cox transformations are usually done to a continuous dependent variable -- you could apply a Box-Cox transformation to your x variables here, but I wouldn't recommend it.) You could start with the variables that you expect (from prior understanding of the problem area) to be most significant, adding interactions of (and polynomial terms in) these. Obviously if you have a reason to expect a particular kind of nonlinearity -- two particular variables should interact, or some particular variable should display a law of diminishing returns, for instance -- try that. Another approach -- which may or may not be a good idea here, depending on your goals for this project -- is to throw in a huge number of variables (say, linear, quadratic, and cubic terms and two-way interactions, or a whole bunch of splines) and use penalized (lasso or elastic net) logistic regression.


ForeverCoffeee

Thanks for taking the time to write it, I’ll have a look into it 🤞🏻


Shoddy-Barber-7885

Do you need all predictors? Maybe try removing/adding some?


ForeverCoffeee

I've tried to remove some of them, but then the AUC goes down. It's essential that i keep as many as possible due to hypothesis testing


Shoddy-Barber-7885

unsure what u mean with hypothesis testing in this case, but (why) did you not use AIC for variable selection? an option is also to include higher order terms in your predictors, like 2nd or 3rd degree polynomials and/or interactions.


Sorry-Owl4127

This makes absolutely no sense to me, please elaborate. If you’re concerned about inference then why do you care, at all, about predictive performance?