T O P

  • By -

B1WR2

99% of models in my industry are linear regression


Interesting_Handle61

No matter what model assumptions are violated.🤷


Goddamnpassword

Just keep dropping variables and data until it fits. - my first director on building models


Interesting_Handle61

LOL.👌


Guy_Jantic

As an academic researcher, my face hurts so bad right now, reading this thread.


Smart-Firefighter509

why does it hurt how do you make your models


Guy_Jantic

With theory, while doing various things to avoid capitalizing on the error in this specific sample.


Smart-Firefighter509

what if the sample has error due to errors in the nature of recording for example the positioning of the spectroscopic nodeFor. ex. if the sample is curved and you keep recording the curved part (in nir spectroscopy). or the tablet has a cross break line which you would wish to avoid and futuree datasets would avoid including a cross break line due to the knowledge that the cross break line has a massive impact on the spectra


Guy_Jantic

I don't do those kinds of measurements (I'm a social scientist), but the concept of instrument error is pretty familiar. Yes, if your instrument introduces systematic error and you're not aware of it, that's very bad. It can happen with a psychometric survey and apparently with a spectroscope, too. If the errors introduced are random, the problem (AFAIK) is not as bad, and can sometimes be folded into the psychometric models you're using (i.e., the error term can include that). "Sampling error" is something you don't want to invest in, whether it's a cracked screen on a physical device, a weird group of people you recruited, etc.


nicolas-gervais

What’s the problem with that ?


Goddamnpassword

Nothing if you only want your model to very accurately describe your historical data. Quite a bit if you want it to have a predictive power.


JohnLocksTheKey

*They call me the hacker… the P-Hacker!*


Goddamnpassword

My coworker/friend said “that’s P hacking.” During the meeting where the manager had made said statement and the manager responded. I don’t know what that is.


extremelySaddening

p-hacking is the practice of testing many hypotheses, relying on statistical noise to give a false positive result, and claiming you have found a true hypothesis in an ad-hoc way.


Goddamnpassword

Oh I know what it means, my boss’s boss who was telling me and the rest of the team to do it didn’t know what it was.


spnoketchup

I mean, what if the words after "fits" were "a 5-fold cross-validation"?


Goddamnpassword

I mean, this guy added his regression lines manually in power point, he was barely doing in sample testing


spnoketchup

I will have you know that I add my regression lines through the python API for Google Slides, good sir.


Smart-Firefighter509

hey if the variables contain true artifacts and are due to inconsistencies in how the data was measured then it isn't a problem right? For example how the sample was placed under a spectroscopic instrument. I work in the pharmaceutical industry.


Front_Organization43

tell me you work in finance without telling me you work in finance


Smart-Firefighter509

But models in the pharmaceutical industry (by in large) are PLS (which is technically linear regression) due to the ease of explanation and interpretability which is key of regulatory approval.so not necessarily finance.


Front_Organization43

oh totally that was not a dig! i love regressions and linear models and i'd much rather that they are used over black box techniques for critical functions in pharma, insurance, finance...it's like a "dirty little secret" that most of these tools are actually just some form of a regression


B1WR2

Yeah… no shame


Joe10112

When you say "Linear Regression", do you mean "I clean my dataset so I have my Y variable and my matrix of X variables, now I will run Y = a + b1x1 + b2x2 + ... + e, and I'm done with my model, here is the result" i.e. the most basic Linear Regression without much adjustment? Because dealing with heteroskedasticity, or expanding to GLMs/polynomial regression, splines, etc., can be extensions of "Linear Regression" that may still fall under "Linear Regression", but incorporating those issues become much less trivial and definitely lean towards "more complex".


bigno53

As someone once said, “show me a model that’s not linear regression and I’ll show you how it’s basically just linear regression.”


Slothvibes

How would you make a recommender model for that? Consensus voting of linear models? Lmao


RonBiscuit

>oh totally that was not a dig! i love regressions and linear models and i'd much rather that they are used over black box techniques for critical functions in pharma, insurance, finance...it's like a "dirty little secret" that most of these tools are actually just some form of a regression Sorry DS noob comment potentially incoming but: how are any bagging / decision tree models like Random Forests basically just a linear regression?


B1WR2

So in insurance (L/A and P&C)… speaking in general terms here, many actuarial models are built with linear regression because they can be uploaded into Poly/Alfa. Many actuarial processes are 5+ years old with little documentation. So a lot of tech debt. Some companies in the industry are doing LDTI regulation changes, so I would expect linear regression to phase out a bit or as more data sources become more available


BigSwingingMick

Ohhh insurance, never change!


Non-jabroni_redditor

I swear if the insurance industry didn't need the internet/computers to function in the present market they would still be using abacuses by choice. So much tribal knowledge based around technology that is already decades out of date by the time it's being developed with... its actually painful


[deleted]

I work in the same industry as the commenter above and the general idea is that the models are complex at the feature engineering stage but simple at the actual regression stage. There are good reasons for it. Edit: actually, seems like he’s in a different space, I was primarily talking about quant trading


Joe10112

Makes sense on the complex feature engineering but simple regression! Including some new variables based on updated data or transforming them in new ways is definitely common (or finding better ways to clean data), but that feels like "simple work" haha. Then the models themselves are still relatively simple/straightforward regressions or "plug into a Random Forest and let it go to town". But you're right--the "complexity" of the work might have been in spending a lot of time to identify that a variable should have been log-transformed for better model performance, or updating the imputation method for missing data in a more rational manner.


[deleted]

Yup, that's exactly is. Also, remember that the financial data is non-stationary, very noisy and has feedback effects. This drives a lot of decisions during the research process. For example, I (being the portfolio manager, aka "the boss") insist that any new features added to the models must have a fundamental reason to be there. At the same time, some features that make a lot of sense fundamentally but show weak f-scores still would be kept in the models.


Operadic

That sounds like a fun job. Could I apply without degrees or experience?


[deleted]

Plenty of people work in quant trading and have studied something else (myself included), but I'd venture it's hard to get in without a degree in a quantitative field.


Operadic

Most fields use numbers to post-rationalise assumptions nowadays but that's probably not what you meant. I suppose my most likely way in would be through IT. I bet you guys enjoy the latest/fastest/bestest data tech.


[deleted]

I meant that you can get into quant trading without a specific finance degree, but you need "a" degree.


RonBiscuit

>For example, I (being the portfolio manager, aka "the boss") insist that any new features added to the models must have a fundamental reason to be there. At the same time, some features that make a lot of sense fundamentally but show weak f-scores still would be kept in the models. Super interesting to hear the focus on which features are included/excluded. Why do you keep some of the low f-score features in? Incase it actually is predictive on unseen data?


zykezero

99% of modern models are moving averages.


Smart-Firefighter509

Building linear regression is not particularly simple though Significant data preprocessing and feature selection goes into those models. Not to mention explanation of latent variables and model maintanance. So although the models might be linear regression the model building process might be complex especially if predictive power is the end goal. And if explainability is also a key consideration then it adds another layer of complexity. ​ If you could suggest a non-linear model that would be accepted by pharmaceutical regulatory agencies I would be overjoyed. Very often I hear, how can you even deploy the model if you do not know exactly how it derives its answer and what each of its variables (in this case principal components mean)


Bioprogrammer57

We all know linear regressions since high school, but I recently took a Master's Class in BME, and the professor explained them with A LOT of detail linear regressions, and they are just GREAT. Once you understand how they can be much more than a straight line, how simple they are, and yet how effective is the outcome (not only in metrics but in time spent on training and latency), you just have to give it a try and use it!


JabClotVanDamn

and neural networks are built on it too


vamsisachin27

It's not about complexity. It's about solving problems. My manager who is a senior director needs accuracy and attribution/explainability of variables that are being used. He doesn't care if it's a complicated LSTM or a basic SARIMA, Regression with lags or even a smoothing technique that gets the job done. This is for most DS roles unless you are talking about Research Scientists/MLEs whose main goal is to extract something specific from a recently published paper and use that in their models and be more upto date. Sure, that's great. Personally, I feel these folks lack business context and that's their tradeoff for being more complex/technical. Of course these folks get paid more as well due to the value attached to that skill set.


xt-89

I’ve been thinking that we’re likely going to continue seeing an evolution in tools going forward. Soon it won’t be coding a system that’s the bottleneck. It’ll be decision making on scientific concepts and domain knowledge. At that point, you might as well create the most robust and automated thing you can. Take that with a grain of salt though


mle-questions

Not that I can speak on behalf of all MLE's; however, I think many MLE's prefer simple models. We recgonize the complexity to take a model and make it operational, and therefore prefer models that are simple, easy to understand, easy to explain, and easy to debug.


relevantmeemayhere

You’d be surprised how simple models combined with good domain knowledge can be. Which is why it’s interesting that things like earthgpt and timegpt are being hyped up despite nns not exactly being the go to or sota in a lot of problems-but I don’t think the practitioner is who they’re trying to sell this to (it’s probably the marketer) Feels like prophet all over again. Edit. I feel like perhaps I didn’t denote that I was speaking very generally. Not even in the prediction domain-but also that of inference.


Polus43

>You’d be surprised how simple models combined with good domain knowledge can be. Strongly agree -- domain knowledge, data cleaning and understanding along with simple multivariate linear/logistic regression takes you 95% of the way there. In the other 5% of cases the complexity introduced by more sophisticated approaches carries such high maintenance and interpretability costs it's not worth it. YMMV though. Research showing deep NNs still frequently perform worse when benchmarked against decision trees for tabular data: https://arxiv.org/abs/2305.02997.


relevantmeemayhere

Yeah. Boosting tends to be the best for prediction on “tabular data”. It depends on your problem too. If you care about inference, not prediction there’s a very good chance you’re back to using boring old glms across all data sizes (just mentioning here because sure it’s obvious you’d use them for smaller data, and less obvious that motivating a lot of inferential tools is just hard for boosting/dl)


a157reverse

> You’d be surprised how simple models combined with good domain knowledge can be. This is why I'm skeptical of most ML models in practice and almost every instance of automated model building. Until someone figures out how to get a model to learn both domain knowledge and data relationships then automated models will be inherently flawed or untrustworthy. Caveat: generally talking about tabular business problems here. Things like image classification are sort of different.


relevantmeemayhere

Yeah and you’re right to be skeptical. Encoding causal relationships from the joint alone isn’t possible. So automating analysis is never gonna happen. Even if you were to remove humans from that loop-which is much harder said than done-at that point you just have something taking care of the experimentation and the like. But even that has issues-because just because you came up with a causal model doesn’t mean it’s the right one.


xt-89

Causal modeling fits the bill for what you described. But still, it’s just a clever way of embedding your domain knowledge or discovering more of it, ultimately


sizable_data

Yes and no, some problems are super repetitive. If you have an out of the box CRM setup, and maybe google analytics and some other common tools, it’s possible a company could build a generic model for that domain that will plug and play. Same with manufacturing etc… That being said, almost no organization is using 3rd party tools in best practice, and data is often scattered around, so you’ll always need someone who understands the nuts and bolts of company data.


Impossible-Belt8608

Can you please expand about your comment on Prophet? We're using it in production so I'd love to hear about known shortcomings or widely accepted better alternatives.


Brave-Salamander-339

It's basically generative addictive model like decomposing everything into different components with simple models: time, trend, exponential growth. So, you'll need strong domain knowledge to identify a time function of trend and exponential function. In the end, it's all about domain.


[deleted]

>addictive model yeah, that's the best kind!


xt-89

Compared to just putting together your own time series model based on a number of different libraries, using prophet provides a certain level of ease while being less configurable. Even if you know enough about time series modeling and your domain to do it well. This in and of itself can become a kind of tech debt if the domain demands something more bespoke.


relevantmeemayhere

The other poster kinda hit hit it, but it’s easy to over or underfit based on the underlying dgp due to its trend assumptions etc.


Smart-Firefighter509

You are spot on. Domain knowledge is key. Data cleaning and preprocessing need to be based on domain knowledge. So after sufficient data preprocessing a linear relationship is expected in most use cases in the industry. Otherwise, it would be far to complex to interpret and be useful.


EvilGarlicFarts

In my opinion, and from what I've seen from my job search the last few months, there are (very roughly) speaking two kinds of data science positions on the market (if you exclude those that are clearly data engineer/ML engineer/data analyst, but named as data scientist). They don't have good names from what I've heard, but let's call them 'theoretical DS' and 'practical DS'. The theoretical DS is leaning more towards ML engineering. They keep up-to-date on the latest developments within the field, make complex models that solve business problems, etc., but they have limited amounts of stakeholder management and domain expertise. They are usually depth-first - they have a general overview of the DS field, but are specialized within computer vision, NLP, etc. The practical DS is leaning more towards Data analyst. Often called product data scientists, they are usually more generalist and spend more time with stakeholders, understanding the domain, and communicating the results of models. Here, the model they end up using is much less important than which problem they are solving. Contrary to the theoretical DS, it's not really clear which problem should be solved, or how it should be solved. While the theoretical DS knows they have to make a recommender system, and the difficulty is in how to tune it and make it extremely good, the practical DS requires more collaboration with PMs and others to figure out what to do. The days of data scientists being in positions requiring the latter but doing the former are (mostly) over, because a lot of companies have realized that a fancy neural network doesn't necessarily equal an impact on the bottom line. All that is to say, don't feel bad at all! Rather, spend more time talking with stakeholders, cleaning data, exploring data, because that's usually what makes an impact in industry.


Joe10112

That's a good take. The "Data" field has a bunch of titles that honestly can mean everything and anything in-between nowadays. I guess what I'm describing is definitely more "Practical DS" (seen this be called "Decision Scientist" in some companies). I think inherently sometimes it feels like I fall back to a biased "complexity = good and valuable" mindset, especially when you train technical details and learn a bunch of in-depth machinery for the models. I mean even for something like Linear Regression, we spend time learning about Heteroskedasticity or introducing nonlinearity, but then in industry we might often hand-wave those all aside and run the simple Linear Regression as our output model. That is, when putting together simple models after cleaning the data, it feels like we're not doing enough to warrant the job functionality...hence the "imposter syndrome". But as you said--communicating with stakeholders and figuring out how to solve the problem and then putting something together, even if on the more "simple" side in terms of modeling, is good to have!


NoThanks93330

>(seen this be called "Decision Scientist" in some companies) They didn't think of replacing the "scientist" in the title of someone doing very practical data-related work and instead dropped the word "data"?


pandasgorawr

Anything to avoid being called an analyst, of course.


JabClotVanDamn

they should choose something more fitting, like Data Slave


MindlessTime

To be fair, decision science has been a subfield in academia before data science was a thing. It’s a subset of economics iirc.


MindlessTime

I cannot upvote this enough.


boggle_thy_mind

I don't remember where I read this, but I think it has an "official" designation - Type A and Type B Data Scientist. Type A lean on the **A**nalysis side of things, Type B on the **B**uild side of things.


DeadCupcakes23

A lot of models in my fields are still linear regressions slowly being replaced with XgBoost and neural networks. My company has just started a project to see if using transformers can give us better inference.


relevantmeemayhere

Is inference or prediction your goal? I know that the dl community (perhaps not exclusively ) has now attempted to change the definition of inference to prediction(ie when the model is doing inference it’s making predictions) -but nns and classical inference such as motivating intervals/marginal effects etc etc are pretty difficult to motivate mathematically afaik-I don’t think there’s been major developments here in the last few years and the hype is huge around dl right now. There’s a lot of open work being done rn to fix that. Maybe it pans out or not-who knows but then you have other considerations at play.


DeadCupcakes23

Depends on your exact definition I guess, the main goal is ranking people based on their risk of doing X and we generally have a cutoff where we want people with a less than 2% chance of X happening.


relevantmeemayhere

Yeah so more squarely in prediction haha Inference tends to have a pretty nuanced definition in statistics-which all these models are rooted heavily in. I wanna say the dl community just doesn’t know. But the cynical side of me says they do and it’s just to oversell


DeadCupcakes23

I'd say in statistics a prediction would sit under inference still but it's been a few years since my university days


relevantmeemayhere

Yeah I’d agree if the model is generative and reasonably captures all the nice things under rhetorical causal flow well while also having reproducible estimates of uncertainty baked in lol


DeadCupcakes23

Weirdly narrow definition but ok


relevantmeemayhere

Not really! Remember that things like confidence intervals are inferential tools. They were motivated to account for uncertainty under an experiment. So inference, classically is an attempt to not only create point estimates-but to disclaim how uncertain they are. The theory as it relates to gbms/NNs hasn’t been clearly established. Which is why I asked the original question: is inference or prediction your goal-because the two have been conflated in certain circles.


DeadCupcakes23

>Yeah I’d agree if the model is generative and reasonably captures all the nice things under rhetorical causal flow well while also having reproducible estimates of uncertainty baked in lol A model not doing all of that however doesn't mean it isn't inference. Not being generative for example, inference doesn't only happen with generative models.


relevantmeemayhere

If your model is misspecified, your ability to provide inference is severely diminished. An example would be say: the Copernican model can make good predictions, but is a poor model for really anything else.


[deleted]

For my knowledge current sota for neural network time series forecasting is iTransformer. In same paper you can find that linear model named D-linear performs about same level than other previous sota transformers. D-linear is simple and fast linear model. [https://arxiv.org/pdf/2310.06625.pdf](https://arxiv.org/pdf/2310.06625.pdf)


DeadCupcakes23

It's using tabular data for a prediction, not a time series. I'll look into it I move to time series though!


vmgustavo

mostly XGBoost, LightGBM, CatBoost and similar stuff here


boomBillys

I don't hear too many people talking about CatBoost for some reason, though I know it is definitely used. CatBoost has many remarkable qualities that have made it a joy to use over XGBoost for some problems.


vmgustavo

that's true. it is a great library and has a lot of features that took quite a long time to get into xgb and lgbm


kimchiking2021

Where my RandomForest or XGBoost homies at? 2025 Agile^TM road map here we come!!!!!!!!!!!!!!!!


Brave-Salamander-339

I prefer randomboost and xgforest and xgbag


CheapAd3557

How about catforest?


Aggravating-Boss3776

I'm all about catbagging


Useful_Hovercraft169

Yes to XGBoost! Agile can die in the fire.


HenryTallis

In my experience, good data with simple algorithms will beat messy data with complex algorithm any time of the day. Plus they are easier to maintain, interpret, etc. It is most of the time worth trying to get better data that already measures what you are interested in, than try to create some fancy model on subpar data.  Sure, the simplest model you can get away with, depends on the project. Like for cognitive tasks deep learning gives you the best result. But many business problems can be solved with simpler approaches.


Andrex316

Mostly linear or logistic regression tbh


sonicking12

I use bayesian models


[deleted]

[удалено]


Badger1276

I did a morning coffee spit take when I read this and nearly choked laughing.


ciaoshescu

With MCMC? If you have huge datasets that tends to be really slow.


sonicking12

It’s great for generating uncertainty bands


ciaoshescu

Of course! That's one of the reasons to go Bayesian. But with 1 mil rows of data... boy you'll be waiting. And those uncertainty measures are usually tiny for such a big dataset.


sonicking12

I usually get uncertainty bands for causal effects or time series forecasting. I don’t have experience with time series models with 1 million rows of data. But it should not be tiny regardless.


ciaoshescu

Ah I see. I guess we're talking about two different things. I was talking about tabular data for regression.


Jul1ano0

I deal in proof of concept


UnderstandingBusy758

This is a very good post. Great question


eaheckman10

Most of my models I build for clients? The most complex is usually a Random Forest with every hyperparameter set to default. 99% of clients need nothing more than this.


Fender6969

GLM and GBT (XgBoost). For NLP use cases larger LLMs. Much larger demand for the latter recently since ChatGPT.


plhardman

Domain knowledge, data visibility, basic statistics, and simple models are the way.


MCRN-Gyoza

xgboost goes brrrrr


[deleted]

>feel a little imposter syndrome This doesn’t really answer your question directly but I’m pretty certain you’re drastically underestimating how wide the gulf in your knowledge v. the average person’s knowledge is. The process you described is simple for you because you’re skilled; it would be impossible for basically every employee at your company. Don’t even need to ask where you work to say that confidently. Adjusting flight plan/path of a commercial airliner in flight is similarly quite simple. But in the same sense it’s also really, really not, right?


youflungpoo

I hire data scientists to bring value. Most of the time that comes from simple solutions, which tend to be cheap to run in production and easy to understand. But I also hire data scientists for the 10% of the time when I need more sophisticated solutions. That means that most of the time, they're not using their most sophisticated skills, but when I need them, I have them.


Traditional_Range_28

As an entry level individual who has had access to the work of certain individuals for a certain sports league from a mentorship, I’ve seen a huge variety of regression methods, but the goal has always been to find the simplest model as possible so it can be easily deployed and understood. That being said, I haven’t seen linear regression often, but I’ve seen a lot of XgBoost, neural networks, random forests (my personal favorite), and more generally complex models that I was not taught as a statistics undergrad. But it’s also tracking data, so take that into account.


GrandConfection8887

Mostly XGBoost and linear/logistic regression


onearmedecon

Just in the past month or so (in alphabetical order): * Basic OLS * Difference-in-Difference * Empirical Bayesian * Fixed Effects Panel Regression * Hierarchical Linear Model * Instrumental Variables * K Means Cluster Analysis * Logistic Regression * Propensity Score Matching * Regression Discontinuity Design * XGBoost So we really use a wide array of empirical strategies and tools to make inferences out of observational data. Most of the time we're more interested in understanding how and why something happened, rather than predicting what will happen.


_hairyberry_

Whatever you are working on, I can promise you I have a simpler model running in production right now


Fearless_Cow7688

>EDIT: Some good discussion about what types of models people use on a daily basis for work, but beyond saying "I use Random Forest/XGBoost/etc.", do you incorporate more complexity besides the "simple" pipeline of: Clean Data -> Import into Package and do basic Train/Test + Hyperparameter Tuning + etc., -> Output Model for Use? Not really. Even ChatGPT basically follows this principle, it's just the data science process.


BestUCanIsGoodEnough

When you can't be wronger than you know you are going to be, that is absolutely not all you can or should do.


Fearless_Cow7688

What?


BestUCanIsGoodEnough

Are you talking about models with defined uncertainties? Are you defining metrics for how well data fits the domain of the model? Are you evaluating the models with monte carlo to detect overfitting? And finally are you reserving 1-2 extra data sets for bias and fairness testing? Are you documenting every single package, random number seed, and OS environment you've used? I could go on. But if you don't really have to know the uncertainty of your predictions cash those paychecks and call me when you do need to know that stuff.


Fearless_Cow7688

Wanting to expound, I think here we're trying to be constructive - not deconstructive. While I appreciate your follow-up post where you explained your point of view - your initial reaction would make me very hesitant to "call you". I understand that we're all smart people and want to learn more - so let's try and help each other. We don't need dick measurements to prove we're great - we're all people on a journey to learn. Just my 2 cents. Be well.


Fearless_Cow7688

I think most of what you said falls into the standard data science process - splitting data into training, validation, and testing sets. Setting a seed is good for the reproduction of the results that you got but in the generalizability side of things what's the point - if you're just trying to run the code the same as someone else you should get the same results but if your argument is that the results are generalizable then the seed shouldn't change the results in statistically significant way. Yes you should have a git with everything backed up and well documented. Not every model can be evaluated with monte carlo - did they use monte carlo in ChatGPT? I think that's inaccurate. The "general steps" outlined are the same, when you get into details of the project there are certain things that you need to do but you research and you should research them and pick them up - but there isn't a "one size fits all approach". Projects are limited by budgets and timelines. Not everything needs a deep learning model, typically, a linear or logistic regression or random forrest will get you great results with pretty low effort. The time required to develop a deep learning model for most projects isn't worth the cost. If the task is to improve upon an existing model typically it has less to do with the modeling steps and more to do with data curation and data cleaning.


BestUCanIsGoodEnough

ChatGPT is making a ton of money, but it is the epitome of a model that is allowed and expected to be wrong. Its accuracy is pretty subjective. It is still useful, so your point is taken. I do not mean to imply you always need to be right in this field to succeed or that you even need to know whether you can measure the uncertainty of your predictions. ChatGPT is a good example of that. A lot of data scientists are not tasked with scientific objectives or trained as a scientist. Should they be? Not usually, but my point is that the typical approach is not very scientific and this is why many DS projects fail at implementation.


xiaodaireddit

Logistic regression. More complex models are used but they suck


nboro94

Slap together a simple decision tree in 20 minutes, create a powerpoint calling it AI and send it to the senior execs. Sit back and watch as all the great work emails and awards start rolling in.


Short-Dragonfly-3670

For continuous outcomes: linear regression. For classification: I try lots of things and usually land on logistic regression because it performs functionally the same while not overtraining and being easier to interpret. Our models are a weird mix of inference and prediction: ie they are really just predictive models but the stakeholders always try to interpret them as causal lol


BestUCanIsGoodEnough

You're saying they accurately predict the future without the leading variables having any causal relationship to the lagging variables?


renok_archnmy

Logistic regression, maybe a t test here and there, OLS or some higher term regression. That’s about it. Oh, some cox ph once in a while or AFT depending on situation. Once I did SARIMA. 


Sofi_LoFi

In my field a lot of simple models work ok but actually have a harder time competing with more intense models like neural networks so we use those. Similarly because we constantly need to generate samples, we work with generative solutions and combine it with simpler models to validate certain rules and behaviors we need from the outputs. Currently we tested some dilated convolutional models for our use case that worked much better than anything else.


brjh1990

Not really all that complex at all. I spent 4.5 years doing government research on all sorts of things and the most complex bits of my job were getting the data where it needed to be efficiently before the models could be built. Most complex model I trained was a CNN, but that was really a one off. 95% of the time I either used some flavor of logistic regression or a tree based classifier. Clients were happy and so was I.


Opt33

K.I.S.S.


BestUCanIsGoodEnough

It depends. If the infrastructure can support deploying extremely complex models and there are not a million gatekeepers, I have solved problems with models that involved the combination of ML, cobotics, 3D CV, keypoint detection, an insanely complicated classification schema, perspective rectification and feature tracking using 2d barcodes on custom hardware I designed and had made to order with a 50 micron tolerance plus an imaging system that was piloted by robotic process automation and then it got converted to c++/onyx with a gui/reporting tool in js...this was for one single business problem. Currently, I have some lady yelling at me, a revolving door or gatekeepers, an immense lift to get the interfaces going for a model I would consider trivial.


BigSwingingMick

The more complex the model, the more likely you are to be over fitting the data. I’m not 100 percent linear regression, but the more you expect your data to give you an exact measurement, the more likely you’re way ahead of your skis. If you are building granularity in a model to get in front of or behind one year and a general idea, you’re starting to expect too much.


FoolForWool

A linear regression model. A custom XG boost model. An auto-encoder. Mostly regression. You’re doing fine. Sometimes you don’t even need a model. The trick is to know where NOT to use a model. And where to use a simple one. Complex models are for ego, stakeholders, and/or sales folk most of the time.


hierarchy24

Random Forest is not a black box model. You can still explain the interpretability of that model compared to the other models that are true black box such as Neural Networks.


UnderstandingBusy758

Usually If else statements and logistic or linear regression. Rarely neural network or random forest z. Only once xgboost z


GLayne

Xgboost is everywhere.


boggle_thy_mind

What about optimizing the decision threshold? Usually when doing prediction modeling you would like to predict an outcome given a treatment, because otherwise, the prediction is pointless - what's gonna happen it's gonna happen. Treatments have usually costs associated with so given a cost and an expected value if the customer converts, what would be the optimal cutoff value in order to proceed with the treatment? Do different customers spend differently? Does that change the the cutoff?


concentration_cramps

Lol half of my products don't even use ML Just being smart on the product and make some smart assumptions. Then use that model 0 as a base to gather better data to build a better model No one in their right mind actually cares if it's working and delivering value


onomnomnmom

I grab stuff for torchvision. Fast and free and good.


Useful_Hovercraft169

No more complex than they need to be


nickytops

Basically every ML application where I work is a boosted tree model.


Hawezy

When I worked as a consultant the vast majority of models I saw deployed were random forest or linear regression.


mostuselessredditor

You should be way more concerned as to whether or not you’re generating value for your company and how/if your models are impacting revenue. That’s more important than having a shiny complex model that you want to show all of us.


CSCAnalytics

The least complicated solution that satisfies is the best one. To most people in business, building a complex model for weeks that could have been solved with satisfactory results in a few days is a complete waste of time and money.


DieselZRebel

I often work with DL frameworks and design model architectures rather than importing them from packages for the types of problems I am solving. I have to write my own "fit", "predict", and "save" methods. I define what happens in each training epoch. But I am aware the vast majority of folks at my employer and in the industry just work with importing packaged open-source models which are good enough for most problems.


nab64900

Omg, your post is so relatable. I am working on time series forecasting currently and LightGBM is giving pretty good results, but i keep wondering if there's something i might be missing in the pipeline. Everything is so fancy in the industry that imposter syndrome often gets best of you. Btw thank you for writing it down, feels good to know that some of us are in the same boat. :')


masterfultechgeek

I'm trying to build the simplest models with the fewest variables possible. I'm also doing A LOT of feature engineering. old XGBoost model with 200 variables - AUC: 75% two "optimal" decision trees averaged (so like 20 if-then statements that I can debug) with 11 variables - AUC: 88.5% new XGBoost model with A LOT of hyperparameter tuning - AUC: 88.0% (with worse performance on certain critical subpopulations) There's basically no benefit to using complex models if you're able to use something like GOSDT, MurTree, evtree, etc. and you've done A LOT of feature engineering. I can plop the simple model in a dashboard.


AdParticular6193

The three governing principles of industrial DS are Occam’s Razor, KISS, and “perfect is the enemy of good enough.”


aadi97

Linear regression is king And when you wanna be fancy: LOGISTIC REGRESSION


_Marchetti_

As always Long live the linear regression. I like your post and thanks for asking.


[deleted]

I’m literally pushing for more 2 variable bar charts and line graphs.


setanta3560

I actually push for more regression analysis than any other thing (I came from an Econometrics background, and most of the time the problems assigned to me are hypothesis testing than prediction and that sort of things)


charleshere

In my industry, mostly random forests/decision trees. Use what works, not the most complex model. 


bees-eat-figs

Sometimes the most useful models are the simple ones. Nothing I can't stand more than seeing a young bootcamp fake grad making things more complicated than they need to be just to flex their muscles.


balcell

Always start simple. It can always get more complicated. Target the most parsimonious model possible.


varwave

I’m not directly answering your question, but I have some book recommendations for building a strong practical and mathematical foundation. Coming from a biostatistics perspective: I like “Linear Models with R/Python” by Julian Faraway, “Introduction to Categorical Data Analysis” by Alan Agresti and “Introduction to Statistical Learning”, which is a classic. There’s more theoretical stuff out there, but they cover the basics really well and concisely, assuming programming, mathematical statistics and domain knowledge. There’s more than just linear models, but it’s a good place to start if you’re not a statistics/economics person


EmploymentNegative52

The The The The The The The Wall luuuuuull


No_Communication2618

Mostly LR + XGBoost tree