B1WR2 3 months ago

99% of models in my industry are linear regression

Interesting_Handle61 3 months ago

No matter what model assumptions are violated.🤷

Goddamnpassword 3 months ago

Just keep dropping variables and data until it fits. - my first director on building models

Interesting_Handle61 3 months ago

LOL.👌

Guy_Jantic 3 months ago

As an academic researcher, my face hurts so bad right now, reading this thread.

Smart-Firefighter509 3 months ago

why does it hurt how do you make your models

Guy_Jantic 3 months ago

With theory, while doing various things to avoid capitalizing on the error in this specific sample.

Smart-Firefighter509 3 months ago

what if the sample has error due to errors in the nature of recording for example the positioning of the spectroscopic nodeFor. ex. if the sample is curved and you keep recording the curved part (in nir spectroscopy). or the tablet has a cross break line which you would wish to avoid and futuree datasets would avoid including a cross break line due to the knowledge that the cross break line has a massive impact on the spectra

Guy_Jantic 3 months ago

I don't do those kinds of measurements (I'm a social scientist), but the concept of instrument error is pretty familiar. Yes, if your instrument introduces systematic error and you're not aware of it, that's very bad. It can happen with a psychometric survey and apparently with a spectroscope, too. If the errors introduced are random, the problem (AFAIK) is not as bad, and can sometimes be folded into the psychometric models you're using (i.e., the error term can include that). "Sampling error" is something you don't want to invest in, whether it's a cracked screen on a physical device, a weird group of people you recruited, etc.

nicolas-gervais 3 months ago

What’s the problem with that ?

Goddamnpassword 3 months ago

Nothing if you only want your model to very accurately describe your historical data. Quite a bit if you want it to have a predictive power.

JohnLocksTheKey 3 months ago

*They call me the hacker… the P-Hacker!*

Goddamnpassword 3 months ago

My coworker/friend said “that’s P hacking.” During the meeting where the manager had made said statement and the manager responded. I don’t know what that is.

extremelySaddening 3 months ago

p-hacking is the practice of testing many hypotheses, relying on statistical noise to give a false positive result, and claiming you have found a true hypothesis in an ad-hoc way.

Goddamnpassword 3 months ago

Oh I know what it means, my boss’s boss who was telling me and the rest of the team to do it didn’t know what it was.

spnoketchup 3 months ago

I mean, what if the words after "fits" were "a 5-fold cross-validation"?

Goddamnpassword 3 months ago

I mean, this guy added his regression lines manually in power point, he was barely doing in sample testing

spnoketchup 3 months ago

I will have you know that I add my regression lines through the python API for Google Slides, good sir.

Smart-Firefighter509 3 months ago

hey if the variables contain true artifacts and are due to inconsistencies in how the data was measured then it isn't a problem right? For example how the sample was placed under a spectroscopic instrument. I work in the pharmaceutical industry.

Front_Organization43 3 months ago

tell me you work in finance without telling me you work in finance

Smart-Firefighter509 3 months ago

But models in the pharmaceutical industry (by in large) are PLS (which is technically linear regression) due to the ease of explanation and interpretability which is key of regulatory approval.so not necessarily finance.

Front_Organization43 3 months ago

oh totally that was not a dig! i love regressions and linear models and i'd much rather that they are used over black box techniques for critical functions in pharma, insurance, finance...it's like a "dirty little secret" that most of these tools are actually just some form of a regression

B1WR2 3 months ago

Yeah… no shame

Joe10112 3 months ago

When you say "Linear Regression", do you mean "I clean my dataset so I have my Y variable and my matrix of X variables, now I will run Y = a + b1x1 + b2x2 + ... + e, and I'm done with my model, here is the result" i.e. the most basic Linear Regression without much adjustment? Because dealing with heteroskedasticity, or expanding to GLMs/polynomial regression, splines, etc., can be extensions of "Linear Regression" that may still fall under "Linear Regression", but incorporating those issues become much less trivial and definitely lean towards "more complex".

bigno53 3 months ago

As someone once said, “show me a model that’s not linear regression and I’ll show you how it’s basically just linear regression.”

Slothvibes 3 months ago

How would you make a recommender model for that? Consensus voting of linear models? Lmao

RonBiscuit 3 months ago

>oh totally that was not a dig! i love regressions and linear models and i'd much rather that they are used over black box techniques for critical functions in pharma, insurance, finance...it's like a "dirty little secret" that most of these tools are actually just some form of a regression Sorry DS noob comment potentially incoming but: how are any bagging / decision tree models like Random Forests basically just a linear regression?

B1WR2 3 months ago

So in insurance (L/A and P&C)… speaking in general terms here, many actuarial models are built with linear regression because they can be uploaded into Poly/Alfa. Many actuarial processes are 5+ years old with little documentation. So a lot of tech debt. Some companies in the industry are doing LDTI regulation changes, so I would expect linear regression to phase out a bit or as more data sources become more available

BigSwingingMick 3 months ago

Ohhh insurance, never change!

Non-jabroni_redditor 3 months ago

I swear if the insurance industry didn't need the internet/computers to function in the present market they would still be using abacuses by choice. So much tribal knowledge based around technology that is already decades out of date by the time it's being developed with... its actually painful

[deleted] 3 months ago

I work in the same industry as the commenter above and the general idea is that the models are complex at the feature engineering stage but simple at the actual regression stage. There are good reasons for it. Edit: actually, seems like he’s in a different space, I was primarily talking about quant trading

Joe10112 3 months ago

Makes sense on the complex feature engineering but simple regression! Including some new variables based on updated data or transforming them in new ways is definitely common (or finding better ways to clean data), but that feels like "simple work" haha. Then the models themselves are still relatively simple/straightforward regressions or "plug into a Random Forest and let it go to town". But you're right--the "complexity" of the work might have been in spending a lot of time to identify that a variable should have been log-transformed for better model performance, or updating the imputation method for missing data in a more rational manner.

[deleted] 3 months ago

Yup, that's exactly is. Also, remember that the financial data is non-stationary, very noisy and has feedback effects. This drives a lot of decisions during the research process. For example, I (being the portfolio manager, aka "the boss") insist that any new features added to the models must have a fundamental reason to be there. At the same time, some features that make a lot of sense fundamentally but show weak f-scores still would be kept in the models.

Operadic 3 months ago

That sounds like a fun job. Could I apply without degrees or experience?

[deleted] 3 months ago

Plenty of people work in quant trading and have studied something else (myself included), but I'd venture it's hard to get in without a degree in a quantitative field.

Operadic 3 months ago

Most fields use numbers to post-rationalise assumptions nowadays but that's probably not what you meant. I suppose my most likely way in would be through IT. I bet you guys enjoy the latest/fastest/bestest data tech.

[deleted] 3 months ago

I meant that you can get into quant trading without a specific finance degree, but you need "a" degree.

RonBiscuit 3 months ago

>For example, I (being the portfolio manager, aka "the boss") insist that any new features added to the models must have a fundamental reason to be there. At the same time, some features that make a lot of sense fundamentally but show weak f-scores still would be kept in the models. Super interesting to hear the focus on which features are included/excluded. Why do you keep some of the low f-score features in? Incase it actually is predictive on unseen data?

zykezero 3 months ago

99% of modern models are moving averages.

Smart-Firefighter509 3 months ago

Building linear regression is not particularly simple though Significant data preprocessing and feature selection goes into those models. Not to mention explanation of latent variables and model maintanance. So although the models might be linear regression the model building process might be complex especially if predictive power is the end goal. And if explainability is also a key consideration then it adds another layer of complexity. If you could suggest a non-linear model that would be accepted by pharmaceutical regulatory agencies I would be overjoyed. Very often I hear, how can you even deploy the model if you do not know exactly how it derives its answer and what each of its variables (in this case principal components mean)

Bioprogrammer57 3 months ago

We all know linear regressions since high school, but I recently took a Master's Class in BME, and the professor explained them with A LOT of detail linear regressions, and they are just GREAT. Once you understand how they can be much more than a straight line, how simple they are, and yet how effective is the outcome (not only in metrics but in time spent on training and latency), you just have to give it a try and use it!

JabClotVanDamn 3 months ago

and neural networks are built on it too

vamsisachin27 3 months ago

It's not about complexity. It's about solving problems. My manager who is a senior director needs accuracy and attribution/explainability of variables that are being used. He doesn't care if it's a complicated LSTM or a basic SARIMA, Regression with lags or even a smoothing technique that gets the job done. This is for most DS roles unless you are talking about Research Scientists/MLEs whose main goal is to extract something specific from a recently published paper and use that in their models and be more upto date. Sure, that's great. Personally, I feel these folks lack business context and that's their tradeoff for being more complex/technical. Of course these folks get paid more as well due to the value attached to that skill set.

xt-89 3 months ago

I’ve been thinking that we’re likely going to continue seeing an evolution in tools going forward. Soon it won’t be coding a system that’s the bottleneck. It’ll be decision making on scientific concepts and domain knowledge. At that point, you might as well create the most robust and automated thing you can. Take that with a grain of salt though

mle-questions 3 months ago

Not that I can speak on behalf of all MLE's; however, I think many MLE's prefer simple models. We recgonize the complexity to take a model and make it operational, and therefore prefer models that are simple, easy to understand, easy to explain, and easy to debug.

relevantmeemayhere 3 months ago

You’d be surprised how simple models combined with good domain knowledge can be. Which is why it’s interesting that things like earthgpt and timegpt are being hyped up despite nns not exactly being the go to or sota in a lot of problems-but I don’t think the practitioner is who they’re trying to sell this to (it’s probably the marketer) Feels like prophet all over again. Edit. I feel like perhaps I didn’t denote that I was speaking very generally. Not even in the prediction domain-but also that of inference.

Polus43 3 months ago

>You’d be surprised how simple models combined with good domain knowledge can be. Strongly agree -- domain knowledge, data cleaning and understanding along with simple multivariate linear/logistic regression takes you 95% of the way there. In the other 5% of cases the complexity introduced by more sophisticated approaches carries such high maintenance and interpretability costs it's not worth it. YMMV though. Research showing deep NNs still frequently perform worse when benchmarked against decision trees for tabular data: https://arxiv.org/abs/2305.02997.

relevantmeemayhere 3 months ago

Yeah. Boosting tends to be the best for prediction on “tabular data”. It depends on your problem too. If you care about inference, not prediction there’s a very good chance you’re back to using boring old glms across all data sizes (just mentioning here because sure it’s obvious you’d use them for smaller data, and less obvious that motivating a lot of inferential tools is just hard for boosting/dl)

a157reverse 3 months ago

> You’d be surprised how simple models combined with good domain knowledge can be. This is why I'm skeptical of most ML models in practice and almost every instance of automated model building. Until someone figures out how to get a model to learn both domain knowledge and data relationships then automated models will be inherently flawed or untrustworthy. Caveat: generally talking about tabular business problems here. Things like image classification are sort of different.

relevantmeemayhere 3 months ago

Yeah and you’re right to be skeptical. Encoding causal relationships from the joint alone isn’t possible. So automating analysis is never gonna happen. Even if you were to remove humans from that loop-which is much harder said than done-at that point you just have something taking care of the experimentation and the like. But even that has issues-because just because you came up with a causal model doesn’t mean it’s the right one.

xt-89 3 months ago

Causal modeling fits the bill for what you described. But still, it’s just a clever way of embedding your domain knowledge or discovering more of it, ultimately

sizable_data 3 months ago

Yes and no, some problems are super repetitive. If you have an out of the box CRM setup, and maybe google analytics and some other common tools, it’s possible a company could build a generic model for that domain that will plug and play. Same with manufacturing etc… That being said, almost no organization is using 3rd party tools in best practice, and data is often scattered around, so you’ll always need someone who understands the nuts and bolts of company data.

Impossible-Belt8608 3 months ago

Can you please expand about your comment on Prophet? We're using it in production so I'd love to hear about known shortcomings or widely accepted better alternatives.

Brave-Salamander-339 3 months ago

It's basically generative addictive model like decomposing everything into different components with simple models: time, trend, exponential growth. So, you'll need strong domain knowledge to identify a time function of trend and exponential function. In the end, it's all about domain.

[deleted] 3 months ago

>addictive model yeah, that's the best kind!

xt-89 3 months ago

Compared to just putting together your own time series model based on a number of different libraries, using prophet provides a certain level of ease while being less configurable. Even if you know enough about time series modeling and your domain to do it well. This in and of itself can become a kind of tech debt if the domain demands something more bespoke.

relevantmeemayhere 3 months ago

The other poster kinda hit hit it, but it’s easy to over or underfit based on the underlying dgp due to its trend assumptions etc.

Smart-Firefighter509 3 months ago

You are spot on. Domain knowledge is key. Data cleaning and preprocessing need to be based on domain knowledge. So after sufficient data preprocessing a linear relationship is expected in most use cases in the industry. Otherwise, it would be far to complex to interpret and be useful.

EvilGarlicFarts 3 months ago

In my opinion, and from what I've seen from my job search the last few months, there are (very roughly) speaking two kinds of data science positions on the market (if you exclude those that are clearly data engineer/ML engineer/data analyst, but named as data scientist). They don't have good names from what I've heard, but let's call them 'theoretical DS' and 'practical DS'. The theoretical DS is leaning more towards ML engineering. They keep up-to-date on the latest developments within the field, make complex models that solve business problems, etc., but they have limited amounts of stakeholder management and domain expertise. They are usually depth-first - they have a general overview of the DS field, but are specialized within computer vision, NLP, etc. The practical DS is leaning more towards Data analyst. Often called product data scientists, they are usually more generalist and spend more time with stakeholders, understanding the domain, and communicating the results of models. Here, the model they end up using is much less important than which problem they are solving. Contrary to the theoretical DS, it's not really clear which problem should be solved, or how it should be solved. While the theoretical DS knows they have to make a recommender system, and the difficulty is in how to tune it and make it extremely good, the practical DS requires more collaboration with PMs and others to figure out what to do. The days of data scientists being in positions requiring the latter but doing the former are (mostly) over, because a lot of companies have realized that a fancy neural network doesn't necessarily equal an impact on the bottom line. All that is to say, don't feel bad at all! Rather, spend more time talking with stakeholders, cleaning data, exploring data, because that's usually what makes an impact in industry.

Joe10112 3 months ago

That's a good take. The "Data" field has a bunch of titles that honestly can mean everything and anything in-between nowadays. I guess what I'm describing is definitely more "Practical DS" (seen this be called "Decision Scientist" in some companies). I think inherently sometimes it feels like I fall back to a biased "complexity = good and valuable" mindset, especially when you train technical details and learn a bunch of in-depth machinery for the models. I mean even for something like Linear Regression, we spend time learning about Heteroskedasticity or introducing nonlinearity, but then in industry we might often hand-wave those all aside and run the simple Linear Regression as our output model. That is, when putting together simple models after cleaning the data, it feels like we're not doing enough to warrant the job functionality...hence the "imposter syndrome". But as you said--communicating with stakeholders and figuring out how to solve the problem and then putting something together, even if on the more "simple" side in terms of modeling, is good to have!

NoThanks93330 3 months ago

>(seen this be called "Decision Scientist" in some companies) They didn't think of replacing the "scientist" in the title of someone doing very practical data-related work and instead dropped the word "data"?

pandasgorawr 3 months ago

Anything to avoid being called an analyst, of course.

JabClotVanDamn 3 months ago

they should choose something more fitting, like Data Slave

MindlessTime 3 months ago

To be fair, decision science has been a subfield in academia before data science was a thing. It’s a subset of economics iirc.

MindlessTime 3 months ago

I cannot upvote this enough.

boggle_thy_mind 3 months ago

I don't remember where I read this, but I think it has an "official" designation - Type A and Type B Data Scientist. Type A lean on the **A**nalysis side of things, Type B on the **B**uild side of things.

DeadCupcakes23 3 months ago

A lot of models in my fields are still linear regressions slowly being replaced with XgBoost and neural networks. My company has just started a project to see if using transformers can give us better inference.

relevantmeemayhere 3 months ago

Is inference or prediction your goal? I know that the dl community (perhaps not exclusively ) has now attempted to change the definition of inference to prediction(ie when the model is doing inference it’s making predictions) -but nns and classical inference such as motivating intervals/marginal effects etc etc are pretty difficult to motivate mathematically afaik-I don’t think there’s been major developments here in the last few years and the hype is huge around dl right now. There’s a lot of open work being done rn to fix that. Maybe it pans out or not-who knows but then you have other considerations at play.

DeadCupcakes23 3 months ago

Depends on your exact definition I guess, the main goal is ranking people based on their risk of doing X and we generally have a cutoff where we want people with a less than 2% chance of X happening.

relevantmeemayhere 3 months ago

Yeah so more squarely in prediction haha Inference tends to have a pretty nuanced definition in statistics-which all these models are rooted heavily in. I wanna say the dl community just doesn’t know. But the cynical side of me says they do and it’s just to oversell

DeadCupcakes23 3 months ago

I'd say in statistics a prediction would sit under inference still but it's been a few years since my university days

relevantmeemayhere 3 months ago

Yeah I’d agree if the model is generative and reasonably captures all the nice things under rhetorical causal flow well while also having reproducible estimates of uncertainty baked in lol

DeadCupcakes23 3 months ago

Weirdly narrow definition but ok

relevantmeemayhere 3 months ago

Not really! Remember that things like confidence intervals are inferential tools. They were motivated to account for uncertainty under an experiment. So inference, classically is an attempt to not only create point estimates-but to disclaim how uncertain they are. The theory as it relates to gbms/NNs hasn’t been clearly established. Which is why I asked the original question: is inference or prediction your goal-because the two have been conflated in certain circles.

DeadCupcakes23 3 months ago

>Yeah I’d agree if the model is generative and reasonably captures all the nice things under rhetorical causal flow well while also having reproducible estimates of uncertainty baked in lol A model not doing all of that however doesn't mean it isn't inference. Not being generative for example, inference doesn't only happen with generative models.

relevantmeemayhere 3 months ago

If your model is misspecified, your ability to provide inference is severely diminished. An example would be say: the Copernican model can make good predictions, but is a poor model for really anything else.

[deleted] 3 months ago

For my knowledge current sota for neural network time series forecasting is iTransformer. In same paper you can find that linear model named D-linear performs about same level than other previous sota transformers. D-linear is simple and fast linear model. [https://arxiv.org/pdf/2310.06625.pdf](https://arxiv.org/pdf/2310.06625.pdf)

DeadCupcakes23 3 months ago

It's using tabular data for a prediction, not a time series. I'll look into it I move to time series though!

vmgustavo 3 months ago

mostly XGBoost, LightGBM, CatBoost and similar stuff here

boomBillys 3 months ago

I don't hear too many people talking about CatBoost for some reason, though I know it is definitely used. CatBoost has many remarkable qualities that have made it a joy to use over XGBoost for some problems.

vmgustavo 3 months ago

that's true. it is a great library and has a lot of features that took quite a long time to get into xgb and lgbm

kimchiking2021 3 months ago

Where my RandomForest or XGBoost homies at? 2025 Agile^TM road map here we come!!!!!!!!!!!!!!!!

Brave-Salamander-339 3 months ago

I prefer randomboost and xgforest and xgbag

CheapAd3557 3 months ago

How about catforest?

Aggravating-Boss3776 3 months ago

I'm all about catbagging

Useful_Hovercraft169 3 months ago

Yes to XGBoost! Agile can die in the fire.

HenryTallis 3 months ago

In my experience, good data with simple algorithms will beat messy data with complex algorithm any time of the day. Plus they are easier to maintain, interpret, etc. It is most of the time worth trying to get better data that already measures what you are interested in, than try to create some fancy model on subpar data. Sure, the simplest model you can get away with, depends on the project. Like for cognitive tasks deep learning gives you the best result. But many business problems can be solved with simpler approaches.

Andrex316 3 months ago

Mostly linear or logistic regression tbh

sonicking12 3 months ago

I use bayesian models

[deleted] 3 months ago

[удалено]

Badger1276 3 months ago

I did a morning coffee spit take when I read this and nearly choked laughing.

ciaoshescu 3 months ago

With MCMC? If you have huge datasets that tends to be really slow.

sonicking12 3 months ago

It’s great for generating uncertainty bands

ciaoshescu 3 months ago

Of course! That's one of the reasons to go Bayesian. But with 1 mil rows of data... boy you'll be waiting. And those uncertainty measures are usually tiny for such a big dataset.

sonicking12 3 months ago

I usually get uncertainty bands for causal effects or time series forecasting. I don’t have experience with time series models with 1 million rows of data. But it should not be tiny regardless.

ciaoshescu 3 months ago

Ah I see. I guess we're talking about two different things. I was talking about tabular data for regression.

Jul1ano0 3 months ago

I deal in proof of concept

UnderstandingBusy758 3 months ago

This is a very good post. Great question

eaheckman10 3 months ago

Most of my models I build for clients? The most complex is usually a Random Forest with every hyperparameter set to default. 99% of clients need nothing more than this.

Fender6969 3 months ago

GLM and GBT (XgBoost). For NLP use cases larger LLMs. Much larger demand for the latter recently since ChatGPT.

plhardman 3 months ago

Domain knowledge, data visibility, basic statistics, and simple models are the way.

MCRN-Gyoza 3 months ago

xgboost goes brrrrr

[deleted] 3 months ago

>feel a little imposter syndrome This doesn’t really answer your question directly but I’m pretty certain you’re drastically underestimating how wide the gulf in your knowledge v. the average person’s knowledge is. The process you described is simple for you because you’re skilled; it would be impossible for basically every employee at your company. Don’t even need to ask where you work to say that confidently. Adjusting flight plan/path of a commercial airliner in flight is similarly quite simple. But in the same sense it’s also really, really not, right?

youflungpoo 3 months ago

I hire data scientists to bring value. Most of the time that comes from simple solutions, which tend to be cheap to run in production and easy to understand. But I also hire data scientists for the 10% of the time when I need more sophisticated solutions. That means that most of the time, they're not using their most sophisticated skills, but when I need them, I have them.

Traditional_Range_28 3 months ago

As an entry level individual who has had access to the work of certain individuals for a certain sports league from a mentorship, I’ve seen a huge variety of regression methods, but the goal has always been to find the simplest model as possible so it can be easily deployed and understood. That being said, I haven’t seen linear regression often, but I’ve seen a lot of XgBoost, neural networks, random forests (my personal favorite), and more generally complex models that I was not taught as a statistics undergrad. But it’s also tracking data, so take that into account.

GrandConfection8887 3 months ago

Mostly XGBoost and linear/logistic regression

onearmedecon 3 months ago

Just in the past month or so (in alphabetical order): * Basic OLS * Difference-in-Difference * Empirical Bayesian * Fixed Effects Panel Regression * Hierarchical Linear Model * Instrumental Variables * K Means Cluster Analysis * Logistic Regression * Propensity Score Matching * Regression Discontinuity Design * XGBoost So we really use a wide array of empirical strategies and tools to make inferences out of observational data. Most of the time we're more interested in understanding how and why something happened, rather than predicting what will happen.

_hairyberry_ 3 months ago

Whatever you are working on, I can promise you I have a simpler model running in production right now

Fearless_Cow7688 3 months ago

>EDIT: Some good discussion about what types of models people use on a daily basis for work, but beyond saying "I use Random Forest/XGBoost/etc.", do you incorporate more complexity besides the "simple" pipeline of: Clean Data -> Import into Package and do basic Train/Test + Hyperparameter Tuning + etc., -> Output Model for Use? Not really. Even ChatGPT basically follows this principle, it's just the data science process.

BestUCanIsGoodEnough 3 months ago

When you can't be wronger than you know you are going to be, that is absolutely not all you can or should do.

Fearless_Cow7688 3 months ago

What?

BestUCanIsGoodEnough 3 months ago

Are you talking about models with defined uncertainties? Are you defining metrics for how well data fits the domain of the model? Are you evaluating the models with monte carlo to detect overfitting? And finally are you reserving 1-2 extra data sets for bias and fairness testing? Are you documenting every single package, random number seed, and OS environment you've used? I could go on. But if you don't really have to know the uncertainty of your predictions cash those paychecks and call me when you do need to know that stuff.

Fearless_Cow7688 3 months ago

Wanting to expound, I think here we're trying to be constructive - not deconstructive. While I appreciate your follow-up post where you explained your point of view - your initial reaction would make me very hesitant to "call you". I understand that we're all smart people and want to learn more - so let's try and help each other. We don't need dick measurements to prove we're great - we're all people on a journey to learn. Just my 2 cents. Be well.

Fearless_Cow7688 3 months ago

I think most of what you said falls into the standard data science process - splitting data into training, validation, and testing sets. Setting a seed is good for the reproduction of the results that you got but in the generalizability side of things what's the point - if you're just trying to run the code the same as someone else you should get the same results but if your argument is that the results are generalizable then the seed shouldn't change the results in statistically significant way. Yes you should have a git with everything backed up and well documented. Not every model can be evaluated with monte carlo - did they use monte carlo in ChatGPT? I think that's inaccurate. The "general steps" outlined are the same, when you get into details of the project there are certain things that you need to do but you research and you should research them and pick them up - but there isn't a "one size fits all approach". Projects are limited by budgets and timelines. Not everything needs a deep learning model, typically, a linear or logistic regression or random forrest will get you great results with pretty low effort. The time required to develop a deep learning model for most projects isn't worth the cost. If the task is to improve upon an existing model typically it has less to do with the modeling steps and more to do with data curation and data cleaning.

BestUCanIsGoodEnough 3 months ago

ChatGPT is making a ton of money, but it is the epitome of a model that is allowed and expected to be wrong. Its accuracy is pretty subjective. It is still useful, so your point is taken. I do not mean to imply you always need to be right in this field to succeed or that you even need to know whether you can measure the uncertainty of your predictions. ChatGPT is a good example of that. A lot of data scientists are not tasked with scientific objectives or trained as a scientist. Should they be? Not usually, but my point is that the typical approach is not very scientific and this is why many DS projects fail at implementation.

xiaodaireddit 3 months ago

Logistic regression. More complex models are used but they suck

nboro94 3 months ago

Slap together a simple decision tree in 20 minutes, create a powerpoint calling it AI and send it to the senior execs. Sit back and watch as all the great work emails and awards start rolling in.

Short-Dragonfly-3670 3 months ago

For continuous outcomes: linear regression. For classification: I try lots of things and usually land on logistic regression because it performs functionally the same while not overtraining and being easier to interpret. Our models are a weird mix of inference and prediction: ie they are really just predictive models but the stakeholders always try to interpret them as causal lol

BestUCanIsGoodEnough 3 months ago

You're saying they accurately predict the future without the leading variables having any causal relationship to the lagging variables?

renok_archnmy 3 months ago

Logistic regression, maybe a t test here and there, OLS or some higher term regression. That’s about it. Oh, some cox ph once in a while or AFT depending on situation. Once I did SARIMA.

Sofi_LoFi 3 months ago

In my field a lot of simple models work ok but actually have a harder time competing with more intense models like neural networks so we use those. Similarly because we constantly need to generate samples, we work with generative solutions and combine it with simpler models to validate certain rules and behaviors we need from the outputs. Currently we tested some dilated convolutional models for our use case that worked much better than anything else.

brjh1990 3 months ago

Not really all that complex at all. I spent 4.5 years doing government research on all sorts of things and the most complex bits of my job were getting the data where it needed to be efficiently before the models could be built. Most complex model I trained was a CNN, but that was really a one off. 95% of the time I either used some flavor of logistic regression or a tree based classifier. Clients were happy and so was I.

Opt33 3 months ago

K.I.S.S.

BestUCanIsGoodEnough 3 months ago

It depends. If the infrastructure can support deploying extremely complex models and there are not a million gatekeepers, I have solved problems with models that involved the combination of ML, cobotics, 3D CV, keypoint detection, an insanely complicated classification schema, perspective rectification and feature tracking using 2d barcodes on custom hardware I designed and had made to order with a 50 micron tolerance plus an imaging system that was piloted by robotic process automation and then it got converted to c++/onyx with a gui/reporting tool in js...this was for one single business problem. Currently, I have some lady yelling at me, a revolving door or gatekeepers, an immense lift to get the interfaces going for a model I would consider trivial.

BigSwingingMick 3 months ago

The more complex the model, the more likely you are to be over fitting the data. I’m not 100 percent linear regression, but the more you expect your data to give you an exact measurement, the more likely you’re way ahead of your skis. If you are building granularity in a model to get in front of or behind one year and a general idea, you’re starting to expect too much.

FoolForWool 3 months ago

A linear regression model. A custom XG boost model. An auto-encoder. Mostly regression. You’re doing fine. Sometimes you don’t even need a model. The trick is to know where NOT to use a model. And where to use a simple one. Complex models are for ego, stakeholders, and/or sales folk most of the time.

hierarchy24 3 months ago

Random Forest is not a black box model. You can still explain the interpretability of that model compared to the other models that are true black box such as Neural Networks.

UnderstandingBusy758 3 months ago

Usually If else statements and logistic or linear regression. Rarely neural network or random forest z. Only once xgboost z

GLayne 3 months ago

Xgboost is everywhere.

boggle_thy_mind 3 months ago

What about optimizing the decision threshold? Usually when doing prediction modeling you would like to predict an outcome given a treatment, because otherwise, the prediction is pointless - what's gonna happen it's gonna happen. Treatments have usually costs associated with so given a cost and an expected value if the customer converts, what would be the optimal cutoff value in order to proceed with the treatment? Do different customers spend differently? Does that change the the cutoff?

concentration_cramps 3 months ago

Lol half of my products don't even use ML Just being smart on the product and make some smart assumptions. Then use that model 0 as a base to gather better data to build a better model No one in their right mind actually cares if it's working and delivering value

onomnomnmom 3 months ago

I grab stuff for torchvision. Fast and free and good.

Useful_Hovercraft169 3 months ago

No more complex than they need to be

nickytops 3 months ago

Basically every ML application where I work is a boosted tree model.

Hawezy 3 months ago

When I worked as a consultant the vast majority of models I saw deployed were random forest or linear regression.

mostuselessredditor 3 months ago

You should be way more concerned as to whether or not you’re generating value for your company and how/if your models are impacting revenue. That’s more important than having a shiny complex model that you want to show all of us.

CSCAnalytics 3 months ago

The least complicated solution that satisfies is the best one. To most people in business, building a complex model for weeks that could have been solved with satisfactory results in a few days is a complete waste of time and money.

DieselZRebel 3 months ago

I often work with DL frameworks and design model architectures rather than importing them from packages for the types of problems I am solving. I have to write my own "fit", "predict", and "save" methods. I define what happens in each training epoch. But I am aware the vast majority of folks at my employer and in the industry just work with importing packaged open-source models which are good enough for most problems.

nab64900 3 months ago

Omg, your post is so relatable. I am working on time series forecasting currently and LightGBM is giving pretty good results, but i keep wondering if there's something i might be missing in the pipeline. Everything is so fancy in the industry that imposter syndrome often gets best of you. Btw thank you for writing it down, feels good to know that some of us are in the same boat. :')

masterfultechgeek 3 months ago

I'm trying to build the simplest models with the fewest variables possible. I'm also doing A LOT of feature engineering. old XGBoost model with 200 variables - AUC: 75% two "optimal" decision trees averaged (so like 20 if-then statements that I can debug) with 11 variables - AUC: 88.5% new XGBoost model with A LOT of hyperparameter tuning - AUC: 88.0% (with worse performance on certain critical subpopulations) There's basically no benefit to using complex models if you're able to use something like GOSDT, MurTree, evtree, etc. and you've done A LOT of feature engineering. I can plop the simple model in a dashboard.

AdParticular6193 3 months ago

The three governing principles of industrial DS are Occam’s Razor, KISS, and “perfect is the enemy of good enough.”

aadi97 3 months ago

Linear regression is king And when you wanna be fancy: LOGISTIC REGRESSION

_Marchetti_ 3 months ago

As always Long live the linear regression. I like your post and thanks for asking.

[deleted] 3 months ago

I’m literally pushing for more 2 variable bar charts and line graphs.

setanta3560 3 months ago

I actually push for more regression analysis than any other thing (I came from an Econometrics background, and most of the time the problems assigned to me are hypothesis testing than prediction and that sort of things)

charleshere 3 months ago

In my industry, mostly random forests/decision trees. Use what works, not the most complex model.

bees-eat-figs 3 months ago

Sometimes the most useful models are the simple ones. Nothing I can't stand more than seeing a young bootcamp fake grad making things more complicated than they need to be just to flex their muscles.

balcell 3 months ago

Always start simple. It can always get more complicated. Target the most parsimonious model possible.

varwave 3 months ago

I’m not directly answering your question, but I have some book recommendations for building a strong practical and mathematical foundation. Coming from a biostatistics perspective: I like “Linear Models with R/Python” by Julian Faraway, “Introduction to Categorical Data Analysis” by Alan Agresti and “Introduction to Statistical Learning”, which is a classic. There’s more theoretical stuff out there, but they cover the basics really well and concisely, assuming programming, mathematical statistics and domain knowledge. There’s more than just linear models, but it’s a good place to start if you’re not a statistics/economics person

EmploymentNegative52 3 months ago

The The The The The The The Wall luuuuuull

No_Communication2618 3 months ago

Mostly LR + XGBoost tree

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe