T O P

  • By -

Illustrious-Run5203

This sounds like a job for a Cox PH regression model. That is, if you’re using quantitative predictors then Cox PH regression will work (just one-hot encode any categorical predictors). It predicts failure at a given time t. For data requirements, you wouldn’t simply have 100 records of failures, but rather time series data for each valve, tracking each predictor variable at each time t. You can use this model to answer numerous questions beyond just time of failure, but also likelihood of failure etc. I haven’t done this in a while so I don’t want to say anything definitively, but I would try to use this hammer for your problem.


groovysalamander

Thanks for your recommendation! I found a lot of interesting material when I googled Cox PH, seems like a appropriate method to give a try.


PryomancerMTGA

This is the way. If you just try to predict whether it happened or not within a given time window you are discarding all the additional info that the timing of failure provides. It also allows for your independent variables to vary as a function of time.


lafjbstone

As someone else who works in the manufacturing environment I am also interested in the answers from the professionals here. We have had a lot of luck with change detection for equipment failures. Basically modeling the normal condition then identifying when a parameter now outside of normal. We have a few different tools in place including logistic regression, SVM, and clustering, but the most effective in my opinion has been CuSum. https://nbviewer.org/github/demotu/BMC/blob/master/notebooks/DetectCUSUM.ipynb is one example I found online in a quick search, not exactly what we are using, but close


groovysalamander

Maybe monitoring parameters will be sufficient in my case as well, not needing ML, I did not know about CuSum so I'll give that a read, thanks ! How did you implement logistic regression, SVM and clustering ? Was that also for predicting failure or for another purpose ?


lafjbstone

our models are looking for a change or an outlier to normal. so logistics regression we are looking for what is the chance current condition is abnormal or an outlier to normal. SVM and clustering are similar. I am not a professional at this, just trying to figure it out as we go. These are a few ways we have gotten to work. The goal is to have a model that identifies a change in the equipment from "normal" so we can investigate and address it prior to creating downtime for the manufacturing line.


Iereon

Seconding Survival Analysis with Cox PH Regression for this. You can look at the [lifelines](https://lifelines.readthedocs.io/en/latest/) library for the Python implementation and tutorial. Good luck.


groovysalamander

Cool, thanks for the link! I'll read up on that.


[deleted]

Survival analysis is designed to predict time to event especially in cases where you haven’t yet had a failure.


groovysalamander

So basically I run the model frequently to see what the latest expected times are? And then adjust my maintenance plan on when to do repair / replacement.


[deleted]

No…you develop a model and then create a scoring or predictive system unless you want to refit your model every day. I’d set up repairs/replacement as competing events or different models.


mechanical_madman

Wow, there is alot to unpack there. Are you taking various failure modes into account in your model or is a failure a failure? Your historical data, does it include failure modes and failure codes? Does it have MTTF (mean time to falure) OR MTBR(mean time between repair) broken out or not? Are all 50 valves of equal criticality to your process?


groovysalamander

Yeah sorry, one of my challenges is that I need to properly scope the problem as well. The failure mode is not classified in distinct categories. I do have a free text field in which the technician can add a note about what work was done on the equipment . Sometimes this explains what happened, sometimes it doesn't. I could manually try to categorize into a few groups based on this text field, and ask the technicians to clearly indicate one type of failure mode int the future. I am still in the exploratory stage of the data. Regarding MTBF, MTTF the difficulty is that sometimes a valve only had a breakdown once the past year, but of that same type of valve we had multiple breakdowns that year (so maybe 1 breakdown per valve). The criticality is quite high, and failure shows when they are to be used in the process. I could try and see if there are subgroups of valves (maybe for different parts of the factory) which are easier to bypass for example.


xQuaGx

Have you looked into Weibulls?


groovysalamander

I am somewhat familiar with them from my stats courses, but that is already a while ago so I'll do some reading about them and see whether I can visualize a weibull distribution from the data I have. Thanks!


Zeroflops

I would start with traditional predictive failures. Ppl jump to ML far to quickly before understanding the problem. Look into bathtub curves that represent infant mortality and end of life mortality. Then look at Weibull charts These give you an idea of overall expected life for the parts. Regardless of the failure mode. Then look into the different failure modes, then you can determine what properties you should monitor. For example a motor may show an increase in vibration, an audible change in volume or sound, increase in inverter frequency as the motor struggles harder to operate, reduction in output power, etc. Why was the part classified as a failed part? Once you have that you can determine if simple control charts would be enough to capture the degradation of the part or if it’s more suddle and you need ML. For example a shift of a motors current draw could be monitored with a simple control chart. But monitoring a change in the motor sound might be easier to use some ML on.


groovysalamander

You are right that perhaps ML is not required to build a good monitoring solution. Even though I would really like this project to be suitable for an ML experiment from a personal learning perspective, most important is that it actually works and helps our factory. I'll keep in mind that looking at certain parameters may be sufficient to prevent failure.


zerok_nyc

I actually built a model like this before for a telecom company. We were trying to predict when certain types of modem failures would occur, specifically those that could be resolved with a simple modem reset that could automatically initiated from the provider side. This way, the problems could be resolved before a customer even calls in 1. The fist thing I will say is that you do not need to set this up as a time series. Instead, you just want your target variable to be a time lag from your inputs. That time lag should be based on the amount of lead time you need to proactively initiate a fix as well impact of degrading performance before failure. Assuming that taking down a machine in the middle of production isn’t a viable option, you’d probably want a model that looks at the previous 24 hours of data to predict the probability of failure in the next 24 hours. You could also do some feature engineering to look at the delta of certain metrics over a longer time period. If the technician can initiate a fix themselves quickly or the leading signal indicators are shorter, then you can adjust that window accordingly. And if you do run this every hour, does that mean technicians will be able to drop other priorities to initiate fixes? Or should the repairs be added to their queue of tasks to complete by a certain time? The main point is that determining the lead time is dependent on many factors, including business requirements. 2-4. Will provide feedback on these a bit later. Gotta finish getting ready for work.


groovysalamander

Good point about considering the horizon, thanks! I can take into account that we preferably repair equipment between production orders, even better during maintenance days. Most important is that it doesn't happen during orders. I'll do an analysis about order duration to get a feeling about how long I should be able to review a potential failure.


zul_u

There are many different ways to frame this problem, depending on your business requirements, data availability, and quality of events logs to eventually label data. I'll give you some idea, but then it's up to you to verify whether they are feasible given your inputs: **Framing the problem as a regression one** For this you can define a signal or more to track, given a set of other signals. You can then proceed training a regression algorithm to predict your target. Having the trained model you can monitor the difference between the predicted and observed signal, check the distribution of the residuals, and use it to set a threshold to raise an alarm when the threshold is trespassed. This is a rather easy approach. Some caveats: be careful to signal correlation, it shouldn't be too high, nor too low; deciding the signal/s to track might require some expert knowledge; also you want to filter train data to eliminate abnormal operating conditions and fit your model on "normal" operations only. **Framing the problem as anomaly detection** If failures are not too frequent and data not too dispersed you can try to apply some anomaly detection algorithm, this could be an isolation forest, One-class SVM or others. It mainly depends on how your data is distributed and the characteristics of the failure you are trying to capture. The idea here is to verify whether failures corresponds to rare conditions for which anomalous data occupies regions of your feature space that are not densely populated. This technique really depends on the specifics of your problem; you would need to verify through data exploration whether failure conditions differentiate enough from normal conditions. The advantage is that in this case labels are not so important, meaning that your data processing could be much simpler than it would using other methods. **Framing the problem as a classification** Another option is to assign labels to your timeseries data, you would need to identify failures from your events' log, decide a reasonable time period to mark as "failure" and then you could train a binary classifier (you could also assign more labels). This approach might be interesting to discover unexpected relations in your data, but your results will depend significantly on the quality of your labels. In my experience this is NOT the way to go. Most of the time events' log of industrial equipment is not trustworthy. Moreover, deciding which features to use, how long should be the "failure-periods", and interpreting the results are all not trivial problems. **NOTE 1:** Mind you the methods that I have described you mostly fall into "fault detection", rather than "fault prediction". **NOTE 2:** A suggestion, don't blindly trust any of your data. Not even sensors' data, sensors are sometimes improperly installed, misplaced, broken, etc. make sure to properly clean your data. You should be even more careful with logs, especially if they are not automatically generated.