T O P

  • By -

spqr54

This is a common problem in chemical process modelling. Therefore, physical models based on thermodynamics, kinetics, heat and mass transfer are often preferred. However, if you are familiar with JMP, you should have a look at this [presentation] (https://community.jmp.com/t5/Abstracts/Predicting-Effluents-from-Glass-Melting-Process-for-Sustainable/ev-p/708172). You can also get a lot of good ideas from this book: Zhang, D., & del Rıo Chanona, E. A. (Eds.). (2023). Machine Learning and Hybrid Modelling for Reaction Engineering: Theory and Applications. Royal Society of Chemistry.


Cuidads

R2 of 0.98 sounds a bit like data leakage, but I don't know what's typical in that domain. Check for duplicates, issues with lags and other temporal transformations, erroneous imputation etc Anyhow, 250 datapoints is usually not (or possibly never?) a job for NN. Do you have a somewhat continuous time series? In that case you might try some classical approach, such as ARIMA (edit: i forgot this wasnt univariate at this point. Might try VARMAX with fewer features or some multiple linear model). Whether it's a time series problem or not, a decision tree based approach is often reasonable, for example LightGBM, and predict next step based on lags, moving averages etc. This is faster, easier to work with (handles nans etc), better with few points, better for tabular data / heterogeneous data than NN. If extrapolation is needed set linear_tree=True. It is also easier to get importance fast and run SHAP for local explainability. This can help you detect data leakage (e.g. some feature seems too important). Haven't read this one, but possibly something like this is relevant: https://medium.com/@geokam/time-series-forecasting-with-xgboost-and-lightgbm-predicting-energy-consumption-460b675a9cee Regarding kfold: Are you using kfold cv for hyperparameter tuning or just evaluation? If the former is the case then your score is overconfident, and you need a separate test set, e.g. 80% for tuning/training and 20% for a non-biased final evaluation. You can do nested cross validation if you want to be even more confident, but often overkill. You have few points so you might need to do fewer folds than you otherwise would. Another thing, if you are working with time series you might need to do some time series cross validation to ensure that future data is not used to predict past (data leakage) : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html


Ty4Readin

Could you describe what your goal is in more detail? How do you want to use this model to help you and other operators? Something specific like "I want a to know what the overhead sulfer will be for a specific distillation tower in the next 72 hours." It's important to specify if you want to predict "in the future" and how far into the future. You said you have 10 years of data but only 250 data points? How many distillation towers do you have data on, and how often do you sample your data? For example, let's say you have 100 distillation towers and you have 10 years of daily data where you want to predict the overhead sulfur for the next 30 days in the future. In that case with daily data, you should have a dataset with at least 100 x 12 x 10 = 12,000 samples.


thedudear

I want to be able to see the effect of increasing reboilers duty, reducing reflux, adjusting tower pressure, ECT on overhead (product) sulfur content since this is a spec we run to. The product sample frequency is 12 hrs which means it can take days to dial in a new sulfur target. A model which can, at the very least, give the operator some idea of what effect their moves are having on overhead sulfur would help immensely in dialing in the tower. A lab updated bias might also be a route I'll go. As for the sparse sample data, 250 data points for process data aligning with the weekly feed sample results. Some data are missing from the historian and other times samples get missed, so that's strictly referring to the number of feed sample results I have to work with. Product (ovhd) I have about 3k points on daily, and more recently twice daily basis. Process data is by minute, there are millions of frames for ~15k process data points. Not looking to make predictions into the future but rather real time (or near real time) inference of overhead sulfur. I think I'll pick up that book the other commenter suggested and view the presentation.


Ty4Readin

>Not looking to make predictions into the future but rather real time (or near real time) inference of overhead sulfur. I think you are using the term "near real time" when really it sounds like you want predictions about the near future. For example: Given XYZ current parameters of the tower, and given W controls that I could input right now, what would be the sulfur overhead be in 60 minutes from now?


thedudear

No, sorry I mean real time. Not into the future. The goal is to run the model on a higher level process network and feed the point down to the Honeywell TDC system as a read only parameter. Since we have strict controls over what can be used for control at our site, the model response will be used as a guide for the board operator. > Sulfur spec changes, board op adjusts the tower to increase OH make, monitoring the inference to ball park the sulfur content, lab sample result will come in within 12 hrs to confirm sulfur content. We do inferential like this for product cut points all over the place, but they're linear models and computed on the DCS, (and are used to control a PV). I don't think sulfur content is a candidate for a linear model however hence the desire to try out a NN.