Submitted by RAFisherman t3_114d166 in MachineLearning

I've been studying about ARIMAX, XGBoost, MLForecast and Prophet. As a newcomer to any method, I like first to do an exhaustive comparison of tools trying to understand where they succeed/fail. After exploring ARIMA/XGBoost, I came across MLForecast/Prophet. But I'm left with the following questions:

  1. Why is MLForecast better than out-of-the-box XGboost? Sure, it does feature engineering and it appears to do dynamic predictions on your lagged features, but is that it? Does it do hyperparameter tuning? Does it have seasonal trends like Prophet does?
  2. I see that you can use exogenous features in Prophet, but how does this scale? Let's assume I have 50 predictors. How does prophet handle these? I found this in the docsand this other person's post explaining how to do it, but largely I've come away with the impression that it's pretty hard to do this vs. just doing it with XGBoost.
  3. Does ARIMAX compare anymore? Are there any papers comparing out-of-sample predictions with ARIMAX vs. XGBoost vs. Prophet vs. Fable? Does it just depend on your dataset and I should try all four?

I have a time series data with dozens of "known" inputs (such as ad spend) and a lot of external data (CPI, economic health, stocks, etc.). My goal is to use my model to optimize my target by "plugging in" ad spend and dynamically forecasting the economic data.

88

Comments

You must log in or register to comment.

pyfreak182 t1_j8vpx4e wrote

In case you are not familiar, there is also Time2Vec embeddings for Transformers. It would be interesting to see how that architecture compares as well.

27

dancingnightly t1_j8y81v9 wrote

Do you know of any kind of similar encoding where you vectorise relative time? as multiple proportions of completeness, if that makes sense?

​

Say, completeness within a paragraph, within a chapter, within a book? (Besides sinusidal embeddings which push up the number of examples you need)

3

RAFisherman OP t1_j8x2wdw wrote

Didn’t think of that. Will take a look!

I do care about interpretability to some point, which is why embeddings sounds complex. But I’m now curious for sure.

2

RAFisherman OP t1_j8x39qj wrote

After skimming the paper, it seems like time to vec is kind of like a “seasonality” factor (kind of like what prophet out puts). Is that true?

2

jimliu741523 t1_j8vrlj0 wrote

Unfortunately, "no single machine learning algorithm is universally the best-performing algorithm for all problems" from no free lunch theorem, that is, you just quickly try each algo on your task, and pick the best one on proper validation.

18

weeeeeewoooooo t1_j8wyqaj wrote

You should probably try all four. There are some simple ways for you to do comparisons yourself. You can easily compare time-series models and the robustness of their training by using them to recursively predict the future by feeding their outputs back into themselves (regardless if they were trained in that fashion).

This will expose the properties of the eigenvalues of the model itself. Failure of a time-series model to match the larger eigenvalues of a system means it is failing the fundamentals and not able to capture the most basic global properties of the system you are trying to fit.

You don't necessarily have to do any fancy calculations. If the models fail to maintain the same qualitative patterns apparent in the original data over long time periods of self-input, then that means they are failing to capture the underlying dynamics. Many models eventually explode or decay to some fixed point (like a cycle or fixed value). This is a red flag that either the model is inadequate or training has failed you.

A simple dummy test for this would be training on something like a spin glass or Lorenz attractor, any kind of chaotic system really. Or just look along any interesting dimension of the data that you are using. A good model when recursively applied to itself will look very similar to the original signal in how it behaves regardless of phase.

10

BenXavier t1_j8x2m9b wrote

Hey, this Is quite interesting - but beyond my radar. I know that eigenvalues are derived from Linear transformations, how do you expose the linear component of a given ts model by recursively using it?

Sorry for the basic question: tutorials, books and references are welcome

1

weeeeeewoooooo t1_j8xoy8u wrote

This is a great question. Steve Brunton has some great videos about dynamical systems and their properties that are very accessible. This one I think does a good job showing the behavioral relationship between the eigenvalues and the underlying system: https://youtu.be/XXjoh8L1HkE

Recursive application of a system (model) over a "long" period of time gets rid of transients, so the system will fall onto the governing attractors of the system, which are generally dictated by the eigenvalues of the system. The recursive application also helps isolate the system so you are observing the model autonomously, rather than being driven by external inputs. This helps you tease out how expressive your model actually is versus how dependent it is on you feeding it from the target system's observations, which helps reduce over fitting and reduces bias.

6

hark_in_tranquillity t1_j8vzo0i wrote

The docs link you shared is not of prophet handling exogenous variables its handling holidays which is a separate "feature".

Nevertheless, prophet's exogenous influence impact/explainability is bad. One other problem with Prophet's regressor (exog features) functionality is that say you have 10 exog vars. You'll have to go through every possible combination of the 10 vars to come up with the best one. This is exponentially increasing compute.

On the other hand ML algorithms are nice for this reason, if you do data pre-processing right and take care of multicollinearity and endogeneity to some extent, influence of exog is much more explainable.

As someone mentioned m5 competition. Do check this out, you'll find a lot of reasons as to why ML based approaches that learn on panel data are SOTA right now. Do not skip experimentations tho.

7

tblume1992 t1_j8y9oti wrote

  1. MLForecast treats it more like a time series - it does differencing and moving averages as levels to encode the general level of each time series along with the ar lags. Not entirely necessary as you can just scale with like a standard scaler or even box cox at the time series level and pass a time series 'id' as a categorical variable to lightgbm and outperform MLForecast although it is pretty snappy with how they have it written.
  2. I honestly just wouldn't use Prophet in general...But if you have 50 regressors it (I believe) fits them with a normal prior which is equivalent to a ridge regression so it shrinks the coefficients but you are stuck with this 'average' effect.
  3. ARIMAX absolutely still has a place but it really all comes down to your features. If you you have good quality predictive features then it is usually better to do ML and 'featurize' the time pieces. You lose out on the time component but gain a lot due to the features. There are other issues like now you have to potentially forecast for those features. The alternative is having bad features. If that is the case then usually you are stuck with just standard time series methods. So it really is 100% dependent on your data and if there is use in learning stuff across multiple time series or not.

An alternative view is hierarchical forecasting which sometimes works well to take advantage of higher level seasonalities and trends that may be harder to see at the lower level and outperforms ML a good chunk in my experience unless you have good regressors.

As many are saying - SOTA are boosted trees with time features. If the features are bad then it is TS stuff like arimax. The best way to find out is to test each.

Edit: In regards to M5 - there was a lot of 'trickery' done to maximize the cost function there so it might not be 100% super useful, at least in my experience.

4

idly t1_j8vsaao wrote

Look up the M5 forecasting conference/competition, there's papers discussing the results - maybe helpful.

3

___luigi t1_j8ymhhk wrote

Recently, we started evaluating Time Series Transformers. TSTs showed good performance in comparison to other TS DL methods.

2

2dayiownu t1_j8wpinb wrote

Temporal Fusion Transformer

1

emotionalfool123 t1_j8wvvda wrote

This lib has lot of implementations including one you mentioned.

https://unit8co.github.io/darts/index.html

3

dj_ski_mask t1_j8x2m11 wrote

I am knee deep in this library at work right now.

Pros: they implement tons of algos and regularly update with the ‘latest and greatest,’ like NHITS. Also can scale with GPUs/TPUs for the algos that use Torch backend. Depending on the algo you can add covariates and the “global” models for multivariate time series are impressive in their performance.

Cons: my god it’s a finicky library that takes considerable time to pick up. Weird syntax/restrictions for scoring and evaluating. Differentiating between “past” and “future” covariates is not as cut and dried as documentation makes it seem. Also, limited tutorials and examples.

All in all I like it and am making a speed run to learning this library for my time series needs.

To OP I would suggest NHITS, but also, the tree based methods STILL tend to win with the data I work with.

3

emotionalfool123 t1_j8x49h8 wrote

Then it seems this is equivalent to the confusion R timeseries libraries cause.

2

clisztian t1_j8z3t1r wrote

I guarantee you a state space model will beat out any fancy named transformer for most “forecastable” problems. Even MDFA - signal extraction + exp integration for forecasting - will beat out these big ML models

1

dj_ski_mask t1_j9345mc wrote

I should mention with some tuning I have been able to get NHITS to outperform Naive Seasonal, CatBoost with lags, and ES models, so it’s not terrible.

2