>TL;DR: We paid USD $800 USD and spend 4 hours in the AWS Forecast console so you don't have to.

In this reproducible experiment, we compare Amazon Forecast and StatsForecast a python open-source library for statistical methods.

Since AWS Forecast specializes in demand forecasting, we selected the M5 competition dataset as a benchmark; the dataset contains 30,490 series of daily Walmart sales.

We found that Amazon Forecast is 60% less accurate and 669 times more expensive than running an open-source alternative in a simple cloud server.

We also provide a step-by-step guide to reproduce the results.

Results

Amazon Forecast:

achieved 1.617 in error (measured in wRMSSE, the official evaluation metric used in the competition),
took 4.1 hours to run,
and cost 803.53 USD.

An ensemble of statistical methods trained on a c5d.24xlarge EC2 instance:

achieved 0.669 in error (wRMSSE),
took 14.5 minutes to run,
and cost only 1.2 USD.

For this data set, we show, therefore, that:

Amazon Forecast is 60% less accurate and 669 times more expensive than running an open-source alternative in a simple cloud server.
Classical methods outperform Machine Learning methods in terms of speed, accuracy, and cost.

Although using StatsForecast requires some basic knowledge of Python and cloud computing, the results are better for this dataset.

Table

https://preview.redd.it/vt9ru0149i5a1.png?width=1274&format=png&auto=webp&s=64e6d4519f5934d56d25d76d17a58e6d03d70512

Comments

You must log in or register to comment.

new_name_who_dis_ t1_izzo58u wrote on December 13, 2022 at 12:45 AM

I totally buy this. However you said

> Classical methods outperform Machine Learning methods in terms of speed, accuracy, and cost

those classical methods are also machine learning methods. Classic AI methods usually refers to non-statistical methods

TaXxER t1_j00kcrd wrote on December 13, 2022 at 5:00 AM

When I hear “classical methods” I associate that with traditional statistical methods that often aren’t even considered ML.

Note that frequentist stats also go by the name of classical methods (as opposed to Bayesian methods).

Delta-tau t1_j01aysc wrote on December 13, 2022 at 10:35 AM

In statistics jargon, classical methods are all frequentist inference methods which rely on asymptotic theory and p-values. Some of them, like linear regression, logistic regression, or ARMA models are nowadays viewed as ML. I guess the "ML" label is a bit vague and changes over time.

TaXxER t1_j01xpkt wrote on December 13, 2022 at 2:21 PM

Yeah I’m aware that linear and logistic regression are classical methods and are in the weird spot where they sometimes are and sometimes are not regarded as ML.

My comment was mostly aimed to argue against this claim in the comment that I replied to:

> Classic AI methods usually refers to non-statistical methods

xgboostftw t1_j02ifkc wrote on December 13, 2022 at 4:50 PM

I think the terminology is more common in the forecasting niche where (especially since the M4, M5 competitions) they started to separate out tree and NN architectures into "ML" and all other methods used for last 50 years are deemed "classical".

Delta-tau t1_j02rk0k wrote on December 13, 2022 at 5:47 PM

Yeah I guess "classical" can mean different things depending on the context.

new_name_who_dis_ t1_j021jgc wrote on December 13, 2022 at 2:48 PM

I have the same association as you if I hear classic (ML) methods. But not classic (AI) methods, those I associate with good old fashioned AI, which aren't statistical.

Maybe it's just me, idk. I studied AI in philosophy long before I took an ML class. And I took my first intro to ML class before they were teaching deep learning in intro to ML classes (though i missed this cut-off only by a year or two haha).

Quantum22 t1_j005mab wrote on December 13, 2022 at 2:56 AM

Classic AI methods being "non-statistical methods" refers to NNs or business logic?

new_name_who_dis_ t1_j006be0 wrote on December 13, 2022 at 3:01 AM

Refers to what’s called symbolic ai that uses logic, and deductions.

Idk what business logic is but maybe. Definitely not neural nets.

Quantum22 t1_j00ckvp wrote on December 13, 2022 at 3:51 AM

Ahhh yes thanks, I recall Symbolic AI and "GOFAI"

Business logic usually refers to the rules that are applied in a software system, could be if/then type statements https://en.wikipedia.org/wiki/Business_logic

So your answer is correct, but it seems like business logic may be similar, while much more basic and with a different scope.

Delta-tau t1_j01adhb wrote on December 13, 2022 at 10:26 AM

So what is an example of a non-statistical method?

new_name_who_dis_ t1_j020yfl wrote on December 13, 2022 at 2:44 PM

Search + hard-coded (expert provided) rules, for example. Deep Blue that beat Kasparov didn't have any statistics in it iirc.

Deductive reasoning (as opposed to inductive which is what statistical/ML methods are), so like reasoning from first principles that are hard coded into the system.

Zealousideal-Card637 t1_izy2kh1 wrote on December 12, 2022 at 6:23 PM

Interested comparison. I looked at the full experiments, and Amazon performs slightly better on the bottom level, the actual time series you are forecasting.

SherbertTiny2366 t1_izy50ew wrote on December 12, 2022 at 6:38 PM

For Hierarchical and sparse data it is quite common to see models achieving good accuracy in the bottom levels but being very bad at higher aggregation levels. This is the case because the models are systematically under or over predicting.

mangotheblackcat89 t1_izzighp wrote on December 13, 2022 at 12:04 AM

IMO, this is an important consideration. Sure, the target level is SKU-store, but at what level are the purchase orders being made? The M5 Competition didn't say anything about this, but probably the SKU level is as important as the SKU-store, if not more.

For retail data in general, I think we need to see how well a method perfoms at different levels of the hierarchy. I've seen commercial and finance teams prefer a forecast that is more accurate at the top than another that is slightly more accurate at the bottom.

-Rizhiy- t1_j018jx5 wrote on December 13, 2022 at 10:01 AM

Do you by any chance have a resource that explains that a bit more?

I can't get my head around how a collection of accurate forecasts, can produce an inaccurate aggregate.

Is it related to class imbalances or perhaps something like Simpson's paradox?

SherbertTiny2366 t1_j01t4du wrote on December 13, 2022 at 1:45 PM

Imagine this toy example. You have 5 series, which are very sparse, as is often the case in retail. For example, series 1 has sales on Mondays and 0's the rest of the days, series 2 on Tuesdays, series 3 on Wednesdays, and so on. For those individual series, a value close to 0 would be more or less accurate, however, when you add all the predictions up, the value will be way below the true value.

-Rizhiy- t1_j030en1 wrote on December 13, 2022 at 6:43 PM

Thank you, that makes sense.

xgboostftw t1_j02hrl4 wrote on December 13, 2022 at 4:46 PM

where do you see the full experiment? I think only the results table from Amazon is published, no?

fedegarzar OP t1_j04jp9e wrote on December 14, 2022 at 12:40 AM

Here are the results: https://github.com/Nixtla/statsforecast/tree/main/experiments/amazon_forecast
Here is the step-by-step guide to reproduce results: https://nixtla.github.io/statsforecast/examples/aws/statsforecast.html
Here are the steps for Amazon Forecast: https://nixtla.github.io/statsforecast/examples/aws/amazonforecast.html

Here is the data:
Train set: https://m5-benchmarks.s3.amazonaws.com/data/train/target.parquet
Temporal exogenous variables (used by AmazonForecast): https://m5-benchmarks.s3.amazonaws.com/data/train/temporal.parquet
Static exogenous variables (used by AmazonForecast): https://m5-benchmarks.s3.amazonaws.com/data/train/static.parquet

dat_cosmo_cat t1_izyz5hj wrote on December 12, 2022 at 9:50 PM

Several of our internal teams have arrived at similar conclusions when comparing AWS models to pre-trained open source models. Specifically; zero shot CLIP, and a fine-tuned ResNet (ImageNet) out performed Rekognition on various classification tasks (both on internal data sourced from 9 e-commerce catalogs, as well as on Google Open Image v6). Zero shot DETIC out performs it on image tagging. We even collaborated with a technical team at AWS to ensure these comparisons were as favorable as possible (truncating some classes from our data, combining others, etc...).

CyberPun-K t1_izywnhq wrote on December 12, 2022 at 9:33 PM

There is long way to go for AutoML solutions. Thanks for confirming I was not the only one.

Mark8472 t1_izyamjv wrote on December 12, 2022 at 7:13 PM

How long was development time and required human resources (e.g. number of FTE days)?

How well do both scale?

How easily are they maintained / cost on the long run?

fedegarzar OP t1_izycx10 wrote on December 12, 2022 at 7:28 PM

We did not run those experiments. But in our opinion, it's easier to maintain a python pipeline than using the UI or CLI of AWS.
In terms of scalability, I think StatsForecast wins by far, given that it takes a lot less time to compute and supports integration with spark and ray.
The point of the whole experiment is to show that the AutoML solution is far more expensive in the long run.

Mark8472 t1_izygeei wrote on December 12, 2022 at 7:50 PM

I get that. But since it doesn’t show the full picture the conclusion is misleading.

marr75 t1_izywb2h wrote on December 12, 2022 at 9:31 PM

If they were using a custom python pipeline for the statistical models, yeah, I could see this argument. But, like many of the Nixtla tools:

!conda install -c conda-forge statsforecast
import sf
sf.fit(Xzero, yzero)
yone = sf.predict(Xone)

This is a pretty common "marketing" post format from Nixtla. I think they make good tools and good points, so I'm not at all mad about it. They're providing a ready to use tool (StatsForecast) and making a great point about it's performance and cost vs the AWS alternative. Asking for the total cost of developing and maintaining statsforecast means you'd have to also account for the total cost and complexity of developing and maintaining AmazonForecast...

Uptown-Dog t1_j01092x wrote on December 13, 2022 at 8:04 AM

Yeah Amazon's ML offerings performed very poorly the last time I tried them out. Kendra returned miserable results, and AWS Comprehend had a crappy (very limited) API, multiple serious bugs (like whole-sale truncating input text segments in the response, not handling quotes consistently, etc.) that they took months to fix when we reported them, and never inspired huge amounts of confidence.

In all honesty, I'm not too surprised; my understanding is that AWS has a habit of grabbing open-source projects that kinda/sorta do what they need and build off of that internally, so you're not typically going to be exposed to unparalleled brilliance with their offerings. Mostly it will "kind of" work. But not much more than that.

(I wouldn't say I hate AWS because they do a reasonable job on several points, but they're no silver bullet across the board.)

maxafrass t1_j00fw4z wrote on December 13, 2022 at 4:19 AM

Thank you for the post and the discussion. Gives me much to consider as I prepare to look at AutoML and Azure and GCP based systems next year.

chief167 t1_j027p7v wrote on December 13, 2022 at 3:34 PM

Don't waste your time. Check datarobot (and H2O is the closest competition).

Everybody else plainly sucks at automl, sorry to put it so bluntly but it's true

I am a happy customer of them, and it took a mountain of effort to convince our it teams to move away from Microsoft and databricks etc..., But the results were just in another ballpark, so we had a strong business case

Delta-tau t1_j01b1f2 wrote on December 13, 2022 at 10:36 AM

Great post! Planning to publish this?

nickkon1 t1_j01yzso wrote on December 13, 2022 at 2:30 PM

While I believe your results, isnt the whole point of AutoML that non-ML people can easily create models (e.g. via Drag & Drop)? While you didnt do much here, you selected models and specified their seasonality, both of which the target audience of AutoML would not do. The alternative of AutoML is not neccessarily "make a model yourself" but often "you will not have a model at all".

Living_Discipline244 t1_j021ja0 wrote on December 13, 2022 at 2:48 PM

When it comes to comparing Amazon's AutoML and open source statistical methods, open source methods come out on top. While AutoML may be easy to use and can quickly train models, it lacks the flexibility and control of open source tools. With open source methods, you can fine-tune your models to your specific needs and goals, and you have access to a wide range of algorithms and techniques to choose from. Additionally, the open source community is constantly developing new methods and techniques, so you can always stay on the cutting edge of statistical analysis.

Furthermore, open source methods are often more cost-effective than commercial solutions like AutoML. While AutoML may seem like a quick and easy way to build machine learning models, the costs can quickly add up, especially for large or complex projects. In contrast, open source tools are typically free to use and can be easily integrated into your existing workflow.

So if you want to take control of your statistical analysis and have access to the latest and greatest methods, open source tools are the way to go. Just remember, with great power comes great responsibility, so be sure to use your newfound statistical prowess wisely.

[deleted] t1_izz0k2q wrote on December 12, 2022 at 9:59 PM

[deleted]

[deleted] t1_j01yx4u wrote on December 13, 2022 at 2:29 PM

[deleted]

[deleted] t1_j01yxwn wrote on December 13, 2022 at 2:30 PM

[deleted]

Living_Discipline244 t1_j020owa wrote on December 13, 2022 at 2:42 PM

So if you want to be a statistical powerhouse, you'd better hop on board the AutoML train. Just don't forget your space suit and your grim reaper scythe, because with great power comes great responsibility. And if you're not careful, you might just end up dooming humanity to a future ruled by sentient algorithms. But hey, at least you'll have impressive machine learning models, right?

[deleted] t1_j02edy2 wrote on December 13, 2022 at 4:24 PM

[removed]

[deleted] t1_j02h9de wrote on December 13, 2022 at 4:43 PM

[removed]

geneman101 t1_j1ectfo wrote on December 23, 2022 at 6:04 PM

Agreed with OP through trial and error!

cajmorgans t1_j00x7cr wrote on December 13, 2022 at 7:23 AM

The cloud has always been a scam in one way or another

xgboostftw t1_j02butn wrote on December 13, 2022 at 4:07 PM

would be nice to disclose that the study was sponsored (and conducted?) by StatsForecast...

chief167 t1_j027csj wrote on December 13, 2022 at 3:32 PM

Honestly, if you want decent automl results, you should only consider datarobot. Everything else is noticeably worse

We are a customer of them and it's a game changer. Yes it's expensive and not aimed at hobbyists, and it's like super expensive. But it's good

If I find the time, I shall upload this dataset into our system and check the results. Remind me later if I forget

xgboostftw t1_j03y82b wrote on December 13, 2022 at 10:11 PM

Seems like a poorly planned attempt at promoting your own tool.
Looking briefly at the notebook, it seems like a lot of the M5 features were excluded and only item_id was kept: https://nixtla.github.io/statsforecast/examples/aws/statsforecast.html#read-data
M5 has additional features like department, category, store, state and of course the events table. These features are very helpful and would obviously be present in a real life scenario of a retail forecast (among with many others).
The code with parameters to train AWS Forecasts models seems to also be missing from the "reproducible experiment" notebook 😂.
Not sure the study is worth taking seriously. Seems like a quick attempt at marketing rather than a study with any meaningful level of rigor. "My Corolla is faster and cheaper than a Porsche 911 when I use vegetable oil to fuel them and don't show you the Porsche".
Where does your result land on the Kaggle leaderboard?

fedegarzar OP t1_j048qe0 wrote on December 13, 2022 at 11:21 PM

Here is the step-by-step guide to reproducing Amazon Forecast: https://nixtla.github.io/statsforecast/examples/aws/amazonforecast.html

As you can see, all the exogenous variables of M5 are included in Amazon Forecast.

Concretely, if you read the same link you posted, we even provide links to the Static and temporal exogenous variables you mention.

From the ReadMe:

The data are ready for download at the following URLs:

Train set: https://m5-benchmarks.s3.amazonaws.com/data/train/target.parquet
Temporal exogenous variables (used by AmazonForecast): https://m5-benchmarks.s3.amazonaws.com/data/train/temporal.parquet
Static exogenous variables (used by AmazonForecast): https://m5-benchmarks.s3.amazonaws.com/data/train/static.parquet

[deleted] t1_j04gjie wrote on December 14, 2022 at 12:17 AM

[removed]

jedi-son t1_izykk0l wrote on December 12, 2022 at 8:17 PM

Machine learning isn't as useful as basic statistical method in 99% of real world problems?! I'm shocked 😲