Validating Sibyl Forecasting

February 26, 2017 - John-Paul Clarke, Chief Science Officer

The purpose of this document is to discuss how the Sibyl forecasting tool performs against various natural competitors. With this goal in mind, we have undertaken a general set of experiments (i.e., a meta-study) to validate Sibyl and to compare Sibyl vs. existing popular methodologies according to a number of different criteria.

The competing methodologies are as follows:

  • Box-Jenkins times series (with low-order autoregressive and moving average terms).
  • Holt-Winters (exponential smoothing with a linear trend; also with a possible seasonal trend).
  • Regression (with linear and perhaps quadratic terms; seasonality is not explicitly modeled).
  • Prophet – Facebook’s forecasting tool. See https://research.fb.com/prophet-forecasting-at-scale
  • Sibyl - Pace’s proprietary engine

According to the website, Prophet uses an amalgamation of techniques to incorporate trends and seasonality, e.g.

  • A piecewise linear or logistic growth curve trend. Prophet supposedly detects changes in trends by selecting changepoints from the data.
  • A yearly seasonal component modeled using Fourier series.
  • A weekly seasonal component using dummy variables. (This typically requires numerous data points, by the way.)
  • A user-provided list of important holidays. Box-Jenkins , Holt-Winters, and Regression can be regarded as “tried and true” methodologies, whereas Prophet and Sibyl are “young and new”.

In order to undertake our comparison study, we generated several super-collections of datasets designed to test the performance of Sibyl and its competitors. In particular, we have put together 6 super-collections, each consisting of 100 datasets, each running for 60 time units. For a particular dataset, the 60 units of data can informally be regarded as “the number of customers” (or revenue) observed for each of 60 days.

The 6 super-collections are characterized as follows:

  • Customers are generated for 60 days according to a Poisson process with a constant arrival rate. The Poisson arrival model is widely regarded as reasonable and ubiquitous.
  • Linearly increasing arrival rate. That is, the arrival rate goes up a constant bit each day. This corresponds to what is known as a nonhomogeneous Poisson process.
  • Quadratically increasing arrival rate.
  • Sinusoidal (seasonal) arrival rate. Composition of two rates, e.g., constant rate + occasional jumps.
  • Composition of multiple streams (in an attempt to mess up the other methods).

So how do the methods do compared to one another? The comparison can be made by considering several possible performance criteria, all of which are evaluated for each of the 6 super-collections (each consisting of 100 datasets). Moreover, we will form estimators of 60-day totals based on the first 15, 30, and 45 days of data. (One would expect that having more days of data will usually produce better estimates for the 60-day totals.)

For example, using data up to Day 30, we estimate the Day 60 total. Do this for each of the 100 datasets. We will then have 100 estimates obtained at Day 30 for the “true” Day 60 totals, along with the 100 corresponding errors (difference between the estimated and true vales).

At this point, we can calculate the bias and variance of the estimates for the “true” Day 60 totals. Bias is defined as the expected distance of our estimate from the true value, i.e., the expected value of the error terms (either positive or negative). To obtain an estimate of the bias for a particular choice of estimator, # of days of data, and super-collection, we simply take the sample average of the 100 corresponding errors. A bias close to zero is good, indicating that our estimator is correct “on average”.

Variance is a standard measure of the “spread” of the error. To obtain an estimate of the variance for a particular choice of estimator, # of days of data, and super-collection, we simply take the sample variance of the 100 corresponding errors. That is, we calculate the square of the difference between each error and the average error, and then take the average of those squared deviations. [Maybe an equation is better here?] A variance close to zero is usually good, indicating that our estimator is stable (we are “confident” about our estimate because it doesn’t exhibit much variability).

Sometimes, a moderate bias plus a very low variance indicates that we are very confident about the wrong answer. Therefore, we usually combine bias and variance into a single standard measure, namely, the mean-squared error (MSE) of an estimator, which is defined as bias2 + variance. A small MSE is good, indicating that we are right on the average, and further that the estimator is “stable” (doesn’t vary much). In any case, to obtain an estimate of the MSE for a particular choice of estimator, # of days of data, and super-collection, we simply add the square of the estimated bias (i.e., the square of the sample average of the errors) + the sample variance of the errors.

We also calculate prediction intervals (PIs) for each of the 5 competing methodologies. A PI in our context states that we are, say, 90% or 95% confidence that the 60-Day total will fall in the interval [A,B], where we calculate A and B.

Similar to the MSE criterion described above, we also calculate (i) the PI coverage for each method (what % of the 100 PIs actually contain the “true” 60-Day total) and (ii) the average length of the PIs. One wants coverage to be around 90% or 95%; and if this holds, then smaller PIs are better.

If the proper coverage is not obtained, then smaller PIs are NOT necessarily better because this might indicate high confidence in the wrong answer.

Example 1. Here we consider the constant arrival case A described above. In this example, all Day-60 estimates are based on 15 days worth of data (25% of the total sample). This example is not regarded as particularly difficult, since the arrival function has a constant rate.

We see that Sibyl does comparatively well with respect to all performance characteristics. The “true” expected number of sales over the 60-day period turned out to be 120, and Sibyl seems to exhibit low bias. Sibyl also does very well in terms of MSE, PI coverage, and PI length. Thus, Sibyl successfully predicts at a very early stage the Day-60 sales total. H-W, Quad Reg, and Prophet do poorly with respect to MSE, mainly due to a large variance component.

Example 2. Same set-up as Example 1 (constant arrival rate), except we now observe up to Day 30 (50% of the data) before making our forecasts.

Sibyl again does much better than before, which makes sense since it now has access to 50% of the data. It still fares comparatively well against the other methodologies, though some of the other techniques are starting to “close the gap”.

Example 3. Here we consider the more-challenging analysis of customers showing up having a sinusoidal arrival rate. In this example, we base our forecasts on 30 days (50%) worth of the dataset.

Here, all methods exhibit about the same bias, except for Quadratic Reg, which produces ridiculous negative sales (due to the fact that the quadratic regression often extrapolates wildly). With the except of Quad Reg, all have about the same order of magnitude of MSE, though it could be argued that H-W’s is still significantly higher than those of the remaining methods. All methods except for H-W have very poor PI coverage, though H-W’s is purchased at the price of higher PI length. We note that the particular sinusoidal periodicity in this example works in favor of the seasonality component used by H-W. In any case, if we take H-W out of the equation, Sibyl performs at roughly the same quality level as the other methods.