TIME SERIES ANALYSIS & FORECASTING
Contents:
-
Introduction
-
Dataset and Visualizations
-
ARIMA (Autoregressive Integrating Moving Average) Model and Forecasting
-
Prophet (Fbprophet) Model and Forecasting
- Conclusion
Introduction
-
What is time series?
A set of ordered observations of a quantitative variable taken at successive points in time (usually evenly spaced, like once a day) is known as Time Series. Time, in terms of years, months, days, or hours is simply a device that enables one to relate all phenomenon to a set of common, stable reference points.
For example, the data of airline ticket sales per day is a time series. - Components of a Time Series:
-
Trend: The trend shows a general direction of the time series data over a long period of time. A trend can be increasing(upward), decreasing(downward), or horizontal(stationary).
-
Seasonality: The seasonality component exhibits a trend that repeats with respect to timing, direction, and magnitude. Some examples include an increase in water consumption in summer due to hot weather conditions, or an increase in the number of airline passengers during holidays each year.
-
Cyclical Component: These are the trends with no set repetition over a particular period of time. A cycle refers to the period of ups and downs, booms and slums of a time series, mostly observed in business cycles. These cycles do not exhibit a seasonal variation but generally occur over a time period of 3 to 12 years depending on the nature of the time series.
-
Irregular Variation: These are the fluctuations in the time series data which become evident when trend and cyclical variations are removed. These variations are unpredictable, erratic, and may or may not be random.
-
Importance of time series:
-
Analysis of causes and conditions prevailing during occurrence of past changes, one can easily determine the future policies and programs.
- Estimation of future trends on the basis of analysis or past trends.
- Trends of trade cycles are studied and their effect can be reduced to a considerable extent.
- Comparative study with the other time series.
-
-
Forecasting:
Time series analysis comprises methods for analyzing the time series data in order to extract the meaningful statistics and the other important characteristics of the data set. And in time series analysis, Forecasting algorithm is an information process that seeks to predict the future values based on past and present data and most commonly by analysis of trends. The historical data points are extracted and prepared trying to predict future values for a selected variable of the data set.
In this blog I am analyzing a univariate time series model.
-
So, what is univariate time series?
Univariate time series consists of single observation recorded sequentially over equal time increments. In this project there is only one variable that is sales counts which is recorded over a certain period of time.
Let’s get started!
-
Dataset and Visualizations
Here, we are working with Superstore Sales Data set. Based on the given data set we have to predict the total sales for every product and store in the next month. The data set provides; ‘Row Id’ , ‘Order ID’ , ‘Order Date’ , ‘Ship Date’ , ‘Ship Mode’ , ‘Customer ID’ , ‘Customer Name’ , ‘Segment’ , ‘Country’ , ‘City’ , ‘State’ , ‘Postal Code’ , ‘Region’ , ‘Product ID’ , ‘Category’ , ‘Sub-category’ , ‘Product Name’ , ‘Sales’ , ‘Quantity’ , ‘Discount’ and ‘Profit’.
Here we are working with a large data set. Hence, we can go through the following process to import:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import matplotlib
matplotlib.rcParams['axes.labelsize']=14
matplotlib.rcParams['xtick.labelsize']=12
matplotlib.rcParams['ytick.labelsize']=12
matplotlib.rcParams['text.color']='k'
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
(Download the data)
downloaded = drive.CreateFile({'id':'1v_ibPwoNak9iD6TDmVfUgC9or-mMpOsP'})
downloaded.GetContentFile('Sample - Superstore (2).xls')
df=pd.read_excel('Sample - Superstore (2).xls')
df.head()
Output:
Hurrah! My Data set imported successfully.
-
Data Processing
This step includes removing columns we do not need (unnecessary data), and checking is any missing values are present or not, aggregate sales by date and so on.
#PREPROCESSING
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name', 'Quantity', 'Discount', 'Profit']df.drop(cols, axis=1, inplace=True)
df.head()
Output
Order Date Sales
0 2016-11-08 261.9600
1 2016-11-08 731.9400
2 2016-06-12 14.6200
3 2015-10-11 957.5775
4 2015-10-11 22.3680
df=df.sort_values('Order Date')
df.isnull().sum()
Output
Order Date 0
Sales 0
dtype: int64
print("start date is:",df['Order Date'].min())
print("End date is:",df['Order Date'].max())
Output
start date is: 2014-01-03 00:00:00
End date is: 2017-12-30 00:00:00
-
Indexing with Time series data:
As the given data set is in DataFrame format, we cannot work with this type of the data. We have to convert the dataset to time series format for further use.
#Make the date an index
df = df.set_index('Order Date')
df.indexOutput
DatetimeIndex(['2014-01-03', '2014-01-04', '2014-01-04', '2014-01-04',
'2014-01-05', '2014-01-06', '2014-01-06', '2014-01-06',
'2014-01-06', '2014-01-06',
...
'2017-12-29', '2017-12-29', '2017-12-29', '2017-12-30',
'2017-12-30', '2017-12-30', '2017-12-30', '2017-12-30',
'2017-12-30', '2017-12-30'],
dtype='datetime64[ns]', name='Order Date', length=9994, freq=None)
df.head()
Output
Sales
Order Date
2014-01-03 16.448
2014-01-04 11.784
2014-01-04 272.736
2014-01-04 3.540
2014-01-05 19.536
Here, our dataset is suitable for using Time series analysis.
-
Re-sampling
Re-sampling is the process of drawing repeated samples from the original dataset. The intuition behind re-sampling methods is that it creates “similar” cases for our data classes in order to render the data representative of the population we wish to investigate, and therefore feed the algorithm enough data to output more accurate results (when the data is not enough).
Re-sampling methods are very easy to use, requiring little mathematical knowledge. They are methods that are easy to understand and implement compared to specialized statistical methods that may require deep technical skill in order to select and interpret.
Y=df['Sales'].resample('M').mean()
-
Visualizing the Dataset
Y.plot(figsize=(15, 6))
plt.show()Output
Here, we have plotted no of sales against each month of the year. Some distinguishable patterns appear when we plot the data. Sales are low at the beginning of the year and high at the end of the year. And it changes reversely in the next year. Every two years interval pattern looks quite similar. Couple of high months present in the mid of the year.
ts=df.groupby(['Order Date'])['Sales'].mean()
ts.astype('float')
plt.figure(figsize=(16,8))
plt.title('Mean Sales of the company')
plt.xlabel('Time')
plt.ylabel('Sales')
plt.plot(ts)
Output
Here, we plot the mean sales of every month. And we can see that there are small bars at the starting and ending of the year and couple of high bars at the middle year.
We can also visualize our data using a method called time-series decomposition that allows us to decompose our time series into three distinct components:
Trend, Seasonality and Noise.
from pylab import rcParams
rcParams['figure.figsize'] = 18, 8
decomposition = sm.tsa.seasonal_decompose(Y, model='additive')
fig = decomposition.plot()
plt.show()
Output
The plot above clearly shows that the sales is unstable, along with its obvious seasonality.
-
ARIMA Model Fitting and Forecasting
One of the most common methods used in time series forecasting is known as the ARIMA model, which stands for AutoRegressive Integrated Moving Average. ARIMA is a model that can be fitted to time series data in order to better understand or predict future points in the series.
There are three distinct integers (p, d, q) that are used to parameterize ARIMA models. Because of that, ARIMA models are denoted with the notation ARIMA(p, d, q). Together these three parameters account for seasonality, trend, and noise in datasets:
- p is the auto-regressive part of the model. It allows us to incorporate the effect of past values into our model. Intuitively, this would be similar to stating that it is likely to be warm tomorrow if it has been warm the past 3 days.
- d is the integrated part of the model. This includes terms in the model that incorporate the amount of differencing (i.e. the number of past time points to subtract from the current value) to apply to the time series. Intuitively, this would be similar to stating that it is likely to be same temperature tomorrow if the difference in temperature in the last three days has been very small.
-
q is the moving average part of the model. This allows us to set the error of our model as a linear combination of the error values observed at previous time points in the past.
-
Checking Stationary of the Data
-
What is Stationary Time Series:
In simple words, a stationary time series is one, whose statistical properties i.e. mean and variance does not depend on time.
-
- p is the auto-regressive part of the model. It allows us to incorporate the effect of past values into our model. Intuitively, this would be similar to stating that it is likely to be warm tomorrow if it has been warm the past 3 days.
For checking stationary, we perform ADF Test .
-
ADF Test: Augmented Dickey Fuller test (ADF Test) is a common statistical test used to test whether a given time series is stationary or not. It is one of the most commonly used statistical test when it comes to analyzing the stationary of the series.
It is fundamentally a statistical significance test. That means, there is a hypothesis testing involved with a null and alternative hypothesis and as a result a test statistic is computed and p-value get reported. It is from the test statistic and the p-value, we can make an inference as to whether the given series is stationary or not.
## Checking for Stationary of the given series
from statsmodels.tsa.stattools import adfuller
def test_stationarity(series,mlag = 365, lag = None,):
print('ADF Test Result')
res = adfuller(series, maxlag = mlag, autolag = lag)
output = pd.Series(res[0:4],index = ['Test Statistic', 'p value', 'used lag', 'Number of observations used'])
for key, value in res[4].items():
output['Critical Value ' + key] = value
print(output)
test_stationarity(df['Sales'])
Output
ADF Test Result
Test Statistic -5.172130
p value 0.000010
used lag 365.000000
Number of observations used 9628.000000
Critical Value 1% -3.431029
Critical Value 5% -2.861840
Critical Value 10% -2.566930
dtype: float64
Therefore based on the ADF Test result we can conclude that the series is stationary and useful for applying ARIMA Method.
- Construct ARIMA Model
import itertools
p=d=q=range(0,2)
pdq=list(itertools.product(p,d,q))
seasonal_pdq=[(x[0],x[1],x[2],12)for x in list(itertools.product(p,d,q))]
print("Example of parameter combinations for Seasonal Arima")
print("Sarimax:{}X{}".format(pdq[1],seasonal_pdq[1]))
print("Sarimax:{}X{}".format(pdq[2],seasonal_pdq[2]))
print("Sarimax:{}X{}".format(pdq[3],seasonal_pdq[3]))
print("Sarimax:{}X{}".format(pdq[4],seasonal_pdq[4]))
Output
Example of parameter combinations for Seasonal Arima
Sarimax:(0, 0, 1)X(0, 0, 1, 12)
Sarimax:(0, 1, 0)X(0, 1, 0, 12)
Sarimax:(0, 1, 1)X(0, 1, 1, 12)
Sarimax:(1, 0, 0)X(1, 0, 0, 12)
We can now use the triplets of parameters defined above to automate the process of training and evaluating ARIMA models on different combinations. In Statistics and Machine Learning, this process is known as grid search (or hyper parameter optimization) for model selection.
When evaluating and comparing statistical models fitted with different parameters, each can be ranked against one another based on how well it fits the data or its ability to accurately predict future data points. We will use the AIC (Akaike Information Criterion) value, which is conveniently returned with ARIMA models fitted using statsmodels. The AIC measures how well a model fits the data while taking into account the overall complexity of the model. A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same goodness-of-fit. Therefore, we are interested in finding the model that yields the lowest AIC value.
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(Y,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
except:
continue
Output
ARIMA(0, 0, 0)x(0, 0, 0, 12)12 - AIC:648.1839075951442
ARIMA(0, 0, 0)x(0, 0, 1, 12)12 - AIC:1573.9564723209578
ARIMA(0, 0, 0)x(0, 1, 0, 12)12 - AIC:387.257330070831
ARIMA(0, 0, 0)x(1, 0, 0, 12)12 - AIC:402.97856214610994
/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
ARIMA(0, 0, 0)x(1, 0, 1, 12)12 - AIC:384.7995950669331
ARIMA(0, 0, 0)x(1, 1, 0, 12)12 - AIC:262.1644595966525
ARIMA(0, 0, 1)x(0, 0, 0, 12)12 - AIC:597.654605697939
ARIMA(0, 0, 1)x(0, 0, 1, 12)12 - AIC:nan
ARIMA(0, 0, 1)x(0, 1, 0, 12)12 - AIC:376.3251637580207
/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
ARIMA(0, 0, 1)x(1, 0, 0, 12)12 - AIC:405.0472142226208
ARIMA(0, 0, 1)x(1, 0, 1, 12)12 - AIC:369.16844047362355
ARIMA(0, 0, 1)x(1, 1, 0, 12)12 - AIC:263.3626446818747
ARIMA(0, 1, 0)x(0, 0, 0, 12)12 - AIC:528.1869579691191
ARIMA(0, 1, 0)x(0, 0, 1, 12)12 - AIC:1454.5916794067966
ARIMA(0, 1, 0)x(0, 1, 0, 12)12 - AIC:402.41176085757166
ARIMA(0, 1, 0)x(1, 0, 0, 12)12 - AIC:387.20858105018107
/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
ARIMA(0, 1, 0)x(1, 0, 1, 12)12 - AIC:1394.9326629156938
ARIMA(0, 1, 0)x(1, 1, 0, 12)12 - AIC:270.9438065103488
ARIMA(0, 1, 1)x(0, 0, 0, 12)12 - AIC:467.6414267046183
ARIMA(0, 1, 1)x(0, 0, 1, 12)12 - AIC:2609.961816454136
ARIMA(0, 1, 1)x(0, 1, 0, 12)12 - AIC:370.0243667754068
ARIMA(0, 1, 1)x(1, 0, 0, 12)12 - AIC:368.6117762611625
ARIMA(0, 1, 1)x(1, 0, 1, 12)12 - AIC:2849.340499469712
ARIMA(0, 1, 1)x(1, 1, 0, 12)12 - AIC:257.37284772726474
ARIMA(1, 0, 0)x(0, 0, 0, 12)12 - AIC:540.6772379709616
/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
ARIMA(1, 0, 0)x(0, 0, 1, 12)12 - AIC:1536.2229591079033
ARIMA(1, 0, 0)x(0, 1, 0, 12)12 - AIC:389.2433793870417
ARIMA(1, 0, 0)x(1, 0, 0, 12)12 - AIC:387.316239028495
ARIMA(1, 0, 0)x(1, 0, 1, 12)12 - AIC:387.4871336951148
ARIMA(1, 0, 0)x(1, 1, 0, 12)12 - AIC:253.0789709391584
ARIMA(1, 0, 1)x(0, 0, 0, 12)12 - AIC:496.90504904914684
ARIMA(1, 0, 1)x(0, 0, 1, 12)12 - AIC:2841.1869647833105
ARIMA(1, 0, 1)x(0, 1, 0, 12)12 - AIC:374.4935542234095
ARIMA(1, 0, 1)x(1, 0, 0, 12)12 - AIC:371.8863361971654
ARIMA(1, 0, 1)x(1, 0, 1, 12)12 - AIC:359.29908312075713
ARIMA(1, 0, 1)x(1, 1, 0, 12)12 - AIC:254.99974976566307
ARIMA(1, 1, 0)x(0, 0, 0, 12)12 - AIC:508.22613161672086
/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
ARIMA(1, 1, 0)x(0, 0, 1, 12)12 - AIC:1463.290184247775
ARIMA(1, 1, 0)x(0, 1, 0, 12)12 - AIC:397.02186857072024
ARIMA(1, 1, 0)x(1, 0, 0, 12)12 - AIC:362.307716753627
/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
ARIMA(1, 1, 0)x(1, 0, 1, 12)12 - AIC:1470.1783831793764
ARIMA(1, 1, 0)x(1, 1, 0, 12)12 - AIC:245.87074565604823
ARIMA(1, 1, 1)x(0, 0, 0, 12)12 - AIC:468.8192870346768
ARIMA(1, 1, 1)x(0, 0, 1, 12)12 - AIC:3213.368937063234
ARIMA(1, 1, 1)x(0, 1, 0, 12)12 - AIC:372.0243404313211
ARIMA(1, 1, 1)x(1, 0, 0, 12)12 - AIC:357.38039336226905
ARIMA(1, 1, 1)x(1, 0, 1, 12)12 - AIC:1263.3417792136752
ARIMA(1, 1, 1)x(1, 1, 0, 12)12 - AIC:243.34456553221048
/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py:512: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
The output of our code suggests that SARIMAX(1, 1, 1)x(1, 1, 0, 12) yields the lowest AIC value of 243.34. We should therefore consider this to be optimal option out of all the models we have considered.
-
Fitting ARIMA Model
Using grid search, we have identified the set of parameters that produces the best fitting model to our time series data. We can proceed to analyze this particular model in more depth.
We’ll start by plugging the optimal parameter values into a new SARIMAX model:
mod = sm.tsa.statespace.SARIMAX(Y, order=(1, 1, 1), seasonal_order=(1, 1, 0, 12),enforce_stationarity=False, enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])
Output
=====================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------------------------------------
ar.L1 -0.0754 0.251 -0.300 0.764 -0.567 0.417
ma.L1 -1.0000 1665.246 -0.001 1.000 -3264.823 3262.823
ar.S.L12 -0.5256 0.180 -2.919 0.004 -0.878 -0.173
sigma2 2305.2202 3.84e+06 0.001 1.000 -7.52e+06 7.53e+06
=====================================================================
The summary attribute that results from the output of SARIMAX returns a significant amount of information, but we’ll focus our attention on the table of coefficients. The coef column shows the weight (i.e. importance) of each feature and how each one impacts the time series. The P>|z| column informs us of the significance of each feature weight. Here, each weight has a p-value lower or close to 0.05, so it is reasonable to retain all of them in our model.
When fitting seasonal ARIMA models (and any other models for that matter), it is important to run model diagnostics to ensure that none of the assumptions made by the model have been violated. The plot_diagnostics object allows us to quickly generate model diagnostics and investigate for any unusual behavior.
results.plot_diagnostics(figsize=(16, 8))
plt.show()
Output
Our primary concern is to ensure that the residuals of our model are uncorrelated and normally distributed with zero-mean. If the seasonal ARIMA model does not satisfy these properties, it is a good indication that it can be further improved.
In this case, our model diagnostics suggests that the model residuals are normally distributed based on the following:
-
In the top right plot, we see that the red KDE line follows closely with the N(0,1) line (where N(0,1)) is the standard notation for a normal distribution with mean 0 and standard deviation of 1). This is a good indication that the residuals are normally distributed.
-
The qq-plot on the bottom left shows that the ordered distribution of residuals (blue dots) follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). Again, this is a strong indication that the residuals are normally distributed.
-
The residuals over time (top left plot) don’t display any obvious seasonality and appear to be white noise. This is confirmed by the autocorrelation (i.e. correlogram) plot on the bottom right, which shows that the time series residuals have low correlation with lagged versions of itself.
Those observations lead us to conclude that our model produces a satisfactory fit that could help us understand our time series data and forecast future values.
Note: Although we have a satisfactory fit, some parameters of our seasonal ARIMA model could be changed to improve our model fit. For example, our grid search only considered a restricted set of parameter combinations, so we might find better models if we widen the grid search.
-
Forecasting
-
Validating Forecasting
To help us understand the accuracy of our forecasts, we compare predicted sales to real sales of the time series, and we set forecasts to start at 2017–01–01 to the end of the data.
pred = results.get_prediction(start=pd.to_datetime('2017-01-31'), dynamic=False)
pred_ci = pred.conf_int()
ax = Y['2014':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=.7, figsize=(14, 7))
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('Date')
ax.set_ylabel('Sales')
plt.legend()
plt.show()
-
Output
The line plot is showing the observed values compared to the rolling forecast predictions. Overall, our forecasts align with the true values very well, showing a downward trend starts from the beginning of the year and captured the seasonality toward the end of the year.
It is also useful to quantify the accuracy of our forecasts. We will use the (Mean Squared Error), which summarizes the average error of our forecasts. For each predicted value, we compute its distance to the true value and square the result. The results need to be squared so that positive/negative differences do not cancel each other out when we compute the overall mean.
y_forecasted = pred.predicted_mean
y_truth = Y['2017-01-01':]
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
Output
The Mean Squared Error of our forecasts is 2240.32
print('The Root Mean Squared Error of our forecasts is {}'.format(round(np.sqrt(mse), 2)))
Output
The Root Mean Squared Error of our forecasts is 47.33
In statistics, the mean square error(MSE) of an estimator measures the average of the squares of the errors — that is, the average squared difference between the estimated values and what is estimated. The MSE is a measure of the quality of an estimator — it is always non-negative, and the smaller the MSE, the closer we are to finding the line of best fit.
Root mean square error (RMSE) tells us that our model was able to forecast the average daily sales in the test set within 47.33 of the real sales. This is a pretty good model so far.
- Producing and Visualizing Forecast
In the final step, we describe how to leverage our seasonal ARIMA time series model to forecast future values. The get_forecast() attribute of our time series object can compute forecasted values for a specified number of steps ahead.
pred_uc = results.get_forecast(steps=12)
pred_ci = pred_uc.conf_int()
ax = Y.plot(label='observed', figsize=(14, 7))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('Furniture Sales')
plt.legend()
plt.show()
Output
Both the forecasts and associated confidence interval that we have generated can now be used to further understand the time series and foresee what to expect. Our forecasts show that the time series is expected to continue increasing at a steady pace.
As we forecast further out into the future, it is natural for us to become less confident in our values. This is reflected by the confidence intervals generated by our model, which grow larger as we move further out into the future.
-
Prophet Model Fitting and ForecastingReleased by Facebook in 2017, forecasting tool Prophet is designed for analyzing time-series that display patterns on different time scales such as yearly, weekly and daily. It also has advanced capabilities for modeling the effects of holidays on a time-series and implementing custom change points. Therefore, we are using Prophet to get a model up and running.
“Prophet has been a key piece to improving Facebook’s ability to create a large number of trustworthy forecasts used for decision-making and even in product features.”
It is a decomposable time series model with three main model components: trend, seasonality, and holidays. They are combined in the following equation:
-
g(t): piecewise linear or logistic growth curve for modeling non-periodic changes in time series
-
s(t): periodic changes (e.g. weekly/yearly seasonality)
-
h(t): effects of holidays (user provided) with irregular schedules
-
εt: error term accounts for any unusual changes not accommodated by the model
Trend: Trend is smooth ,regular, long term movement of a statistical series: frequent changes either in absolute amount or in rates of increase or decrease are quite inconsistent with the idea of secular trend i.e there is no periodicity.
Seasonal Variation: A periodic movement is one which recurs with some degree of regularity within a definite time period.
Holiday: Holidays or events incur predictability of a time series. As for example in New year the sale of clothes remain quite high as compared to other regular days and same for Diwali (in India) or other occasions.
-
-
Installing Packages and Libraries
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
%matplotlib inline
from fbprophet import Prophet
from sklearn.metrics import mean_squared_error, mean_absolute_error
plt.rcParams['figure.figsize']=(15,10)
-
Preparing Dataset for Prophet
df = df.rename(columns={'Order Date': 'ds', 'Sales': 'y'})
df_model = Prophet(interval_width=0.95)
df_model.fit(df)Output
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
<fbprophet.forecaster.Prophet at 0x7f0ebc07c940>
df.set_index('ds').y.plot()
Output
Here, we plot the sales data with respect to selling date. And see that there are low bars in the starting of year and ending of year and couple of big bars in the mid year.
-
Forecast using Prophet
model = Prophet()
model.fit(df)
Output
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
future = model.make_future_dataframe(periods=12, freq = 'm')
future.tail()
Output
ds
10002 2018-07-31
10003 2018-08-31
10004 2018-09-30
10005 2018-10-31
10006 2018-11-30
forecast = model.predict(future)
forecast.tail()
Output
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
Output
ds yhat yhat_lower yhat_upper
1244 2018-08-01 1108.296244 -1751.105682 3839.363077
1245 2018-09-01 3101.296688 461.439189 5855.126001
1246 2018-10-01 3369.973509 610.175122 6172.857257
1247 2018-11-01 3502.864480 679.938129 6185.642495
1248 2018-12-01 3674.377755 783.551508 6571.994512
-
Result of Forecast
model.plot(forecast)
Output
From the diagram, we can see the forecasting sales for the next year. And we can see that the rate of the sales is increasing with respect to years.
Now break the model in different components;
Trend , Seasonal Variation , Holidays ;
model.plot_components(forecast)
Outout
Good to see that the sales have been increasing over time and will be keep growing year after year. The worst day for sale is Wednesday (should be a holiday), The best day for sale is Monday and Friday. The worst month for sale is February, The best month for sale is November and September.
metric_df = forecast.set_index('ds')[['yhat']].join(df.set_index('ds').y).reset_index()
metric_df.tail()
Output
ds yhat y
1244 2018-08-01 1108.296244 NaN
1245 2018-09-01 3101.296688 NaN
1246 2018-10-01 3369.973509 NaN
1247 2018-11-01 3502.864480 NaN
1248 2018-12-01 3674.377755 NaN
metric_df.dropna(inplace=True)
metric_df.tail()
Output
ds yhat y
1232 2017-12-26 2636.565655 814.5940
1233 2017-12-27 1670.346485 177.6360
1234 2017-12-28 2764.198983 1657.3508
1235 2017-12-29 2922.492564 2915.5340
1236 2017-12-30 2416.579612 713.7900
mean_squared_error(metric_df.y, metric_df.yhat)
Output
4564396.647441651
mean_absolute_error(metric_df.y, metric_df.yhat)
Output
1403.109843097731
Therefore, RMSE = Square root (MSE) = 2136.445
Root mean square error (RMSE) tells us that our model was able to forecast the average daily sales in the test set within 2136.445 of the real sales. This is a quite good model so far.
Conclusion
So here we are to conclude that we focused on the application of different time series models in this study, learned how to use them with the objective to forecast new sale values. There are many time-series analysis we can explore from now on, such as forecast with uncertainty bounds, change point and anomaly detection, forecast time-series with external data source. With respect to accuracy level, we can conclude that ARIMA performed very well. Prophet is one of the new forecasting approach, it may need more tuning for getting more accuracy.
Mentored by
A lot of thanks for every one of your hard work on this blog. My daughter really likes working on investigation and it’s really easy to see why. Many of us notice all of the lively tactic you make great techniques through your web blog and inspire participation from other ones about this matter and our favorite simple princess is always learning a great deal. Take pleasure in the rest of the new year. You’re the one performing a great job.