myrelaxsauna.com

Enhancing Weather Predictions with SARIMA Models in Python

Written on

Time series forecasting is effectively carried out using ARMA models; however, SARIMA models extend this capability to non-stationary data with seasonal components.

Despite the limited online resources detailing Python implementations of SARIMA models, particularly those that evaluate prediction outcomes against alternative methods, this article aims to bridge that gap. While it won't delve into the technical intricacies of SARIMA, it will concentrate on practical Python applications, comparisons, and the model's limitations, utilizing temperature data from Istanbul.

You might wonder if using a SARIMA model is appropriate for forecasting temperature. Indeed, it is. Autoregressive Models (AR) are beneficial when current data points are influenced by their previous values. This holds true for weather forecasting, where the temperature of the current month is often linked to that of the preceding month. Furthermore, the current month's average temperature is likely influenced by the error terms from prior timesteps, justifying the inclusion of a Moving Average (MA) component. Given the evident seasonal patterns, a Seasonal ARMA model should provide robust predictive capabilities for monthly weather data.

I anticipate that no significant trend will emerge from the weather data over the span of 32 years, unless influenced by factors such as climate change or pollution. Should any trends be identified, differencing can help eliminate them, thereby supporting our stationarity assumption necessary for ARMA models. Consequently, I opted for Seasonal ARIMA instead of Seasonal ARMA, and we will evaluate these assumptions during the implementation.

You can find the full Colab Notebook for this project at: https://github.com/cnzdgr/Weather-Forecast/blob/main/Istanbul_Weather_Forecast_Using_SARIMA.ipynb. The dataset was acquired from “www.visualcrossing.com” and cannot be shared, but you can replicate the analysis using the Colab Notebook with alternative time series data.

Part 1: Data Preparation

We must analyze our weather data and aggregate it from daily to monthly figures.

A common question arises: “Why not use daily data, as autocorrelation is more relevant in daily contexts, and daily weather predictions are more practical?” This limitation of SARIMA models, particularly in Python, arises because employing daily data necessitates setting the seasonality parameter to 365, resulting in excessive RAM requirements (over 100GB) and significantly prolonged model fitting times (over an hour per training session) for our 28-year dataset.

Many others face this same challenge, as highlighted in discussions such as: https://datascience.stackexchange.com/questions/90327/python-sarimax-model-fits-too-slow. This difficulty explains why educators often resort to using monthly or quarterly data when teaching SARIMA, which negatively impacts its practical application.

The clear patterns in the data are evident. Please note that we will focus solely on average temperature data, excluding humidity and wind speed, as ARMA models, including SARIMA, are designed for single-variable analyses.

Part 2: Benchmarking with a Basic Model

To assess the performance of our SARIMA model, we require a benchmark. There are two primary approaches: 1. Compare it with naive estimation errors, where the average temperature for the next month is predicted using the current month’s temperature. 2. Predict the mean temperature based on 'month of the year' data by fitting a linear model.

I will adopt the second method. Initially, I will one-hot encode the month variable and then apply linear regression to establish this as our benchmark.

The training dataset will encompass records from 1991 to 2018, while the testing set will span from 2019 to 2022, covering all days from January 1, 1991, to December 31, 2022.

# Extracting month from the datetime object

weather_linear['month'] = pd.DatetimeIndex(weather_linear['date']).month

# One-hot encoding month data

weather_linear = pd.get_dummies(weather_linear, columns=['month'])

# Fitting a linear regression model using 1991-2018 data for training and 2019-2022 for testing

train_data = weather_linear[weather_linear['date'] < '2019-01-01']

test_data = weather_linear[weather_linear['date'] >= '2019-01-01']

regression = LinearRegression().fit(train_data.iloc[:, -12:], train_data['temperature'])

weather_pred = regression.predict(test_data.iloc[:, -12:])

test_data['prediction'] = weather_pred

print("Mean squared error: %.3f" % mean_squared_error(test_data['temperature'], weather_pred))

print("Coefficient of determination: %.3f" % r2_score(test_data['temperature'], weather_pred))

The mean squared error is calculated as 2.292, and the Coefficient of determination is reported at 0.947. This outcome is impressive; the straightforward linear model, relying solely on the month of the year, achieves high accuracy and explains 94.7% of the variance. The fit, as illustrated below, appears to be accurate, serving as our benchmark.

Part 3: Forecasting with SARIMA

While we can theoretically implement SARIMA directly, the best practice involves four key steps: 1. Assess the stationarity of the time series. 2. Analyze autocorrelation. 3. Evaluate partial autocorrelation. 4. Determine the seven parameters of SARIMA (p, d, q), (P, D, Q, m) based on the insights from the first three steps.

  1. For stationarity, the Augmented Dickey-Fuller test from statsmodels can be applied:

    def ad_fuller(timeseries):

    print('Dickey-Fuller Test indicates:')

    df_test = adfuller(timeseries, regression='ct', autolag='AIC')

    output = pd.Series(df_test[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])

    print(output)

    print(ad_fuller(weather['temperature']))

The test yields a very low p-value, confirming our time series is stationary.

  1. and 3. For the autocorrelation and partial autocorrelation analyses, we can utilize the statsmodels functions: plot_acf and plot_pacf.

    plot_acf(weather['temperature'])

    plt.show()

    plot_pacf(weather['temperature'])

    plt.show()

Significant autocorrelation and partial autocorrelation are observed.

  1. Next, we must define the model parameters (p,d,q)x(P,D,Q,m), utilizing both the previous analyses and some trial and error.

Since the Dickey-Fuller test confirms stationarity, we can set d=0 in our model. For D (the seasonal counterpart of d), I will use D=1, as it yields a slightly better fit.

The MA(q) component predicts using a weighted average of past errors, and we refer to the autocorrelation chart to determine the q value. The chart suggests a value less than or equal to 3 due to high spikes, and testing 1, 2, and 3 reveals that q=1 produces the best results. The seasonal component, Q=1, also performs well.

To determine AR(p), we look for significant spikes in the partial autocorrelation chart, which indicates that the first 5 to 6 spikes are notably large. Testing values from 1 to 6 shows that p=2 yields the best results. Given that long-term autoregression is not particularly relevant, I will set P to 0 without further evaluation.

Our monthly data indicates a seasonality of 12.

It's important to note that insights from the initial three steps guide the upper bounds for the p, d, q parameters, while the exact values, along with their seasonal counterparts (P, D, Q), require some experimentation. Lower AIC (or BIC) values or significant p-values for each parameter can guide us in confirming that each parameter meaningfully deviates from zero.

model = sm.tsa.statespace.SARIMAX(train_data_SARIMA['temperature'],

order=(2, 0, 1),

seasonal_order=(0, 1, 1, 12))

result = model.fit()

print(result.summary())

Through a combination of analytical insights and trial and error, we achieve a satisfactory fit with SARIMA(2,0,1)x(0,1,1,12). Next, we will evaluate its accuracy.

Before proceeding, it’s prudent to inspect the residual distribution to confirm that the fit is satisfactory.

The bell-shaped distribution of the residuals suggests that our model fit is adequate.

Now, we can proceed with one-step-ahead predictions using our model. This process is computationally intensive as it requires retraining the model at each timestep using the latest data point.

# Retraining the model for each prediction, making one-step-ahead forecasts

one_step_predictions = []

for i in range(48):

cut_point = weather.size - 24 + i

model = sm.tsa.statespace.SARIMAX(train_data_SARIMA['temperature'][:cut_point],

order=(2, 0, 1),

seasonal_order=(0, 1, 1, 12),

enforce_stationarity=False,

enforce_invertibility=False)

result = model.fit()

one_step_predictions.append(result.predict(cut_point).values[0])

test_data_SARIMA['prediction'] = one_step_predictions

test_data_SARIMA.set_index('date', inplace=True)

test_data_SARIMA['temperature'].plot(label='Actual Average Temperature', color='orange')

test_data_SARIMA['prediction'].plot(label='SARIMA Model Prediction', color='blue')

plt.title("Prediction vs. Actual Temperature of Istanbul")

plt.xlabel('Date')

plt.ylabel('Temperature (degrees Celsius)')

plt.legend(loc='lower right')

Coefficient of determination: 0.952

Our benchmark R² was 0.947, and the SARIMA model achieved 0.952, indicating an additional 0.48% variance explained in temperature data. While this is a modest improvement, it may not be entirely satisfactory.

Conclusion

The SARIMA model slightly surpassed the simple linear regression in predicting monthly weather. Although it performed better, the results were not entirely fulfilling, likely due to the inherent dependency of monthly averages on the respective month. As a result, the SARIMA model may struggle to outperform it meaningfully. This study highlights two key limitations of the SARIMA model: 1. Although SARIMA can function effectively with daily data, setting the seasonality parameter to 365 significantly increases computational demands in Python. In contrast, R implementations appear to be more efficient. This may explain why most online examples utilize monthly or quarterly data. 2. SARIMA excels in one-step-ahead predictions, as the value at time t=10 relies on the value at t=9. This necessitates continuous model fitting for every prediction using all preceding data, which renders the process computationally burdensome.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Revitalizing Your Space: Techniques for Cleansing and Energy Flow

Explore effective methods to cleanse and rejuvenate your home's energy through Feng Shui and other spiritual practices.

Exploring gRPC vs REST APIs: Key Insights and Updates

A comparative analysis of gRPC and REST APIs along with recent Java developments and database framework discussions.

Unlocking the Power of Big Ideas: Strategies for Success

Discover how to generate impactful big ideas that inspire action and creativity in your marketing strategies.

The Essential Skill of Gratitude: Unlocking Life's Opportunities

Exploring the importance of gratitude and how it can transform lives by fostering appreciation and meaningful connections.

Navigating the Complexities of Software Requirements

Exploring the intricacies of software requirements and the challenges they present in the development process.

Exploring the Intricacies of Mental Health and Sexual Well-Being

An in-depth look at the relationship between mental health and sexual experiences, exploring benefits and complexities.

Routines: Balancing Benefits and Drawbacks for a Fulfilling Life

Explore how routines can enhance your life while also presenting challenges, and learn strategies for maintaining balance.

Discovering the Hidden Allure of Your Unique Traits

Explore how traits often seen as flaws can enhance your attractiveness and help you connect with others.