Abstract

Zero-inflated time series data, characterized by a high frequency of zero values, are common in fields such as meteorology, finance, and traffic analysis. This paper introduces an enhanced methodology for classifying event intensities by leveraging the stationary distribution of a fitted time series model, where applicable. By modeling the time series, obtaining the stationary distribution parameters, and using the quantiles of this distribution for classification, the method ensures that thresholds reflect the true underlying process. Applications to precipitation data, wind speed, and financial transactions demonstrate the effectiveness of this approach in capturing event intensities.

1. Time Series Stationarity and Zero-Inflation

Stationarity in Time Series Data

A stationary time series is one where the statistical properties, such as mean and variance, are constant over time. Stationarity is a crucial concept in time series analysis because it simplifies modeling and ensures that the time series behaves predictably. A stationary process allows us to describe the data with a time-invariant distribution, which can be exploited to define thresholds for classification.

However, zero-inflated data complicates traditional analysis due to an overabundance of zeros. Many time series, like precipitation or traffic flow data, have long stretches of zeros interrupted by bursts of non-zero values. If the series is stationary, the excess zeros and the non-zero observations can still be described using a suitable time series model that captures both components.

Objective

This paper aims to integrate the concept of stationarity into the threshold classification of zero-inflated time series data. By fitting a suitable time series model, we can derive the stationary distribution, use its quantiles for classification, and ensure that the thresholds reflect the true underlying process, providing a more meaningful categorization.

2. Challenges in Traditional Threshold Classification

Zero-Inflated Time Series Data

In zero-inflated time series data, zeros dominate the distribution, making it difficult to apply traditional thresholding techniques like percentiles or standard deviation-based intervals. This is especially problematic for classification tasks, where a meaningful distinction between event intensities (such as low, moderate, or severe precipitation) is needed.

Traditional methods like percentile-based thresholds or standard deviation intervals fail in zero-inflated contexts because they do not account for the high frequency of zero values. Furthermore, these methods typically assume the data follows a normal distribution, which is rarely the case in zero-inflated datasets.

Limitations of Existing Methods

Common methods such as percentile and standard deviation thresholds struggle in the face of zero inflation:

  • Percentile-Based Thresholds: The large number of zeros skews the distribution, leading to uninformative thresholds.
  • Standard Deviation Intervals: Mean and standard deviation metrics are heavily influenced by zeros, often resulting in unrepresentative or even negative thresholds.

These limitations motivate the need for an alternative methodology that considers the structure and stationarity of the time series.

3. Methodology: Fitting Time Series Models and Using Stationary Distributions

Step 1: Fit a Suitable Time Series Model

The first step involves fitting a time series model that accounts for the zero-inflated nature of the data. Depending on the characteristics of the time series, different models may be appropriate:

  • ARMA/ARIMA Models: Autoregressive Moving Average (ARMA) or Autoregressive Integrated Moving Average (ARIMA) models can be used for stationary or differenced stationary time series.
  • Zero-Inflated Time Series Models: For highly zero-inflated data, specialized models like Zero-Inflated ARMA (ZIARMA) or Hurdle Models can be used to capture both the zero and non-zero components.

The model fitting process yields parameters that describe the time series, such as autoregressive coefficients, moving average terms, and noise variance.

Step 2: Obtain the Stationary Distribution

Once the model is fitted, the next step is to derive the stationary distribution. The stationary distribution represents the long-term probabilistic behavior of the time series and includes both zero and non-zero components.

For a stationary ARMA or ARIMA model, the stationary distribution can often be described by a normal or Gaussian distribution, parameterized by the model’s mean and variance. In the case of zero-inflated models, the stationary distribution may combine a point mass at zero with a continuous distribution for non-zero values.

Step 3: Define Thresholds Based on Quantiles

With the stationary distribution in hand, thresholds for classification can be defined using its quantiles. This ensures that the thresholds are meaningful and reflect the underlying distribution of the time series, including the effect of zeros.

For example, thresholds can be set at the 25th, 50th, and 75th percentiles of the stationary distribution:

  1. None: Zero values, represented by the point mass at zero.
  2. Low: Values between the 25th percentile and the median (50th percentile).
  3. Moderate: Values between the median and the 75th percentile.
  4. Intense: Values above the 75th percentile.

These thresholds capture the full range of data, providing a more nuanced and accurate classification scheme.

Step 4: Adjust for Domain-Specific Needs

Thresholds can be further refined based on the specific characteristics of the time series or the domain requirements. For example, in the context of precipitation data, expert meteorological knowledge may suggest adjustments to ensure thresholds align with standard categories of rainfall intensity.

4. Application to Various Domains

Precipitation Data

Precipitation data is typically zero-inflated, with many hours showing no rainfall and occasional periods of light, moderate, or heavy precipitation. Fitting an ARMA model to a stationary precipitation series allows us to derive the stationary distribution, including the influence of zeros.

  • Model Fit: A Zero-Inflated ARMA (ZIARMA) model can be used to capture both zero and non-zero precipitation events.
  • Stationary Distribution: The stationary distribution combines a point mass at zero with a continuous distribution for positive precipitation values.
  • Thresholds: Quantile-based thresholds derived from this distribution accurately reflect rainfall intensity, providing a robust classification method.

Wind Speed Data

Wind speed data often exhibits periods of calm (zero wind speed) interspersed with gusts of varying intensities. An ARMA or Zero-Inflated ARMA model can be used to model wind speed over time, capturing both calm and gusty periods.

  • Model Fit: An ARMA model is fitted to the wind speed data, and the stationary distribution is obtained.
  • Thresholds: Quantiles of the stationary distribution are used to define wind speed categories, such as calm, breezy, and windy conditions.

Financial Transactions

In financial markets, trading volumes often fluctuate between periods of inactivity (zero transactions) and active trading. A Hurdle or Zero-Inflated ARMA model can model both components.

  • Model Fit: A Hurdle model can separate zero transaction periods from positive transactions.
  • Thresholds: Quantiles of the stationary distribution for non-zero transactions define thresholds for low, moderate, and high trading volumes.

5. Evaluation and Validation

Performance Metrics

The effectiveness of the stationary distribution approach is evaluated using several performance metrics:

  • Accuracy: The proportion of correctly classified observations.
  • Precision and Recall: These metrics are especially important when certain categories (e.g., severe precipitation) are rare but critical.
  • Model Fit Indicators: Metrics such as the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) evaluate the goodness of fit of the time series model.

Cross-Dataset Validation

The methodology is tested across multiple datasets to assess generalizability:

  • Precipitation Data: Datasets from different geographic regions and climates.
  • Wind Speed Data: Urban and rural wind patterns.
  • Financial Data: Transaction data from various market conditions.

Comparison with Baseline Methods

The stationary distribution approach is compared to traditional methods, such as percentile-based thresholds, showing improvements in classification accuracy and the interpretability of thresholds.

6. Discussion

Insights

The use of stationary distributions for threshold classification offers several advantages:

  • Incorporating Time Series Properties: By leveraging the fitted time series model, the approach captures temporal dependencies and the overall distribution more effectively than simple statistical measures.
  • Adapting to Zero Inflation: The method naturally handles zero-inflated data, providing more meaningful classifications.

Limitations

  • Model Complexity: The process of fitting time series models and deriving stationary distributions requires expertise and computational resources.
  • Assumption of Stationarity: If the time series is non-stationary, differencing or other transformations may be necessary before applying this approach.

7. Conclusion

This paper presents a novel approach to threshold classification in zero-inflated time series data by fitting time series models and using the resulting stationary distributions to define thresholds. This method improves upon traditional classification techniques by incorporating the time series structure, providing more accurate and meaningful event intensity classifications. Applications to domains such as meteorology, wind speed, and finance demonstrate the versatility of the methodology.