Maurice Kraus
AI & ML Group, TU Darmstadt&Felix Divo
AI & ML Group, TU Darmstadt
&Devendra Singh Dhami
Uncertainty in AI Group, TU Eindhoven
Hessian Center for AI (hessian.AI)&Kristian Kersting
AI & ML Group, TU Darmstadt
Centre for Cognitive Science, TU Darmstadt
Hessian Center for AI (hessian.AI)
German Research Center for AI (DFKI)Contact: maurice.kraus@cs.tu-darmstadt.de.
Abstract
Time series data is prevalent across numerous fields, necessitating the development of robust and accurate forecasting models.Capturing patterns both within and between temporal and multivariate components is crucial for reliable predictions.We introduce xLSTM-Mixer, a model designed to effectively integrate temporal sequences, joint time-variate information, and multiple perspectives for robust forecasting.Our approach begins with a linear forecast shared across variates, which is then refined by xLSTM blocks.They serve as key elements for modeling the complex dynamics of challenging time series data.xLSTM-Mixer ultimately reconciles two distinct views to produce the final forecast.Our extensive evaluations demonstrate its superior long-term forecasting performance compared to recent state-of-the-art methods.A thorough model analysis provides further insights into its key components and confirms its robustness and effectiveness.This work contributes to the resurgence of recurrent models in time series forecasting.
1 Introduction
Time series are an essential data modality ubiquitous in many critical fields of application, such as medicine(Hosseini etal., 2021), manufacturing(Essien & Giannetti, 2020), logistics(Seyedan & Mafakheri, 2020), traffic management(Lippi etal., 2013), finance(Lin etal., 2012), audio processing(Latif etal., 2023), and weather modeling(Lam etal., 2023).While significant progress in time series forecasting has been made over the decades, the field is still far from being solved.The regular appearance of yet better models and improved combinations of existing approaches exemplifies this.Further increasing the forecast quality obtained from machine learning models promises a manifold of improvements, such as higher efficiency in manufacturing and transportation as well as more accurate medical treatments.
Historically, recurrent neural networks(RNNs) and their powerful successors were natural choices for deep learning-based time series forecasting(Hochreiter & Schmidhuber, 1997; Cho etal., 2014).Today, large Transformers(Vaswani etal., 2017) are applied extensively to time series tasks, including forecasting.Many improvements to the vanilla architecture have since been proposed, including patching(Nie etal., 2023), decompositions(Zeng etal., 2023), and tokenization inversions(Liu etal., 2023).However, some of their limitations are yet to be lifted.For instance, they typically require large datasets to train successfully, restricting their use to only a subset of conceivable applications.Furthermore, they are inefficient when applied to long sequences due to the cost of the attention mechanism being quadratic in the number of variates and time steps, depending on the specific choice of tokenization.Therefore, recurrent and state space models(SSMs) (Patro & Agneeswaran, 2024) are experiencing a resurgence of interest in overcoming such limitations. Specifically, Beck etal. (2024) revisited recurrent models by borrowing insights gained from Transformers applied to many domains, specifically to natural language processing. They propose Extended Long Short-Term Memory(xLSTM) models as a viable alternative to current sequence models.
We propose xLSTM-Mixer111https://github.com/mauricekraus/xLSTM-Mixer, a new state-of-the-art method for time series forecasting using recurrent deep learning methods.Specifically, we augment the highly expressive xLSTM architecture with carefully crafted time, variate, and multi-view mixing.These operations regularize the training and limit the model parameters by weight-sharing, effectively improving the learning of features necessary for accurate forecasting.xLSTM-Mixer initially computes a channel-independent linear forecast shared over the variates.It is then up-projected to a higher hidden dimension and subsequently refined by an xLSTM stack.It performs multi-view forecasting by producing a forecast from the original and reversed up-projected embedding.The powerful xLSTM cells thereby jointly mix time and variate information to capture complex patterns from the data.Both forecasts are eventually reconciled by a learned linear projection into the final prediction, again by mixing time.An overview of our method is shown in Figure1.
Overall, we make the following contributions:
- (i)
We investigate time and variate mixing in the context of recurrent models and propose a joint multistage approach that is highly effective for multivariate time series forecasting. We argue that marching over the variates instead of the temporal axis yields better results if suitably combined with temporal mixing.
- (ii)
We propose xLSTM-Mixer, a state-of-the-art method for time series forecasting using recurrent deep learning methods.
- (iii)
We extensively compare xLSTM-Mixer with existing methods for multivariate long-term time series forecasting and perform in-depth model analyses. The experiments demonstrate that xLSTM-Mixer consistently achieves state-of-the-art performance in a wide range of benchmarks.
The following work is structured as follows:In the upcoming Sec.2, we introduce preliminaries to then motivate and explain xLSTM-Mixer in Sec.3.We then present comprehensive experiments on its effectiveness and inner workings in Sec.4.We finally contextualize the findings within the related work in Sec.5 and close with a conclusion and outlook in Sec.6.
2 Background
After introducing the notation used throughout this work, we review xLSTM blocks and discuss leveraging channel mixing or their independence in time series models.
2.1 Notation
In multivariate time series forecasting, the model is presented with a time series consisting of time steps with variates each.Given this context, the forecaster shall predict the future values up to a horizon .A variate (also called a channel) can be any scalar measurement, such as the occupancy of a road or the oil temperature in a power plant.The measurements are assumed to be carried out jointly, such that the time steps reflect a regularly sampled multivariate signal.A time series dataset consists of such pairs divided into train, validation, and test portions.
2.2 Extended Long Short-Term Memory (xLSTM)
Beck etal. (2024) propose xLSTM architectures consisting of two building blocks, namely the sLSTM and mLSTM modules.To harness the full expressivity of xLSTMs within each step and across the computation sequence, we employ a stack of sLSTM blocks without any mLSTM blocks.The latter are less suited for joint mixing due to their independent treatment of the sequence elements, making it impossible to learn any relationships between them directly.We will continue by recalling the construction of sLSTM cells.
The standard LSTM architecture of Hochreiter & Schmidhuber (1997) involves updating the cell state through a combination of input, forget, and output gates, which regulate the flow of information across tokens.sLSTM blocks enhance this by incorporating exponential gating and memory mixing(Greff etal., 2017) to handle complex temporal and cross-variate dependencies more effectively.The sLSTM updates the cell and hidden state using three gates as follows:
cell state | (1) | ||||||
normalizer state | (2) | ||||||
hidden state | (3) | ||||||
cell input | (4) | ||||||
input gate | (5) | ||||||
forget gate | (6) | ||||||
output gate | (7) | ||||||
stabilizer state | (8) |
In this setup, the matrices and are input weights mapping the input token to the cell input , input gate, forget gate, and output gate, respectively.The states and serve as necessary normalization and training stabilization, respectively.
As Beck etal. have shown, it is beneficial to restrict the memory mixing performed by the recurrent weight matrices and to individual heads, inspired by the multi-head setup of Transformers(Zeng etal., 2023), yet more restricted and therefore more efficient to compute.In particular, each token gets broken up into separate pieces, where the input weights act across all of them, but the recurrence matrices are implemented as block-diagonals and therefore only act within each piece.This permits specialization of the individual heads to patterns specific to the respective section of the tokens and empirically does not sacrifice expressivity.
2.3 Channel Independence and Mixing in Time Series Models
Multiple works have investigated whether it is beneficial to learn representations of the time and variate dimensions jointly or separately.Intuitively, because joint mixing is strictly more expressive, one might think it should always be preferred.It is indeed used by many methods such as Temporal Convolutional Networks(TCN) (Lea etal., 2016), N-BEATS(Oreshkin etal., 2019), N-HiTS(Challu etal., 2023), and many Transformers(Vaswani etal., 2017), including Temporal Fusion Transformer(TFT) (Lim etal., 2021), Autoformer(Wu etal., 2021), and FEDFormer(Zhou etal., 2022).However, treating slices of the input data independently assumes an invariance to temporal or variate positions and serves as a strong regularization against overfitting, reminiscent of kernels in CNNs.Prominent models implementing some aspects of channel independence in multivariate time series forecasting are PatchTST(Nie etal., 2023) and iTransformer(Liu etal., 2023).TiDE(Das etal., 2023), on the other hand, contains a time-step shared feature projection and temporal decoder but treats variates jointly.As Tolstikhin etal. (2021) have shown with MLP-Mixer, interleaving mixing of all channels in each token and all tokens per channel does not empirically sacrifice any expressivity and instead improves performance.This idea has since been applied to time series, too, namely in architectures such as TimeMixer(Chen etal., 2023c) and TSMixer(Chen etal., 2023c), and is one of the key components of our method xLSTM-Mixer.
3 xLSTM-Mixer
Now we have everything at hand to introduce xLSTM-Mixer as depicted in Fig.1. It carefully integrates three key components: (1) an initial linear forecast with time mixing, (2) joint mixing using powerful sLSTM modules, and (3) an eventual combination of two views by a final fully connected layer.The transposing steps between the key components enable capturing complex temporal and intra-variate patterns while facilitating easy trainability and limiting parameter counts.The sLSTM block, in particular, can learn intricate non-linear relationships hidden within the data along both the time and variate dimensions.The xLSTM-Mixer architecture is furthermore equipped with normalization layers and skip connections to improve training stability and overall effectiveness.
3.1 Key Component 1: Normalization and Initial Linear Forecast
Normalization has become an essential ingredient of modern deep learning architectures(Huang etal., 2023).For time series in particular, reversible instance norm(RevIN) (Kim etal., 2022) is a general recipe for improving forecasting performance, where each time series instance is normalized by its mean and variance and furthermore scaled and offset by learnable scalars and :
We apply it as part of xLSTM-Mixer, and at the end of the entire pipeline, we invert the RevIN operation to obtain the final prediction.In the case of xLSTM-Mixer, the typical skip connections found in mixer acrchitectures(Tolstikhin etal., 2021; Chen etal., 2023c) are taken up by RevIN, the normalization in the NLinear forecast explained shortly, and the integral skip connections within each sLSTM block.
It has been shown previously that simple linear models equipped with appropriate normalization schemes are, already by themselves, decent long-term forecasters(Zeng etal., 2023; Li etal., 2023).Our observations confirm this finding.Therefore, we first process each variate separately by an NLinear model by computing:
where denotes a fully-connected linear layer with bias term.Sharing this model across variates limits parameter counts, and the weight-tying serves as a useful regularization.The quality of this initial forecast will be investigated in Sec.4.1 and 4.2.
3.2 Key Component 2: sLSTM Refinement
While the NLinear forecast captures the basic patterns between the historic and future time steps, its quality alone is insufficient for todayโs challenging time series datasets.We, therefore, refine it using powerful sLSTM blocks.As a first step, it is crucial to increase the embedding dimension of the data to provide enough latent dimensions for the sLSTM cells: .This pre-up-projection is similar to what is commonly performed in SSMs(Beck etal., 2024).We weight-share across variates to perform time-mixing similar to the initial forecast.Note that this step does not maintain the temporal ordering within the embedding token dimensions, as was the case up until this step, and instead embeds it into a higher latent dimension.
The stack of sLSTM blocks transforms as defined in Eq.1 to 8.The recurrent model strides over the data in variate order, i.e., where each token represents all time steps from a single variate as in the work of Liu etal. (2023).The sLSTM blocks learn intricate non-linear relationships hidden within the data along both the time and variate dimensions.The mixing of the hidden state is still limited to blocks of consecutive dimensions, aiding efficient learning and inference while allowing for effective cross-variate interaction during the recurrent processing.Striding over variates has the benefit of linear runtime scaling in the number of variates at a constant number of parameters.It, however, comes at the cost of possibly fixing a suboptimal order of variates.While this is empirically not a significant limitation, we leave investigations into how to find a suitable ordering for future work.In addition to a large embedding dim, we observed a high number of heads being crucial for effective forecasting.
The sLSTM cellsโ first hidden state must be initialized before each sequence of tokens can be processed.Extending the initial description of these blocks, we propose learning a single initial embedding token that gets prepended to each encoded time series .These initial embeddings draw from recent advances in Large Language Models, where learnable "soft prompt" tokens are used to condition models and improve their ability to generate coherent outputs(Lester etal., 2021; Li & Liang, 2021; Chen etal., 2023a; b).Recent research has extended the application of soft prompts to LLM-based time series forecasting (Cao etal., 2023; Sun etal., 2024), emphasizing their adaptability and effectiveness in improving model performance across modalities.These tokens enable greater flexibility and conditioning, allowing the model to adapt its initial memory representation to specific dataset characteristics and to dynamically interact with the time and variate data.Soft prompts can be readily optimized through back-propagation with very little overhead.
3.3 Key Component 3: Multi-View Mixing
To further regularize the training of the sLSTM as with the linear projections, we compute forecasts from the original embedding as well as the reversed embedding , where the order of the latent dimensions including the representation of is inverted.Learning forecasts and for both views while sharing weights helps learn better representations.Such multi-task learning settings are known to benefit training(Zhang & Yang, 2022).The final forecast is obtained by a linear projection of the two concatenated forecasts, again per-variate.Specifically, we compute:
The final forecast is obtained after de-normalizing the reconciled forecasts as .
4 Experimental Evaluation
Our intention here is to evaluate the forecasting capabilities of xLSTM-Mixer, aiming to provide comprehensive insights into its performance. To this end, we conducted a series of experiments with the primary focus on long-term forecasting, following the work of Das etal. (2023) and Chen etal. (2023c).An evaluation of xLSTM-Mixerโs competitiveness in short-term forecasting on the PEMS dataset is provided in Sec.A.2.Additionally, we perform an extensive model analysis consisting of an ablation study to identify the contributions of individual components of xLSTM-Mixer, followed by an inspection of the initial embedding tokens, a hyperparameter sensitivity analysis, and an investigation into its robustness.
Dataset | Source | Domain | Horizons | Sampling | #Variates |
---|---|---|---|---|---|
Weather | Zhou etal. (2021) | Weather | 96โ720 | 10 min | 21 |
Electricity | Zhou etal. (2021) | Power Usage | 96โ720 | 1 hour | 321 |
Traffic | Wu etal. (2021) | Traffic Load | 96โ720 | 1 hour | 862 |
ETT | Zhou etal. (2021) | Power Production | 96โ720 | 15&60 min | 7 |
Datasets.We generally follow the established benchmark procedure of Wu etal. (2021) and Zhou etal. (2021) for best backward and future comparability. The datasets we thus used are summarized in Tab.1.
Training.We follow standard practice in the forecasting literature by evaluating long-term forecasts using the mean squared error(MSE) and the mean absolute error(MAE). Based on our experiments, we used the MAE as the training loss function since it yielded the best results. The datasets were standardized for consistency across features. Further details on hyperparameter selection, metrics, and the implementation can be found in Sec.A.1.
Baseline Models.We compare xLSTM-Mixer to the recurrent models xLSTMTime(Alharthi & Mahmood, 2024) and LSTM(Hochreiter & Schmidhuber, 1997); multi-perceptron(MLP) based models TimeMixer(Wang etal., 2024a), TSMixer(Chen etal., 2023c), DLinear(Zeng etal., 2023), and TiDE(Das etal., 2023); the Transformers PatchTST(Nie etal., 2023), iTransformer(Liu etal., 2023), FEDFormer(Zhou etal., 2022), and Autoformer(Wu etal., 2021); and the convolutional architectures MICN(Wang etal., 2022) and TimesNet(Wu etal., 2022).
4.1 Long-Term Time Series Forecasting
xLSTM-Mixer xLSTMTime LSTM TimeMixer TSMixer DLinear TiDE PatchTST iTransformer FEDFormer Autoformer MICN TimesNet (Ours) 2024 1997a 2024a 2023c 2023 2023 2023 2023 2022 2021 2022 2022 MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE 0.219 0.250 0.444 0.454 0.262 0.225 0.264 0.246 0.300 0.236 0.282 0.241 0.264 0.258 0.278 0.309 0.360 0.338 0.382 0.242 0.299 0.259 0.287 0.153 0.245 0.157 0.250 0.559 0.549 0.160 0.256 0.166 0.264 0.159 0.257 0.159 0.253 0.178 0.270 0.214 0.321 0.227 0.338 0.186 0.295 0.192 0.295 0.392 0.253 0.391 1.011 0.541 0.262 0.408 0.284 0.434 0.295 0.356 0.391 0.264 0.428 0.282 0.609 0.376 0.628 0.379 0.541 0.315 0.620 0.336 0.397 0.420 0.428 1.198 0.821 0.411 0.412 0.428 0.423 0.437 0.419 0.430 0.413 0.434 0.454 0.448 0.440 0.460 0.496 0.487 0.558 0.535 0.458 0.450 0.340 0.346 0.386 3.095 1.352 0.316 0.384 0.355 0.401 0.431 0.447 0.345 0.394 0.381 0.383 0.407 0.433 0.447 0.453 0.462 0.588 0.525 0.414 0.427 0.339 0.366 1.142 0.782 0.348 0.375 0.375 0.357 0.379 0.355 0.378 0.353 0.382 0.407 0.410 0.448 0.452 0.588 0.517 0.392 0.413 0.400 0.406 0.248 0.307 0.254 2.395 1.177 0.256 0.315 0.267 0.322 0.267 0.332 0.312 0.256 0.317 0.288 0.332 0.304 0.349 0.324 0.368 0.328 0.382 0.291 0.333 Wins 18 23 1 2 3 1 2 1 Taken from Wu etal. (2022).Models Recurrent MLP Transformer Convolutional Dataset Weather 0.222 0.255 0.222 Electricity 0.156 0.246 Traffic 0.261 0.387 0.261 ETTh1 0.408 0.423 ETTh2 0.382 0.324 ETTm1 0.347 0.372 0.347 ETTm2 0.310 0.249 5 5
We present the performance of xLSTM-Mixer compared to prior models in Tab.2. As shown, xLSTM-Mixer consistently delivers highly accurate forecasts across a wide range of datasets. It achieves the best results in 18 out of 28 cases for MSE and 22 out of 28 cases for MAE, demonstrating its superior performance in long-term forecasting.In particular, xLSTM-Mixer exhibits exceptional forecasting accuracy, as evidenced particularly by its strong MAE performance across all datasets. Notably, on Weather, xLSTM-Mixer reduces the MAE by 2% compared to xLSTMTime and 4.6% compared to TimeMixer. Similarly, for ETTm1, xLSTM-Mixer outperforms TimeMixer by 2.4% in MAE and shows a strong competitive edge over xLSTMTime.Although xLSTM-Mixer performs slightly less well on the Traffic and ETTh2 datasets, where it encounters challenges with handling outliers, it remains highly competitive and outperforms the majority of baseline models. This suggests that despite these few cases, xLSTM-Mixer can consistently deliver state-of-the-art performance in long-term forecasting. A qualitative inspection of several baseline models, including the initial forecast extracted before the sLSTM refinement, is shown in Fig.2.In this comparison, the lookback window and forecasting horizon are both fixed at 96.
4.2 Model Analysis
Ablation Study.
To assess the contributions of each component in xLSTM-Mixer to its strong overall forecast performance, we conducted an extensive ablation study with the results listed in Tab.3. Each configuration represents a different combination of the four key components: mixing time with NLinear, using sLSTM blocks, learning an initial embedding token, and multi-view mixing. We evaluated the performance using the MSE and MAE across the prediction lengths .
#1 (full) #2 #3 #4 #5 #6 #7 #8 #9 #10 โ โ โ โ โ โ โ โ โ โ Variates Time Variates Variates Variates None Variates Variates Variates Variates โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ Horizon MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE 96 0.143 0.184 0.148 0.194 0.145 0.186 0.144 0.185 0.144 0.186 0.173 0.223 0.149 0.193 0.151 0.195 0.149 0.192 0.152 0.195 192 0.186 0.226 0.196 0.239 0.188 0.228 0.186 0.226 0.188 0.228 0.219 0.257 0.192 0.233 0.192 0.234 0.191 0.234 0.193 0.236 336 0.237 0.266 0.252 0.281 0.239 0.267 0.241 0.270 0.242 0.270 0.261 0.288 0.240 0.271 0.242 0.273 0.242 0.273 0.244 0.274 720 0.310 0.324 0.315 0.328 0.310 0.324 0.309 0.323 0.309 0.323 0.320 0.334 0.320 0.329 0.319 0.329 0.322 0.330 0.319 0.328 96 0.275 0.328 0.298 0.348 0.277 0.329 0.278 0.331 0.279 0.333 0.295 0.338 0.282 0.339 0.285 0.341 0.281 0.337 0.284 0.339 192 0.319 0.354 0.337 0.369 0.321 0.354 0.321 0.356 0.322 0.358 0.329 0.357 0.329 0.364 0.330 0.365 0.337 0.367 0.335 0.366 336 0.353 0.374 0.368 0.388 0.354 0.375 0.355 0.377 0.357 0.379 0.359 0.376 0.367 0.385 0.367 0.385 0.366 0.384 0.366 0.385 720 0.409 0.407 0.420 0.416 0.411 0.408 0.413 0.411 0.414 0.411 0.412 0.407 0.422 0.412 0.422 0.413 0.417 0.410 0.418 0.411Mix Time sLSTM Init. Token Mix View Weather ETTm1
The full version of xLSTM-Mixer (#1), which integrates all components, achieves the best performance overall. However, we also observe that some configurations of xLSTM-Mixer, which exclude specific components, remain competitive.For instance, #3, which excludes the initial embedding token, still performs reasonably well.This suggests that while it contributes positively to the overall performance, the model can sometimes still achieve competitive results without it.In general, removing any specific component leads to a performance drop. For example, removing the time mixing (#7) increases the MAE by 3.4% on ETTm1 at length 96 or 2.8% at length 192, highlighting its critical role in capturing intratemporal dependencies.When we now omit everything except for time mixing on Weather at 192, we suffer a 13.7% performance decrease.In summary, the ablation study confirms that all components of xLSTM-Mixer contribute to its effectiveness, with the full configuration yielding the best results. Furthermore, we identified the sLSTM blocks and time-mixing as critical components for ensuring high accuracy across datasets and prediction lengths.
Initial Token Embedding.
We qualitatively inspect decodings of the initial embedding tokens on multiple datasets to further understand and interpret the initializations learned by xLSTM-Mixer. are decoded to a forecast by transforming them through the sLSTM stack and applying multi-view mixing.The resulting output of can then be interpreted as the conditioning forecast used to initialize the sLSTM blocks.Fig.3 shows the dataset-specific patterns the initial embedding tokens have learned on Weather, ETTm1, and ETTh2 for various prediction horizons.With increasing prediction horizons, we observe longer spans of time, eventually revealing underlying seasonal patterns and respective dataset dynamics.
Sensitivity to xLSTM Hidden
Dimension.
In Fig.4, we visualize the performance of xLSTM-Mixer on the Electricity dataset with increasing sLSTM embedding (hidden) dimension realized by . The results indicate that larger hidden dimensions consistently enhance the modelโs performance, particularly for longer forecast horizons.This suggests that a larger embedding dimension enables xLSTM-Mixer to capture better the higher complexity of the time series data over extended horizons, leading to improved forecasting accuracy.
Robustness to Lookback Length.
Fig.5illustrates the performance of xLSTM-Mixer across varying lookback lengths and prediction horizons. We observe that xLSTM-Mixer can effectively utilize longer lookback windows than the baselines, especially when compared to transformer-based models.This advantage stems from xLSTM-Mixerโs avoidance of self-attention, allowing it to handle extended lookback lengths efficiently.Additionally, xLSTM-Mixer demonstrates stable and consistent performance with low variance.These results confirm that increasing the lookback length improves forecasting accuracy and enhances robustness, particularly for longer prediction horizons.
5 Related Work
Time Series Forecasting.
A long line of machine learning research led from early statistical methods like ARIMA(Box & Jenkins, 1976) to contemporary models based on deep learning, where four architectural families take center stage: The ones based on recurrence, convolutions, Multilayer Perceptrons(MLPs), and Transformers.While all of them are used by practitioners today, the research focus is gradually shifting over time.Initially, the naturally sequential recurrent models such as Long Short-Term Memory(LSTM) (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Units(GRUs) (Cho etal., 2014) were used for time series analysis.Their main benefits are the high inference efficiency and arbitrary input and output lengths due to their autoregressive nature.While their effectiveness has historically been constrained by a limited ability to capture long-range dependencies, active research continues to alleviate these limitations(Salinas etal., 2020), including the xLSTM architecture presented in Sec.2 (Beck etal., 2024; Alharthi & Mahmood, 2024).Similarly efficient as RNNs, yet more restricted in their output length, are the location-invariant CNNs(Li etal., 2022; Lara-Benรญtez etal., 2021), such as TCN (Lea etal., 2016), TimesNet(Wu etal., 2022), and MICN(Wang etal., 2022).Recently, some MLP-based architectures have also shown good success, including the simplistic DLinear and NLinear models(Zeng etal., 2023), the encoder-decoder architecture of TiDE(Das etal., 2023), the mixing architectures TimeMixer(Wang etal., 2024a) and TSMixer(Chen etal., 2023c), as well as the hierarchical N-BEATSOreshkin etal. (2019) and N-HiTS(Challu etal., 2023) models.Finally, a lot of models have been proposed based on Transformers(Vaswani etal., 2017), such as Autoformer(Wu etal., 2021), TFT(Lim etal., 2021), FEDFormer(Zhou etal., 2022), PatchTST(Nie etal., 2023), and iTransformer(Liu etal., 2023).
xLSTM Models for Time Series.
Some initial experiments of applying xLSTMs(Beck etal., 2024) to time series were already performed by Alharthi & Mahmood (2024) with their proposed xLSTMTime model.While it showed promising forecasting performance, these initial soundings did not surpass stronger recent models such as TimeMixer(Wang etal., 2024a) on multivariate benchmarks, and the reported performance is challenging to reproduce.We ensure that our method xLSTM-Mixer is well suited as a foundation for further research by providing extensive model analysis, including an ablation study with ten variants, and ensuring that results are readily reproducible.Our methodology draws from xLSTMTime yet improves on it by several key components.Most importantly, our novel multi-view mixing consistently enhances forecasting performance.Furthermore, we find the trend-seasonality decomposition to be redundant and a simple NLinear normalization scheme(Zeng etal., 2023) to suffice.
6 Conclusion
In this work, we introduced xLSTM-Mixer, a method that combines a linear forecast with further refinement using xLSTM blocks. Our architecture effectively integrates time, joint, and view mixing to capture complex dependencies. In long-term forecasting, xLSTM-Mixer consistently achieved state-of-the-art performance, outperforming previous methods in 41 out of 56 cases. Furthermore, our detailed model analysis provided valuable insights into the contribution of each component and demonstrated its robustness to varying hyperparameter settings.
While xLSTM-Mixer has shown extraordinary performance in long-term forecasting, it should be noted that due to the transpose of the input, i.e., processing the variates as sequence elements, the number of variates may limit the overall performance.To overcome this, we plan to explore how different variate orderings influence performance and whether incorporating more than two views could lead to further improvements.This study focused on long-term forecasting, yet extending xLSTM-Mixer to tasks such as short-term forecasting, time series classification, or imputation offers promising directions for future research.
Ethics Statement
Our research advances machine learning by enhancing the capabilities of long-term forecasting in time series models, significantly improving both accuracy and efficiency. By developing xLSTM-Mixer, we introduce a robust framework that can be applied across various industries, including finance, healthcare, energy, and logistics. The improved forecasting accuracy enables better decision-making in critical areas, such as optimizing resource allocation, predicting market trends, and managing risk.
However, we also recognize the potential risks associated with the misuse of these advanced models. Time series forecasting models could be leveraged for malicious purposes, especially when applied at scale. For example, in the financial sector, adversarial agents might manipulate forecasts to create market instability. In political or social contexts, these models could be exploited to predict and influence public opinion or destabilize economies. Additionally, the application of these models in sensitive domains like healthcare and security may lead to unintended consequences if not carefully regulated and ethically deployed.
Therefore, it is essential that the use of xLSTM-Mixer, like all machine learning technologies, is guided by responsible practices and ethical considerations. We encourage stakeholders to adopt rigorous evaluation processes to ensure fairness, transparency, and accountability in its deployment, and to remain vigilant to the broader societal implications of time series forecasting technologies.
Reproducibility Statement
All implementation details, including dataset descriptions, metric calculations, and experiment configurations, are provided in Sec.4 and Sec.A.1.We make sure to exclusively use openly available software and datasets and provide the source code for full reproducibility at
https://github.com/mauricekraus/xLSTM-Mixer.
Acknowledgments
This work received funding from the EU project EXPLAIN, under the Federal Ministry of Education and Research (BMBF) (grant 01โS22030D). Furthermore, it was funded by the ACATIS Investment KVG mbH project โTemporal Machine Learning for Long-Term Value Investingโ and the BMBF KompAKI project within the โThe Future of Value Creation โ Research on Production, Services and Workโ program (funding number 02L19C150), managed by the Project Management Agency Karlsruhe (PTKA). The author of Eindhoven University of Technology received support from their Department of Mathematics and Computer Science and the Eindhoven Artificial Intelligence Systems Institute. Furthermore, this work benefited from the HMWK project โThe Third Wave of Artificial Intelligence - 3AIโ.
References
- Akiba etal. (2019)Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama.Optuna: A Next-generation Hyperparameter Optimization Framework.In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2019.
- Alharthi & Mahmood (2024)Musleh Alharthi and Ausif Mahmood.xLSTMTime: Long-Term Time Series Forecasting with xLSTM.MDPI AI, 5(3):1482โ1495, 2024.
- Beck etal. (2024)Maximilian Beck, Korbinian Pรถppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Gรผnter Klambauer, Johannes Brandstetter, and Sepp Hochreiter.xLSTM: Extended Long Short-Term Memory.ArXiv:2405.04517, 2024.
- Box & Jenkins (1976)George E.P. Box and GwilymM. Jenkins.Time series analysis: forecasting and control.Holden-Day series in time series analysis and digital processing. Holden-Day, San Francisco, rev. ed. edition, 1976.ISBN 0-8162-1104-3.
- Cao etal. (2023)Defu Cao, Furong Jia, SercanO. Arik, Tomas Pfister, Yixiang Zheng, Wen Ye, and Yan Liu.TEMPO: Prompt-based Generative Pre-trained Transformer for Time Series Forecasting.In The Twelfth International Conference on Learning Representations, 2023.
- Challu etal. (2023)Cristian Challu, KinG. Olivares, Boris Oreshkin, Federico Ramirez, Max Canseco, and Artur Dubrawski.NHITS: Neural Hierarchical Interpolation for Time Series Forecasting.Proceedings of the AAAI Conference on Artificial Intelligence, 37:6989โ6997, 2023.
- Chen etal. (2023a)Jiuhai Chen, Lichang Chen, Chen Zhu, and Tianyi Zhou.How many demonstrations do you need for in-context learning?Findings of the Association for Computational Linguistics, EMNLP 2023:11149โ11159, 2023a.
- Chen etal. (2023b)Lichang Chen, Heng Huang, and Minhao Cheng.Ptp: Boosting stability and performance of prompt tuning with perturbation-based regularizer.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 13512โ13525. Association for Computational Linguistics, 2023b.
- Chen etal. (2023c)Si-An Chen, Chun-Liang Li, NathanaelC Yoder, Sercanร Arฤฑk, and Tomas Pfister.TSMixer: An All-MLP Architecture for Time Series Forecasting.Transactions on Machine Learning Research, 2023c.
- Cho etal. (2014)Kyunghyun Cho, Bart van Merriรซnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio.Learning Phrase Representations using RNN EncoderโDecoder for Statistical Machine Translation.In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724โ1734, Doha, Qatar, 2014. Association for Computational Linguistics.
- Das etal. (2023)Abhimanyu Das, Weihao Kong, Andrew Leach, ShaanK. Mathur, Rajat Sen, and Rose Yu.Long-term Forecasting with TiDE: Time-series Dense Encoder.Transactions on Machine Learning Research, 2023.ISSN 2835-8856.
- Essien & Giannetti (2020)Aniekan Essien and Cinzia Giannetti.A Deep Learning Model for Smart Manufacturing Using Convolutional LSTM Neural Network Autoencoders.IEEE Transactions on Industrial Informatics, 16(9):6069โ6078, 2020.
- Greff etal. (2017)Klaus Greff, RupeshKumar Srivastava, Jan Koutnรญk, BasR. Steunebrink, and Jรผrgen Schmidhuber.LSTM: A Search Space Odyssey.IEEE Transactions on Neural Networks and Learning Systems, 28(10):2222โ2232, 2017.
- Hochreiter & Schmidhuber (1997)Sepp Hochreiter and Jรผrgen Schmidhuber.Long Short-Term Memory.Neural Computation, 9(8):1735โ1780, 1997.
- Hosseini etal. (2021)Mohammad-Parsa Hosseini, Amin Hosseini, and Kiarash Ahi.A Review on Machine Learning for EEG Signal Processing in Bioengineering.IEEE Reviews in Biomedical Engineering, 14:204โ218, 2021.
- Huang etal. (2023)Lei Huang, Jie Qin, YiZhou, Fan Zhu, LiLiu, and Ling Shao.Normalization Techniques in Training DNNs: Methodology, Analysis and Application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):10173 โ 10196, 2023.
- Kim etal. (2022)Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo.Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift.In International Conference on Learning Representations, 2022.
- Kingma & Ba (2017)DiederikP. Kingma and Jimmy Ba.Adam: A Method for Stochastic Optimization.ArXiv:1412.6980, 2017.
- Lam etal. (2023)Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, Stephan Hoyer, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mohamed, and Peter Battaglia.Learning skillful medium-range global weather forecasting.Science, 382(6677):1416โ1421, 2023.
- Lara-Benรญtez etal. (2021)Pedro Lara-Benรญtez, Manuel Carranza-Garcรญa, and JosรฉC. Riquelme.An Experimental Review on Deep Learning Architectures for Time Series Forecasting.International Journal of Neural Systems, 31(03):2130001, 2021.
- Latif etal. (2023)Siddique Latif, Heriberto Cuayรกhuitl, Farrukh Pervez, Fahad Shamshad, HafizShehbaz Ali, and Erik Cambria.A survey on deep reinforcement learning for audio-based applications.Artificial Intelligence Review, 56(3):2193โ2240, 2023.
- Lea etal. (2016)Colin Lea, Renรฉ Vidal, Austin Reiter, and GregoryD. Hager.Temporal Convolutional Networks: A Unified Approach to Action Segmentation.In Gang Hua and Hervรฉ Jรฉgou (eds.), Computer Vision โ ECCV 2016 Workshops, volume 9915, pp. 47โ54, Cham, 2016.
- Lester etal. (2021)Brian Lester, Rami Al-Rfou, and Noah Constant.The Power of Scale for Parameter-Efficient Prompt Tuning.In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045โ3059. Association for Computational Linguistics, 2021.
- Li & Liang (2021)XiangLisa Li and Percy Liang.Prefix-tuning: Optimizing continuous prompts for generation.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, volume Volume 1: Long Papers, pp. 4582โ4597. Association for Computational Linguistics, 2021.
- Li etal. (2022)Zewen Li, Fan Liu, Wenjie Yang, Shouheng Peng, and Jun Zhou.A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects.IEEE Transactions on Neural Networks and Learning Systems, 33(12):6999โ7019, 2022.
- Li etal. (2023)Zhe Li, Shiyi Qi, Yiduo Li, and Zenglin Xu.Revisiting Long-term Time Series Forecasting: An Investigation on Linear Mapping.ArXiv:2305.10721, 2023.
- Lim etal. (2021)Bryan Lim, Sercanร. Arฤฑk, Nicolas Loeff, and Tomas Pfister.Temporal Fusion Transformers for interpretable multi-horizon time series forecasting.International Journal of Forecasting, 37(4):1748โ1764, 2021.
- Lin etal. (2012)Wei-Yang Lin, Ya-Han Hu, and Chih-Fong Tsai.Machine Learning in Financial Crisis Prediction: A Survey.IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4):421โ436, 2012.
- Lippi etal. (2013)Marco Lippi, Matteo Bertini, and Paolo Frasconi.Short-Term Traffic Flow Forecasting: An Experimental Comparison of Time-Series Analysis and Supervised Learning.IEEE Transactions on Intelligent Transportation Systems, 14(2):871โ882, 2013.
- Liu etal. (2023)Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long.iTransformer: Inverted Transformers Are Effective for Time Series Forecasting.In The Twelfth International Conference on Learning Representations, 2023.
- Nie etal. (2023)Yuqi Nie, NamH. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam.A Time Series is Worth 64 Words: Long-term Forecasting with Transformers.In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
- Oreshkin etal. (2019)BorisN. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio.N-BEATS: Neural basis expansion analysis for interpretable time series forecasting.In International Conference on Learning Representations, 2019.
- Paszke etal. (2019)Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, LuFang, Junjie Bai, and Soumith Chintala.PyTorch: An Imperative Style, High-Performance Deep Learning Library.In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2019.
- Patro & Agneeswaran (2024)BadriNarayana Patro and VijaySrinivas Agneeswaran.Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges.ArXiv:2404.16112, 2024.
- Salinas etal. (2020)David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski.DeepAR: Probabilistic forecasting with autoregressive recurrent networks.International Journal of Forecasting, 36(3):1181โ1191, 2020.
- Seyedan & Mafakheri (2020)Mahya Seyedan and Fereshteh Mafakheri.Predictive big data analytics for supply chain demand forecasting: methods, applications, and research opportunities.Journal of Big Data, 7(1):53, 2020.
- Sun etal. (2024)Chenxi Sun, Hongyan Li, Yaliang Li, and Shenda Hong.TEST: Text prototype aligned embedding to activate LLMโs ability for time series.In The twelfth international conference on learning representations, 2024.
- Tolstikhin etal. (2021)IlyaO Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy.MLP-Mixer: An all-MLP Architecture for Vision.In Advances in Neural Information Processing Systems, volume34, pp. 24261โ24272. Curran Associates, Inc., 2021.
- Vaswani etal. (2017)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention is All you Need.In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2017.
- Wang etal. (2022)Huiqiang Wang, Jian Peng, Feihu Huang, Jince Wang, Junhui Chen, and Yifei Xiao.MICN: Multi-scale Local and Global Context Modeling for Long-term Series Forecasting.In The Eleventh International Conference on Learning Representations, 2022.
- Wang etal. (2024a)Shiyu Wang, Haixu Wu, Xiaoming Shi, Tengge Hu, Huakun Luo, Lintao Ma, JamesY. Zhang, and Jun Zhou.TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting.In The Twelfth International Conference on Learning Representations, 2024a.
- Wang etal. (2024b)Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong Liu, Mingsheng Long, and Jianmin Wang.Deep Time Series Models: A Comprehensive Survey and Benchmark, July 2024b.
- Wu etal. (2021)Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long.Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting.In Advances in Neural Information Processing Systems, 2021.
- Wu etal. (2022)Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long.TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis.In The Eleventh International Conference on Learning Representations, 2022.
- Zeng etal. (2023)Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu.Are Transformers Effective for Time Series Forecasting?In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023.
- Zhang & Yang (2022)YuZhang and Qiang Yang.A Survey on Multi-Task Learning.IEEE Transactions on Knowledge and Data Engineering, 34(12):5586โ5609, 2022.
- Zhou etal. (2021)Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting.Proceedings of the AAAI Conference on Artificial Intelligence, 35(12):11106โ11115, 2021.
- Zhou etal. (2022)Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin.FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting.In Proceedings of the 39th International Conference on Machine Learning, volume 162, 2022.
Appendix A Appendix
A.1 Implementation Details
Experimental Details
Our codebase is implemented in Python 3.11, leveraging PyTorch 2.4(Paszke etal., 2019) in combination with Lightning 2.4222https://lightning.ai/pytorch-lightning for model training and optimization.We used the custom CUDA implementation of Beck etal. (2024) for sLSTM333https://github.com/NX-AI/xlstm which relies on the NVIDIA Compute Capability 8.0 or higher. Thus, our experiments were conducted on a single NVIDIA A100 80GB GPU.The majority of our baseline implementations, along with data loading and preprocessing steps, are adapted from the Time-Series-Library444https://github.com/thuml/Time-Series-Library of Wang etal. (2024b).Additionally, for xLSTMTime we used code based on the official repository555https://github.com/muslehal/xLSTMTime of Alharthi & Mahmood (2024).
Training and Hyperparameters
We optimized xLSTM-Mixer for up to 60 epochs with a cosine-annealing scheduler with the Adam optimizer (Kingma & Ba, 2017), using and and no weight decay. Hyperparameter tuning was conducted using Optuna (Akiba etal., 2019) with the choices provided in Tab.4.We optimized for the L1 forecast error, also known as the Mean Absolute Error (MAE). To further stabilize the training process, gradient clipping with a maximum norm of was applied. All experiments were run with three different random seeds {2021, 2022, 2023}.
Hyperparameter | Choices |
---|---|
Batch size | {16, 32, 64, 128, 256, 512} |
Initial learning rate | {, , , , , } |
Scheduler warmup steps | {5, 10, 15} |
Lookback length | {96, 256, 512, 768, 1024, 2048} |
Embedding dimension | {32, 64, 128, 256, 512, 768, 1024} |
sLSTM conv. kernel width | {disabled, 2, 4} |
sLSTM dropout rate | {0.1, 0.25} |
# sLSTM blocks in | {1, 2, 3, 4} |
# sLSTM heads | {4, 8, 16, 32} |
Metrics
We follow common practice in the literature(Wu etal., 2021; Wang etal., 2024a) for maximum comparability and, therefore, evaluate long-term forecasting of all models on the mean absolute error(MAE), mean squared error(MSE), and for short-term forecasting, using the MAE, root mean squared error(RMSE), and mean absolute percentage error(MAPE).The metrics are averaged over all variates and computed as:
where are the targets, the predictions, and a small constant added for numerical stability.
A.2 Outlook: Short-Term Time Series Forecasting
Having shown superior long-term forecasting accuracies in Sec.4.1, we also provide an initial exploration of the effectiveness of xLSTM-Mixer to short-term forecasts.To this end, we compare it to applicable baselines on PEMS datasets with input lengths uniformly set to 96 and prediction lengths to 12.The results in Tab.5 show that the performance of xLSTM-Mixer is competitive with existing methods.We provide the MAE, MAPE, and RMSE as is common practice.
xLSTM-Mixer xLSTMTime LSTM TimeMixer DLinear PatchTST FEDFormer Autoformer MICN TimesNet (Ours) 2024 1997a 2024a 2023 2023 2022 2021 2022 2022 MAE 16.59 18.65 14.63 19.70 18.95 19.00 18.08 16.41 MAPE 15.31 17.39 14.54 18.35 17.29 18.57 18.75 15.67 15.17 RMSE 26.47 31.73 23.28 32.35 30.15 30.05 27.82 24.55 26.72 MAE 17.44 20.34 15.22 20.26 20.35 20.56 20.47 17.76 19.01 MAPE 10.58 13.05 9.67 12.09 13.15 12.41 12.27 10.76 11.83 RMSE 28.13 31.90 24.26 32.38 31.04 32.97 31.52 27.26 30.65 Configuration following Wu etal. (2021).Models Recurrent MLP Transformer Convolutional PEMS03 15.71 15.71 14.92 24.82 PEMS08 16.56 10.24 26.65
A.3 Full Results for Long-Term Forecasting
Tab.6 shows the full results for long-term forecasting on all four separate forecast horizons.
xLSTM-Mixer xLSTMTime LSTM TimeMixer TSMixer DLinear TiDE PatchTST iTransformer FEDFormer Autoformer MICN TimesNet (Ours) 2024 1997a 2024a 2023c 2023 2023 2023 2023 2022 2021 2022 2022 Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE 96 0.143 0.184 0.369 0.406 0.147 0.197 0.145 0.198 0.176 0.237 0.166 0.222 0.149 0.198 0.174 0.214 0.217 0.296 0.266 0.336 0.161 0.229 0.172 0.220 192 0.186 0.226 0.192 0.416 0.435 0.239 0.191 0.242 0.220 0.282 0.209 0.263 0.194 0.241 0.221 0.254 0.276 0.336 0.307 0.367 0.220 0.281 0.219 0.261 336 0.236 0.266 0.455 0.454 0.241 0.280 0.242 0.280 0.265 0.319 0.254 0.301 0.306 0.282 0.278 0.296 0.339 0.380 0.359 0.395 0.278 0.331 0.280 0.306 720 0.310 0.323 0.313 0.535 0.520 0.310 0.330 0.320 0.336 0.323 0.362 0.313 0.340 0.314 0.334 0.358 0.347 0.403 0.428 0.419 0.428 0.356 0.365 0.359 Avg 0.219 0.250 0.444 0.454 0.262 0.225 0.264 0.246 0.300 0.236 0.282 0.241 0.264 0.258 0.278 0.309 0.360 0.338 0.382 0.242 0.299 0.259 0.287 96 0.126 0.218 0.375 0.437 0.129 0.224 0.131 0.229 0.140 0.237 0.132 0.229 0.129 0.222 0.148 0.240 0.193 0.308 0.201 0.317 0.164 0.269 0.168 0.272 192 0.150 0.243 0.442 0.473 0.140 0.220 0.151 0.246 0.153 0.249 0.147 0.243 0.147 0.240 0.162 0.253 0.201 0.315 0.222 0.334 0.177 0.285 0.184 0.289 336 0.157 0.250 0.166 0.259 0.439 0.473 0.161 0.261 0.169 0.267 0.261 0.163 0.259 0.178 0.269 0.214 0.329 0.231 0.338 0.193 0.304 0.198 0.300 720 0.183 0.276 0.276 0.980 0.814 0.194 0.197 0.293 0.203 0.301 0.196 0.294 0.197 0.290 0.225 0.317 0.246 0.355 0.254 0.361 0.212 0.321 0.220 0.320 Avg 0.153 0.245 0.157 0.250 0.559 0.549 0.160 0.256 0.166 0.264 0.159 0.257 0.159 0.253 0.178 0.270 0.214 0.321 0.227 0.338 0.186 0.295 0.192 0.295 96 0.236 0.358 0.843 0.453 0.360 0.249 0.376 0.264 0.410 0.282 0.336 0.253 0.360 0.249 0.395 0.268 0.587 0.366 0.613 0.388 0.519 0.309 0.593 0.321 192 0.377 0.241 0.378 0.253 0.847 0.453 0.397 0.277 0.423 0.287 0.346 0.257 0.379 0.256 0.417 0.276 0.604 0.373 0.616 0.382 0.537 0.315 0.617 0.336 336 0.394 0.250 0.392 0.261 0.853 0.455 0.270 0.413 0.290 0.436 0.296 0.355 0.392 0.264 0.433 0.283 0.621 0.383 0.622 0.337 0.534 0.313 0.629 0.336 720 0.439 0.283 0.434 0.287 1.500 0.805 0.444 0.306 0.466 0.315 0.386 0.273 0.432 0.286 0.467 0.302 0.626 0.382 0.660 0.408 0.577 0.325 0.640 0.350 Avg 0.392 0.253 0.391 1.011 0.541 0.262 0.408 0.284 0.434 0.295 0.356 0.391 0.264 0.428 0.282 0.609 0.376 0.628 0.379 0.541 0.315 0.620 0.336 96 0.359 0.386 0.368 0.395 1.044 0.773 0.361 0.361 0.392 0.375 0.399 0.375 0.398 0.370 0.400 0.386 0.405 0.376 0.419 0.449 0.459 0.421 0.431 0.384 0.402 192 0.417 0.401 1.217 0.832 0.409 0.414 0.404 0.418 0.405 0.412 0.422 0.413 0.429 0.441 0.436 0.420 0.448 0.500 0.482 0.474 0.487 0.436 0.429 336 0.408 0.429 0.422 0.437 1.259 0.841 0.430 0.429 0.420 0.431 0.439 0.443 0.435 0.433 0.422 0.440 0.487 0.458 0.459 0.465 0.521 0.496 0.569 0.551 0.491 0.469 720 0.419 0.448 0.465 1.271 0.838 0.445 0.463 0.472 0.472 0.490 0.454 0.465 0.447 0.468 0.503 0.491 0.506 0.507 0.514 0.512 0.770 0.672 0.521 0.500 Avg 0.397 0.420 0.428 1.198 0.821 0.411 0.412 0.428 0.423 0.437 0.419 0.430 0.413 0.434 0.454 0.448 0.440 0.460 0.496 0.487 0.558 0.535 0.458 0.450 96 0.267 0.329 0.273 0.333 2.522 1.278 0.271 0.274 0.341 0.289 0.353 0.336 0.274 0.337 0.297 0.349 0.346 0.388 0.358 0.397 0.299 0.364 0.340 0.374 192 0.338 0.375 0.340 3.312 1.384 0.402 0.339 0.385 0.383 0.418 0.332 0.380 0.314 0.382 0.380 0.400 0.429 0.439 0.456 0.452 0.441 0.454 0.402 0.414 336 0.367 0.401 0.373 0.403 3.291 1.388 0.361 0.406 0.448 0.465 0.360 0.407 0.329 0.384 0.428 0.432 0.496 0.487 0.482 0.486 0.654 0.567 0.452 0.452 720 0.388 0.424 0.398 0.430 3.257 1.357 0.342 0.408 0.445 0.470 0.605 0.551 0.419 0.451 0.427 0.445 0.463 0.474 0.515 0.511 0.956 0.716 0.462 0.468 Avg 0.340 0.346 0.386 3.095 1.352 0.316 0.384 0.355 0.401 0.431 0.447 0.345 0.394 0.381 0.383 0.407 0.433 0.447 0.453 0.462 0.588 0.525 0.414 0.427 96 0.275 0.328 0.286 0.863 0.664 0.291 0.340 0.339 0.299 0.343 0.306 0.349 0.293 0.346 0.334 0.368 0.379 0.419 0.505 0.475 0.316 0.362 0.338 0.375 192 0.319 0.354 0.329 1.113 0.776 0.365 0.365 0.335 0.365 0.335 0.366 0.333 0.370 0.377 0.391 0.426 0.441 0.553 0.496 0.363 0.390 0.374 0.387 336 0.353 0.374 0.358 1.267 0.832 0.360 0.381 0.382 0.369 0.386 0.364 0.384 0.369 0.392 0.426 0.420 0.445 0.459 0.621 0.537 0.408 0.426 0.410 0.411 720 0.409 0.407 0.416 1.324 0.858 0.415 0.417 0.419 0.414 0.425 0.421 0.413 0.416 0.420 0.491 0.459 0.543 0.490 0.671 0.561 0.481 0.476 0.478 0.450 Avg 0.339 0.366 1.142 0.782 0.348 0.375 0.375 0.357 0.379 0.355 0.378 0.353 0.382 0.407 0.410 0.448 0.452 0.588 0.517 0.392 0.413 0.400 0.406 96 0.157 0.244 0.164 2.041 1.073 0.164 0.254 0.163 0.252 0.167 0.260 0.251 0.166 0.256 0.180 0.264 0.203 0.287 0.255 0.339 0.179 0.275 0.187 0.267 192 0.213 0.285 0.218 2.249 1.112 0.223 0.295 0.216 0.290 0.224 0.303 0.289 0.223 0.296 0.250 0.309 0.269 0.328 0.281 0.340 0.307 0.376 0.249 0.309 336 0.269 0.322 0.271 0.322 2.568 1.238 0.279 0.330 0.281 0.342 0.267 0.326 0.274 0.329 0.311 0.348 0.325 0.366 0.339 0.372 0.325 0.388 0.321 0.351 720 0.351 0.377 0.361 2.720 1.287 0.359 0.383 0.420 0.422 0.397 0.421 0.383 0.362 0.385 0.412 0.407 0.421 0.415 0.422 0.419 0.502 0.490 0.408 0.403 Avg 0.248 0.307 0.254 2.395 1.177 0.256 0.315 0.267 0.322 0.267 0.332 0.312 0.256 0.317 0.288 0.332 0.304 0.349 0.324 0.368 0.328 0.382 0.291 0.333 Wins 18 23 1 2 3 1 2 1 Taken from Wu etal. (2022).Models Recurrent MLP Transformer Convolutional Weather 0.144 0.187 0.236 0.189 0.237 0.272 0.326 0.311 0.222 0.255 0.222 Electricity 0.128 0.221 0.144 0.235 0.161 0.255 0.161 0.185 0.287 0.156 0.246 Traffic 0.357 0.242 0.375 0.250 0.385 0.260 0.430 0.281 0.261 0.387 0.261 ETTh1 0.390 0.402 0.416 0.416 0.441 0.460 0.408 0.423 ETTh2 0.330 0.270 0.378 0.317 0.332 0.396 0.379 0.422 0.382 0.324 ETTm1 0.335 0.285 0.361 0.327 0.327 0.379 0.356 0.411 0.413 0.347 0.372 0.347 ETTm2 0.250 0.161 0.288 0.215 0.268 0.324 0.380 0.352 0.310 0.249 5 5