Multivariate Time Series Forecasting by Mixing via Scalar Memories (2024)

Maurice Kraus
AI & ML Group, TU Darmstadt&Felix Divo
AI & ML Group, TU Darmstadt
&Devendra Singh Dhami
Uncertainty in AI Group, TU Eindhoven
Hessian Center for AI (hessian.AI)&Kristian Kersting
AI & ML Group, TU Darmstadt
Centre for Cognitive Science, TU Darmstadt
Hessian Center for AI (hessian.AI)
German Research Center for AI (DFKI)
Contact: maurice.kraus@cs.tu-darmstadt.de.

Abstract

Time series data is prevalent across numerous fields, necessitating the development of robust and accurate forecasting models.Capturing patterns both within and between temporal and multivariate components is crucial for reliable predictions.We introduce xLSTM-Mixer, a model designed to effectively integrate temporal sequences, joint time-variate information, and multiple perspectives for robust forecasting.Our approach begins with a linear forecast shared across variates, which is then refined by xLSTM blocks.They serve as key elements for modeling the complex dynamics of challenging time series data.xLSTM-Mixer ultimately reconciles two distinct views to produce the final forecast.Our extensive evaluations demonstrate its superior long-term forecasting performance compared to recent state-of-the-art methods.A thorough model analysis provides further insights into its key components and confirms its robustness and effectiveness.This work contributes to the resurgence of recurrent models in time series forecasting.

1 Introduction

Time series are an essential data modality ubiquitous in many critical fields of application, such as medicine(Hosseini etal., 2021), manufacturing(Essien & Giannetti, 2020), logistics(Seyedan & Mafakheri, 2020), traffic management(Lippi etal., 2013), finance(Lin etal., 2012), audio processing(Latif etal., 2023), and weather modeling(Lam etal., 2023).While significant progress in time series forecasting has been made over the decades, the field is still far from being solved.The regular appearance of yet better models and improved combinations of existing approaches exemplifies this.Further increasing the forecast quality obtained from machine learning models promises a manifold of improvements, such as higher efficiency in manufacturing and transportation as well as more accurate medical treatments.

Historically, recurrent neural networks(RNNs) and their powerful successors were natural choices for deep learning-based time series forecasting(Hochreiter & Schmidhuber, 1997; Cho etal., 2014).Today, large Transformers(Vaswani etal., 2017) are applied extensively to time series tasks, including forecasting.Many improvements to the vanilla architecture have since been proposed, including patching(Nie etal., 2023), decompositions(Zeng etal., 2023), and tokenization inversions(Liu etal., 2023).However, some of their limitations are yet to be lifted.For instance, they typically require large datasets to train successfully, restricting their use to only a subset of conceivable applications.Furthermore, they are inefficient when applied to long sequences due to the cost of the attention mechanism being quadratic in the number of variates and time steps, depending on the specific choice of tokenization.Therefore, recurrent and state space models(SSMs) (Patro & Agneeswaran, 2024) are experiencing a resurgence of interest in overcoming such limitations. Specifically, Beck etal. (2024) revisited recurrent models by borrowing insights gained from Transformers applied to many domains, specifically to natural language processing. They propose Extended Long Short-Term Memory(xLSTM) models as a viable alternative to current sequence models.

We propose xLSTM-Mixer111https://github.com/mauricekraus/xLSTM-Mixer, a new state-of-the-art method for time series forecasting using recurrent deep learning methods.Specifically, we augment the highly expressive xLSTM architecture with carefully crafted time, variate, and multi-view mixing.These operations regularize the training and limit the model parameters by weight-sharing, effectively improving the learning of features necessary for accurate forecasting.xLSTM-Mixer initially computes a channel-independent linear forecast shared over the variates.It is then up-projected to a higher hidden dimension and subsequently refined by an xLSTM stack.It performs multi-view forecasting by producing a forecast from the original and reversed up-projected embedding.The powerful xLSTM cells thereby jointly mix time and variate information to capture complex patterns from the data.Both forecasts are eventually reconciled by a learned linear projection into the final prediction, again by mixing time.An overview of our method is shown in Figure1.

Multivariate Time Series Forecasting by Mixing via Scalar Memories (1)

Overall, we make the following contributions:

  1. (i)

    We investigate time and variate mixing in the context of recurrent models and propose a joint multistage approach that is highly effective for multivariate time series forecasting. We argue that marching over the variates instead of the temporal axis yields better results if suitably combined with temporal mixing.

  2. (ii)

    We propose xLSTM-Mixer, a state-of-the-art method for time series forecasting using recurrent deep learning methods.

  3. (iii)

    We extensively compare xLSTM-Mixer with existing methods for multivariate long-term time series forecasting and perform in-depth model analyses. The experiments demonstrate that xLSTM-Mixer consistently achieves state-of-the-art performance in a wide range of benchmarks.

The following work is structured as follows:In the upcoming Sec.2, we introduce preliminaries to then motivate and explain xLSTM-Mixer in Sec.3.We then present comprehensive experiments on its effectiveness and inner workings in Sec.4.We finally contextualize the findings within the related work in Sec.5 and close with a conclusion and outlook in Sec.6.

2 Background

After introducing the notation used throughout this work, we review xLSTM blocks and discuss leveraging channel mixing or their independence in time series models.

2.1 Notation

In multivariate time series forecasting, the model is presented with a time series ๐‘ฟ=(๐’™1,โ€ฆ,๐’™T)โˆˆโ„Vร—T๐‘ฟsubscript๐’™1โ€ฆsubscript๐’™๐‘‡superscriptโ„๐‘‰๐‘‡\bm{X}=\left(\bm{x}_{1},\dots,\bm{x}_{T}\right)\in\mathbb{R}^{V\times T}bold_italic_X = ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , โ€ฆ , bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_V ร— italic_T end_POSTSUPERSCRIPT consisting of T๐‘‡Titalic_T time steps with V๐‘‰Vitalic_V variates each.Given this context, the forecaster shall predict the future values ๐’€=(๐’™T+1,โ€ฆ,๐’™T+H)โˆˆโ„Vร—H๐’€subscript๐’™๐‘‡1โ€ฆsubscript๐’™๐‘‡๐ปsuperscriptโ„๐‘‰๐ป\bm{Y}=\left(\bm{x}_{T+1},\dots,\bm{x}_{T+H}\right)\in\mathbb{R}^{V\times H}bold_italic_Y = ( bold_italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT , โ€ฆ , bold_italic_x start_POSTSUBSCRIPT italic_T + italic_H end_POSTSUBSCRIPT ) โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_V ร— italic_H end_POSTSUPERSCRIPT up to a horizon H๐ปHitalic_H.A variate (also called a channel) can be any scalar measurement, such as the occupancy of a road or the oil temperature in a power plant.The measurements are assumed to be carried out jointly, such that the T+H๐‘‡๐ปT+Hitalic_T + italic_H time steps reflect a regularly sampled multivariate signal.A time series dataset consists of N๐‘Nitalic_N such pairs {(๐‘ฟ(i),๐’€(i))}iโˆˆ{1,โ€ฆ,N}subscriptsuperscript๐‘ฟ๐‘–superscript๐’€๐‘–๐‘–1โ€ฆ๐‘\left\{\left(\bm{X}^{(i)},\bm{Y}^{(i)}\right)\right\}_{i\in\{1,\dots,N\}}{ ( bold_italic_X start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_italic_Y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i โˆˆ { 1 , โ€ฆ , italic_N } end_POSTSUBSCRIPT divided into train, validation, and test portions.

2.2 Extended Long Short-Term Memory (xLSTM)

Beck etal. (2024) propose xLSTM architectures consisting of two building blocks, namely the sLSTM and mLSTM modules.To harness the full expressivity of xLSTMs within each step and across the computation sequence, we employ a stack of sLSTM blocks without any mLSTM blocks.The latter are less suited for joint mixing due to their independent treatment of the sequence elements, making it impossible to learn any relationships between them directly.We will continue by recalling the construction of sLSTM cells.

The standard LSTM architecture of Hochreiter & Schmidhuber (1997) involves updating the cell state ๐œtsubscript๐œ๐‘ก\mathbf{c}_{t}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT through a combination of input, forget, and output gates, which regulate the flow of information across tokens.sLSTM blocks enhance this by incorporating exponential gating and memory mixing(Greff etal., 2017) to handle complex temporal and cross-variate dependencies more effectively.The sLSTM updates the cell ๐’„tsubscript๐’„๐‘ก\bm{c}_{t}bold_italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and hidden state ๐’‰tsubscript๐’‰๐‘ก\bm{h}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using three gates as follows:

๐’„tsubscript๐’„๐‘ก\displaystyle\bm{c}_{t}bold_italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT=๐’‡tโŠ™ctโˆ’1+๐’ŠtโŠ™๐’›tabsentdirect-productsubscript๐’‡๐‘กsubscript๐‘๐‘ก1direct-productsubscript๐’Š๐‘กsubscript๐’›๐‘ก\displaystyle=\bm{f}_{t}\odot c_{t-1}+\bm{i}_{t}\odot\bm{z}_{t}= bold_italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โŠ™ italic_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โŠ™ bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTcell state(1)
๐’tsubscript๐’๐‘ก\displaystyle\bm{n}_{t}bold_italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT=๐’‡tโ‹…๐’tโˆ’1+๐’Štabsentโ‹…subscript๐’‡๐‘กsubscript๐’๐‘ก1subscript๐’Š๐‘ก\displaystyle=\bm{f}_{t}\cdot\bm{n}_{t-1}+\bm{i}_{t}= bold_italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โ‹… bold_italic_n start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTnormalizer state(2)
๐’‰tsubscript๐’‰๐‘ก\displaystyle\bm{h}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT=๐’tโŠ™๐’„tโŠ™๐’tโˆ’1absentdirect-productsubscript๐’๐‘กsubscript๐’„๐‘กsubscriptsuperscript๐’1๐‘ก\displaystyle=\bm{o}_{t}\odot\bm{c}_{t}\odot\bm{n}^{-1}_{t}= bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โŠ™ bold_italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โŠ™ bold_italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPThidden state(3)
๐’›tsubscript๐’›๐‘ก\displaystyle\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT=tanhโก(๐‘พzโข๐’™t+๐‘นzโขhtโˆ’1+๐’ƒz)absentsubscript๐‘พ๐‘งsubscript๐’™๐‘กsubscript๐‘น๐‘งsubscriptโ„Ž๐‘ก1subscript๐’ƒ๐‘ง\displaystyle=\tanh\bigl{(}\bm{W}_{z}\bm{x}_{t}+\bm{R}_{z}h_{t-1}+\bm{b}_{z}%\bigr{)}= roman_tanh ( bold_italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_R start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT )cell input(4)
๐’Štsubscript๐’Š๐‘ก\displaystyle\bm{i}_{t}bold_italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT=expโก(๐’Š~tโˆ’๐’Žt)absentsubscriptbold-~๐’Š๐‘กsubscript๐’Ž๐‘ก\displaystyle=\exp\bigl{(}\bm{\tilde{i}}_{t}-\bm{m}_{t}\bigr{)}= roman_exp ( overbold_~ start_ARG bold_italic_i end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )๐’Š~tsubscriptbold-~๐’Š๐‘ก\displaystyle\bm{\tilde{i}}_{t}overbold_~ start_ARG bold_italic_i end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT=๐‘พiโข๐’™t+๐‘นiโข๐’‰tโˆ’1+๐’ƒiabsentsubscript๐‘พ๐‘–subscript๐’™๐‘กsubscript๐‘น๐‘–subscript๐’‰๐‘ก1subscript๐’ƒ๐‘–\displaystyle=\bm{W}_{i}\bm{x}_{t}+\bm{R}_{i}\bm{h}_{t-1}+\bm{b}_{i}= bold_italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTinput gate(5)
๐’‡tsubscript๐’‡๐‘ก\displaystyle\bm{f}_{t}bold_italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT=expโก(๐’‡~t+๐’Žtโˆ’1โˆ’๐’Žt)absentsubscriptbold-~๐’‡๐‘กsubscript๐’Ž๐‘ก1subscript๐’Ž๐‘ก\displaystyle=\exp\bigl{(}\bm{\tilde{f}}_{t}+\bm{m}_{t-1}-\bm{m}_{t}\bigr{)}= roman_exp ( overbold_~ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )๐’‡~tsubscriptbold-~๐’‡๐‘ก\displaystyle\bm{\tilde{f}}_{t}overbold_~ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT=๐‘พfโข๐’™t+๐‘นfโข๐’‰tโˆ’1+๐’ƒfabsentsubscript๐‘พ๐‘“subscript๐’™๐‘กsubscript๐‘น๐‘“subscript๐’‰๐‘ก1subscript๐’ƒ๐‘“\displaystyle=\bm{W}_{f}\bm{x}_{t}+\bm{R}_{f}\bm{h}_{t-1}+\bm{b}_{f}= bold_italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPTforget gate(6)
๐’tsubscript๐’๐‘ก\displaystyle\bm{o}_{t}bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT=ฯƒโข(๐‘พoโข๐’™t+๐‘นoโข๐’‰tโˆ’1+๐’ƒo)absent๐œŽsubscript๐‘พ๐‘œsubscript๐’™๐‘กsubscript๐‘น๐‘œsubscript๐’‰๐‘ก1subscript๐’ƒ๐‘œ\displaystyle=\sigma\bigl{(}\bm{W}_{o}\bm{x}_{t}+\bm{R}_{o}\bm{h}_{t-1}+\bm{b}%_{o}\bigr{)}= italic_ฯƒ ( bold_italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_R start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT )output gate(7)
๐’Žtsubscript๐’Ž๐‘ก\displaystyle\bm{m}_{t}bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT=maxโก(๐’‡~t+๐’Žtโˆ’1,๐’Š~t)absentsubscriptbold-~๐’‡๐‘กsubscript๐’Ž๐‘ก1subscriptbold-~๐’Š๐‘ก\displaystyle=\max\bigl{(}\bm{\tilde{f}}_{t}+\bm{m}_{t-1},\bm{\tilde{i}}_{t}%\bigr{)}= roman_max ( overbold_~ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , overbold_~ start_ARG bold_italic_i end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )stabilizer state(8)

In this setup, the matrices ๐‘พz,๐‘พi,๐‘พf,subscript๐‘พ๐‘งsubscript๐‘พ๐‘–subscript๐‘พ๐‘“\bm{W}_{z},\bm{W}_{i},\bm{W}_{f},bold_italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , and ๐‘พosubscript๐‘พ๐‘œ\bm{W}_{o}bold_italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT are input weights mapping the input token ๐’™tsubscript๐’™๐‘ก\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the cell input ๐’›tsubscript๐’›๐‘ก\bm{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, input gate, forget gate, and output gate, respectively.The states ๐’tsubscript๐’๐‘ก\bm{n}_{t}bold_italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ๐’Žtsubscript๐’Ž๐‘ก\bm{m}_{t}bold_italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT serve as necessary normalization and training stabilization, respectively.

As Beck etal. have shown, it is beneficial to restrict the memory mixing performed by the recurrent weight matrices ๐‘นz,๐‘นi,๐‘นf,subscript๐‘น๐‘งsubscript๐‘น๐‘–subscript๐‘น๐‘“\bm{R}_{z},\bm{R}_{i},\bm{R}_{f},bold_italic_R start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , bold_italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , and ๐‘นosubscript๐‘น๐‘œ\bm{R}_{o}bold_italic_R start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT to individual heads, inspired by the multi-head setup of Transformers(Zeng etal., 2023), yet more restricted and therefore more efficient to compute.In particular, each token gets broken up into separate pieces, where the input weights ๐‘พz,i,f,osubscript๐‘พ๐‘ง๐‘–๐‘“๐‘œ\bm{W}_{z,i,f,o}bold_italic_W start_POSTSUBSCRIPT italic_z , italic_i , italic_f , italic_o end_POSTSUBSCRIPT act across all of them, but the recurrence matrices ๐‘นz,i,f,osubscript๐‘น๐‘ง๐‘–๐‘“๐‘œ\bm{R}_{z,i,f,o}bold_italic_R start_POSTSUBSCRIPT italic_z , italic_i , italic_f , italic_o end_POSTSUBSCRIPT are implemented as block-diagonals and therefore only act within each piece.This permits specialization of the individual heads to patterns specific to the respective section of the tokens and empirically does not sacrifice expressivity.

2.3 Channel Independence and Mixing in Time Series Models

Multiple works have investigated whether it is beneficial to learn representations of the time and variate dimensions jointly or separately.Intuitively, because joint mixing is strictly more expressive, one might think it should always be preferred.It is indeed used by many methods such as Temporal Convolutional Networks(TCN) (Lea etal., 2016), N-BEATS(Oreshkin etal., 2019), N-HiTS(Challu etal., 2023), and many Transformers(Vaswani etal., 2017), including Temporal Fusion Transformer(TFT) (Lim etal., 2021), Autoformer(Wu etal., 2021), and FEDFormer(Zhou etal., 2022).However, treating slices of the input data independently assumes an invariance to temporal or variate positions and serves as a strong regularization against overfitting, reminiscent of kernels in CNNs.Prominent models implementing some aspects of channel independence in multivariate time series forecasting are PatchTST(Nie etal., 2023) and iTransformer(Liu etal., 2023).TiDE(Das etal., 2023), on the other hand, contains a time-step shared feature projection and temporal decoder but treats variates jointly.As Tolstikhin etal. (2021) have shown with MLP-Mixer, interleaving mixing of all channels in each token and all tokens per channel does not empirically sacrifice any expressivity and instead improves performance.This idea has since been applied to time series, too, namely in architectures such as TimeMixer(Chen etal., 2023c) and TSMixer(Chen etal., 2023c), and is one of the key components of our method xLSTM-Mixer.

3 xLSTM-Mixer

Now we have everything at hand to introduce xLSTM-Mixer as depicted in Fig.1. It carefully integrates three key components: (1) an initial linear forecast with time mixing, (2) joint mixing using powerful sLSTM modules, and (3) an eventual combination of two views by a final fully connected layer.The transposing steps between the key components enable capturing complex temporal and intra-variate patterns while facilitating easy trainability and limiting parameter counts.The sLSTM block, in particular, can learn intricate non-linear relationships hidden within the data along both the time and variate dimensions.The xLSTM-Mixer architecture is furthermore equipped with normalization layers and skip connections to improve training stability and overall effectiveness.

3.1 Key Component 1: Normalization and Initial Linear Forecast

Normalization has become an essential ingredient of modern deep learning architectures(Huang etal., 2023).For time series in particular, reversible instance norm(RevIN) (Kim etal., 2022) is a general recipe for improving forecasting performance, where each time series instance is normalized by its mean and variance and furthermore scaled and offset by learnable scalars ฮณ๐›พ\gammaitalic_ฮณ and ฮฒ๐›ฝ\betaitalic_ฮฒ:

๐’™tnorm=RevINโก(๐’™t)=ฮณโข(๐’™tโˆ’๐”ผโข[๐’™]Varโข[๐’™]+ฯต)+ฮฒ.superscriptsubscript๐’™๐‘กnormRevINsubscript๐’™๐‘ก๐›พsubscript๐’™๐‘ก๐”ผdelimited-[]๐’™Vardelimited-[]๐’™italic-ฯต๐›ฝ\bm{x}_{t}^{\text{norm}}=\operatorname{RevIN}(\bm{x}_{t})=\gamma\left(\frac{%\bm{x}_{t}-\mathbb{E}\left[\bm{x}\right]}{\sqrt{\mathrm{Var}\left[\bm{x}\right%]}+\epsilon}\right)+\beta.bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT norm end_POSTSUPERSCRIPT = roman_RevIN ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_ฮณ ( divide start_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - blackboard_E [ bold_italic_x ] end_ARG start_ARG square-root start_ARG roman_Var [ bold_italic_x ] end_ARG + italic_ฯต end_ARG ) + italic_ฮฒ .

We apply it as part of xLSTM-Mixer, and at the end of the entire pipeline, we invert the RevIN operation to obtain the final prediction.In the case of xLSTM-Mixer, the typical skip connections found in mixer acrchitectures(Tolstikhin etal., 2021; Chen etal., 2023c) are taken up by RevIN, the normalization in the NLinear forecast explained shortly, and the integral skip connections within each sLSTM block.

It has been shown previously that simple linear models equipped with appropriate normalization schemes are, already by themselves, decent long-term forecasters(Zeng etal., 2023; Li etal., 2023).Our observations confirm this finding.Therefore, we first process each variate separately by an NLinear model by computing:

๐’™initial=NLinearโก(๐’™norm)=FCโก(๐’™1:Tnormโˆ’xTnorm)+xTnorm,superscript๐’™initialNLinearsuperscript๐’™normFCsuperscriptsubscript๐’™:1๐‘‡normsuperscriptsubscript๐‘ฅ๐‘‡normsuperscriptsubscript๐‘ฅ๐‘‡norm\bm{x}^{\text{initial}}=\operatorname{NLinear}(\bm{x}^{\text{norm}})=%\operatorname{FC}\left(\bm{x}_{1:T}^{\text{norm}}-x_{T}^{\text{norm}}\right)+x%_{T}^{\text{norm}}\,,bold_italic_x start_POSTSUPERSCRIPT initial end_POSTSUPERSCRIPT = roman_NLinear ( bold_italic_x start_POSTSUPERSCRIPT norm end_POSTSUPERSCRIPT ) = roman_FC ( bold_italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT norm end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT norm end_POSTSUPERSCRIPT ) + italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT norm end_POSTSUPERSCRIPT ,

where FCโก(โ‹…)FCโ‹…\operatorname{FC}(\cdot)roman_FC ( โ‹… ) denotes a fully-connected linear layer with bias term.Sharing this model across variates limits parameter counts, and the weight-tying serves as a useful regularization.The quality of this initial forecast will be investigated in Sec.4.1 and 4.2.

3.2 Key Component 2: sLSTM Refinement

While the NLinear forecast ๐’™initialsuperscript๐’™initial\bm{x}^{\text{initial}}bold_italic_x start_POSTSUPERSCRIPT initial end_POSTSUPERSCRIPT captures the basic patterns between the historic and future time steps, its quality alone is insufficient for todayโ€™s challenging time series datasets.We, therefore, refine it using powerful sLSTM blocks.As a first step, it is crucial to increase the embedding dimension of the data to provide enough latent dimensions D๐ทDitalic_D for the sLSTM cells: ๐’™up=FCupโก(๐’™initial)superscript๐’™upsuperscriptFCupsuperscript๐’™initial\bm{x}^{\text{up}}=\operatorname{FC}^{\text{up}}\left(\bm{x}^{\text{initial}}\right)bold_italic_x start_POSTSUPERSCRIPT up end_POSTSUPERSCRIPT = roman_FC start_POSTSUPERSCRIPT up end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT initial end_POSTSUPERSCRIPT ).This pre-up-projection is similar to what is commonly performed in SSMs(Beck etal., 2024).We weight-share FCupsuperscriptFCup\operatorname{FC}^{\text{up}}roman_FC start_POSTSUPERSCRIPT up end_POSTSUPERSCRIPT across variates to perform time-mixing similar to the initial forecast.Note that this step does not maintain the temporal ordering within the embedding token dimensions, as was the case up until this step, and instead embeds it into a higher latent dimension.

The stack of M๐‘€Mitalic_M sLSTM blocks ๐’ฎโข(โ‹…)๐’ฎโ‹…\mathcal{S}(\cdot)caligraphic_S ( โ‹… ) transforms ๐’™upsuperscript๐’™up\bm{x}^{\text{up}}bold_italic_x start_POSTSUPERSCRIPT up end_POSTSUPERSCRIPT as defined in Eq.1 to 8.The recurrent model strides over the data in variate order, i.e., where each token represents all time steps from a single variate as in the work of Liu etal. (2023).The sLSTM blocks learn intricate non-linear relationships hidden within the data along both the time and variate dimensions.The mixing of the hidden state is still limited to blocks of consecutive dimensions, aiding efficient learning and inference while allowing for effective cross-variate interaction during the recurrent processing.Striding over variates has the benefit of linear runtime scaling in the number of variates at a constant number of parameters.It, however, comes at the cost of possibly fixing a suboptimal order of variates.While this is empirically not a significant limitation, we leave investigations into how to find a suitable ordering for future work.In addition to a large embedding dim, we observed a high number of heads being crucial for effective forecasting.

The sLSTM cellsโ€™ first hidden state ๐’‰tโˆ’1subscript๐’‰๐‘ก1\bm{h}_{t-1}bold_italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT must be initialized before each sequence of tokens can be processed.Extending the initial description of these blocks, we propose learning a single initial embedding token ๐œผโˆˆโ„D๐œผsuperscriptโ„๐ท\bm{\eta}\in\mathbb{R}^{D}bold_italic_ฮท โˆˆ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT that gets prepended to each encoded time series ๐’™upsuperscript๐’™up\bm{x}^{\text{up}}bold_italic_x start_POSTSUPERSCRIPT up end_POSTSUPERSCRIPT.These initial embeddings draw from recent advances in Large Language Models, where learnable "soft prompt" tokens are used to condition models and improve their ability to generate coherent outputs(Lester etal., 2021; Li & Liang, 2021; Chen etal., 2023a; b).Recent research has extended the application of soft prompts to LLM-based time series forecasting (Cao etal., 2023; Sun etal., 2024), emphasizing their adaptability and effectiveness in improving model performance across modalities.These tokens enable greater flexibility and conditioning, allowing the model to adapt its initial memory representation to specific dataset characteristics and to dynamically interact with the time and variate data.Soft prompts can be readily optimized through back-propagation with very little overhead.

3.3 Key Component 3: Multi-View Mixing

To further regularize the training of the sLSTM as with the linear projections, we compute forecasts from the original embedding ๐’™upsuperscript๐’™up\bm{x}^{\text{up}}bold_italic_x start_POSTSUPERSCRIPT up end_POSTSUPERSCRIPT as well as the reversed embedding ๐’™^upsuperscriptbold-^๐’™up\bm{\widehat{x}}^{\text{up}}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT up end_POSTSUPERSCRIPT, where the order of the latent dimensions including the representation of ๐œผ๐œผ\bm{\eta}bold_italic_ฮท is inverted.Learning forecasts ๐’šโ€ฒsuperscript๐’šโ€ฒ\bm{y}^{\prime}bold_italic_y start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT and ๐’šโ€ฒโ€ฒsuperscript๐’šโ€ฒโ€ฒ\bm{y}^{\prime\prime}bold_italic_y start_POSTSUPERSCRIPT โ€ฒ โ€ฒ end_POSTSUPERSCRIPT for both views while sharing weights helps learn better representations.Such multi-task learning settings are known to benefit training(Zhang & Yang, 2022).The final forecast is obtained by a linear projection FCviewsuperscriptFCview\operatorname{FC}^{\text{view}}roman_FC start_POSTSUPERSCRIPT view end_POSTSUPERSCRIPT of the two concatenated forecasts, again per-variate.Specifically, we compute:

๐’šnorm=FCviewโก(๐’šโ€ฒ,๐’šโ€ฒโ€ฒ),whereโข๐’šโ€ฒ=๐’ฎโข(๐’™up)โขandโข๐’šโ€ฒโ€ฒ=๐’ฎโข(๐’™^up).formulae-sequencesuperscript๐’šnormsuperscriptFCviewsuperscript๐’šโ€ฒsuperscript๐’šโ€ฒโ€ฒwheresuperscript๐’šโ€ฒ๐’ฎsuperscript๐’™upandsuperscript๐’šโ€ฒโ€ฒ๐’ฎsuperscriptbold-^๐’™up\bm{y}^{\text{norm}}=\operatorname{FC}^{\text{view}}\left(\bm{y}^{\prime},\bm{%y}^{\prime\prime}\right),\hskip 4.30554pt\text{where}\hskip 4.30554pt\bm{y}^{%\prime}=\mathcal{S}(\bm{x}^{\text{up}})\hskip 4.30554pt\text{and}\hskip 4.3055%4pt\bm{y}^{\prime\prime}=\mathcal{S}(\bm{\widehat{x}}^{\text{up}}).bold_italic_y start_POSTSUPERSCRIPT norm end_POSTSUPERSCRIPT = roman_FC start_POSTSUPERSCRIPT view end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT โ€ฒ โ€ฒ end_POSTSUPERSCRIPT ) , where bold_italic_y start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT = caligraphic_S ( bold_italic_x start_POSTSUPERSCRIPT up end_POSTSUPERSCRIPT ) and bold_italic_y start_POSTSUPERSCRIPT โ€ฒ โ€ฒ end_POSTSUPERSCRIPT = caligraphic_S ( overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT up end_POSTSUPERSCRIPT ) .

The final forecast is obtained after de-normalizing the reconciled forecasts as ๐’š=RevINโˆ’1โก(๐’šnorm)๐’šsuperscriptRevIN1superscript๐’šnorm\bm{y}=\operatorname{RevIN}^{-1}(\bm{y}^{\text{norm}})bold_italic_y = roman_RevIN start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUPERSCRIPT norm end_POSTSUPERSCRIPT ).

4 Experimental Evaluation

Our intention here is to evaluate the forecasting capabilities of xLSTM-Mixer, aiming to provide comprehensive insights into its performance. To this end, we conducted a series of experiments with the primary focus on long-term forecasting, following the work of Das etal. (2023) and Chen etal. (2023c).An evaluation of xLSTM-Mixerโ€™s competitiveness in short-term forecasting on the PEMS dataset is provided in Sec.A.2.Additionally, we perform an extensive model analysis consisting of an ablation study to identify the contributions of individual components of xLSTM-Mixer, followed by an inspection of the initial embedding tokens, a hyperparameter sensitivity analysis, and an investigation into its robustness.

DatasetSourceDomainHorizonsSampling#Variates
WeatherZhou etal. (2021)Weather96โ€“72010 min21
ElectricityZhou etal. (2021)Power Usage96โ€“7201 hour321
TrafficWu etal. (2021)Traffic Load96โ€“7201 hour862
ETTZhou etal. (2021)Power Production96โ€“72015&60 min7

Datasets.We generally follow the established benchmark procedure of Wu etal. (2021) and Zhou etal. (2021) for best backward and future comparability. The datasets we thus used are summarized in Tab.1.

Training.We follow standard practice in the forecasting literature by evaluating long-term forecasts using the mean squared error(MSE) and the mean absolute error(MAE). Based on our experiments, we used the MAE as the training loss function since it yielded the best results. The datasets were standardized for consistency across features. Further details on hyperparameter selection, metrics, and the implementation can be found in Sec.A.1.

Baseline Models.We compare xLSTM-Mixer to the recurrent models xLSTMTime(Alharthi & Mahmood, 2024) and LSTM(Hochreiter & Schmidhuber, 1997); multi-perceptron(MLP) based models TimeMixer(Wang etal., 2024a), TSMixer(Chen etal., 2023c), DLinear(Zeng etal., 2023), and TiDE(Das etal., 2023); the Transformers PatchTST(Nie etal., 2023), iTransformer(Liu etal., 2023), FEDFormer(Zhou etal., 2022), and Autoformer(Wu etal., 2021); and the convolutional architectures MICN(Wang etal., 2022) and TimesNet(Wu etal., 2022).

4.1 Long-Term Time Series Forecasting

ModelsRecurrentMLPTransformerConvolutional

xLSTM-Mixer

xLSTMTime

LSTM

TimeMixer

TSMixer

DLinear

TiDE

PatchTST

iTransformer

FEDFormer

Autoformer

MICN

TimesNet

(Ours)

2024

1997a

2024a

2023c

2023

2023

2023

2023

2022

2021

2022

2022

Dataset

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

Weather

0.219

0.250

0.2220.255

0.444

0.454

0.222

0.262

0.225

0.264

0.246

0.300

0.236

0.282

0.241

0.264

0.258

0.278

0.309

0.360

0.338

0.382

0.242

0.299

0.259

0.287

Electricity

0.153

0.245

0.157

0.250

0.559

0.549

0.1560.246

0.160

0.256

0.166

0.264

0.159

0.257

0.159

0.253

0.178

0.270

0.214

0.321

0.227

0.338

0.186

0.295

0.192

0.295

Traffic

0.392

0.253

0.391

0.261

1.011

0.541

0.387

0.262

0.408

0.284

0.434

0.295

0.356

0.261

0.391

0.264

0.428

0.282

0.609

0.376

0.628

0.379

0.541

0.315

0.620

0.336

ETTh1

0.397

0.420

0.408

0.428

1.198

0.821

0.411

0.423

0.412

0.428

0.423

0.437

0.419

0.430

0.413

0.434

0.454

0.448

0.440

0.460

0.496

0.487

0.558

0.535

0.458

0.450

ETTh2

0.340

0.382

0.346

0.386

3.095

1.352

0.316

0.384

0.355

0.401

0.431

0.447

0.345

0.394

0.324

0.381

0.383

0.407

0.433

0.447

0.453

0.462

0.588

0.525

0.414

0.427

ETTm1

0.339

0.366

0.3470.372

1.142

0.782

0.348

0.375

0.347

0.375

0.357

0.379

0.355

0.378

0.353

0.382

0.407

0.410

0.448

0.452

0.588

0.517

0.392

0.413

0.400

0.406

ETTm2

0.248

0.307

0.254

0.310

2.395

1.177

0.256

0.315

0.267

0.322

0.267

0.332

0.249

0.312

0.256

0.317

0.288

0.332

0.304

0.349

0.324

0.368

0.328

0.382

0.291

0.333

Wins

18

23

1

2

3

55

1

2

1

  • a

    Taken from Wu etal. (2022).

Multivariate Time Series Forecasting by Mixing via Scalar Memories (2)

We present the performance of xLSTM-Mixer compared to prior models in Tab.2. As shown, xLSTM-Mixer consistently delivers highly accurate forecasts across a wide range of datasets. It achieves the best results in 18 out of 28 cases for MSE and 22 out of 28 cases for MAE, demonstrating its superior performance in long-term forecasting.In particular, xLSTM-Mixer exhibits exceptional forecasting accuracy, as evidenced particularly by its strong MAE performance across all datasets. Notably, on Weather, xLSTM-Mixer reduces the MAE by 2% compared to xLSTMTime and 4.6% compared to TimeMixer. Similarly, for ETTm1, xLSTM-Mixer outperforms TimeMixer by 2.4% in MAE and shows a strong competitive edge over xLSTMTime.Although xLSTM-Mixer performs slightly less well on the Traffic and ETTh2 datasets, where it encounters challenges with handling outliers, it remains highly competitive and outperforms the majority of baseline models. This suggests that despite these few cases, xLSTM-Mixer can consistently deliver state-of-the-art performance in long-term forecasting. A qualitative inspection of several baseline models, including the initial forecast extracted before the sLSTM refinement, is shown in Fig.2.In this comparison, the lookback window and forecasting horizon are both fixed at 96.

4.2 Model Analysis

Ablation Study.

To assess the contributions of each component in xLSTM-Mixer to its strong overall forecast performance, we conducted an extensive ablation study with the results listed in Tab.3. Each configuration represents a different combination of the four key components: mixing time with NLinear, using sLSTM blocks, learning an initial embedding token, and multi-view mixing. We evaluated the performance using the MSE and MAE across the prediction lengths {96,192,336,720}96192336720\{96,192,336,720\}{ 96 , 192 , 336 , 720 }.

#1 (full)

#2

#3

#4

#5

#6

#7

#8

#9

#10

Mix Time

โœ“

โœ“

โœ“

โœ“

โœ“

โœ“

โœ—

โœ—

โœ—

โœ—

sLSTM

Variates

Time

Variates

Variates

Variates

None

Variates

Variates

Variates

Variates

Init. Token

โœ“

โœ“

โœ—

โœ“

โœ—

โœ—

โœ“

โœ—

โœ“

โœ—

Mix View

โœ“

โœ“

โœ“

โœ—

โœ—

โœ—

โœ“

โœ“

โœ—

โœ—

Horizon

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

Weather

96

0.143

0.184

0.148

0.194

0.145

0.186

0.144

0.185

0.144

0.186

0.173

0.223

0.149

0.193

0.151

0.195

0.149

0.192

0.152

0.195

192

0.186

0.226

0.196

0.239

0.188

0.228

0.186

0.226

0.188

0.228

0.219

0.257

0.192

0.233

0.192

0.234

0.191

0.234

0.193

0.236

336

0.237

0.266

0.252

0.281

0.239

0.267

0.241

0.270

0.242

0.270

0.261

0.288

0.240

0.271

0.242

0.273

0.242

0.273

0.244

0.274

720

0.310

0.324

0.315

0.328

0.310

0.324

0.309

0.323

0.309

0.323

0.320

0.334

0.320

0.329

0.319

0.329

0.322

0.330

0.319

0.328

ETTm1

96

0.275

0.328

0.298

0.348

0.277

0.329

0.278

0.331

0.279

0.333

0.295

0.338

0.282

0.339

0.285

0.341

0.281

0.337

0.284

0.339

192

0.319

0.354

0.337

0.369

0.321

0.354

0.321

0.356

0.322

0.358

0.329

0.357

0.329

0.364

0.330

0.365

0.337

0.367

0.335

0.366

336

0.353

0.374

0.368

0.388

0.354

0.375

0.355

0.377

0.357

0.379

0.359

0.376

0.367

0.385

0.367

0.385

0.366

0.384

0.366

0.385

720

0.409

0.407

0.420

0.416

0.411

0.408

0.413

0.411

0.414

0.411

0.412

0.407

0.422

0.412

0.422

0.413

0.417

0.410

0.418

0.411

The full version of xLSTM-Mixer (#1), which integrates all components, achieves the best performance overall. However, we also observe that some configurations of xLSTM-Mixer, which exclude specific components, remain competitive.For instance, #3, which excludes the initial embedding token, still performs reasonably well.This suggests that while it contributes positively to the overall performance, the model can sometimes still achieve competitive results without it.In general, removing any specific component leads to a performance drop. For example, removing the time mixing (#7) increases the MAE by 3.4% on ETTm1 at length 96 or 2.8% at length 192, highlighting its critical role in capturing intratemporal dependencies.When we now omit everything except for time mixing on Weather at 192, we suffer a 13.7% performance decrease.In summary, the ablation study confirms that all components of xLSTM-Mixer contribute to its effectiveness, with the full configuration yielding the best results. Furthermore, we identified the sLSTM blocks and time-mixing as critical components for ensuring high accuracy across datasets and prediction lengths.

Multivariate Time Series Forecasting by Mixing via Scalar Memories (3)
Initial Token Embedding.

We qualitatively inspect decodings of the initial embedding tokens ๐œผ๐œผ\bm{\eta}bold_italic_ฮท on multiple datasets to further understand and interpret the initializations learned by xLSTM-Mixer.๐œผ๐œผ\bm{\eta}bold_italic_ฮท are decoded to a forecast ๐’š๐’š\bm{y}bold_italic_y by transforming them through the sLSTM stack ๐’ฎ๐’ฎ\mathcal{S}caligraphic_S and applying multi-view mixing.The resulting output of FCviewsuperscriptFCview\operatorname{FC}^{\text{view}}roman_FC start_POSTSUPERSCRIPT view end_POSTSUPERSCRIPT can then be interpreted as the conditioning forecast used to initialize the sLSTM blocks.Fig.3 shows the dataset-specific patterns the initial embedding tokens have learned on Weather, ETTm1, and ETTh2 for various prediction horizons.With increasing prediction horizons, we observe longer spans of time, eventually revealing underlying seasonal patterns and respective dataset dynamics.

Multivariate Time Series Forecasting by Mixing via Scalar Memories (4)
Sensitivity to xLSTM Hidden
Dimension.

In Fig.4, we visualize the performance of xLSTM-Mixer on the Electricity dataset with increasing sLSTM embedding (hidden) dimension realized by FCupsuperscriptFCup\operatorname{FC}^{\text{up}}roman_FC start_POSTSUPERSCRIPT up end_POSTSUPERSCRIPT. The results indicate that larger hidden dimensions consistently enhance the modelโ€™s performance, particularly for longer forecast horizons.This suggests that a larger embedding dimension enables xLSTM-Mixer to capture better the higher complexity of the time series data over extended horizons, leading to improved forecasting accuracy.

Robustness to Lookback Length.

Fig.5illustrates the performance of xLSTM-Mixer across varying lookback lengths and prediction horizons. We observe that xLSTM-Mixer can effectively utilize longer lookback windows than the baselines, especially when compared to transformer-based models.This advantage stems from xLSTM-Mixerโ€™s avoidance of self-attention, allowing it to handle extended lookback lengths efficiently.Additionally, xLSTM-Mixer demonstrates stable and consistent performance with low variance.These results confirm that increasing the lookback length improves forecasting accuracy and enhances robustness, particularly for longer prediction horizons.

Multivariate Time Series Forecasting by Mixing via Scalar Memories (5)

5 Related Work

Time Series Forecasting.

A long line of machine learning research led from early statistical methods like ARIMA(Box & Jenkins, 1976) to contemporary models based on deep learning, where four architectural families take center stage: The ones based on recurrence, convolutions, Multilayer Perceptrons(MLPs), and Transformers.While all of them are used by practitioners today, the research focus is gradually shifting over time.Initially, the naturally sequential recurrent models such as Long Short-Term Memory(LSTM) (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Units(GRUs) (Cho etal., 2014) were used for time series analysis.Their main benefits are the high inference efficiency and arbitrary input and output lengths due to their autoregressive nature.While their effectiveness has historically been constrained by a limited ability to capture long-range dependencies, active research continues to alleviate these limitations(Salinas etal., 2020), including the xLSTM architecture presented in Sec.2 (Beck etal., 2024; Alharthi & Mahmood, 2024).Similarly efficient as RNNs, yet more restricted in their output length, are the location-invariant CNNs(Li etal., 2022; Lara-Benรญtez etal., 2021), such as TCN (Lea etal., 2016), TimesNet(Wu etal., 2022), and MICN(Wang etal., 2022).Recently, some MLP-based architectures have also shown good success, including the simplistic DLinear and NLinear models(Zeng etal., 2023), the encoder-decoder architecture of TiDE(Das etal., 2023), the mixing architectures TimeMixer(Wang etal., 2024a) and TSMixer(Chen etal., 2023c), as well as the hierarchical N-BEATSOreshkin etal. (2019) and N-HiTS(Challu etal., 2023) models.Finally, a lot of models have been proposed based on Transformers(Vaswani etal., 2017), such as Autoformer(Wu etal., 2021), TFT(Lim etal., 2021), FEDFormer(Zhou etal., 2022), PatchTST(Nie etal., 2023), and iTransformer(Liu etal., 2023).

xLSTM Models for Time Series.

Some initial experiments of applying xLSTMs(Beck etal., 2024) to time series were already performed by Alharthi & Mahmood (2024) with their proposed xLSTMTime model.While it showed promising forecasting performance, these initial soundings did not surpass stronger recent models such as TimeMixer(Wang etal., 2024a) on multivariate benchmarks, and the reported performance is challenging to reproduce.We ensure that our method xLSTM-Mixer is well suited as a foundation for further research by providing extensive model analysis, including an ablation study with ten variants, and ensuring that results are readily reproducible.Our methodology draws from xLSTMTime yet improves on it by several key components.Most importantly, our novel multi-view mixing consistently enhances forecasting performance.Furthermore, we find the trend-seasonality decomposition to be redundant and a simple NLinear normalization scheme(Zeng etal., 2023) to suffice.

6 Conclusion

In this work, we introduced xLSTM-Mixer, a method that combines a linear forecast with further refinement using xLSTM blocks. Our architecture effectively integrates time, joint, and view mixing to capture complex dependencies. In long-term forecasting, xLSTM-Mixer consistently achieved state-of-the-art performance, outperforming previous methods in 41 out of 56 cases. Furthermore, our detailed model analysis provided valuable insights into the contribution of each component and demonstrated its robustness to varying hyperparameter settings.

While xLSTM-Mixer has shown extraordinary performance in long-term forecasting, it should be noted that due to the transpose of the input, i.e., processing the variates as sequence elements, the number of variates may limit the overall performance.To overcome this, we plan to explore how different variate orderings influence performance and whether incorporating more than two views could lead to further improvements.This study focused on long-term forecasting, yet extending xLSTM-Mixer to tasks such as short-term forecasting, time series classification, or imputation offers promising directions for future research.

Ethics Statement

Our research advances machine learning by enhancing the capabilities of long-term forecasting in time series models, significantly improving both accuracy and efficiency. By developing xLSTM-Mixer, we introduce a robust framework that can be applied across various industries, including finance, healthcare, energy, and logistics. The improved forecasting accuracy enables better decision-making in critical areas, such as optimizing resource allocation, predicting market trends, and managing risk.

However, we also recognize the potential risks associated with the misuse of these advanced models. Time series forecasting models could be leveraged for malicious purposes, especially when applied at scale. For example, in the financial sector, adversarial agents might manipulate forecasts to create market instability. In political or social contexts, these models could be exploited to predict and influence public opinion or destabilize economies. Additionally, the application of these models in sensitive domains like healthcare and security may lead to unintended consequences if not carefully regulated and ethically deployed.

Therefore, it is essential that the use of xLSTM-Mixer, like all machine learning technologies, is guided by responsible practices and ethical considerations. We encourage stakeholders to adopt rigorous evaluation processes to ensure fairness, transparency, and accountability in its deployment, and to remain vigilant to the broader societal implications of time series forecasting technologies.

Reproducibility Statement

All implementation details, including dataset descriptions, metric calculations, and experiment configurations, are provided in Sec.4 and Sec.A.1.We make sure to exclusively use openly available software and datasets and provide the source code for full reproducibility at
https://github.com/mauricekraus/xLSTM-Mixer.

Acknowledgments

This work received funding from the EU project EXPLAIN, under the Federal Ministry of Education and Research (BMBF) (grant 01โ€”S22030D). Furthermore, it was funded by the ACATIS Investment KVG mbH project โ€œTemporal Machine Learning for Long-Term Value Investingโ€ and the BMBF KompAKI project within the โ€œThe Future of Value Creation โ€“ Research on Production, Services and Workโ€ program (funding number 02L19C150), managed by the Project Management Agency Karlsruhe (PTKA). The author of Eindhoven University of Technology received support from their Department of Mathematics and Computer Science and the Eindhoven Artificial Intelligence Systems Institute. Furthermore, this work benefited from the HMWK project โ€œThe Third Wave of Artificial Intelligence - 3AIโ€.

References

  • Akiba etal. (2019)Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama.Optuna: A Next-generation Hyperparameter Optimization Framework.In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2019.
  • Alharthi & Mahmood (2024)Musleh Alharthi and Ausif Mahmood.xLSTMTime: Long-Term Time Series Forecasting with xLSTM.MDPI AI, 5(3):1482โ€“1495, 2024.
  • Beck etal. (2024)Maximilian Beck, Korbinian Pรถppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Gรผnter Klambauer, Johannes Brandstetter, and Sepp Hochreiter.xLSTM: Extended Long Short-Term Memory.ArXiv:2405.04517, 2024.
  • Box & Jenkins (1976)George E.P. Box and GwilymM. Jenkins.Time series analysis: forecasting and control.Holden-Day series in time series analysis and digital processing. Holden-Day, San Francisco, rev. ed. edition, 1976.ISBN 0-8162-1104-3.
  • Cao etal. (2023)Defu Cao, Furong Jia, SercanO. Arik, Tomas Pfister, Yixiang Zheng, Wen Ye, and Yan Liu.TEMPO: Prompt-based Generative Pre-trained Transformer for Time Series Forecasting.In The Twelfth International Conference on Learning Representations, 2023.
  • Challu etal. (2023)Cristian Challu, KinG. Olivares, Boris Oreshkin, Federico Ramirez, Max Canseco, and Artur Dubrawski.NHITS: Neural Hierarchical Interpolation for Time Series Forecasting.Proceedings of the AAAI Conference on Artificial Intelligence, 37:6989โ€“6997, 2023.
  • Chen etal. (2023a)Jiuhai Chen, Lichang Chen, Chen Zhu, and Tianyi Zhou.How many demonstrations do you need for in-context learning?Findings of the Association for Computational Linguistics, EMNLP 2023:11149โ€“11159, 2023a.
  • Chen etal. (2023b)Lichang Chen, Heng Huang, and Minhao Cheng.Ptp: Boosting stability and performance of prompt tuning with perturbation-based regularizer.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 13512โ€“13525. Association for Computational Linguistics, 2023b.
  • Chen etal. (2023c)Si-An Chen, Chun-Liang Li, NathanaelC Yoder, Sercanร– Arฤฑk, and Tomas Pfister.TSMixer: An All-MLP Architecture for Time Series Forecasting.Transactions on Machine Learning Research, 2023c.
  • Cho etal. (2014)Kyunghyun Cho, Bart van Merriรซnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio.Learning Phrase Representations using RNN Encoderโ€“Decoder for Statistical Machine Translation.In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724โ€“1734, Doha, Qatar, 2014. Association for Computational Linguistics.
  • Das etal. (2023)Abhimanyu Das, Weihao Kong, Andrew Leach, ShaanK. Mathur, Rajat Sen, and Rose Yu.Long-term Forecasting with TiDE: Time-series Dense Encoder.Transactions on Machine Learning Research, 2023.ISSN 2835-8856.
  • Essien & Giannetti (2020)Aniekan Essien and Cinzia Giannetti.A Deep Learning Model for Smart Manufacturing Using Convolutional LSTM Neural Network Autoencoders.IEEE Transactions on Industrial Informatics, 16(9):6069โ€“6078, 2020.
  • Greff etal. (2017)Klaus Greff, RupeshKumar Srivastava, Jan Koutnรญk, BasR. Steunebrink, and Jรผrgen Schmidhuber.LSTM: A Search Space Odyssey.IEEE Transactions on Neural Networks and Learning Systems, 28(10):2222โ€“2232, 2017.
  • Hochreiter & Schmidhuber (1997)Sepp Hochreiter and Jรผrgen Schmidhuber.Long Short-Term Memory.Neural Computation, 9(8):1735โ€“1780, 1997.
  • Hosseini etal. (2021)Mohammad-Parsa Hosseini, Amin Hosseini, and Kiarash Ahi.A Review on Machine Learning for EEG Signal Processing in Bioengineering.IEEE Reviews in Biomedical Engineering, 14:204โ€“218, 2021.
  • Huang etal. (2023)Lei Huang, Jie Qin, YiZhou, Fan Zhu, LiLiu, and Ling Shao.Normalization Techniques in Training DNNs: Methodology, Analysis and Application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):10173 โ€“ 10196, 2023.
  • Kim etal. (2022)Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo.Reversible Instance Normalization for Accurate Time-Series Forecasting against Distribution Shift.In International Conference on Learning Representations, 2022.
  • Kingma & Ba (2017)DiederikP. Kingma and Jimmy Ba.Adam: A Method for Stochastic Optimization.ArXiv:1412.6980, 2017.
  • Lam etal. (2023)Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, Stephan Hoyer, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mohamed, and Peter Battaglia.Learning skillful medium-range global weather forecasting.Science, 382(6677):1416โ€“1421, 2023.
  • Lara-Benรญtez etal. (2021)Pedro Lara-Benรญtez, Manuel Carranza-Garcรญa, and JosรฉC. Riquelme.An Experimental Review on Deep Learning Architectures for Time Series Forecasting.International Journal of Neural Systems, 31(03):2130001, 2021.
  • Latif etal. (2023)Siddique Latif, Heriberto Cuayรกhuitl, Farrukh Pervez, Fahad Shamshad, HafizShehbaz Ali, and Erik Cambria.A survey on deep reinforcement learning for audio-based applications.Artificial Intelligence Review, 56(3):2193โ€“2240, 2023.
  • Lea etal. (2016)Colin Lea, Renรฉ Vidal, Austin Reiter, and GregoryD. Hager.Temporal Convolutional Networks: A Unified Approach to Action Segmentation.In Gang Hua and Hervรฉ Jรฉgou (eds.), Computer Vision โ€“ ECCV 2016 Workshops, volume 9915, pp. 47โ€“54, Cham, 2016.
  • Lester etal. (2021)Brian Lester, Rami Al-Rfou, and Noah Constant.The Power of Scale for Parameter-Efficient Prompt Tuning.In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045โ€“3059. Association for Computational Linguistics, 2021.
  • Li & Liang (2021)XiangLisa Li and Percy Liang.Prefix-tuning: Optimizing continuous prompts for generation.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, volume Volume 1: Long Papers, pp. 4582โ€“4597. Association for Computational Linguistics, 2021.
  • Li etal. (2022)Zewen Li, Fan Liu, Wenjie Yang, Shouheng Peng, and Jun Zhou.A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects.IEEE Transactions on Neural Networks and Learning Systems, 33(12):6999โ€“7019, 2022.
  • Li etal. (2023)Zhe Li, Shiyi Qi, Yiduo Li, and Zenglin Xu.Revisiting Long-term Time Series Forecasting: An Investigation on Linear Mapping.ArXiv:2305.10721, 2023.
  • Lim etal. (2021)Bryan Lim, Sercanร–. Arฤฑk, Nicolas Loeff, and Tomas Pfister.Temporal Fusion Transformers for interpretable multi-horizon time series forecasting.International Journal of Forecasting, 37(4):1748โ€“1764, 2021.
  • Lin etal. (2012)Wei-Yang Lin, Ya-Han Hu, and Chih-Fong Tsai.Machine Learning in Financial Crisis Prediction: A Survey.IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4):421โ€“436, 2012.
  • Lippi etal. (2013)Marco Lippi, Matteo Bertini, and Paolo Frasconi.Short-Term Traffic Flow Forecasting: An Experimental Comparison of Time-Series Analysis and Supervised Learning.IEEE Transactions on Intelligent Transportation Systems, 14(2):871โ€“882, 2013.
  • Liu etal. (2023)Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long.iTransformer: Inverted Transformers Are Effective for Time Series Forecasting.In The Twelfth International Conference on Learning Representations, 2023.
  • Nie etal. (2023)Yuqi Nie, NamH. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam.A Time Series is Worth 64 Words: Long-term Forecasting with Transformers.In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
  • Oreshkin etal. (2019)BorisN. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio.N-BEATS: Neural basis expansion analysis for interpretable time series forecasting.In International Conference on Learning Representations, 2019.
  • Paszke etal. (2019)Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, LuFang, Junjie Bai, and Soumith Chintala.PyTorch: An Imperative Style, High-Performance Deep Learning Library.In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2019.
  • Patro & Agneeswaran (2024)BadriNarayana Patro and VijaySrinivas Agneeswaran.Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges.ArXiv:2404.16112, 2024.
  • Salinas etal. (2020)David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski.DeepAR: Probabilistic forecasting with autoregressive recurrent networks.International Journal of Forecasting, 36(3):1181โ€“1191, 2020.
  • Seyedan & Mafakheri (2020)Mahya Seyedan and Fereshteh Mafakheri.Predictive big data analytics for supply chain demand forecasting: methods, applications, and research opportunities.Journal of Big Data, 7(1):53, 2020.
  • Sun etal. (2024)Chenxi Sun, Hongyan Li, Yaliang Li, and Shenda Hong.TEST: Text prototype aligned embedding to activate LLMโ€™s ability for time series.In The twelfth international conference on learning representations, 2024.
  • Tolstikhin etal. (2021)IlyaO Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy.MLP-Mixer: An all-MLP Architecture for Vision.In Advances in Neural Information Processing Systems, volume34, pp. 24261โ€“24272. Curran Associates, Inc., 2021.
  • Vaswani etal. (2017)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention is All you Need.In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2017.
  • Wang etal. (2022)Huiqiang Wang, Jian Peng, Feihu Huang, Jince Wang, Junhui Chen, and Yifei Xiao.MICN: Multi-scale Local and Global Context Modeling for Long-term Series Forecasting.In The Eleventh International Conference on Learning Representations, 2022.
  • Wang etal. (2024a)Shiyu Wang, Haixu Wu, Xiaoming Shi, Tengge Hu, Huakun Luo, Lintao Ma, JamesY. Zhang, and Jun Zhou.TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting.In The Twelfth International Conference on Learning Representations, 2024a.
  • Wang etal. (2024b)Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong Liu, Mingsheng Long, and Jianmin Wang.Deep Time Series Models: A Comprehensive Survey and Benchmark, July 2024b.
  • Wu etal. (2021)Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long.Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting.In Advances in Neural Information Processing Systems, 2021.
  • Wu etal. (2022)Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long.TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis.In The Eleventh International Conference on Learning Representations, 2022.
  • Zeng etal. (2023)Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu.Are Transformers Effective for Time Series Forecasting?In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023.
  • Zhang & Yang (2022)YuZhang and Qiang Yang.A Survey on Multi-Task Learning.IEEE Transactions on Knowledge and Data Engineering, 34(12):5586โ€“5609, 2022.
  • Zhou etal. (2021)Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting.Proceedings of the AAAI Conference on Artificial Intelligence, 35(12):11106โ€“11115, 2021.
  • Zhou etal. (2022)Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin.FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting.In Proceedings of the 39th International Conference on Machine Learning, volume 162, 2022.

Appendix A Appendix

A.1 Implementation Details

Experimental Details

Our codebase is implemented in Python 3.11, leveraging PyTorch 2.4(Paszke etal., 2019) in combination with Lightning 2.4222https://lightning.ai/pytorch-lightning for model training and optimization.We used the custom CUDA implementation of Beck etal. (2024) for sLSTM333https://github.com/NX-AI/xlstm which relies on the NVIDIA Compute Capability 8.0 or higher. Thus, our experiments were conducted on a single NVIDIA A100 80GB GPU.The majority of our baseline implementations, along with data loading and preprocessing steps, are adapted from the Time-Series-Library444https://github.com/thuml/Time-Series-Library of Wang etal. (2024b).Additionally, for xLSTMTime we used code based on the official repository555https://github.com/muslehal/xLSTMTime of Alharthi & Mahmood (2024).

Training and Hyperparameters

We optimized xLSTM-Mixer for up to 60 epochs with a cosine-annealing scheduler with the Adam optimizer (Kingma & Ba, 2017), using ฮฒ1=0.9subscript๐›ฝ10.9\beta_{1}=0.9italic_ฮฒ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and ฮฒ2=0.999subscript๐›ฝ20.999\beta_{2}=0.999italic_ฮฒ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999 and no weight decay. Hyperparameter tuning was conducted using Optuna (Akiba etal., 2019) with the choices provided in Tab.4.We optimized for the L1 forecast error, also known as the Mean Absolute Error (MAE). To further stabilize the training process, gradient clipping with a maximum norm of 1.01.01.01.0 was applied. All experiments were run with three different random seeds {2021, 2022, 2023}.

HyperparameterChoices
Batch size{16, 32, 64, 128, 256, 512}
Initial learning rate{1โ‹…10โˆ’2โ‹…1superscript1021\cdot 10^{-2}1 โ‹… 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, 3โ‹…10โˆ’3โ‹…3superscript1033\cdot 10^{-3}3 โ‹… 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, 1โ‹…10โˆ’3โ‹…1superscript1031\cdot 10^{-3}1 โ‹… 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, 5โ‹…10โˆ’4โ‹…5superscript1045\cdot 10^{-4}5 โ‹… 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, 2โ‹…10โˆ’4โ‹…2superscript1042\cdot 10^{-4}2 โ‹… 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, 1โ‹…10โˆ’4โ‹…1superscript1041\cdot 10^{-4}1 โ‹… 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT}
Scheduler warmup steps{5, 10, 15}
Lookback length T๐‘‡Titalic_T{96, 256, 512, 768, 1024, 2048}
Embedding dimension D๐ทDitalic_D{32, 64, 128, 256, 512, 768, 1024}
sLSTM conv. kernel width{disabled, 2, 4}
sLSTM dropout rate{0.1, 0.25}
# sLSTM blocks in ๐’ฎ๐’ฎ\mathcal{S}caligraphic_S{1, 2, 3, 4}
# sLSTM heads{4, 8, 16, 32}
Metrics

We follow common practice in the literature(Wu etal., 2021; Wang etal., 2024a) for maximum comparability and, therefore, evaluate long-term forecasting of all models on the mean absolute error(MAE), mean squared error(MSE), and for short-term forecasting, using the MAE, root mean squared error(RMSE), and mean absolute percentage error(MAPE).The metrics are averaged over all variates and computed as:

MAEโก(๐’š,๐’š^)MAE๐’šbold-^๐’š\displaystyle\operatorname{MAE}(\bm{y},\bm{\hat{y}})roman_MAE ( bold_italic_y , overbold_^ start_ARG bold_italic_y end_ARG )=โˆ‘i=1H|yiโˆ’yi^|absentsuperscriptsubscript๐‘–1๐ปsubscript๐‘ฆ๐‘–^subscript๐‘ฆ๐‘–\displaystyle=\sum_{i=1}^{H}\left|y_{i}-\hat{y_{i}}\right|= โˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG |MSEโก(๐’š,๐’š^)MSE๐’šbold-^๐’š\displaystyle\operatorname{MSE}(\bm{y},\bm{\hat{y}})roman_MSE ( bold_italic_y , overbold_^ start_ARG bold_italic_y end_ARG )=โˆ‘i=1H(yiโˆ’yi^)2absentsuperscriptsubscript๐‘–1๐ปsuperscriptsubscript๐‘ฆ๐‘–^subscript๐‘ฆ๐‘–2\displaystyle=\sum_{i=1}^{H}(y_{i}-\hat{y_{i}})^{2}= โˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
RMSEโก(๐’š,๐’š^)RMSE๐’šbold-^๐’š\displaystyle\operatorname{RMSE}(\bm{y},\bm{\hat{y}})roman_RMSE ( bold_italic_y , overbold_^ start_ARG bold_italic_y end_ARG )=MSEโก(๐’š,๐’š^)absentMSE๐’šbold-^๐’š\displaystyle=\sqrt{\operatorname{MSE}(\bm{y},\bm{\hat{y}})}= square-root start_ARG roman_MSE ( bold_italic_y , overbold_^ start_ARG bold_italic_y end_ARG ) end_ARGMAPEโก(๐’š,๐’š^)MAPE๐’šbold-^๐’š\displaystyle\operatorname{MAPE}(\bm{y},\bm{\hat{y}})roman_MAPE ( bold_italic_y , overbold_^ start_ARG bold_italic_y end_ARG )=100Hโขโˆ‘i=1H|yiโˆ’y^i||yi|+ฯต,absent100๐ปsuperscriptsubscript๐‘–1๐ปsubscript๐‘ฆ๐‘–subscript^๐‘ฆ๐‘–subscript๐‘ฆ๐‘–italic-ฯต\displaystyle=\frac{100}{H}\sum_{i=1}^{H}\frac{\left|y_{i}-\hat{y}_{i}\right|}%{\left|y_{i}\right|+\epsilon},= divide start_ARG 100 end_ARG start_ARG italic_H end_ARG โˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT divide start_ARG | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + italic_ฯต end_ARG ,

where ๐’š๐’š\bm{y}bold_italic_y are the targets, ๐’š^bold-^๐’š\bm{\hat{y}}overbold_^ start_ARG bold_italic_y end_ARG the predictions, and ฯตitalic-ฯต\epsilonitalic_ฯต a small constant added for numerical stability.

A.2 Outlook: Short-Term Time Series Forecasting

Having shown superior long-term forecasting accuracies in Sec.4.1, we also provide an initial exploration of the effectiveness of xLSTM-Mixer to short-term forecasts.To this end, we compare it to applicable baselines on PEMS datasets with input lengths uniformly set to 96 and prediction lengths to 12.The results in Tab.5 show that the performance of xLSTM-Mixer is competitive with existing methods.We provide the MAE, MAPE, and RMSE as is common practice.

ModelsRecurrentMLPTransformerConvolutional

xLSTM-Mixer

xLSTMTime

LSTM

TimeMixer

DLinear

PatchTST

FEDFormer

Autoformer

MICN

TimesNet

(Ours)

2024

1997a

2024a

2023

2023

2022

2021

2022

2022

PEMS03

MAE

15.71

16.59

18.65

14.63

19.70

18.95

19.00

18.08

15.71

16.41

MAPE

14.92

15.31

17.39

14.54

18.35

17.29

18.57

18.75

15.67

15.17

RMSE

24.82

26.47

31.73

23.28

32.35

30.15

30.05

27.82

24.55

26.72

PEMS08

MAE

16.56

17.44

20.34

15.22

20.26

20.35

20.56

20.47

17.76

19.01

MAPE

10.24

10.58

13.05

9.67

12.09

13.15

12.41

12.27

10.76

11.83

RMSE

26.65

28.13

31.90

24.26

32.38

31.04

32.97

31.52

27.26

30.65

  • a

    Configuration following Wu etal. (2021).

A.3 Full Results for Long-Term Forecasting

Tab.6 shows the full results for long-term forecasting on all four separate forecast horizons.

ModelsRecurrentMLPTransformerConvolutional

xLSTM-Mixer

xLSTMTime

LSTM

TimeMixer

TSMixer

DLinear

TiDE

PatchTST

iTransformer

FEDFormer

Autoformer

MICN

TimesNet

(Ours)

2024

1997a

2024a

2023c

2023

2023

2023

2023

2022

2021

2022

2022

Metric

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

MSE

MAE

Weather

96

0.143

0.184

0.1440.187

0.369

0.406

0.147

0.197

0.145

0.198

0.176

0.237

0.166

0.222

0.149

0.198

0.174

0.214

0.217

0.296

0.266

0.336

0.161

0.229

0.172

0.220

192

0.186

0.226

0.192

0.236

0.416

0.435

0.189

0.239

0.191

0.242

0.220

0.282

0.209

0.263

0.194

0.241

0.221

0.254

0.276

0.336

0.307

0.367

0.220

0.281

0.219

0.261

336

0.236

0.266

0.2370.272

0.455

0.454

0.241

0.280

0.242

0.280

0.265

0.319

0.254

0.301

0.306

0.282

0.278

0.296

0.339

0.380

0.359

0.395

0.278

0.331

0.280

0.306

720

0.310

0.323

0.313

0.326

0.535

0.520

0.310

0.330

0.320

0.336

0.323

0.362

0.313

0.340

0.314

0.334

0.358

0.347

0.403

0.428

0.419

0.428

0.311

0.356

0.365

0.359

Avg

0.219

0.250

0.2220.255

0.444

0.454

0.222

0.262

0.225

0.264

0.246

0.300

0.236

0.282

0.241

0.264

0.258

0.278

0.309

0.360

0.338

0.382

0.242

0.299

0.259

0.287

Electricity

96

0.126

0.218

0.1280.221

0.375

0.437

0.129

0.224

0.131

0.229

0.140

0.237

0.132

0.229

0.129

0.222

0.148

0.240

0.193

0.308

0.201

0.317

0.164

0.269

0.168

0.272

192

0.1440.235

0.150

0.243

0.442

0.473

0.140

0.220

0.151

0.246

0.153

0.249

0.147

0.243

0.147

0.240

0.162

0.253

0.201

0.315

0.222

0.334

0.177

0.285

0.184

0.289

336

0.157

0.250

0.166

0.259

0.439

0.473

0.1610.255

0.161

0.261

0.169

0.267

0.161

0.261

0.163

0.259

0.178

0.269

0.214

0.329

0.231

0.338

0.193

0.304

0.198

0.300

720

0.183

0.276

0.185

0.276

0.980

0.814

0.194

0.287

0.197

0.293

0.203

0.301

0.196

0.294

0.197

0.290

0.225

0.317

0.246

0.355

0.254

0.361

0.212

0.321

0.220

0.320

Avg

0.153

0.245

0.157

0.250

0.559

0.549

0.1560.246

0.160

0.256

0.166

0.264

0.159

0.257

0.159

0.253

0.178

0.270

0.214

0.321

0.227

0.338

0.186

0.295

0.192

0.295

Traffic

96

0.357

0.236

0.358

0.242

0.843

0.453

0.360

0.249

0.376

0.264

0.410

0.282

0.336

0.253

0.360

0.249

0.395

0.268

0.587

0.366

0.613

0.388

0.519

0.309

0.593

0.321

192

0.377

0.241

0.378

0.253

0.847

0.453

0.3750.250

0.397

0.277

0.423

0.287

0.346

0.257

0.379

0.256

0.417

0.276

0.604

0.373

0.616

0.382

0.537

0.315

0.617

0.336

336

0.394

0.250

0.392

0.261

0.853

0.455

0.385

0.270

0.413

0.290

0.436

0.296

0.355

0.260

0.392

0.264

0.433

0.283

0.621

0.383

0.622

0.337

0.534

0.313

0.629

0.336

720

0.439

0.283

0.434

0.287

1.500

0.805

0.4300.281

0.444

0.306

0.466

0.315

0.386

0.273

0.432

0.286

0.467

0.302

0.626

0.382

0.660

0.408

0.577

0.325

0.640

0.350

Avg

0.392

0.253

0.391

0.261

1.011

0.541

0.387

0.262

0.408

0.284

0.434

0.295

0.356

0.261

0.391

0.264

0.428

0.282

0.609

0.376

0.628

0.379

0.541

0.315

0.620

0.336

ETTh1

96

0.359

0.386

0.368

0.395

1.044

0.773

0.361

0.390

0.361

0.392

0.375

0.399

0.375

0.398

0.370

0.400

0.386

0.405

0.376

0.419

0.449

0.459

0.421

0.431

0.384

0.402

192

0.402

0.417

0.401

0.416

1.217

0.832

0.409

0.414

0.404

0.418

0.405

0.416

0.412

0.422

0.413

0.429

0.441

0.436

0.420

0.448

0.500

0.482

0.474

0.487

0.436

0.429

336

0.408

0.429

0.422

0.437

1.259

0.841

0.430

0.429

0.420

0.431

0.439

0.443

0.435

0.433

0.422

0.440

0.487

0.458

0.459

0.465

0.521

0.496

0.569

0.551

0.491

0.469

720

0.419

0.448

0.441

0.465

1.271

0.838

0.445

0.460

0.463

0.472

0.472

0.490

0.454

0.465

0.447

0.468

0.503

0.491

0.506

0.507

0.514

0.512

0.770

0.672

0.521

0.500

Avg

0.397

0.420

0.408

0.428

1.198

0.821

0.411

0.423

0.412

0.428

0.423

0.437

0.419

0.430

0.413

0.434

0.454

0.448

0.440

0.460

0.496

0.487

0.558

0.535

0.458

0.450

ETTh2

96

0.267

0.329

0.273

0.333

2.522

1.278

0.271

0.330

0.274

0.341

0.289

0.353

0.270

0.336

0.274

0.337

0.297

0.349

0.346

0.388

0.358

0.397

0.299

0.364

0.340

0.374

192

0.338

0.375

0.340

0.378

3.312

1.384

0.317

0.402

0.339

0.385

0.383

0.418

0.332

0.380

0.314

0.382

0.380

0.400

0.429

0.439

0.456

0.452

0.441

0.454

0.402

0.414

336

0.367

0.401

0.373

0.403

3.291

1.388

0.3320.396

0.361

0.406

0.448

0.465

0.360

0.407

0.329

0.384

0.428

0.432

0.496

0.487

0.482

0.486

0.654

0.567

0.452

0.452

720

0.388

0.424

0.398

0.430

3.257

1.357

0.342

0.408

0.445

0.470

0.605

0.551

0.419

0.451

0.3790.422

0.427

0.445

0.463

0.474

0.515

0.511

0.956

0.716

0.462

0.468

Avg

0.340

0.382

0.346

0.386

3.095

1.352

0.316

0.384

0.355

0.401

0.431

0.447

0.345

0.394

0.324

0.381

0.383

0.407

0.433

0.447

0.453

0.462

0.588

0.525

0.414

0.427

ETTm1

96

0.275

0.328

0.286

0.335

0.863

0.664

0.291

0.340

0.285

0.339

0.299

0.343

0.306

0.349

0.293

0.346

0.334

0.368

0.379

0.419

0.505

0.475

0.316

0.362

0.338

0.375

192

0.319

0.354

0.329

0.361

1.113

0.776

0.327

0.365

0.327

0.365

0.335

0.365

0.335

0.366

0.333

0.370

0.377

0.391

0.426

0.441

0.553

0.496

0.363

0.390

0.374

0.387

336

0.353

0.374

0.358

0.379

1.267

0.832

0.360

0.381

0.356

0.382

0.369

0.386

0.364

0.384

0.369

0.392

0.426

0.420

0.445

0.459

0.621

0.537

0.408

0.426

0.410

0.411

720

0.409

0.407

0.416

0.411

1.324

0.858

0.415

0.417

0.419

0.414

0.425

0.421

0.413

0.413

0.416

0.420

0.491

0.459

0.543

0.490

0.671

0.561

0.481

0.476

0.478

0.450

Avg

0.339

0.366

0.3470.372

1.142

0.782

0.348

0.375

0.347

0.375

0.357

0.379

0.355

0.378

0.353

0.382

0.407

0.410

0.448

0.452

0.588

0.517

0.392

0.413

0.400

0.406

ETTm2

96

0.157

0.244

0.164

0.250

2.041

1.073

0.164

0.254

0.163

0.252

0.167

0.260

0.161

0.251

0.166

0.256

0.180

0.264

0.203

0.287

0.255

0.339

0.179

0.275

0.187

0.267

192

0.213

0.285

0.218

0.288

2.249

1.112

0.223

0.295

0.216

0.290

0.224

0.303

0.215

0.289

0.223

0.296

0.250

0.309

0.269

0.328

0.281

0.340

0.307

0.376

0.249

0.309

336

0.269

0.322

0.271

0.322

2.568

1.238

0.279

0.330

0.2680.324

0.281

0.342

0.267

0.326

0.274

0.329

0.311

0.348

0.325

0.366

0.339

0.372

0.325

0.388

0.321

0.351

720

0.351

0.377

0.361

0.380

2.720

1.287

0.359

0.383

0.420

0.422

0.397

0.421

0.352

0.383

0.362

0.385

0.412

0.407

0.421

0.415

0.422

0.419

0.502

0.490

0.408

0.403

Avg

0.248

0.307

0.254

0.310

2.395

1.177

0.256

0.315

0.267

0.322

0.267

0.332

0.249

0.312

0.256

0.317

0.288

0.332

0.304

0.349

0.324

0.368

0.328

0.382

0.291

0.333

Wins

18

23

1

2

3

55

1

2

1

  • a

    Taken from Wu etal. (2022).

Multivariate Time Series Forecasting by Mixing via Scalar Memories (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Nicola Considine CPA

Last Updated:

Views: 6243

Rating: 4.9 / 5 (69 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Nicola Considine CPA

Birthday: 1993-02-26

Address: 3809 Clinton Inlet, East Aleisha, UT 46318-2392

Phone: +2681424145499

Job: Government Technician

Hobby: Calligraphy, Lego building, Worldbuilding, Shooting, Bird watching, Shopping, Cooking

Introduction: My name is Nicola Considine CPA, I am a determined, witty, powerful, brainy, open, smiling, proud person who loves writing and wants to share my knowledge and understanding with you.