While Transformers have become the dominant architecture for visual generation, linear attention models, such as the state-space models (SSM), are increasingly recognized for their efficiency in processing long visual sequences. However, the essential efficiency of these models comes from formulating a limited recurrent state, enforcing causality among tokens that are prone to inconsistent modeling of N-dimensional visual data, leaving questions on their capacity to generate long non-causal sequences. In this paper, we explore the boundary of SSM on image and video generation by building the largest-scale diffusion SSM-Transformer hybrid model to date (5B parameters) based on the sub-quadratic bi-directional Hydra and self-attention, and generate up to 2K images and 360p 8 seconds (16 FPS) videos. Our experiments demonstrate that the model can produce faithful results aligned with complex text prompts and temporal consistent videos with high dynamics, suggesting the great potential of using SSMs for visual generation tasks.
This figure illustrates our diffusion Hydra-Transformer Hybrid (HTH) model for image and video generation. The architecture consists of N stacked blocks, each comprising a cross-attention layer, a token mixer, and a feed-forward network. (a) The token mixer can be implemented as either the Hydra state space model or self-attention. (b) For image data, we use horizontal and vertical bidirectional raster scans on tokens, and for video data, an additional bidirectional temporal scan is applied.
To adapt an image pre-trained HTH model to video data, our empirical results indicate that neither keeping the spatial-major scanning (as in stage 1) nor adding new temporal-major scanning Hydra blocks is effective. Through extensive experiments, we found a simple but surprisingly effective method - directly revising the scanning patterns of certain spatial-major scanning blocks to temporal-major scanning.
The strong locality of state space models presents significant challenges in modeling long-range dependencies. However, we found that this locality also enables certain capabilities for zero-shot high-resolution image generation.
In this work, we trained HTH on a mixed set of 1K-resolution images with extreme height-to-width ratios of 512:2048 and 2048:512, and found that the model demonstrated the ability to generate images approximately 3.5x larger in a zero-shot manner, producing outputs at resolutions up to 1920x1920, 1440x2560, and 2560x1440. As shown in the figure above, Transformer-based models with absolute positional embeddings (APE) struggle to generalize to higher resolutions, leading to severe checkerboard artifacts. While models using relative positional encoding, such as RoPE, achieve better results, they still exhibit noticeable inconsistencies and noise. In contrast, our SSM-major HTH model using APE is able to generalize.
We express our deepest gratitude to our great colleagues at Adobe Research - Hao Tan, Kai Zhang, Jianming Zhang, Aseem Agarwala, Feng Liu, Long Mai, Zhifei Zhang, Zhan Xu, Aniruddha Mahapatra, Difan Liu, Yang Zhou - for their invaluable advice and support in training infrastructure, data processing, architecture design, VAEs, inference and evaluation, computational resources, and project management.
@misc{hong2025hth,
title={Pushing the Boundaries of State Space Models for Image and Video Generation},
author={Yicong Hong and Long Mai and Yuan Yao and Feng Liu},
year={2025},
eprint={2502.00972},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.00972},
}