Pushing the Boundaries of State Space Models
for Image and Video Generation

Yicong Hong☆:), Long Mai, Yuan Yao, Feng Liu
Adobe Research
☆:)Project Lead, Intern, University of Rochester

A Hydra (State Space Model)-Transformer hybrid model for text to image and video generation.


Text-to-8 seconds 360p 16FPS Video Generation

Realistic shot of a fluffy koala bear surfs. It has a grey and white coat and a round nose. The surfboard is yellow. The koala bear is holding onto the surfboard with its paws. The koala bear’s facial expression is focused. The sun is shining.
The text "Hydra" is dust over a desert oasis. Thin and elegant typography.
Starting from a ground-level view of a road leading towards a graffiti covered tunnel, the camera tracks smoothly along the road into a short dark tunnel. As it emerges on the other side, the camera rapidly ascends, revealing the road continuing through a huge field of multicoloured wildflowers surrounded by snow capped mountains.
A slow cinematic push in on an ostrich standing in a 1980s kitchen.
An astronaut running through an alley in Rio de Janeiro.
A small boat floating on a body of water, such as a lake or a sea, during a sunset. The sky is filled with birds, adding a sense of serenity and beauty to the scene. The sun is setting, casting a warm glow over the water and the surrounding area. The presence of the boat and the birds in the sky creates a peaceful and picturesque atmosphere, making it an ideal setting for relaxation and enjoying the natural beauty of the environment.
Drone view of waves crashing against the rugged cliffs along Big Sur's garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff's edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff's edges jutting out over the sea.
A long haired, muscular, handsome, tattoed tribal warrior standing at the edge of the cliff, gazing at the fantastical landscape below. The rich colors and intricate details capture the thrilling moment of his leap of faith. Anime illustration style. Cinematic still.
Origami dancers in white paper, 3D render, ultra-detailed, on white background, studio shot, dancing modern dance.
This realistic shot shows a surrealistic scene of an astronaut floating in a calm, reflective swimming pool of water. Half of his body is above the water, and the astronaut is dressed in a thick white spacesuit, his face covered by a large, reflective helmet that conceals the identity of the person inside. In the background, on either side of the pool, are a row of tall, slender palm trees, their fronds reaching upward, creating a tropical atmosphere. The trees are even and symmetrical, their trunks curved slightly outward. Beyond the trees, the horizon is blurred, high key, hyper-detailed, masterpiece, award-winning.
Panda bear wearing gold-plated shoes strutting with a sassy demeanor through a haute couture runway.
The robot in a cyberpunk city.
Entering a Martian cave to reveal an alien colony hidden within. Cinematic FPV.
Photorealistic shot of a young man at his 20s is sitting on a piece of cloud in the sky, reading a book.
An astronaut riding a horse on a beautiful grassland.
A slow-motion shot of a snowboarder carving through fresh powder on a mountain peak, with the sun setting behind jagged peaks. The camera follows the boarder as they perform a series of graceful actions, with the snow sparkling like diamonds in the fading light.
A squirrel sitting on a tree and nibbling on an acorn.
A low angle hyper realistic shot of a thick purple goo flowing quickly down a white marble staircase. Cinematic, highly detailed, film grade.
A woman DJ spins records on a rooftop in LA. She is wearing a pink jacket and giant headphones. There is a cheetah next to the woman. The background is a cityscape.
A red-faced monkey with white fur is bathing in a natural hot spring. The monkey is playing in the water with a miniature sail ship in front of it, made of wood with a white sail and a small rudder. The hot spring is surrounded by lush greenery, with rocks and trees.
A stop motion animation of a flower growing out of the windowsill of a house.
A litter of golden retriever puppies playing in the snow. Their heads pop out of the snow, covered in.
Tour of an art gallery with many beautiful works of art in different styles.
A robot is dancing in Times Square.

Abstract

While Transformers have become the dominant architecture for visual generation, linear attention models, such as the state-space models (SSM), are increasingly recognized for their efficiency in processing long visual sequences. However, the essential efficiency of these models comes from formulating a limited recurrent state, enforcing causality among tokens that are prone to inconsistent modeling of N-dimensional visual data, leaving questions on their capacity to generate long non-causal sequences. In this paper, we explore the boundary of SSM on image and video generation by building the largest-scale diffusion SSM-Transformer hybrid model to date (5B parameters) based on the sub-quadratic bi-directional Hydra and self-attention, and generate up to 2K images and 360p 8 seconds (16 FPS) videos. Our experiments demonstrate that the model can produce faithful results aligned with complex text prompts and temporal consistent videos with high dynamics, suggesting the great potential of using SSMs for visual generation tasks.


h2_arch_model_figure

This figure illustrates our diffusion Hydra-Transformer Hybrid (HTH) model for image and video generation. The architecture consists of N stacked blocks, each comprising a cross-attention layer, a token mixer, and a feed-forward network. (a) The token mixer can be implemented as either the Hydra state space model or self-attention. (b) For image data, we use horizontal and vertical bidirectional raster scans on tokens, and for video data, an additional bidirectional temporal scan is applied.


hybrid_formula_model_figure

To adapt an image pre-trained HTH model to video data, our empirical results indicate that neither keeping the spatial-major scanning (as in stage 1) nor adding new temporal-major scanning Hydra blocks is effective. Through extensive experiments, we found a simple but surprisingly effective method - directly revising the scanning patterns of certain spatial-major scanning blocks to temporal-major scanning.


The strong locality of state space models presents significant challenges in modeling long-range dependencies. However, we found that this locality also enables certain capabilities for zero-shot high-resolution image generation.


zero_shot_model_figure

In this work, we trained HTH on a mixed set of 1K-resolution images with extreme height-to-width ratios of 512:2048 and 2048:512, and found that the model demonstrated the ability to generate images approximately 3.5x larger in a zero-shot manner, producing outputs at resolutions up to 1920x1920, 1440x2560, and 2560x1440. As shown in the figure above, Transformer-based models with absolute positional embeddings (APE) struggle to generalize to higher resolutions, leading to severe checkerboard artifacts. While models using relative positional encoding, such as RoPE, achieve better results, they still exhibit noticeable inconsistencies and noise. In contrast, our SSM-major HTH model using APE is able to generalize.


Acknowledgement

We express our deepest gratitude to our great colleagues at Adobe Research - Hao Tan, Kai Zhang, Jianming Zhang, Aseem Agarwala, Feng Liu, Long Mai, Zhifei Zhang, Zhan Xu, Aniruddha Mahapatra, Difan Liu, Yang Zhou - for their invaluable advice and support in training infrastructure, data processing, architecture design, VAEs, inference and evaluation, computational resources, and project management.

BibTeX


        @misc{hong2025hth,
              title={Pushing the Boundaries of State Space Models for Image and Video Generation}, 
              author={Yicong Hong and Long Mai and Yuan Yao and Feng Liu},
              year={2025},
              eprint={2502.00972},
              archivePrefix={arXiv},
              primaryClass={cs.CV},
              url={https://arxiv.org/abs/2502.00972}, 
        }