Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Video Generation

Demo

Side-by-side comparison of Self-Forcing baseline vs. Sparse Forcing (Ours).

Playback Speed:

"In super slow motion, a friendly panda bear sits at a cozy café table in Paris. The panda is wearing a small, stylish beret and is seated comfortably in a chair. It holds a steaming cup of coffee delicately with both paws, sipping from a straw inserted into the cup. The panda's black eyes are focused on the cup with a curious yet relaxed expression. The café background showcases elegant Parisian decor, including vintage posters and soft lighting, with other patrons subtly visible in the periphery. The scene captures the panda's gentle movements and the delicate steam rising from the coffee in a close-up shot."

Self-Forcing (Baseline)

Sparse Forcing (Ours)

"A joyful, playful Corgi running and frolicking in a vibrant park during sunset. The Corgi has a cheerful expression with its tail wagging excitedly as it jumps over small obstacles and chases after a ball. The dog has short legs, a sturdy build, and a fluffy coat. The background showcases a beautiful orange and pink sky with tall grass swaying gently in the breeze. The scene transitions from a wide shot of the park to a close-up of the Corgi, emphasizing its lively actions and the warm, serene atmosphere."

Self-Forcing (Baseline)

Sparse Forcing (Ours)

"A person is cycling through a scenic park trail. The rider is wearing a helmet, casual clothes, and sunglasses, pedaling steadily. They are mid-action, leaning slightly forward, with one hand on the handlebars and the other hanging loosely. The environment around them includes lush green trees, blooming flowers, and a winding dirt path. The sun is shining brightly, casting dappled shadows through the leaves. The scene captures a close-up of the rider from a side angle, focusing on their determined expression and the motion of the bicycle wheels."

Self-Forcing (Baseline)

Sparse Forcing (Ours)

"A close-up of a person styling their hair with a handheld hair dryer. The person, with a focused expression, holds the hair dryer in one hand and uses a brush in the other to smooth their hair. They are standing in front of a bathroom mirror, which reflects their determined face and the steam from the hair dryer. The background includes a typical bathroom setup with a towel rack and a sink. The person is mid-action, with natural motion captured in a medium shot that emphasizes the interaction between the person and the hair dryer."

Self-Forcing (Baseline)

Sparse Forcing (Ours)

Persistent Block Sparse Attention Kernel

The customized, trainable CUDA kernel implements efficient local block-sparse attention with persistent blocks that maintain globally important context, while supporting composite sparsity patterns and flexible KV-cache management.

High Performance

Optimized CUDA implementation with Tensor Core utilization for maximum throughput.

Memory Efficient

Reduced memory footprint through sparse attention patterns and efficient KV caching.

Flexible Sparsity

Support for various sparsity patterns: local, sink, and learned top-k selection.

Backward Compatible

Drop-in replacement for standard attention with full gradient support for training.

Performance Benchmarks

Python API

from sparse_forcing import SparseForcingAttention

attn = SparseForcingAttention(
    total_cache_size=18720,
    local_cache_size=9360,
    persistent_cache_size=9360,
    kv_cache_mode="sparse_forcing",
    sparse_kv_block_size=[1, 8, 8],
    max_seq_len=32760,
    use_local_block_sparse_attn=True,
    local_block_sparse_topk_ratio=0.125
)

output = attn(query, key_persistent, value_persistent, key_local, value_local)

Evaluation

5-Second Video Generation

Comparison with relevant baselines. We compare Sparse Forcing with representative open-source video generation models of comparable scale and resolution. Best results are in bold and second-best results are underlined. ◆: with pretraining; ◇: without pretraining; ♠: [3,4,4] block size; ♣: [1,8,8] block size.

Model	#Params	Resolution	Throughput (FPS) ↑	Latency (s) ↓	Total	Quality	Semantic
Diffusion models
LTX-Video (HaCohen et al., 2024)	1.9B	768×512	8.98	13.5	80.00	82.30	70.79
Wan2.1 (Wan et al., 2025)	1.3B	832×480	0.78	103	84.26	85.30	80.09
Chunk-wise autoregressive models
SkyReels-V2 (Chen et al., 2025a)	1.3B	960×540	0.49	112	82.67	84.70	74.53
MAGI-1 (Teng et al., 2025)	4.5B	832×480	0.19	282	79.18	82.04	67.74
CausVid (Yin et al., 2025)	1.3B	896×512	17.0	0.69	82.69	83.73	78.49
Self Forcing (Huang et al., 2025)	1.3B	896×512	17.0	0.69	83.88	84.60	81.01
Sparse Forcing^◇♠	1.3B	896×512	19.9	0.59	83.99	84.65	81.36
Sparse Forcing^◇♣	1.3B	896×512	18.8	0.63	83.91	84.58	81.24
Sparse Forcing^◆♠	1.3B	896×512	19.9	0.59	84.14	84.84	81.39

Long-Horizon Video Generation

Comparison with baselines on long-horizon generation. ◆: with pretraining; ♠: [3,4,4] block size; ♣: [1,8,8] block size.

Model	FPS ↑	Latency/s ↓	VBench ↑ (T/Q/S)
20-second length video
Self Forcing	14.4	0.83	82.09 / 82.48 / 80.51
Sparse Forcing^◆♠	18.3	0.65	82.68 / 83.13 / 80.87
Sparse Forcing^◆♣	17.9	0.67	82.31 / 82.64 / 81.01
1-minute length video
Self Forcing	13.9	0.87	78.93 / 79.48 / 76.70
Sparse Forcing^◆♠	18.0	0.66	81.96 / 82.25 / 80.82
Sparse Forcing^◆♣	17.6	0.67	81.67 / 82.17 / 79.67

Checkpoints

Coming Soon

Pre-trained model checkpoints will be released.

Code

Coming Soon

Source code and training scripts will be released.

Citation

BibTeX

@article{xu2026sparseforcing,
  title={Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Video Generation},
  author={Xu, Boxun and Du, Yuming and Liu, Zichang and Yang, Siyu and Jiang, Ziyang and Yan, Siqi and Saha, Rajasi and Pumarola, Albert and Wang, Wenchen and Li, Peng},
  journal={pending},
  year={2026}
}

Acknowledgements

This work builds upon the excellent open-source implementations of CausVid, Wan2.1, Self-Forcing, and FastVideo. We thank the authors for making their code publicly available.