Forge4D

Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse Videos


1 HKUST     2 Tongyi Lab, Alibaba Group     3 NUS     4 FDU

* Equal Contribution     † Corresponding Author

🔥   Feed-forward 3D human reconstruction and streaming with superme novel view image quality and high generalization ability.
🔥   Efficient free-time 3D Gaussian interpolation with minimal temporal redundancy.
🔥   Accurate human motion and metric scale prediction.



Abstract

Instant reconstruction of dynamic 3D humans from uncalibrated sparse-view videos is critical for numerous downstream applications. Existing methods, however, are either limited by the slow reconstruction speeds or incapable of generating novel-time representations. To address these challenges, we propose Forge4D, a feed-forward 4D human reconstruction and interpolation model that efficiently reconstructs temporally aligned representations from uncalibrated sparse-view videos, enabling both novel view and novel time synthesis. Our model simplifies the 4D reconstruction and interpolation problem as a joint task of streaming 3D Gaussian reconstruction and dense motion prediction. For the task of streaming 3D Gaussian reconstruction, we first reconstruct static 3D Gaussians from uncalibrated sparse-view images and then introduce learnable state tokens to enforce temporal consistency in a memory-friendly manner by interactively updating shared information across different timestamps. To overcome the lack of the ground truth for dense motion supervision, we formulate dense motion prediction as a dense point matching task and introduce a self-supervised retargeting loss to optimize this module. An additional occlusion-aware optical flow loss is introduced to ensure motion consistency with plausible human movement, providing stronger regularization. Extensive experiments demonstrate the effectiveness of our model on both in-domain and out-of-domain datasets.



Pipeline

Alternative Text

The overall pipeline of Forge4D. It is trained in three stages: (1) static feed-forward 3D Gaussian reconstruction stage; (2) a streaming stage temporally aligned via state tokens; and (3) a feed-forward 4D reconstruction stage that predicts dense motion for each 3D Gaussian and interpolates free-time 3D Gaussians using an occlusion-aware fusion process.

Feed-forward streaming 4D content creation

3D Reconstruction

Using Forge4D, you can reconstruct high-quality 3D Gaussian assets in metric scale from uncalibrated sparse-view images in one feed-forward inference.

Streaming

Forge4D can handle streaming videos as input and generate temporally aligned 3D Gaussians for each instant frame as output.

Video under reconstruction...

Free-time interpolation and Free-viewpoint rendering

Forge4D can interpolate 3D Gaussians at key-frames to any middle timestamps, utilizing an accurate human motion prediction procedure and an occlusion-aware Gaussian Fusion process. The motion prediction is modeled as 3D Gaussian matching process and trained in a self-supervised manner.

Video under reconstruction...

More Results

BibTeX


@misc{hu2025forge4dfeedforward4dhuman,
      title={Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse-view Videos}, 
      author={Yingdong Hu and Yisheng He and Jinnan Chen and Weihao Yuan and Kejie Qiu and Zehong Lin and Siyu Zhu and Zilong Dong and Jun Zhang},
      year={2025},
      eprint={2509.24209},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.24209}, 
}