🔥   Feed-forward 3D human reconstruction and streaming with superme novel view image quality and high generalization ability.
🔥   Efficient free-time 3D Gaussian interpolation with minimal temporal redundancy.
🔥   Accurate human motion and metric scale prediction.
Instant reconstruction of dynamic 3D humans from uncalibrated sparse-view videos is critical for numerous downstream applications. Existing methods, however, are either limited by the slow reconstruction speeds or incapable of generating novel-time representations. To address these challenges, we propose Forge4D, a feed-forward 4D human reconstruction and interpolation model that efficiently reconstructs temporally aligned representations from uncalibrated sparse-view videos, enabling both novel view and novel time synthesis. Our model simplifies the 4D reconstruction and interpolation problem as a joint task of streaming 3D Gaussian reconstruction and dense motion prediction. For the task of streaming 3D Gaussian reconstruction, we first reconstruct static 3D Gaussians from uncalibrated sparse-view images and then introduce learnable state tokens to enforce temporal consistency in a memory-friendly manner by interactively updating shared information across different timestamps. To overcome the lack of the ground truth for dense motion supervision, we formulate dense motion prediction as a dense point matching task and introduce a self-supervised retargeting loss to optimize this module. An additional occlusion-aware optical flow loss is introduced to ensure motion consistency with plausible human movement, providing stronger regularization. Extensive experiments demonstrate the effectiveness of our model on both in-domain and out-of-domain datasets.
Using Forge4D, you can reconstruct high-quality 3D Gaussian assets in metric scale from uncalibrated sparse-view images in one feed-forward inference.
Forge4D can handle streaming videos as input and generate temporally aligned 3D Gaussians for each instant frame as output.
Forge4D can interpolate 3D Gaussians at key-frames to any middle timestamps, utilizing an accurate human motion prediction procedure and an occlusion-aware Gaussian Fusion process. The motion prediction is modeled as 3D Gaussian matching process and trained in a self-supervised manner.

@misc{hu2025forge4dfeedforward4dhuman,
title={Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse-view Videos},
author={Yingdong Hu and Yisheng He and Jinnan Chen and Weihao Yuan and Kejie Qiu and Zehong Lin and Siyu Zhu and Zilong Dong and Jun Zhang},
year={2025},
eprint={2509.24209},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.24209},
}