🔥   Feed-forward 3D human reconstruction and streaming with superme novel view image quality and high generalization ability.
🔥   Efficient free-time 3D Gaussian interpolation with minimal temporal redundancy.
🔥   Accurate human motion and metric scale prediction.
Instant reconstruction of dynamic 3D humans from uncalibrated sparse-view videos is critical for numerous downstream applications. Existing methods, however, are either limited by the slow reconstruction speeds or incapable of generating novel-time representations. To address these challenges, we propose Forge4D, a feed-forward 4D human reconstruction and interpolation model that efficiently reconstructs temporally aligned representations from uncalibrated sparse-view videos, enabling both novel view and novel time synthesis. Our model simplifies the 4D reconstruction and interpolation problem as a joint task of streaming 3D Gaussian reconstruction and dense motion prediction. For the task of streaming 3D Gaussian reconstruction, we first reconstruct static 3D Gaussians from uncalibrated sparse-view images and then introduce learnable state tokens to enforce temporal consistency in a memory-friendly manner by interactively updating shared information across different timestamps. To overcome the lack of the ground truth for dense motion supervision, we formulate dense motion prediction as a dense point matching task and introduce a self-supervised retargeting loss to optimize this module. An additional occlusion-aware optical flow loss is introduced to ensure motion consistency with plausible human movement, providing stronger regularization. Extensive experiments demonstrate the effectiveness of our model on both in-domain and out-of-domain datasets.
Using Forge4D, you can reconstruct high-quality 3D Gaussian assets in metric scale from uncalibrated sparse-view images in one feed-forward inference.
Forge4D can handle streaming videos as input and generate temporally aligned 3D Gaussians for each instant frame as output.
Video under reconstruction...
Forge4D can interpolate 3D Gaussians at key-frames to any middle timestamps, utilizing an accurate human motion prediction procedure and an occlusion-aware Gaussian Fusion process. The motion prediction is modeled as 3D Gaussian matching process and trained in a self-supervised manner.
Video under reconstruction...
@misc{hu2025forge4dfeedforward4dhuman,
title={Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse-view Videos},
author={Yingdong Hu and Yisheng He and Jinnan Chen and Weihao Yuan and Kejie Qiu and Zehong Lin and Siyu Zhu and Zilong Dong and Jun Zhang},
year={2025},
eprint={2509.24209},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.24209},
}