PrimDiffusion: Volumetric Primitives Diffusion for 3D Human Generation
NeurIPS 2023


PrimDiffusion performs the diffusion and denoising process on a set of primitives which compactly represent 3D humans. This generative modeling enables explicit pose, view, and shape control, with the capability of modeling off-body topology in well-defined depth. Moreover, our method can generalize to novel poses without post-processing and enable downstream human-centric tasks like 3D texture transfer.


We represent the 3D human as K primitives learned from multi-view images. Each primitive Vk has independent kinematic parameters {Tk, Rk, sk} (translation, rotation, and per-axis scales, respectively) and radiance parameters {ck, σk} (color and density). For each time step t, we diffuse the primitives V0 with noise ϵ sampled according to a fixed noise schedule. The resulting Vt is fed to gΦ(·) which learns to predict the denoised volumetric primitives.


To get rid of per-subject optimization, we propose an encoder-only network that is capable of learning primitives from multi-view images across identities. The encoder consists of a motion branch and an appearance branch, which are fused by the proposed cross-modal attention layer to get kinematic and radiance information of primitives.


Visualization of the Denoising Process

We visualize the denoising process of primitives and corresponding 360-degree novel views.




PrimDiffusion is implemented on top of the DVA and Latent-Diffusion codebase. The training data are rendered via XRFeitoria toolchain.
The website template is borrowed from Mip-NeRF.