Samples of generated images from the reference identity by our Diff-ID. Our method can generate new human images conditioned on various poses while maintaining the appearance of the original human identity, which are used for augmenting the training data for person re-identification (Re-ID) task.

Overview


Upon observing the highly biased viewpoint and human pose distributions in training dataset, we augment the dataset by manipulating SMPL body shapes and feeding the rendered shapes into a generative model to fill in sparsely distributed poses and viewpoints. With this augmented dataset, we can train a Re-ID model that is robust to viewpoint and human pose biases.

Abstract

Person re-identification (Re-ID) often faces challenges due to variations in human pose and camera viewpoint, which significantly affects the appearance variations of individuals across images. Existing datasets often lack diversity and scalability of these human pose and camera viewpoint, hindering the generalization of Re-ID models to new camera networks. Previous methods have attempted to address these issues by using data augmentation, but they rely on poses already present in the dataset, failing to effectively reduce the pose bias in the dataset. In this paper, we propose Diff-ID, a novel approach that augments training data with sparse and limited poses that are underrepresented in the original distribution. By leveraging the knowledge of pre-trained large-scale generative models like Stable Diffusion, we successfully generate realistic images with diverse human poses and camera viewpoints. Specifically, our objective is to create a training dataset that enables existing Re-ID models to learn features debiased to pose variations. Qualitative results demonstrate the effectiveness of our method in addressing pose bias and enhancing the generalizability of Re-ID models compared to other approaches. The performance gains achieved by training Re-ID models on our offline augmented dataset highlight the potential of our proposed framework in improving the scalability and generalizability of person Re-ID models.



Main Architecture


Given the viewpoint and pose distributions, we first render the body shape sampled from the distribution using SMPL, generating the corresponding skeleton, depth map, and normal maps. These conditions, along with a reference image for identity preservation, are then fed into Diff-ID. Diff-ID consists of two branches: the reference U-Net processes the identity information from the reference image, while the denoising U-Net generates a person with the same identity, given the input conditions. The denoising U-Net generates images by iterating through the denoising process.


The Effect of Viewpoint and Human Pose Augmentation



Visualization of camera viewpoint and human pose distributions for the Market-1501 and DukeMTMC-reID datasets. The left figures (i) display the camera viewpoint distribution derived from SMPL, while the right figures (ii) illustrate t-SNE visualizations of the human pose distributions. These visualizations demonstrate that our pose augmentation successfully diversifies both viewpoint and human pose distributions.


Qualitative Results



Given a reference image, we sample five images generated by our method. These outputs demonstrate the model’s capability to produce diverse and realistic variations. Ref. denotes the reference image.

GAN-based Models


Qualitative comparison for the generated output of FD-GAN and XingGAN. Our method demonstrates significantly better fidelity while faithfully capturing the identity of the person in the reference image and accurately following the target pose.


Quantitative comparison on standard Re-ID benchmarks. Note that the Re-ID Experts in the first row group are not directly comparable, as our primary focus is on dataset generation. For augmentation-based methods, we train the same Re-ID model on the datasets generated by each method to ensure a fair comparison.
*: The authors did not provide a pre-trained model.


Citation

@misc{kim2024posediversified,
    title={Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification},
    author={Inès Hyeonsu Kim and JoungBin Lee and Soowon Son and Woojeong Jin and Kyusun Cho and Junyoung Seo and Min-Seop Kwak and Seokju Cho and JeongYeol Baek and Byeongwon Lee and Seungryong Kim},
    year={2024},
    eprint={2406.16042},
    archivePrefix={arXiv},
    primaryClass={id='cs.CV' full_name='Computer Vision and Pattern Recognition' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.'}

Acknowledgements

The website template was borrowed from Michaël Gharbi.