[논문 리뷰] MegaPortraits: One-shot Megapixel Neural Head Avatars

Face Reenactment

[논문 리뷰] MegaPortraits: One-shot Megapixel Neural Head Avatars

Justin4AI 2024. 7. 11. 14:18

✨MegaPortraits : MS가 인정한 3D representation-aided cross-driving reenactment synthesis✨

Drobyshev, Nikita, et al. "Megaportraits: One-shot megapixel neural head avatars." Proceedings of the 30th ACM International Conference on Multimedia. 2022. [paper]

🚨 최근 talking head generation에서 정점을 찍은 VASA에서 활용하고 있으므로 must-read라고 생각한다. 또다른 reenactment model인 face-vid2vid는 해당 모델의 핵심 아이디어의 기원이고, VASA에서도 활용되며, 또 face swapping 분야의 E4S라는 model에서도 활용되므로 함께 읽으면 좋다. 다 읽고 정리 좀 해놓을걸 ㅜㅜ

Abstract & Introduction

One-shot talking head generation은 사람의 외형에 대한 generic knowledge를 이용하기 위해 매우 큰 dataset에 사전학습된 모델들을 사용한다. 따라서 resolution이 dataset에 대해 제한되며, higher resolution을 위한 dataset을 또 구하는 것은 쉽지 않다.

MegaPortraits는 이를 극복하는, 다음과 같은 세 개의 contributions를 가진다:

1. Avatar의 appearance를 latent 3D volume으로 표현하고, latent motion representations와 결합시키는 새로운 방법을 제시한다. 이는 latent motion - appearance representations 간의 뛰어난 disentanglement를 보장하는 novel contrastive loss를 포함한다.
1. Cross-driving synthesis라는 도전적인 태스크를 megapixel 해상도까지 발전시켰으며, high-quality 및 novel views/motion에 일반화를 갖도록 중간/고해상도 이미지 데이터를 종합하여 학습하는 방법론을 제시한다.
1. 실제 활용성을 고려하여 distillation을 통해 몇 개의 사전 정의된 source image의 identity에 대하여 실시간으로 작동하는 기법도 소개한다.

Related Work

Avatar의 appearance와 motion을 모델링하는 것을 training video의 non-rigid reconstruction task로 간주하는 4D head avatars 연구가 있지만, cost가 많이 들고 person-dependent하므로 unseen motions에 대한 표현력이 떨어진다.

Hallo에서도 언급되었듯, disentanglement만 보장된다면 motion에 대한 explicit representations를 사용하는 것보다 latent space에서 parameterize하는 것이 더 나은 표현력을 달성한다. 따라서 MegaPortraits는 motion과 appearance를 disentangle하는 새로운 방법론을 제시한다.

512 x 512에 대하여 훈련되었어도 각 프레임에 Single Image Super-Resolution를 적용하여 megapixel로 만들 수는 있으나, imposed motion에 따라 one-shot talking head generation의 결과가 크게 다르므로 SR의 결과 또한 좋지 않으며, subject에 대한 정보가 사진 한 장이기 때문에 새로운 motion에 대한 정보를 학습시킬 수도 없다. 이 문제를 supervised/unsupervised training 기법을 결합한 새로운 방식으로 해결한다.

Method

Dataset으로부터 두 random frames를 sampling한다 - identity를 유지할 source frame $x_s$과 motion을 가져올 driver frame $x_d$. MegaPortraits는 driving frame의 motion을 source frame의 appearance에 부여한 image $\hat{x}_{s→d}$를 생성한다.

Source와 target이 같은 identity일 때에는 identity preservation을 제외하고 오로지 driver frame의 motion을 source frame에 부여하는 것만 학습할 수 있게 된다.

Base model

각 모듈들의 output features가 어떻게 연결되는지를 잘 생각하면서 읽어야한다. 특히 3D representation을 어떻게 사용하는지에도 집중해야 한다.

Reenactment가 이루어지는 main network의 구조다. 매우 복잡해 보이지만, 보라색으로 둘러싸인 모듈과 파란색으로 둘러싸인 output feature의 역할을 잘 정리해놓고 보면 어렵지 않다.

🚨 Reenactment는 무엇일까? $S$의 identity를 보존하면서 $D$의 motion만 주입한다는 것은, $S$에서 motion 정보만 빼고 $D$의 것으로 채워넣는 것으로 생각할 수 있다. 이 흐름을 잘 따라가보자!

우선,

$E_{app}$ - Appearance encoder : 4D tensor local volumetric features $v_s$와 global descriptor인 $e_s$를 추출한다.

$E_{mtn}$ - Motion encoder : Head rotations $R_{s/d}$, translations $t_{s/d}$ 그리고 latent expression descriptors $z_{s/d}$를 추출한다.

$W_{s→}$ : Source tuple인 [ $R_s$, $t_s$, $z_s$, $e_s$ ]를 받아, volumetric features $v_s$로부터 motion 정보를 제거하는 3D warping field $w_{s→}$를 생성한다.

$W_{→d}$ : Driver tuple인 [ $R_d$, $t_d$, $z_d$, $e_d$ ]를 받아, driver motion을 부여하는 warper인 $w_{→d}$를 생성한다.

그렇다면, x_s의 4D volumetric representation - channel dimension을 포함해서 4D라고 하는 것이기 때문에 일반적인 의미로는 3D가 이해하기 편할 것이다 - 로부터 motion data를 제거하는 $w_{s→}$ warping을 진행한 뒤, 이것에 driver motion을 입히는 $w_{→d}$ warping operation을 진행하면 다음과 같다:

여기서 ◦는 3D warping operation이고, $G_{3D}$는 3D convolutional network이다. Motion data의 제거란, frontal viewpoint로 바꾸고 $z_s$로부터 얻어지는 face expression motion를 제거하는 것이므로, 이를 3D CNN으로 처리한 후 $w_{→d}$를 이용하여 driver head rotation과 motion을 입히는 것이다.

이렇게 volumetric feature encoding과 explicit한 head pose를 사용하는 아이디어는 face-vid2vid로부터 나왔으나, expression의 표현으로 keypoints를 사용하는 대신 latent descriptor $z_{s/d}$를 이용했다. 그러나 megapixel로 확장하는 과정에서 결과가 무너지는 현상을 개선하기 위해서, cycle consistency loss라는 것을 사용한다.

Driver volumetric features $v_{s→d}$는 이제 face-vid2vid에서와 같은 방식으로 camera frame에 $\mathcal{P}$를 통해 project되며, 2D CNN을 거쳐 최종적으로 $\hat{x}_{s→d}$를 다음과 같이 얻는다:

이 모든 과정을 $G_{base}$라는 이름으로 나타내면,

위와 같이 표현되므로 base model이라는 표현을 사용했다.

Loss

Perceptual loss

각각 general content, facial appearance 그리고 eye gaze의 매칭을 검사하는 term으로 구성된 $\mathcal{L_{per}}$ loss를 사용한다:

Adversarial loss

일반적인 adv loss에 더해, Pix2pixHD에서 GAN의 training stability를 위해 도입한 feature-matching loss를 추가로 사용한다:

Cycle consistency loss

Driver image로부터는 motion 정보만 잘 추출해내야 한다. 하지만 explicit하게 추출하지 않고 latent space상에서 학습하기 때문에, 자칫하면 appearance에 대한 정보가 섞여나올 수도 있다. 이를 방지하기 위한 loss이다.

새로운 video로부터 $x_{s^*}$와 $x_{d^*}$를 추출한다. 이 둘은 각각 $x_s$와 $x_d$와는 다른 소스로부터 나오는 것이다. $\hat{x}_{s^*→d}$를 구하고, $E_{mtn}(x_{d^*})$로 구해지는 motion descriptor $z_{d^*}$를 계산한다. $z_{s^*→d}$와 $z_{s→d}$는 base model의 각각의 forward passes에서 이미 계산되었다.

이제 CosFace에서의 contrastive loss를 사용하여, positive pairs와 negative pairs를 묶어서 motion encoder를 따로 학습하는 효과를 얻을 수 있게 된다. 따라서 최종 loss는 다음과 같다:

High-resolution model

512 x 512에서 1024 x 1024로 업그레이드하는 SR task를 post-processing이 아닌, framework에 통합시키는 방식으로 변형한다고 생각하면 된다. 512 → 1024로 mapping하는 모델을 $G_{enh}$로 정의한다.

Loss를 두 그룹으로 나눌 수 있는데, 첫 번째는 일반적인 SR objectives처럼 MAE loss와 GAN loss를 사용한다.

두 번째는 cross-driving 상황에서의 성능을 보장하기 위한 unsupervised 방식이다. 다양한 identity를 가진 $x^{HR}$로 이루어진 high-resolution dataset이 있을 때, 또다른 $x^{HR}_c$를 추출한다. $\hat{x}c = G{base}(x^{LR}, x_c^{LR})$로 initial reconstruction을 구하고 SR을 진행하면 $\hat{x}c^{HR} = G{enh}(\hat{x}_c)$이다. 그런데 이것에 대한 ground-truth가 없으므로, patch discriminator를 통해 distribution을 ground-truth에 매칭시키는 것이 최선이다.

또한, 다음과 같이 lower resolution에 대해 다음과 같은 cycle-consistency loss를 적용할 수도 있다:

최종 loss는 다음과 같다:

Student model

Student model은 cross driving mode에 대해서만, teacher model로 생성된 pseudo-ground truth를 이용하여 훈련된다. Student network는 제한된 숫자의 avatar를 가지므로, $N$개의 appearances 중에서의 index인 $i$를 통해 conditioning한다.

따라서 driving frame $x_d$와 index $i$를 sample하여, 다음과 같은 두 개의 이미지를 매칭시키는 방향으로 perceptual loss와 adversarial loss를 사용하여 훈련한다:

현재글[논문 리뷰] MegaPortraits: One-shot Megapixel Neural Head Avatars

Justin4AI