Creating Better Virtual Human

October 13, 2024

This short article overlays methods to build virtual human from generative AI

🎨 Final result: https://www.instagram.com/geni.baek.art/ (opens in a new tab)

It is extremely common to create virtual human these days, and there are multiple ways to it. You may heard about virtual influencers such as Rozy (opens in a new tab), Miquela (opens in a new tab), and Imma (opens in a new tab). They are not 100% generation in the sense that their faces are rigged on real human.

This article discusses how to create a good virtual human from end to end generation using diffusion or other AI models. Roughly, there are three ways for this: 1) face swap after image generations, 2) Ip-Adaptor Face ID (opens in a new tab) (or similar adaptor approaches), and 3) training an existing model for adaptation (Dreambooth or LoRA).

Here, I analyze their pros and cons in terms of costs, text alignment, and balance between diversity and consistency.

1. Face Swapping after Generation

Face swapping is using the code like inswapper (opens in a new tab), and it is well provided from webui or comfyui via (Reactor (opens in a new tab)). In can be used for any generated images. (If you use this on the real human pictures, this act is called Deepfake.)

Fig. Face swap pipeline

Pros

You do not depend on a specific generation models. You can use swapping on the images generated from various styles
No training is required.

Cons

Most of the swapping models are limited to low resolution face images (opens in a new tab).
Lose lighting or details.
Mal-performing on face pose changes.
Very difficult to have makeup variations such as face painting.

Fig. Lighting and resolution degraded.

Fig. Only frontal pose is working good.

How to Overcome Face Resolution Drop?

Despite of the simplicity, the biggest problem of this approach is resolution drop of the face region. Using GAN based upscalers (e.g., GFPGAN) is not working very well. I would recommend inpainting a little bit ont he facial region with low strength as mush as you keep the original face identity.

2. IP-Adaptors (FaceID)

IP adaptor is essentially an image prompt. The prompt is extracted from a given image based on CLIP vision encoder. FaceID (opens in a new tab) is a specialized IP adaptor to extract meanings from face iamges. By using FaceID IPadaptor, it is basically providing an additional prompt: "resembles this face".

Pros

Supports high resolution 1024 x 1024 image without resolution drop of the face, if you use FaceID SDXL (here (opens in a new tab)).
We can output various pose, lighting, and so on.

Cons

It decreases text alignment, as the image prompt can be a big disrupter for your textual prompts. Thus, it is very difficult to use long description.
This adaptor will work well only with the models with which IP adaptor is trained. In general, those models are basic models such as sd1.5, sdxl, which produce poor realistic photos (that is why these models (opens in a new tab) were picking up among users).
It is very well known that CLIP vision encoder is capable of grab global semantic, not a detailed information.

Personally, using IP adaptor for preserving face identity can be an over-use of CLIP vision encoder, and I did not get satisfactory result in both identity preservation and text alignment.

3. Training Customized Models (Dreambooth and LoRA)

In the end, I arrived here. This is the best method if you have enough GPU power. Basic concept is feeding a foundation model with your character. I selected 10 best quality photos with face swapping on the images generated by flux.dev (opens in a new tab). Then, I trained flux.dev model with dreambooth objective by making the model to generate my character if given sks woman. sks means nothing but a word not used in our dictionary. Personally, I used the dreambooth training script after some fixes (here is my PR (opens in a new tab) to huggingface). I selected flux as the foundation, as it is the best-performing in fingers, with a insane text alignment.

Pros

No face resolution drop, little gap between target image and model outputs (if correctly trained).
Keeping face / lighting details while makeup variation.

Cons

Require many trials and errors to pick the training checkpoint, balancing id-consistency and diversity. If you train too long or without prior preservation (opens in a new tab), your model will collapse into only making the photos you provided. In contrast, too short training will give you unsatisfactory id.
Require decent GPUs.