Video face segmentation: use cases and open datasets Posted

June 12, 2024Updated 8:26 am

#FaceSegmentation #VideoSegmentation #OpenDatasets #Datasets #SegmentationDatasets #videoface

Videoconferencing, Livestream and virtual reality applications gain widespread support. However, the urgent and serious question is how to protect personal privacy.

In this case, background real-time replacement technology has been developed rapidly.

(Daniel Dvorský on Unsplash)

What is video face segmentation?

Video human segmentation, known as portrait video segmentation, can be seen as a special semantic segmentation task.

Compared with semantic segmentation, portrait segmentation is relatively simple, with only two categories: portrait and background.

The goal of portrait segmentation is to segment the portrait from the background.

However, the video portrait segmentation algorithm differs from single-frame image segmentation. It mainly relies on the continuity between multiple frames to achieve high smoothness and high accuracy of segmentation results.

.Let’s try to make it clearer, the movement of the characters in the video is continuous.

So, normally the Mask information of the portrait is similar between several consecutive frames. So we can refer to the Mask information already calculated in frame x-1 when segmenting the image in frame x.

(engin akyurt on Unsplash )

How does human video segmentation work?

Representative network structure is FCN, which is Fully Convolutional network. Its model structure is very simple. For example, VGG is used to extract image features, remove the last full connection layer, use Transpose Convolution of up-sampling to restore feature images of multiple down-sampling to the same size as the original image, and then generate a classified label for each pixel.

Generally, The CNN network will connect the full connection layer after the convolutional layer and map the feature map generated by the convolutional layer into a feature vector of fixed length.

However, FCN needs to classify pixels and accept input images of any size. FCN uses deconvolution to upsample the feature map of the last convolutional layer to restore it to the same size of the input image, thus generating a prediction for each pixel while retaining the spatial information in the original input image.

Finally, pixel-by-pixel classification is performed on the feature map of upsampling.

Human video segmentation in different scenarios

Virtual backgrounds are based on portrait segmentation technology. This technology is achieved by segmenting the portrait in the image and replacing the background image.

According to the application scenarios in which they are used, they can be broadly divided into the following categories.

1. Live broadcasting scenarios

Video portrait segmentation is used for atmosphere creation, such as live education broadcasts, online annual meetings, etc.

(Caspar Camille Rubin on Unsplash)

2. Real-time communication scenios

Using video portrait segmentation to protect user privacy, e.g. video conferencing, etc.

For video conferencing scenarios, background replacement techniques are also important. The reason is that the background of the participants may not be suitable for sharing and filming. The portrait needs to be separated from the original background. After all, the separated portrait is added to the new fixed background.

3. Interactive entertainment scenarios:

Video portrait segmentation is used to add more fun elements to the scenarios such as film and TV editing.

For instance, we try to highlight the main characters’ portraits and blur the background for some shooting scenes. Therefore ,we can get better visual effects.

Background blur first needs to achieve the separation of portrait and scene. Then, using portrait segmentation technology to separate the characters. Last but not least, adding portraits after the operation of blur and blur the background.

Background replacement technology also has many applications in some film shooting fields. For example, filmmakers need to re-insert the modeled background image and the segmented portrait for some specific science fiction scenes.

(Ryan Garry on Unsplash )

The related free and commercial datasets

open dataset

1. VideoMatte240K1

The dataset contains 384 4K resolution green screen videos and 100 HD green screen videos. According to the above green screen video, 24,0709 frames were generated for portrait segmentation. The dataset contains more clothing types and clothing types, and the training model can obtain better robustness.

More Infos: https://grail.cs.washington.edu/projects/background-matting-v2/#/datasets

2. PhotoMatte13K/852

The dataset contains 13,665 high-quality green screen images, and the gestures of people are controlled within a reasonable range. The resolution of most images is 2000×2500.

More Infos: https://github.com/PeterL1n/BackgroundMattingV2

3. The Adobe Image Matting3

The training set consists of 49,300 images, including 493 substrates unique to the foreground. The test set consists of 1,000 images. The resolution of the dataset is mostly 1000×1000.

More Infos: https://sites.google.com/view/deepimagematting

Commercial Datasets

1. MD-Image-003

MD-Image-003 is a single-figure portrait segmentation dataset with a total of about 5W images. The dataset, collected from the Internet, includes a variety of poses, hairstyles and landscapes. The image resolution is greater than 1080 x 1080.

More Infos: https://maadaa.ai/dataset/single-person-portrait-matting-dataset/

2. MD-Image-004

MD-Image-004 is a segmentation dataset of East Asian portraits, with a total of about 5W images. The background includes indoor, outdoor, street scenes and sports.

More Infos: https://maadaa.ai/dataset/eastern-asia-single-person-portrait-segmentation-dataset/

Reference List

https://segmentfault.com/a/1190000040423305

Any further information, please contact us.