Decryption: Key Techniques for Facial Feature Point Detection

The facial feature point positioning task automatically locates the key feature points of the face according to the input face image, such as the eyes, the nose tip, the corner of the mouth, the eyebrows, and the contour points of the faces of the face, as shown in the following figure.

è§£å¯†ï¼šé¢éƒ¨ç‰¹å¾ç‚¹æ£€æµ‹çš„å…³é”®æŠ€æœ¯

Figure 1 Key feature points of the face

This technology is widely used, such as automatic face recognition, expression recognition, and automatic animation of face animation. Due to factors such as different poses, expressions, lighting, and occlusion, it may seem difficult to accurately locate each key feature point. We simply analyze this problem, it is not difficult to find that this task can actually split three sub-problems:

1. How to model the face image (input) of the face

2, how to model the face shape (output)

3. How to establish the association between the facial image (model) and the face shape (model)

Past research work is also inseparable from these three aspects. Typical methods for face shape modeling include Deformable Template, Point Distribution Model (Active Shape Model), and Graph Model.

Facial representation modeling can be divided into global apparent modeling and local apparent modeling. Global Appearance Modeling simply refers to how to model the apparent information of the entire face. Typical methods include the Active Appearance Model and the Boosted Appearance Model. The corresponding local superficial modeling is to model the apparent information of the local area, including color model, projection model, side profile model and so on.

Recently, the cascading shape regression model has made a major breakthrough in the feature point location task. This method uses the regression model to directly learn the mapping function from the face appearance to the face shape (or the parameters of the face shape model), and then establishes The correspondence from the appearance to the shape. Such methods do not require complex face shape and apparent modeling, are simple and efficient, and achieve good positioning in controllable scenes (human faces collected under laboratory conditions) and non-controllable scenes (network face images, etc.). effect. In addition, the deep feature-based facial feature point localization method has also achieved remarkable results. Deep learning combined with shape regression framework can further improve the accuracy of positioning model and become one of the mainstream methods of current feature positioning. Below I will introduce the research progress of two major methods of cascading shape regression and deep learning.

Cascaded linear regression model

The facial feature point localization problem can be regarded as learning a regression function F with the image I as an input and the output Î¸ as the position of the feature point (face shape): Î¸ = F(I). Simply put, the cascaded regression model can be unified into the following framework: learning multiple regression functions {f1,...,fn-1,fn} to approximate the function F:

Î¸ = F(I)= fn (fn-1 (...f1(Î¸0, I), I), I)

Î˜i= fi (Î¸i-1, I), i=1,...,n

The so-called cascading, that is, the input of the current function fi depends on the output Î¸i-1 of the upper-level function fi-1, and the learning target of each fi is approaching the true position Î¸ of the feature point, and Î¸0 is the initial shape. Normally, fi does not directly return to the true position Î¸, but returns the difference between the current shape Î¸i-1 and the true position Î¸: Î”Î¸i = Î¸ - Î¸i-1.

Next, I will introduce a few typical shape regression methods in detail. The fundamental difference is that the design of the function fi is different and the input characteristics are different.

Piotr DollÃ¡r, a postdoctoral researcher at the California Institute of Technology, first proposed the Cascaded Pose Regression (CPR) model in 2010 to predict the shape of objects. The work was published at the International Computer Vision and Pattern Recognition Conference CVPR. As shown in the following figure, as shown in the following figure, given the initial shape Î¸0, which is usually an average shape, the feature (difference between two pixel points) is extracted as the input of the function f1 from the initial shape Î¸0. Each function fi is modeled as a Random Fern regression to predict the difference Î”Î¸i between the current shape Î¸i-1 and the target shape Î¸, and update the current shape Î¸i = Î¸i-1+Î”?i according to the Î”?i prediction result, as The input of the first level function fi+1.

This method achieves good experimental results on three data sets of face, mouse and fish. The general algorithm framework can also be used for other shape estimation tasks, such as human pose estimation. The shortcoming of this method is that it is sensitive to the initialization shape Î¸0. Using multiple initializations to perform multiple tests and merging multiple prediction results can alleviate the impact of initialization on the algorithm to a certain extent, but it cannot completely solve the problem, and multiple times. Testing brings additional computational overhead. When the target object is occluded, the performance will also deteriorate.

Figure 2 Key technologies for facial feature point detection

Xavier P. Burgos-Artizzu, who was from the same task group as the previous work, further proposed the Robust Cascaded Pose Regression (RCPR) method for the shortcomings of the CPR method and published it at the 2013 International Computing Visual Conference ICCV. In order to solve the occlusion problem, Piotr DollÃ¡r proposes a state of simultaneously predicting whether the face shape and the feature point are occluded, that is, the output of fi contains Î”Î¸i and a state pi in which each feature point is occluded:

{Î”Î¸i,pi}= fi(Î¸i-1,I),i=1,...,n

When some feature points are occluded, the feature of the region where the feature point is located is not selected as an input, thereby avoiding interference of the occlusion on the positioning. In addition, the author proposes a smart restart technique to solve the problem of shape initialization sensitivity: randomly initialize a set of shapes, run the first 10% of the functions of {f1,...,fn-1, fn}, and statistically predict the variance of the shape, if the variance is less than a certain Threshold, indicating that this group of initialization is good, then run the remaining 90% of the cascade function to get the final prediction result; if the variance is greater than a certain threshold, it indicates that the initialization is not ideal, and choose to reinitialize a set of shapes. The strategy is straightforward, but the results are very good.

Another interesting work, the Supervised Descent Method (SDM), considers the problem from another perspective, considering how to use the method of supervised gradient descent to solve the nonlinear least squares problem and successfully applied it to the facial feature point location task. It is not difficult to find that the final algorithm framework of the method is also a cascade regression model.

The difference between CPR and RCPR is that fi is modeled as a linear regression model; the input of fi is the SIFT feature associated with the shape of the face. The extraction of this feature is also very simple, that is, a 128-dimensional SIFT feature is extracted at each feature point of the current face shape Î¸i-1, and all SIFT features are connected together as an input of fi.

This method achieves good positioning results on the LFPW and LFW-A&C data sets. Another working DRMF in the same period is to use the support vector regression SVR to model the regression function fi, and use the shape-dependent HOG feature (the extraction method is similar to the shape-dependent SIFT) as the fi input to cascade the prediction of the face shape. The biggest difference from SDM is that DRMF has parametric modeling of face shapes. The goal of fi becomes to predict these shape parameters instead of the direct face shape. Both of these work were published simultaneously at CVPR 2013. Since the face shape parametric model is difficult to describe all shape changes perfectly, the measured effect of SDM is better than DRMF.

The team of Sun Jian, a researcher at Microsoft Research Asia, presented a more efficient Regressing Local Binary Features (LBF) at CVPR 2014. Similar to SDM, fi is also modeled as a linear regression model; the difference is that SDM directly uses SIFT features, and LBF learns sparse binarization features in local regions based on random forest regression models. By learning the sparse binarization feature, the computational overhead is greatly reduced, and the operation efficiency is higher than that of CRP, RCPR, SDM, DRMF, etc. (LBF can run to 300FPS on the mobile phone), and the IBUG public evaluation set is excellent. Performance in SDM, RCPR.

Figure 3 Local area binarization feature learning

The key to the success of the cascaded shape regression model is:

1. The shape-related feature is used, that is, the input of the function fi is closely related to the current face shape Î¸i-1;

2. The target of the function fi is also related to the current face shape Î¸i-1, that is, the optimization target of fi is the difference Î”Î¸i between the current shape Î¸i-1 and the true position Î¸.

Such methods achieve good positioning performance in both controllable and non-controllable scenarios, and have good real-time performance.

Depth model

The cascaded shape regression method described above is a shallow model (linear regression model, Random Fern, etc.). Deep network models, such as Convolutional Neural Networks (CNN), Deep Self Encoders (DAEs), and Restricted Boltzmann Machines (RBMs), are used in computer vision problems such as scene classification, target tracking, and image segmentation. Has a wide range of applications, of course, including feature positioning issues. The specific methods can be divided into two categories: modeling the shape and appearance of the face using the depth model and learning the nonlinear mapping function from the face to the shape based on the deep network.

Active Shape Model ASM and Active Appearance Model AAM use principal component analysis (PCA) to model changes in face shape. Due to factors such as gesture expression, linear PCA models are difficult to perfectly depict face shape changes in different expressions and poses. The research team from Professor JiQiang of Rensselaer Polytechnic Institute proposed in CVPR2013 to use the Deep Trust Network (DBN) to characterize the complex nonlinear changes of the face shape under different expressions. In addition, in order to deal with the feature point location problem of different poses, the 3-direction RBM network is further used to model the face shape change from front to non-positive. Finally, the method achieves better positioning results on the expression database CK+ than the linear model AAM. The method has a database with multiple gestures and multiple expressions at the same time.

ISL also achieved better positioning results, but it is not ideal for the situation of extreme posture and exaggerated expression changes.

The following figure is a deep confidence network (DBN): a schematic diagram of modeling changes in face shape under different expressions.

Figure 4 Modeling changes in face shape under different expressions

The research group of Professor Tang Xiaoou from the Chinese University of Hong Kong proposed a three-level convolutional neural network DCNN to implement facial feature point localization at CVPR 2013. This method can also be unified under the large framework of the cascade shape regression model. Unlike CPR, RCPR, SDM, LBF, etc., DCNN uses the depth model-convolution neural network to implement fi. The first level f1 uses three different areas of the face image (the entire face, eyes and nose area, nose and lip area) as input, and respectively trains three convolutional neural networks to predict the position of the feature points. The network structure contains Four convolutional layers, three Pooling layers and two fully connected layers, and the prediction of three networks are combined to obtain more stable positioning results.

The latter two levels f2, f3 extract features around each feature point, and individually train a convolutional neural network (2 convolution layers, 2 Pooling layers and 1 fully connected layer) for each feature point to correct the positioning result. . This method achieved the best positioning results at the time on the LFPW data set.

Figure 5 LFPW data set to get the best positioning results at that time

I also take this opportunity to introduce one of my work in the European Visual Conference ECCV2014: a rough-to-fine self-encoder network (CFAN) is proposed to describe the complex nonlinear mapping process from face appearance to face shape. The method cascades a plurality of stack-type self-encoder networks fi, each of which depicts a partial non-linear mapping from the face of the face to the shape of the face.

Specifically, a low-resolution face image I is input, and the first layer of the self-encoder network f1 can quickly estimate the approximate face shape, which is recorded as a stack-based self-encoding network based on global features. The network f1 contains three hidden layers, and the number of hidden layer nodes is 1600, 900, 400 respectively. Then, the resolution of the face image is improved, and the joint local feature is extracted according to the initial face shape Î¸1 obtained by f1, and input to the next layer of the self-encoder network f2 to simultaneously optimize and adjust the positions of all the feature points, which are recorded as local Feature stacking self-encoding network. The method cascades three local stack self-encoding networks {f2, f3, f4} until it converges on the training set. Each local stack self-encoding network contains three hidden layers, and the number of hidden layer nodes is 1296, 784, 400 respectively. Thanks to the powerful nonlinear characterization capabilities of the depth model, this method achieves better results than the DRMF and SDM on the XM2VTS, LFPW, and HELEN data sets. In addition, CFAN can perform face facial feature point positioning in real time (up to 23 milliseconds per sheet on I7 desktop), with faster processing speed than DCNN (120 milliseconds/sheet).

The following figure is a CFAN: Schematic diagram of a real-time facial feature point location method based on a coarse to fine self-encoder network.

Figure 6 is a schematic diagram of a real-time facial feature point localization method based on a coarse to fine self-encoder network

The above method based on cascading shape regression and deep learning can obtain better positioning results for large poses (left-right rotation -60Â°~+60Â°) and various expression changes, with fast processing speed and good product application prospects. . The solution to the problem of pure side (Â±90Â°), partial occlusion, and joint estimation of face detection and feature location is still a hot research topic.

Office Chairs

Office Chairs,Office Meeting Chair,Office Training Chair,Stackable Office Chair

Taihe Fangyuan Muye Co.,Ltd , https://www.fyofficefurniture.com