Advanced real-time facial tracking ready to leave the labs
24 07 2008
By Christian Laforte and Joshua Koopferstock
The goal of facial tracking is to recognize elements of a face in an image, and to follow them in a series of images. It may sound simple, but in reality it’s so complex that the human brain evolved a special area just for this task. Some unfortunate folks are born without that area and can’t even recognize themselves in the mirror.

Is this me? Welcome to Prosopagnosia
(original image)
AAM (Active Appearance Model) is the best family of facial tracking algorithms out there right now. The technique was first published by Cootes, Edwards, and Taylor in 2001, then heavily optimized and extended by a group of CMU researchers, primarily Iain Matthews, Simon Baker and Ralph Gross.
Here’s why this algorithm rocks:
- It’s fast: 300Hz on a regular desktop PC.
- It’s robust. It can deal with occlusions, e.g. sunglasses.
- It’s relatively straightforward to implement.
- It requires no special calibration for the user.
First, statistical model…
To do its magic, AAM must be taught what faces look like in various conditions. To achieve this, hundreds of images of faces must be annotated by human operators. These faces display a wide range of conditions including different races, different expressions, illumination, etc. For each image, someone must manually mark special points, e.g. tip of the nose, to build a mesh:

This training data is converted into a statistical model of face shapes and appearances. This is a tedious process, but once it works for a few faces, the rest of the algorithm can be used to “bootstrap” other faces, so adding new examples become faster with time.
… then track the face.
Once we have our statistical model, tracking can be performed in real-time by fitting the model on an image, such as a frame from a real-time video. Technically-speaking, this is a non-linear optimization problem that consists in minimizing the error between the image and the model. Because the problem is non-linear, we need a good first estimate and a robust fitting algorithm, otherwise the tracking gets stuck in the wrong part of the image, so a face will be detected in some guy’s ear.
Having a good first estimate used to be the hardest part. Basically, we need to tell AAM roughly where to look. Five years ago this would have required a special face detection algorithm, making the system twice as complicated to implement. Now that AAM is super fast, it’s probably simpler to just run it randomly in the image until we catch a face.
AAM improved by CMU researchers
Prior to the CMU papers, AAM was promising but not robust or fast enough for practical applications. The CMU researchers invented a fitting technique called the inverse compositional image alignment algorithm. Basically, they inverted one key step of the original algorithm (comparing the image against the model) which allowed them to compute some expensive calculations much less frequently. The end result was a much faster and robust algorithm, capable of running hundreds of times per second.
The CMU researchers then further improved AAM to deal with occlusions (i.e. partially hidden face) and to track a 3D face instead of a 2D approximation.
Regular AAM on the left,
improved AAM to deal with occlusions on the right
(video)
Surprisingly, when the CMU researchers extended AAM in 3D, using one or more cameras, the extension not only produced precise 3D results, it also ran faster and more robustly!
3D AAM in action (video)
From lab to the real world
Tons of cool applications could leverage this algorithm. So why hasn’t it happened yet? I think the primary reasons is that most people just don’t know it’s possible.
The time has come for AAM to leave the lab and make the real world a more technologically advanced place. If you have a good application idea and funding to bring this to market, give us a call and we’ll be happy to help and pitch in!
Subscribe to RSS feed!





‘300 Hz on a regular desktop PC’ - is the implementation purely CPU-based? Because the problem seems to be parallelizable, at least partially.
Hi Bersi! IIRC the implementation used standard OpenGL rasterization in parts of the algorithm. Thanks for the link BTW!