Feeling Software Going to SIGGRAPH

25 07 2008

By Joshua Koopferstock

SIGGRAPH

Many of us from Feeling Software, including Christian and I, will be attending SIGGRAPH in Los Angeles in a couple of weeks. If any of you would like to sit down and brainstorm some computer vision or 3D-related project ideas, we will be happy to schedule some time in our trip to meet up with you. Send us an e-mail to enlighten3d@feelingsoftware.com which both of us will receive.

Also, if you are exhibiting at SIGGRAPH and read our blog regularly, let us know, we’ll try to come by and say hello!

Share/Save/Bookmark

Subscribe to RSS feed!



Advanced real-time facial tracking ready to leave the labs

24 07 2008

By Christian Laforte and Joshua Koopferstock

The goal of facial tracking is to recognize elements of a face in an image, and to follow them in a series of images. It may sound simple, but in reality it’s so complex that the human brain evolved a special area just for this task. Some unfortunate folks are born without that area and can’t even recognize themselves in the mirror.

Is this me? Welcome to Prosopagnosia
(original image)

AAM (Active Appearance Model) is the best family of facial tracking algorithms out there right now. The technique was first published by Cootes, Edwards, and Taylor in 2001, then heavily optimized and extended by a group of CMU researchers, primarily Iain Matthews, Simon Baker and Ralph Gross.

Here’s why this algorithm rocks:

  • It’s fast: 300Hz on a regular desktop PC.
  • It’s robust. It can deal with occlusions, e.g. sunglasses.
  • It’s relatively straightforward to implement.
  • It requires no special calibration for the user.

First, statistical model…

To do its magic, AAM must be taught what faces look like in various conditions. To achieve this, hundreds of images of faces must be annotated by human operators. These faces display a wide range of conditions including different races, different expressions, illumination, etc. For each image, someone must manually mark special points, e.g. tip of the nose, to build a mesh:

This training data is converted into a statistical model of face shapes and appearances. This is a tedious process, but once it works for a few faces, the rest of the algorithm can be used to “bootstrap” other faces, so adding new examples become faster with time.

… then track the face.

Once we have our statistical model, tracking can be performed in real-time by fitting the model on an image, such as a frame from a real-time video. Technically-speaking, this is a non-linear optimization problem that consists in minimizing the error between the image and the model. Because the problem is non-linear, we need a good first estimate and a robust fitting algorithm, otherwise the tracking gets stuck in the wrong part of the image, so a face will be detected in some guy’s ear.

Having a good first estimate used to be the hardest part. Basically, we need to tell AAM roughly where to look. Five years ago this would have required a special face detection algorithm, making the system twice as complicated to implement. Now that AAM is super fast, it’s probably simpler to just run it randomly in the image until we catch a face.

AAM improved by CMU researchers

Prior to the CMU papers, AAM was promising but not robust or fast enough for practical applications. The CMU researchers invented a fitting technique called the inverse compositional image alignment algorithm. Basically, they inverted one key step of the original algorithm (comparing the image against the model) which allowed them to compute some expensive calculations much less frequently. The end result was a much faster and robust algorithm, capable of running hundreds of times per second.

The CMU researchers then further improved AAM to deal with occlusions (i.e. partially hidden face) and to track a 3D face instead of a 2D approximation.

Regular AAM on the left,
improved AAM to deal with occlusions on the right
(video)

Surprisingly, when the CMU researchers extended AAM in 3D, using one or more cameras, the extension not only produced precise 3D results, it also ran faster and more robustly!

3D AAM in action (video)

From lab to the real world

Tons of cool applications could leverage this algorithm. So why hasn’t it happened yet? I think the primary reasons is that most people just don’t know it’s possible.

The time has come for AAM to leave the lab and make the real world a more technologically advanced place. If you have a good application idea and funding to bring this to market, give us a call and we’ll be happy to help and pitch in!

Share/Save/Bookmark

Subscribe to RSS feed!



Semantic texton forests for image categorization and segmentation

24 07 2008

By Christian Laforte

Warning: this post is pretty technical.

I forgot an important newcomer in my earlier post on segmentation algorithms published at CVPR 2008:

Semantic Texton Forests for Image Categorization and Segmentation (PDF, extra results, video)
Jamie Shotton, Matthew Johnson, Roberto Cipolla

Semantic texton forests (STFs) is not a typical segmentation algorithm. Unlike traditional segmentation algorithms that rely primarily on edge information or other low-level image processing, the semantic texton forests (STFs) use an ensemble of decision trees built from training examples, e.g. manually segmented images. This allows STFs to understand that pixels not only belong together, but they also represent, say, a sheep, or some grass. STFs and the related algorithms described in this paper therefore solve two problems: segmentation and categorization.

Other categorization techniques typically rely on features descriptors or manually tuned filter banks. In contrast, STFs operate directly on pixels, resulting in a very fast, relatively simple to implement algorithm, especially compared against state-of-the-art classification systems like Marszaek’s.

Once an STF is built, it can be used to identify that a given green pixel in an image is likely to be grass, by examining neighboring pixels in, say, the 21×21 pixels surrounding it. Likewise, a different pixel could be identified as part of a sheep.

By computing histograms of STFs in a region or a full image, we end up with a higher-level Bags of Semantic Textons (BoSTs). We can then perform semantic segmentation, e.g. capture the notion that sheep often stand on grass. This greatly increases the recognition and segmentation accuracy. The authors explore several optional implementation details and optimizations, and provide details on how each of them improves or not the segmentation quality.

Shotton and his colleagues report that their 8 fps implementation achieves segmentation accuracy of 66.9% on the medium-difficulty MSRC segmentation data set:

Original images (top) from the MSRC data set
and the final categorized segmentation (bottom)

The performance comes down significantly when the algorithm is confronted to the much harder VOC 2007 data set, performing at 24% by itself, or 42% when combined with TKK, a state-of-the-art detector.

Original image (left) from the VOC2007 data set
and the final categorized segmentation (right)

You’ll notice that the segmentation is rough and not pixel-perfect. The authors mention that a much cleaner segmentation could be performed using a Markov or conditional random field to precisely follow image edges.

Share/Save/Bookmark

Subscribe to RSS feed!



Trying on clothes in 3D: We have a long way to go.

23 07 2008

By Joshua Koopferstock & Christian Laforte

For you technology lovers who are still kids at heart, Disneyland has recently opened up the Innoventions Dream Home, showcasing cool high tech integration in a futuristic home. What caught my attention was one invention called the Magic Mirror. Getting its name from the mirror in Snow White, this Magic Mirror does not tell you “who is the fairest of them all” (for that you’ll still need HotOrNot.com). What it does do is allow you to virtually try on clothes in your wardrobe. In fact, the Magic Mirror is not a mirror at all; it is a large display monitor with a video camera next to it.

Magic Mirror

Trying on a dress in the Magic Mirror. Photo: cepro.com

While the concept is neat and would probably be even more useful in the department store dressing room than the bedroom, by the looks of it in the video below, the concept is still far from the realism necessary for a technology like this to take off. A few years back, it was thought that by today, virtual clothes shopping would be mainstream, and companies like My Virtual Model had signed contracts with major apparel retailers to integrate their technology into online stores.

It turns out that the technology wasn’t ready, and by the looks of this Magic Mirror, it still has a long way to go.

Here’s how I think they do it:

The dress moves roughly according to the orientation of the head, so they are most likely using a simple real-time head tracker and applying the pose of the head to the top of the dress. The bottom seems to be animated randomly, or maybe through secondary animation.

Later this week I’ll post on a face tracking algorithm that could make this easily possible.

How could we do it better?

One imperfection is very noticeable: the dress doesn’t follow the shoulders and hips properly. Part of it may be anatomical (this is a guy after all), but I think this problem should be easily solved by tracking the silhouette (using background subtraction) and identifying the shoulders and hips using simple heuristics, e.g. areas of low curvature and roughly horizontal or vertical slopes. This would immediately improve the realism of this solution.

Another improvement would be to track features on the user’s T-shirt, so we can have a better estimate of the body pose, its size and maybe even the person’s sex. I’d start my search with Automatic Non-Rigid 3D Modeling from Video (Torresani and Hertzmann, 2004), since I remember being impressed with the results way back then: it handles occlusions and variations in illumination very nicely. In the picture below, you can see one of the researchers moving his hands in front of his T-shirt… the algorithm can still capture a 3D representation of the deforming T-shirt. Doing this in real-time may be challenging, but fast GPUs and multi-core systems should make it possible.

Still a Long Way to Go

With the method that we have suggested, there is one major sticking point that we have not addressed: content creation.  Assuming you have the ability to accurately track a person and render the image in real time, you still need a way to create the clothes in 3D.  This is not a simple task, as clothes can be highly variable in elasticity, reflectiveness, etc., which would make automation of the modeling process complex.

Before we see these Magic Mirrors in department stores like Sears or Macy’s that have hundreds of thousands of different apparel items each year, a method for automatic creation of clothing content will have to be developed.  And while automatic 3D content creation is going to take great strides in the next couple of years, the quality of 3D reconstruction needed for clothing is still a long way off.

Share/Save/Bookmark

Subscribe to RSS feed!



Google Earth in the browser… is it really useful?

17 07 2008

By Christian Laforte

It’s been a over month since Google launched the browser plug-in for Google Earth, combining the best of Google Maps (fast, easy, in the browser, extendable through javascript) with the capability to navigate in a 3D terrain.

So does the 3D capability really add a lot over the regular Google Maps? It’s still too early to tell, but a few examples show the potential applications in education, entertainment and planning.

The most visually compelling example I’ve seen so far comes from Bjorn Sandvik’s ThematicMapping blog:

In a glance, you can see in which countries infant mortality is a critical problem. A color legend alone doesn’t fire the imagination the same way.

Another example, the Google Monster Milktruck, is kind of fun:

This mini-game allows you to drive the milk truck around. Unfortunately, many limitations of the Google Earth engine become apparent, especially the lack of collision detections with walls.

A third example, from GolfNation’s blog, allows you to see a golf course in 3D:

This example demonstrates the main problem with Google Earth right now: for the 3D capability to be worthwhile, we need more 3D content. Trees, cars and buildings look like they are painted on the ground, because we don’t have a 3D representation. We’re basically just looking at 2D data (satellite imagery) from a different perspective.

Google and Microsoft are apparently working hard on this problem of reconstructing buildings and landmarks. Feeling Software is also investing in 3D reconstruction from images… but we’re taking a different strategy, that hopefully will put us one step ahead of these giants, in one promising niche. Incidentally, our Feeling 3D Engine also supports KML and KMZ, along with geo-referencing and geographic measurements.

Incidentally, there are other 3D GIS (geographic information systems) that work on the web. One interesting example comes from Korea, according to this informative ZDNet article:

If you know of other compelling examples of 3D use in GIS, by all means, reply to this post!

Share/Save/Bookmark

Subscribe to RSS feed!



Follow up: Daihatsu also investing in 3D dash displays

16 07 2008

By Joshua Koopferstock

I posted a little while back about car-maker Renault teaming up with Holografika to create 3D displays in their vehicles, but based on the information provided, we were only left to guess what they might be doing with this technology.

Today I came across a very similar story from “way back” in March about Daihatsu (majority-owned by Toyota) working with ProVision, another 3D holographic display developer and vendor.  Happily, Daihatsu is providing us with a bit more information as to what exactly they will use a 3D display for.  According to the article from Reuters:

“The traditional 2D flat screen acts as a traditional dashboard, displaying gauges such as the speedometer and tachometer.  The 3D holographic screen presents warnings and vehicle information as easy-to-understand 3D images to facilitate driver awareness and recognition. These features will be integrated as part of Daihatsu’s OPCS (Omni-directional Pre-Crash Safety Support System) to create the next generation of digital dashboards.”

Unfortunately it looks like we’re going to have to wait until 2012 before we actually see these things in anything besides concept cars, but it’s still interesting to have a preview of what the future of 3D displays might be.

Share/Save/Bookmark

Subscribe to RSS feed!



Crystal Ball Feature: Human Tracking and Recognition

15 07 2008

By Joshua Koopferstock

“I’ve fallen and I can’t get up!”  Who could forget this classic tag line for a wearable safety button that gives you a direct line to an emergency response operator, in case, well, you’ve “fallen and can’t get up!”

Modern technology, however, is working toward a more elegant and unobtrusive solution to this problem.  In this first Crystal Ball Feature, I will be exploring future applications of human crowd tracking and action recognition technologies we have recently covered in our posts.

“I’ve fallen and I can’t get up!”

Let’s begin with applications for safety.  Attach a computer that recognizes action to a simple video camera surveillance system, say, in an assisted living home, and you have built yourself an automatic “I’ve fallen and can’t get up” response system, no button pressing required.  Or, install a similar system at pools and beaches to aid overworked lifeguards by providing them a technology that literally can watch hundreds of people at once.

Move over to the entertainment side of things.  What casino manager wouldn’t be interested in computers that can automatically detect cheaters?  Even a system that only worked a fraction of the time may serve as a deterrent, and security companies have already created facial recognition technologies that compare players’ faces to those of known cheaters.

Sliding from the business of entertainment to your own personal entertainment, recognition of human actions will eventually make video as searchable as text is now.  Major players in the high tech industry are working on this problem as we speak.  Want to zoom to that specific dance in your wedding video?  Just tell your computer to find “red dress tango” and it will search through the tags it has automatically created for each frame or scene.  And since we’re on the topic of entertainment, it would be unfair to completely ignore one of the most common uses of the web, adult entertainment.  You don’t need me to elaborate on how video search could be valuable to this industry.

Busy beach
I would not want to be a lifeguard here.
Photo by snappybex.

Lastly, one considerable application for retail will be in-store consumer recognition and tracking.  Retailers want to know everything that consumers see and do in their stores.  They already have ways to track how many people come in and out, what products they buy, and even test what they look at on the shelves with eye tracking.  In the future, technology will allow retailers to unobtrusively track a consumer’s every behavior from the time they enter the store to the time they leave and automatically understand every action the consumer has taken, as seen by the in-store video system.  With all this happening automatically, the retailer will be able to analyze and dig through this data to discover consumer trends that we cannot currently understand due to the time-cost of tracking and tagging methods.

That’s as far as I look into my crystal ball for today.  If you know of technologies that already exist that do what I’ve talked about here, leave a comment and impress us all!  Or reach into the back of your desk, pull out your own crystal ball, and tell us what you think the future of these technologies will bring.

Share/Save/Bookmark

Subscribe to RSS feed!



My Computer Knows how to French Kiss

10 07 2008

By Christian Laforte and Joshua Koopferstock

You’re watching a feel-good romantic comedy. It comes to that scene: Mr. Hollywood Hunk and Ms. Beverly Hills Perfect10 are staring into each other’s eyes. They lean in. She blinks, slowly. You know what’s going to happen. As they come together for that trite, big-romantic-scene-of-the-movie kiss, you think to yourself, “All these Hollywood rom-coms are exactly the same!”

Your computer watches on silently, intently. It’s trying to learn how to French kiss.

Teaching computers how to recognize human actions is one of the biggest ongoing challenges in the field of computer vision. A new paper shows a first step in the right direction: recognizing specific human actions, such as kissing or answering the phone.

Learning realistic human actions from movies (PDF, video mpg, abstract)
Ivan Laptev,
Marcin Marszalek, Cordelia Schmid, Benjamin Rozenfeld
Published at CVPR 2008

How does it work? First, the authors needed examples of human actions, like actors kissing, answering the phone or getting out of a car.

Examples of actions recognized by Laptev et al:
Kiss, Answer the phone, Get out of a car

They took clips of Hollywood movies and annotated them, e.g. a long kiss starts at 1m53s. Doing this manually for dozens of movies would have been tedious, so the authors developed a technique that combines subtitles and movie scripts (e.g. http://www.dailyscript.com) automatically, a hard problem considering the variety of expressions and the ambiguities in the text. The authors shared their results at http://www.irisa.fr/vista/actions, along with detailed technical explanations.

Automated machine learning and computer vision techniques were then used to recognize what a specific action, like kissing, looks like in successive images. The computer can therefore notice that in most kisses, two regions of the image (e.g. eyes, lips and nose of the actor) are slowly approaching and touching each other.

We will cover interesting potential applications in a future post. In the meantime, here are a few additional technical details.

Technical details

Although the automatic generation of the data sets is itself interesting, I paid particular attention to the video classification problem for action recognition. First, sparse space-time features are detected using a space-time extension of the Harris operator, using multiple temporal and spatial scales to further improve accuracy. To characterize the motion and appearance of local features, histograms are computed in the space-time volume surrounding a point feature, somewhat like SIFT encodes 2D point features.

A spatio-temporal bag of features (BoF) is built from the features, arranged along several spatio-temporal grids shown empirically to produce good results. A non-linear support vector machine is then used to classify actions amongst the 8 possibilities. Basically, this allows the system to automatically learn what important visual features appear in a given sequence. For example, for the shaking a hand action, we would expect that some features would move up and down in time.

Examples of spatio-temporal grids

This new technique outperforms previous ones in simple scenes, and works for natural movies with cluttered backgrounds. For the simpler KTH action dataset, the Laptev approach achieves an average classification accuracy of 91.8%, higher than any other published technique.

Actions in the KTH data sets

The action recognition in real-world video is much harder, so unsurprisingly, the accuracy is much lower, varying between 18% and 53% depending on the type of action. Still, this type of approach is promising and it’s reasonable to expect, in a few years, that improved versions will achieve near-human action recognition. What new applications will this technology make possible? We’ll explore the exciting possibilities in a future post.

Share/Save/Bookmark

Subscribe to RSS feed!



New Research: Face Recognition from any Angle

3 07 2008

By Christian Laforte

Look directly at the camera, or the computer can’t see you. Or at least, it can’t recognize that it is you. The major drawback of all facial recognition systems in commercial use today is that they require people to face the camera directly, like a passport photo. A newly published paper takes aim at this problem, helping computers better recognize a face in more natural conditions.

Through Tied Factor Analysis, computers can recognize a face seen from the side by comparing the side image with the passport picture, achieving an accuracy of 92%, a major technological leap when compared with 60% in the previous state-of-the-art technique. The algorithm learns how to extrapolate other views by analyzing thousands of pictures of varied people, taken from many angles. The approach is relatively simple to implement and reportedly much faster than other state-of-the-art techniques.

Side photos (bottom) automatically generated from “Passport” photos (top)

This new model assumes very little about the structure of a face, geometry or lighting, so it could easily be adapted to applications such as recognizing vehicles or animals in a semi-controlled environment.

Technical details

The tied factor analysis (TFA) technique uses machine learning techniques such as Expectation Maximization to automatically learn the relationship between frontal and nonfrontal faces, e.g. pictures taken from the side or at an angle.

Along the way, the system automatically learns an identity space to represent a face in a few hundred parameters. This identity space doesn’t vary significantly with pose, angle or lighting, so in theory, all images of an individual would map to the identity position in that space.

To perform this feat, the researchers started from a large number of pictures, like the 320 individuals from the FERET database taken with multiple poses and angles. These faces were manually altered and annotated to increase accuracy. After running TFA on these pictures, the system can extrapolate from a known facing face (e.g. passport photo) to an unknown non-facing face (e.g. side photo), as shown in the image shown above.

The authors then significantly increased the accuracy by combining 21 TFAs applied around manually-specified positions, identifying standard facial features like the left eye corner in the following figure:

Prince and his colleagues also integrated a relatively simple face part detector inspired from Viola and Jones. This made the system more automatic but reduced the precision in the worst case by 6%. Still, they are confident that this gap could be filled with a more sophisticated detector.

Conclusion

I find these results exciting and promising, but we are still a long way from human-like recognition. Here are the most significant limitations with this approach, along with potential solutions:

Discretized poses: this paper shows how to support many poses, but it doesn’t address how to support arbitrary poses efficiently and accurately, like seeing a face slightly from above or from the bottom. This severely limits the use in non-intrusive monitoring. Building a full 3D model may be better suited to this.

Resistance to occlusions: The paper assumes that the user is cooperative and won’t get partially occluded, e.g. put his hand in front of his face. There are possible solutions that we’ll explore in future posts.

Automatic head pose detection: To make the system completely automatic and unobtrusive, the faces would need to be detected and registered (identified and located) automatically.

Tied factor analysis for face recognition across large pose differences
Simon J.D. Prince, James H. Elder, Jonathan Warrell, Fatima M. Felisberti

IEEE Transactions on pattern analysis and machine intelligence, Vol. 30, No. 6, June 2008
(http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4459336)
You need an IEEE Explore membership to access the paper.

An older, less complete version of the paper is publicly available: http://www.macs.hw.ac.uk/bmvc2006/papers/292.pdf

Share/Save/Bookmark

Subscribe to RSS feed!



Yoowalk.com: Community Web Browsing in 3D

3 07 2008

By Joshua Koopferstock

By turning the web into a 3D environment, Yoowalk.com attempts to make surfing the web a community activity.  Using Flash3D, Yoowalk presents the web as a sort of virtual shopping mall, with each site being a virtual store.

Yoowalk 3D websurfing screenshot

When you go to yoowalk.com,  you are immediately dropped into a virtual world of the internet.  You can choose a geographic region to visit, and then walk around through different areas by subject (news, music, shopping, etc.).  As you walk along, you will see the avatars of others who are walking through the same places that you are.

If going to the shopping mall can be an activity to do with friends, why not websurfing?  The idea of social surfing isn’t a new one.  Another application called Me.dium already acts as a kind of buddy list for your browser.  However, I have yet to see another site bringing together a 3D environment and community surfing the way that Yoowalk has.

So will it work?  At first glance, it looks like the concept needs significant refinement before it can become a mainstream way to browse the web.  For one, when you actually enter a site/building, the way the site appears is completely unfamiliar and confusing.

Yoowalk Google Store

This is Google?

I am personally a big advocate of keeping things familiar; Yoowalk is forcing people to learn a new way to create and see a web page, and I think this is too big of a leap for the company to be taking now.  It would be enough for them to create a way to browse through the web, but then entering the sites should bring up the familiar 2D website.

Furthermore, if the site does get larger, they are going to have millions or billions of websites, which is one awfully big shopping mall.  Effectively organize something like that must be a huge question mark for them.  The concept needs to get much more focused than it is right now since surfing the web is such a varied activity, even for one individual at different times of the day.  Yoowalk should look to emulate the success of other fun-browsing or community-browsing applications: something like StumbleUpon3D could be an interesting approach without having to teach everyone how to surf the web all over again.

I’m always excited and happy to discover new 3D-based apps on the web, but Yoowalk, like too many others, seems to not be placing enough focus on the specific added value (and the time-costs related to gaining this value) that they are providing to their users through their service.  Then again, this is just their beta phase, so I won’t be making a final judgement on them just yet.

Share/Save/Bookmark

Subscribe to RSS feed!