Feeling Software Going to SIGGRAPH

25 07 2008

By Joshua Koopferstock

SIGGRAPH

Many of us from Feeling Software, including Christian and I, will be attending SIGGRAPH in Los Angeles in a couple of weeks. If any of you would like to sit down and brainstorm some computer vision or 3D-related project ideas, we will be happy to schedule some time in our trip to meet up with you. Send us an e-mail to enlighten3d@feelingsoftware.com which both of us will receive.

Also, if you are exhibiting at SIGGRAPH and read our blog regularly, let us know, we’ll try to come by and say hello!

Share/Save/Bookmark

Subscribe to RSS feed!



Semantic texton forests for image categorization and segmentation

24 07 2008

By Christian Laforte

Warning: this post is pretty technical.

I forgot an important newcomer in my earlier post on segmentation algorithms published at CVPR 2008:

Semantic Texton Forests for Image Categorization and Segmentation (PDF, extra results, video)
Jamie Shotton, Matthew Johnson, Roberto Cipolla

Semantic texton forests (STFs) is not a typical segmentation algorithm. Unlike traditional segmentation algorithms that rely primarily on edge information or other low-level image processing, the semantic texton forests (STFs) use an ensemble of decision trees built from training examples, e.g. manually segmented images. This allows STFs to understand that pixels not only belong together, but they also represent, say, a sheep, or some grass. STFs and the related algorithms described in this paper therefore solve two problems: segmentation and categorization.

Other categorization techniques typically rely on features descriptors or manually tuned filter banks. In contrast, STFs operate directly on pixels, resulting in a very fast, relatively simple to implement algorithm, especially compared against state-of-the-art classification systems like Marszaek’s.

Once an STF is built, it can be used to identify that a given green pixel in an image is likely to be grass, by examining neighboring pixels in, say, the 21×21 pixels surrounding it. Likewise, a different pixel could be identified as part of a sheep.

By computing histograms of STFs in a region or a full image, we end up with a higher-level Bags of Semantic Textons (BoSTs). We can then perform semantic segmentation, e.g. capture the notion that sheep often stand on grass. This greatly increases the recognition and segmentation accuracy. The authors explore several optional implementation details and optimizations, and provide details on how each of them improves or not the segmentation quality.

Shotton and his colleagues report that their 8 fps implementation achieves segmentation accuracy of 66.9% on the medium-difficulty MSRC segmentation data set:

Original images (top) from the MSRC data set
and the final categorized segmentation (bottom)

The performance comes down significantly when the algorithm is confronted to the much harder VOC 2007 data set, performing at 24% by itself, or 42% when combined with TKK, a state-of-the-art detector.

Original image (left) from the VOC2007 data set
and the final categorized segmentation (right)

You’ll notice that the segmentation is rough and not pixel-perfect. The authors mention that a much cleaner segmentation could be performed using a Markov or conditional random field to precisely follow image edges.

Share/Save/Bookmark

Subscribe to RSS feed!



Trying on clothes in 3D: We have a long way to go.

23 07 2008

By Joshua Koopferstock & Christian Laforte

For you technology lovers who are still kids at heart, Disneyland has recently opened up the Innoventions Dream Home, showcasing cool high tech integration in a futuristic home. What caught my attention was one invention called the Magic Mirror. Getting its name from the mirror in Snow White, this Magic Mirror does not tell you “who is the fairest of them all” (for that you’ll still need HotOrNot.com). What it does do is allow you to virtually try on clothes in your wardrobe. In fact, the Magic Mirror is not a mirror at all; it is a large display monitor with a video camera next to it.

Magic Mirror

Trying on a dress in the Magic Mirror. Photo: cepro.com

While the concept is neat and would probably be even more useful in the department store dressing room than the bedroom, by the looks of it in the video below, the concept is still far from the realism necessary for a technology like this to take off. A few years back, it was thought that by today, virtual clothes shopping would be mainstream, and companies like My Virtual Model had signed contracts with major apparel retailers to integrate their technology into online stores.

It turns out that the technology wasn’t ready, and by the looks of this Magic Mirror, it still has a long way to go.

Here’s how I think they do it:

The dress moves roughly according to the orientation of the head, so they are most likely using a simple real-time head tracker and applying the pose of the head to the top of the dress. The bottom seems to be animated randomly, or maybe through secondary animation.

Later this week I’ll post on a face tracking algorithm that could make this easily possible.

How could we do it better?

One imperfection is very noticeable: the dress doesn’t follow the shoulders and hips properly. Part of it may be anatomical (this is a guy after all), but I think this problem should be easily solved by tracking the silhouette (using background subtraction) and identifying the shoulders and hips using simple heuristics, e.g. areas of low curvature and roughly horizontal or vertical slopes. This would immediately improve the realism of this solution.

Another improvement would be to track features on the user’s T-shirt, so we can have a better estimate of the body pose, its size and maybe even the person’s sex. I’d start my search with Automatic Non-Rigid 3D Modeling from Video (Torresani and Hertzmann, 2004), since I remember being impressed with the results way back then: it handles occlusions and variations in illumination very nicely. In the picture below, you can see one of the researchers moving his hands in front of his T-shirt… the algorithm can still capture a 3D representation of the deforming T-shirt. Doing this in real-time may be challenging, but fast GPUs and multi-core systems should make it possible.

Still a Long Way to Go

With the method that we have suggested, there is one major sticking point that we have not addressed: content creation.  Assuming you have the ability to accurately track a person and render the image in real time, you still need a way to create the clothes in 3D.  This is not a simple task, as clothes can be highly variable in elasticity, reflectiveness, etc., which would make automation of the modeling process complex.

Before we see these Magic Mirrors in department stores like Sears or Macy’s that have hundreds of thousands of different apparel items each year, a method for automatic creation of clothing content will have to be developed.  And while automatic 3D content creation is going to take great strides in the next couple of years, the quality of 3D reconstruction needed for clothing is still a long way off.

Share/Save/Bookmark

Subscribe to RSS feed!



Google Earth in the browser… is it really useful?

17 07 2008

By Christian Laforte

It’s been a over month since Google launched the browser plug-in for Google Earth, combining the best of Google Maps (fast, easy, in the browser, extendable through javascript) with the capability to navigate in a 3D terrain.

So does the 3D capability really add a lot over the regular Google Maps? It’s still too early to tell, but a few examples show the potential applications in education, entertainment and planning.

The most visually compelling example I’ve seen so far comes from Bjorn Sandvik’s ThematicMapping blog:

In a glance, you can see in which countries infant mortality is a critical problem. A color legend alone doesn’t fire the imagination the same way.

Another example, the Google Monster Milktruck, is kind of fun:

This mini-game allows you to drive the milk truck around. Unfortunately, many limitations of the Google Earth engine become apparent, especially the lack of collision detections with walls.

A third example, from GolfNation’s blog, allows you to see a golf course in 3D:

This example demonstrates the main problem with Google Earth right now: for the 3D capability to be worthwhile, we need more 3D content. Trees, cars and buildings look like they are painted on the ground, because we don’t have a 3D representation. We’re basically just looking at 2D data (satellite imagery) from a different perspective.

Google and Microsoft are apparently working hard on this problem of reconstructing buildings and landmarks. Feeling Software is also investing in 3D reconstruction from images… but we’re taking a different strategy, that hopefully will put us one step ahead of these giants, in one promising niche. Incidentally, our Feeling 3D Engine also supports KML and KMZ, along with geo-referencing and geographic measurements.

Incidentally, there are other 3D GIS (geographic information systems) that work on the web. One interesting example comes from Korea, according to this informative ZDNet article:

If you know of other compelling examples of 3D use in GIS, by all means, reply to this post!

Share/Save/Bookmark

Subscribe to RSS feed!



Follow up: Daihatsu also investing in 3D dash displays

16 07 2008

By Joshua Koopferstock

I posted a little while back about car-maker Renault teaming up with Holografika to create 3D displays in their vehicles, but based on the information provided, we were only left to guess what they might be doing with this technology.

Today I came across a very similar story from “way back” in March about Daihatsu (majority-owned by Toyota) working with ProVision, another 3D holographic display developer and vendor.  Happily, Daihatsu is providing us with a bit more information as to what exactly they will use a 3D display for.  According to the article from Reuters:

“The traditional 2D flat screen acts as a traditional dashboard, displaying gauges such as the speedometer and tachometer.  The 3D holographic screen presents warnings and vehicle information as easy-to-understand 3D images to facilitate driver awareness and recognition. These features will be integrated as part of Daihatsu’s OPCS (Omni-directional Pre-Crash Safety Support System) to create the next generation of digital dashboards.”

Unfortunately it looks like we’re going to have to wait until 2012 before we actually see these things in anything besides concept cars, but it’s still interesting to have a preview of what the future of 3D displays might be.

Share/Save/Bookmark

Subscribe to RSS feed!



Crystal Ball Feature: Human Tracking and Recognition

15 07 2008

By Joshua Koopferstock

“I’ve fallen and I can’t get up!”  Who could forget this classic tag line for a wearable safety button that gives you a direct line to an emergency response operator, in case, well, you’ve “fallen and can’t get up!”

Modern technology, however, is working toward a more elegant and unobtrusive solution to this problem.  In this first Crystal Ball Feature, I will be exploring future applications of human crowd tracking and action recognition technologies we have recently covered in our posts.

“I’ve fallen and I can’t get up!”

Let’s begin with applications for safety.  Attach a computer that recognizes action to a simple video camera surveillance system, say, in an assisted living home, and you have built yourself an automatic “I’ve fallen and can’t get up” response system, no button pressing required.  Or, install a similar system at pools and beaches to aid overworked lifeguards by providing them a technology that literally can watch hundreds of people at once.

Move over to the entertainment side of things.  What casino manager wouldn’t be interested in computers that can automatically detect cheaters?  Even a system that only worked a fraction of the time may serve as a deterrent, and security companies have already created facial recognition technologies that compare players’ faces to those of known cheaters.

Sliding from the business of entertainment to your own personal entertainment, recognition of human actions will eventually make video as searchable as text is now.  Major players in the high tech industry are working on this problem as we speak.  Want to zoom to that specific dance in your wedding video?  Just tell your computer to find “red dress tango” and it will search through the tags it has automatically created for each frame or scene.  And since we’re on the topic of entertainment, it would be unfair to completely ignore one of the most common uses of the web, adult entertainment.  You don’t need me to elaborate on how video search could be valuable to this industry.

Busy beach
I would not want to be a lifeguard here.
Photo by snappybex.

Lastly, one considerable application for retail will be in-store consumer recognition and tracking.  Retailers want to know everything that consumers see and do in their stores.  They already have ways to track how many people come in and out, what products they buy, and even test what they look at on the shelves with eye tracking.  In the future, technology will allow retailers to unobtrusively track a consumer’s every behavior from the time they enter the store to the time they leave and automatically understand every action the consumer has taken, as seen by the in-store video system.  With all this happening automatically, the retailer will be able to analyze and dig through this data to discover consumer trends that we cannot currently understand due to the time-cost of tracking and tagging methods.

That’s as far as I look into my crystal ball for today.  If you know of technologies that already exist that do what I’ve talked about here, leave a comment and impress us all!  Or reach into the back of your desk, pull out your own crystal ball, and tell us what you think the future of these technologies will bring.

Share/Save/Bookmark

Subscribe to RSS feed!



My Computer Knows how to French Kiss

10 07 2008

By Christian Laforte and Joshua Koopferstock

You’re watching a feel-good romantic comedy. It comes to that scene: Mr. Hollywood Hunk and Ms. Beverly Hills Perfect10 are staring into each other’s eyes. They lean in. She blinks, slowly. You know what’s going to happen. As they come together for that trite, big-romantic-scene-of-the-movie kiss, you think to yourself, “All these Hollywood rom-coms are exactly the same!”

Your computer watches on silently, intently. It’s trying to learn how to French kiss.

Teaching computers how to recognize human actions is one of the biggest ongoing challenges in the field of computer vision. A new paper shows a first step in the right direction: recognizing specific human actions, such as kissing or answering the phone.

Learning realistic human actions from movies (PDF, video mpg, abstract)
Ivan Laptev,
Marcin Marszalek, Cordelia Schmid, Benjamin Rozenfeld
Published at CVPR 2008

How does it work? First, the authors needed examples of human actions, like actors kissing, answering the phone or getting out of a car.

Examples of actions recognized by Laptev et al:
Kiss, Answer the phone, Get out of a car

They took clips of Hollywood movies and annotated them, e.g. a long kiss starts at 1m53s. Doing this manually for dozens of movies would have been tedious, so the authors developed a technique that combines subtitles and movie scripts (e.g. http://www.dailyscript.com) automatically, a hard problem considering the variety of expressions and the ambiguities in the text. The authors shared their results at http://www.irisa.fr/vista/actions, along with detailed technical explanations.

Automated machine learning and computer vision techniques were then used to recognize what a specific action, like kissing, looks like in successive images. The computer can therefore notice that in most kisses, two regions of the image (e.g. eyes, lips and nose of the actor) are slowly approaching and touching each other.

We will cover interesting potential applications in a future post. In the meantime, here are a few additional technical details.

Technical details

Although the automatic generation of the data sets is itself interesting, I paid particular attention to the video classification problem for action recognition. First, sparse space-time features are detected using a space-time extension of the Harris operator, using multiple temporal and spatial scales to further improve accuracy. To characterize the motion and appearance of local features, histograms are computed in the space-time volume surrounding a point feature, somewhat like SIFT encodes 2D point features.

A spatio-temporal bag of features (BoF) is built from the features, arranged along several spatio-temporal grids shown empirically to produce good results. A non-linear support vector machine is then used to classify actions amongst the 8 possibilities. Basically, this allows the system to automatically learn what important visual features appear in a given sequence. For example, for the shaking a hand action, we would expect that some features would move up and down in time.

Examples of spatio-temporal grids

This new technique outperforms previous ones in simple scenes, and works for natural movies with cluttered backgrounds. For the simpler KTH action dataset, the Laptev approach achieves an average classification accuracy of 91.8%, higher than any other published technique.

Actions in the KTH data sets

The action recognition in real-world video is much harder, so unsurprisingly, the accuracy is much lower, varying between 18% and 53% depending on the type of action. Still, this type of approach is promising and it’s reasonable to expect, in a few years, that improved versions will achieve near-human action recognition. What new applications will this technology make possible? We’ll explore the exciting possibilities in a future post.

Share/Save/Bookmark

Subscribe to RSS feed!