By Christian Laforte and Joshua Koopferstock
You’re watching a feel-good romantic comedy. It comes to that scene: Mr. Hollywood Hunk and Ms. Beverly Hills Perfect10 are staring into each other’s eyes. They lean in. She blinks, slowly. You know what’s going to happen. As they come together for that trite, big-romantic-scene-of-the-movie kiss, you think to yourself, “All these Hollywood rom-coms are exactly the same!”
Your computer watches on silently, intently. It’s trying to learn how to French kiss.
Teaching computers how to recognize human actions is one of the biggest ongoing challenges in the field of computer vision. A new paper shows a first step in the right direction: recognizing specific human actions, such as kissing or answering the phone.
Learning realistic human actions from movies (PDF, video mpg, abstract)
Ivan Laptev, Marcin Marszalek, Cordelia Schmid, Benjamin Rozenfeld
Published at CVPR 2008
How does it work? First, the authors needed examples of human actions, like actors kissing, answering the phone or getting out of a car.

Examples of actions recognized by Laptev et al:
Kiss, Answer the phone, Get out of a car
They took clips of Hollywood movies and annotated them, e.g. a long kiss starts at 1m53s. Doing this manually for dozens of movies would have been tedious, so the authors developed a technique that combines subtitles and movie scripts (e.g. http://www.dailyscript.com) automatically, a hard problem considering the variety of expressions and the ambiguities in the text. The authors shared their results at http://www.irisa.fr/vista/actions, along with detailed technical explanations.
Automated machine learning and computer vision techniques were then used to recognize what a specific action, like kissing, looks like in successive images. The computer can therefore notice that in most kisses, two regions of the image (e.g. eyes, lips and nose of the actor) are slowly approaching and touching each other.
We will cover interesting potential applications in a future post. In the meantime, here are a few additional technical details.
Technical details
Although the automatic generation of the data sets is itself interesting, I paid particular attention to the video classification problem for action recognition. First, sparse space-time features are detected using a space-time extension of the Harris operator, using multiple temporal and spatial scales to further improve accuracy. To characterize the motion and appearance of local features, histograms are computed in the space-time volume surrounding a point feature, somewhat like SIFT encodes 2D point features.
A spatio-temporal bag of features (BoF) is built from the features, arranged along several spatio-temporal grids shown empirically to produce good results. A non-linear support vector machine is then used to classify actions amongst the 8 possibilities. Basically, this allows the system to automatically learn what important visual features appear in a given sequence. For example, for the shaking a hand action, we would expect that some features would move up and down in time.

Examples of spatio-temporal grids
This new technique outperforms previous ones in simple scenes, and works for natural movies with cluttered backgrounds. For the simpler KTH action dataset, the Laptev approach achieves an average classification accuracy of 91.8%, higher than any other published technique.

Actions in the KTH data sets
The action recognition in real-world video is much harder, so unsurprisingly, the accuracy is much lower, varying between 18% and 53% depending on the type of action. Still, this type of approach is promising and it’s reasonable to expect, in a few years, that improved versions will achieve near-human action recognition. What new applications will this technology make possible? We’ll explore the exciting possibilities in a future post.
Subscribe to RSS feed!