Create 3D Models from Photos

12 09 2008

By Joshua Koopferstock

If creating 3D models was as easy as taking photos, it is safe to say that the use of 3D would be far more widespread than it is today.  From e-commerce to virtual tourism to casual games, reducing the cost and complexity of creating 3D models would have a widespread effect on multiple industries.

Feeling Software is making that possible.  Over the last 2 years, we have worked to develop a technology that allows anyone to create 3D models with little effort and no training.  Our goal: simplicity.  You take a bunch of photos with a regular camera from any angle you please, and we automatically create a 3D model.  The demo video below discusses our project in detail.


Feeling Software Demo from joshk on Vimeo.

We have thought of a variety ways that this technology can be applied to solve problems for consumers.  For our readers, imagine that you could take photos of an object or scene, press a button, and instantly have a high-quality 3D model of that object or scene.  If this technology were available today, how would you use it?

Share/Save/Bookmark

Subscribe to RSS feed!



DARPA Binoculars Use Haze to See Farther

28 08 2008

By Joshua Koopferstock

Being able to see through haze would be neat enough in its own right, but DARPA-funded scientists are going one step beyond that and using the haze to actually see further than they would be able to see if it wasn’t there.  More specifically, these researchers believe that the shimmering of heat “waves” can in fact be used as a lens, with the right image recognition technology.

Photo by Keirn

The goal? 90% accurate facial recognition at 1km with a 6cm lens.  As the system combines the data from multiple images, you will not be able to see 1km away in real time; the aim is 1 frame per second.

What I like about this approach is that it takes something that is impeding the goal — haze blocks your ability to see far — and turns it into an improvement — haze helps you see farther!  More problems should be solved this way.

View the technical presentation

Source: New Scientist

Share/Save/Bookmark

Subscribe to RSS feed!



Amazing Inspiration for Computer Vision

27 08 2008

By Joshua Koopferstock

These product concept designs by Mac Funamizu, a Japanese graphic designer, are among the most amazing applications of computer vision technology that I have come across. What I found especially inspiring is that some of the technology that we are working on at Feeling Software will be key to making a concept such as this a reality. And without too much effort, I believe that just about everyone reading this blog can see how their own work in 3D or computer vision will be a necessary building block to make this possible.  I hope Mr. Funamizu’s work concept fires the imagination of many people regarding the possibilities and usefulness of augmented reality.

“Future of Internet Search: Mobile Version” Product Concept


This Photosynth-esque approach (above) shows you other photos of the same scene you are looking at, from the same angle.  Here, the designer demonstrates looking at the scene in front of you over time through historical photographs.

Image recognition + mobile internet + Wikipedia?

Text recognition + mobile internet + babel fish?  I wish I had that when I was trying to decipher menus in Slovakia!

See more possibilities for this concept at Mac Funamizu’s blog, petitinvention.

Share/Save/Bookmark

Subscribe to RSS feed!



A computer vision system arrested my wife

3 08 2008

By Christian Laforte

Panic and surprise

Three days ago I received a panicked call from my wife. She had been arrested while driving on the highway near the office in Montreal, Canada with our 10-months old daughter. I ran to the scene and was told by the policeman that my wife drove safely, but we had neglected to renew our license plate on time. We had to accompany him to the police station and pay $600 in fines and towing charges.

Not the actual scene, but you get the idea.

How could this happen? We always pay our bills right away. We notified the government of our new address before moving apartment last year. But more interesting to the readers of this blog, how did the police identify my wife’s car out of the dozens that pass every minute on the highway?

The policeman — let’s call him Joe — gave me a lift to the traffic authorities, and explained how this all works. A real-time license plate scanner is installed on a patrol car on the side of the highway. Using an active light source and high-speed cameras, it tracks every license plate that passes and compares it against an on-board database, updated once a week through a USB key. The device costs $25,000.

“Isn’t that expensive?”, I asked Joe.

“Listen to the radio… They just arrested a guy who already lost his permit. He was driving a car with an expired license plate and he was wanted for petty crimes and unpaid parking tickets. He’s looking at a fine of at least $900, plus the old parking tickets. We would have never caught the guy otherwise. No wonder the big boss wants to equip at least 100 cars with the device by the end of year.”

Frustration instantly switched into interest (and a bit of envy)

That’s a market of $2.5M for a small city like Montreal. A great market for a computer vision technology, with a lot of potential growth in years to come.

Still until now, I’ve always been an optimistic proponent of computer vision technologies. I wasn’t too worried about privacy. Being arrested certainly gave me a fresh perspective. Especially, as it turns out, because the government admitted having a bug in their address change software, which explained why we never got the license plate renewal notice.

Anyway, I still love computer vision and this is a cool technology, so let’s explore how it works and how it could be improved.

Description of the system

I haven’t seen the system but on the spot I asked Joe a lot of questions to have a better idea. The device is bolted on the roof of another patrol car, stopped on the highway. It has two cameras and one red, intense light source, like those used in barcode scanners. The cameras and the light source are tuned to focus on highly reflective surfaces, like a clean license plate. It can be fooled if the license plate is dirty or if there are other highly reflective surfaces in the field of view, e.g. a policeman badge, or I assume, when the sun reflects toward the camera. Otherwise the system appears quite robust: it works night and day, it can deal with partial occlusions of the license plate, and it can read multiple license plates in the same image.

Limitations of the system

- The database is only updated once a week. People can get arrested more than once even though they paid the fine.

- The device only scans plates. It cannot recognize a stolen car with a valid plate. As Joe explained, organized criminals are smart: they wouldn’t risk getting arrested with a false or expired plate.

- The device, I presume, can be fooled easily by adding a filter (e.g. transparent film or grease) on the plate to absorb the red wavelength, or by adding a mirror next to it to distract the cameras. To the human eye, the plate would look fine, but it would no longer be detected by the device.

- Joe explained that, if the driver were to speed away, he probably couldn’t do anything. The police no longer engage in speed chases since it’s too dangerous for the police and the general public. They have a hard time tracking dangerous drivers that speed away.

Clearly, recognizing a license plate is too simplistic. Pretty soon, criminals will know how to fool the system and the only honest people like my wife will be apprehended.

A better solution

For a device like this to be truly useful, it would first need to be connected to the central station database. Just plug it into a cellular network, e.g. using an iPhone or Android (link). With a fast enough connection, the video stream could be uploaded, recorded and processed in a central server farm. This could vastly reduce the size and cost of the device and increase the recognition capability of the overall system. The cheaper system could be installed on every patrol car or traffic light. A dangerous driver speeding away could be tracked across the city and apprehended when finally stops.

Using high resolution cameras, It would be pretty easy to recognize a car color, brand and year from the video stream: all you need is a database of logos and a good feature detector. Getting this to run at real-time would be challenging, but I’m confident this can be achieved given a year or two of development. Looking at the car as a whole would help identify stolen vehicles.

Pushing this farther, cars could be tracked across an entire city, e.g.: London with its networks of surveillance cameras. Criminals could be followed to their lair hours after a crime is reported. Hopefully the people in charge will re-think the overall process so honest people aren’t harassed or tracked without a good reason.

(Note: this is a draft of the post. I haven’t had the time to research the solution, but I’m posting it early anyway since the Washington post and Slashdot just featured a similar story.)

Share/Save/Bookmark

Subscribe to RSS feed!



Advanced real-time facial tracking ready to leave the labs

24 07 2008

By Christian Laforte and Joshua Koopferstock

The goal of facial tracking is to recognize elements of a face in an image, and to follow them in a series of images. It may sound simple, but in reality it’s so complex that the human brain evolved a special area just for this task. Some unfortunate folks are born without that area and can’t even recognize themselves in the mirror.

Is this me? Welcome to Prosopagnosia
(original image)

AAM (Active Appearance Model) is the best family of facial tracking algorithms out there right now. The technique was first published by Cootes, Edwards, and Taylor in 2001, then heavily optimized and extended by a group of CMU researchers, primarily Iain Matthews, Simon Baker and Ralph Gross.

Here’s why this algorithm rocks:

  • It’s fast: 300Hz on a regular desktop PC.
  • It’s robust. It can deal with occlusions, e.g. sunglasses.
  • It’s relatively straightforward to implement.
  • It requires no special calibration for the user.

First, statistical model…

To do its magic, AAM must be taught what faces look like in various conditions. To achieve this, hundreds of images of faces must be annotated by human operators. These faces display a wide range of conditions including different races, different expressions, illumination, etc. For each image, someone must manually mark special points, e.g. tip of the nose, to build a mesh:

This training data is converted into a statistical model of face shapes and appearances. This is a tedious process, but once it works for a few faces, the rest of the algorithm can be used to “bootstrap” other faces, so adding new examples become faster with time.

… then track the face.

Once we have our statistical model, tracking can be performed in real-time by fitting the model on an image, such as a frame from a real-time video. Technically-speaking, this is a non-linear optimization problem that consists in minimizing the error between the image and the model. Because the problem is non-linear, we need a good first estimate and a robust fitting algorithm, otherwise the tracking gets stuck in the wrong part of the image, so a face will be detected in some guy’s ear.

Having a good first estimate used to be the hardest part. Basically, we need to tell AAM roughly where to look. Five years ago this would have required a special face detection algorithm, making the system twice as complicated to implement. Now that AAM is super fast, it’s probably simpler to just run it randomly in the image until we catch a face.

AAM improved by CMU researchers

Prior to the CMU papers, AAM was promising but not robust or fast enough for practical applications. The CMU researchers invented a fitting technique called the inverse compositional image alignment algorithm. Basically, they inverted one key step of the original algorithm (comparing the image against the model) which allowed them to compute some expensive calculations much less frequently. The end result was a much faster and robust algorithm, capable of running hundreds of times per second.

The CMU researchers then further improved AAM to deal with occlusions (i.e. partially hidden face) and to track a 3D face instead of a 2D approximation.

Regular AAM on the left,
improved AAM to deal with occlusions on the right
(video)

Surprisingly, when the CMU researchers extended AAM in 3D, using one or more cameras, the extension not only produced precise 3D results, it also ran faster and more robustly!

3D AAM in action (video)

From lab to the real world

Tons of cool applications could leverage this algorithm. So why hasn’t it happened yet? I think the primary reasons is that most people just don’t know it’s possible.

The time has come for AAM to leave the lab and make the real world a more technologically advanced place. If you have a good application idea and funding to bring this to market, give us a call and we’ll be happy to help and pitch in!

Share/Save/Bookmark

Subscribe to RSS feed!



Semantic texton forests for image categorization and segmentation

24 07 2008

By Christian Laforte

Warning: this post is pretty technical.

I forgot an important newcomer in my earlier post on segmentation algorithms published at CVPR 2008:

Semantic Texton Forests for Image Categorization and Segmentation (PDF, extra results, video)
Jamie Shotton, Matthew Johnson, Roberto Cipolla

Semantic texton forests (STFs) is not a typical segmentation algorithm. Unlike traditional segmentation algorithms that rely primarily on edge information or other low-level image processing, the semantic texton forests (STFs) use an ensemble of decision trees built from training examples, e.g. manually segmented images. This allows STFs to understand that pixels not only belong together, but they also represent, say, a sheep, or some grass. STFs and the related algorithms described in this paper therefore solve two problems: segmentation and categorization.

Other categorization techniques typically rely on features descriptors or manually tuned filter banks. In contrast, STFs operate directly on pixels, resulting in a very fast, relatively simple to implement algorithm, especially compared against state-of-the-art classification systems like Marszaek’s.

Once an STF is built, it can be used to identify that a given green pixel in an image is likely to be grass, by examining neighboring pixels in, say, the 21×21 pixels surrounding it. Likewise, a different pixel could be identified as part of a sheep.

By computing histograms of STFs in a region or a full image, we end up with a higher-level Bags of Semantic Textons (BoSTs). We can then perform semantic segmentation, e.g. capture the notion that sheep often stand on grass. This greatly increases the recognition and segmentation accuracy. The authors explore several optional implementation details and optimizations, and provide details on how each of them improves or not the segmentation quality.

Shotton and his colleagues report that their 8 fps implementation achieves segmentation accuracy of 66.9% on the medium-difficulty MSRC segmentation data set:

Original images (top) from the MSRC data set
and the final categorized segmentation (bottom)

The performance comes down significantly when the algorithm is confronted to the much harder VOC 2007 data set, performing at 24% by itself, or 42% when combined with TKK, a state-of-the-art detector.

Original image (left) from the VOC2007 data set
and the final categorized segmentation (right)

You’ll notice that the segmentation is rough and not pixel-perfect. The authors mention that a much cleaner segmentation could be performed using a Markov or conditional random field to precisely follow image edges.

Share/Save/Bookmark

Subscribe to RSS feed!



Trying on clothes in 3D: We have a long way to go.

23 07 2008

By Joshua Koopferstock & Christian Laforte

For you technology lovers who are still kids at heart, Disneyland has recently opened up the Innoventions Dream Home, showcasing cool high tech integration in a futuristic home. What caught my attention was one invention called the Magic Mirror. Getting its name from the mirror in Snow White, this Magic Mirror does not tell you “who is the fairest of them all” (for that you’ll still need HotOrNot.com). What it does do is allow you to virtually try on clothes in your wardrobe. In fact, the Magic Mirror is not a mirror at all; it is a large display monitor with a video camera next to it.

Magic Mirror

Trying on a dress in the Magic Mirror. Photo: cepro.com

While the concept is neat and would probably be even more useful in the department store dressing room than the bedroom, by the looks of it in the video below, the concept is still far from the realism necessary for a technology like this to take off. A few years back, it was thought that by today, virtual clothes shopping would be mainstream, and companies like My Virtual Model had signed contracts with major apparel retailers to integrate their technology into online stores.

It turns out that the technology wasn’t ready, and by the looks of this Magic Mirror, it still has a long way to go.

Here’s how I think they do it:

The dress moves roughly according to the orientation of the head, so they are most likely using a simple real-time head tracker and applying the pose of the head to the top of the dress. The bottom seems to be animated randomly, or maybe through secondary animation.

Later this week I’ll post on a face tracking algorithm that could make this easily possible.

How could we do it better?

One imperfection is very noticeable: the dress doesn’t follow the shoulders and hips properly. Part of it may be anatomical (this is a guy after all), but I think this problem should be easily solved by tracking the silhouette (using background subtraction) and identifying the shoulders and hips using simple heuristics, e.g. areas of low curvature and roughly horizontal or vertical slopes. This would immediately improve the realism of this solution.

Another improvement would be to track features on the user’s T-shirt, so we can have a better estimate of the body pose, its size and maybe even the person’s sex. I’d start my search with Automatic Non-Rigid 3D Modeling from Video (Torresani and Hertzmann, 2004), since I remember being impressed with the results way back then: it handles occlusions and variations in illumination very nicely. In the picture below, you can see one of the researchers moving his hands in front of his T-shirt… the algorithm can still capture a 3D representation of the deforming T-shirt. Doing this in real-time may be challenging, but fast GPUs and multi-core systems should make it possible.

Still a Long Way to Go

With the method that we have suggested, there is one major sticking point that we have not addressed: content creation.  Assuming you have the ability to accurately track a person and render the image in real time, you still need a way to create the clothes in 3D.  This is not a simple task, as clothes can be highly variable in elasticity, reflectiveness, etc., which would make automation of the modeling process complex.

Before we see these Magic Mirrors in department stores like Sears or Macy’s that have hundreds of thousands of different apparel items each year, a method for automatic creation of clothing content will have to be developed.  And while automatic 3D content creation is going to take great strides in the next couple of years, the quality of 3D reconstruction needed for clothing is still a long way off.

Share/Save/Bookmark

Subscribe to RSS feed!



My Computer Knows how to French Kiss

10 07 2008

By Christian Laforte and Joshua Koopferstock

You’re watching a feel-good romantic comedy. It comes to that scene: Mr. Hollywood Hunk and Ms. Beverly Hills Perfect10 are staring into each other’s eyes. They lean in. She blinks, slowly. You know what’s going to happen. As they come together for that trite, big-romantic-scene-of-the-movie kiss, you think to yourself, “All these Hollywood rom-coms are exactly the same!”

Your computer watches on silently, intently. It’s trying to learn how to French kiss.

Teaching computers how to recognize human actions is one of the biggest ongoing challenges in the field of computer vision. A new paper shows a first step in the right direction: recognizing specific human actions, such as kissing or answering the phone.

Learning realistic human actions from movies (PDF, video mpg, abstract)
Ivan Laptev,
Marcin Marszalek, Cordelia Schmid, Benjamin Rozenfeld
Published at CVPR 2008

How does it work? First, the authors needed examples of human actions, like actors kissing, answering the phone or getting out of a car.

Examples of actions recognized by Laptev et al:
Kiss, Answer the phone, Get out of a car

They took clips of Hollywood movies and annotated them, e.g. a long kiss starts at 1m53s. Doing this manually for dozens of movies would have been tedious, so the authors developed a technique that combines subtitles and movie scripts (e.g. http://www.dailyscript.com) automatically, a hard problem considering the variety of expressions and the ambiguities in the text. The authors shared their results at http://www.irisa.fr/vista/actions, along with detailed technical explanations.

Automated machine learning and computer vision techniques were then used to recognize what a specific action, like kissing, looks like in successive images. The computer can therefore notice that in most kisses, two regions of the image (e.g. eyes, lips and nose of the actor) are slowly approaching and touching each other.

We will cover interesting potential applications in a future post. In the meantime, here are a few additional technical details.

Technical details

Although the automatic generation of the data sets is itself interesting, I paid particular attention to the video classification problem for action recognition. First, sparse space-time features are detected using a space-time extension of the Harris operator, using multiple temporal and spatial scales to further improve accuracy. To characterize the motion and appearance of local features, histograms are computed in the space-time volume surrounding a point feature, somewhat like SIFT encodes 2D point features.

A spatio-temporal bag of features (BoF) is built from the features, arranged along several spatio-temporal grids shown empirically to produce good results. A non-linear support vector machine is then used to classify actions amongst the 8 possibilities. Basically, this allows the system to automatically learn what important visual features appear in a given sequence. For example, for the shaking a hand action, we would expect that some features would move up and down in time.

Examples of spatio-temporal grids

This new technique outperforms previous ones in simple scenes, and works for natural movies with cluttered backgrounds. For the simpler KTH action dataset, the Laptev approach achieves an average classification accuracy of 91.8%, higher than any other published technique.

Actions in the KTH data sets

The action recognition in real-world video is much harder, so unsurprisingly, the accuracy is much lower, varying between 18% and 53% depending on the type of action. Still, this type of approach is promising and it’s reasonable to expect, in a few years, that improved versions will achieve near-human action recognition. What new applications will this technology make possible? We’ll explore the exciting possibilities in a future post.

Share/Save/Bookmark

Subscribe to RSS feed!



Tracking a Crowd: Teaching Computers to Watch Many People at Once

27 06 2008

By Christian Laforte

Researchers just unveiled a promising solution to the challenging problem of automatically detecting and tracking multiple people in cluttered scenes using a single, potentially moving camera.

The paper was published at CVPR 2008 this week:

People-Tracking-by-Detection and People-Detection-by-Tracking (PDF)
Mykhaylo Andriluka, Stefan Roth, Bernt Schiele

Andriluka and his colleagues have also published demonstration videos: video 1, video 2.

Their solution relies on a model for pedestrian detection, which can identify articulated parts of a pedestrian (e.g. leg, arm, head) in a single image with a higher accuracy than previous techniques. The model is trained on a set of human-annotated videos. To further increase the accuracy, sequences of images are analyzed to detect tracks. This both decreases the number of false positives (i.e. detected pedestrians where no one is present) and enables the system to detect partially occluded people in a crowded scene. The system can also identify a probable pose for each tracked pedestrian and keep tracking him as he becomes temporarily occluded, e.g. passing in front of a car.

Although the technology is not flawless (e.g. it still misses pedestrians farther away), it is very promising and the applications are numerous. The paper mentions video surveillance in airports and train stations. It is conceivable that in a couple of years, a computer can track thousands of people as they enter and leave public places, noticing unusual activities along the way (e.g. suitcases that are left on the ground for an extended period of time).

This kind of solution could also help reduce theft, e.g. in grocery self-checkouts. Finally, for the entertainment industry this paves the way to non-intrusive motion capture for entire crowds.

Can you think of other applications of this technology?

add to del.icio.us::Digg it::::Stumble It!

Share/Save/Bookmark

Subscribe to RSS feed!