Teaching machines to see

2 min read

How do we know if we’re looking at the three-dimensional world or at a kind of trompe l’oeil image painted on the inside of a huge glass sphere? More to the point, how would a robot know?

Blessed with brains and the power of biological computation, humans can compute the most likely explanation for what we see. Our neural networks turn the fizz of photons, hitting a curved screen, into perception.

That’s awfully difficult to translate into code, says David Cox, who holds a joint appointment as assistant professor of molecular and cellular biology and of computer science at Harvard.

“Vision is the process of figuring out what’s out there in a 3-D world, from a set of 2-D images cast onto our retinas,” Cox explains. “It’s actually really hard, and the only reason it seems easy is that we’re seeing the world through the solution to the problem.”

After all, evolution over hundreds of millions of years has given us a system that works rather well. When we look out at the world, Cox marvels, “we sort of just transparently see.”

“That’s one of the challenges for computer vision,” he says: “Our intuitions about what’s easy and what’s difficult are usually wrong, because all of our intuitions are coming by way of this biological system. When you sit down and try to write a computer program that does the same thing, you discover just how hard it is.”