Computers are getting very good at looking at a picture and telling us what's in it. In fact, they're getting so good that researchers are starting to turn their gaze skywards and bellow to the heavens, "Is that all you've got?!"
As it turns out, there is something new to tackle: the third dimension.
Arguably, being able to describe scenes in the real world is much more important than just being able to describe a single picture. I'm sure the reason we haven't seen more research done on this problem really just comes down to the crushing amount of data you'd need to process.
When computer vision algorithms learn to recognise something, they're trained by being given thousands of examples. With images, that amounts to gigabytes or terabytes of data, but full 3D scenes make that number jump up several orders of magnitude.
Ahh, but the beautiful thing about computing is that our ability to process information increases exponentially over time, so even if it's several thousand times harder, I have a feeling it will be less than a decade before we're in this position again, asking for an even harder problem. Hey, maybe we could throw time into the mix to make it four dimensions.
Every year, participants in the ImageNet Large Scale Visual Recognition Challenge try to code algorithms that can categorise these images with as few errors as possible. Seven years ago, this was a difficult task, but now computer vision is great at categorising images. [...] So the ImageNet team say it’s time for a fresh challenge in 2018. Although the details of this competition have yet to be decided, it will tackle a problem computer vision has yet to master: making systems that can classify objects in the real world, not just in 2D images, and describe them using natural language.