Vision is a remarkably difficult problem. One of my favourite "trick" images that demonstrates some of the challenging aspects of vision is the one displayed above. In the image, there are two squares labelled A and B. Those squares are identical in colour. It takes a bit of effort for most people to accept that, however. If you don't believe me, try loading the image into an image editing program and either get colour information from each square, or copy and paste regions of each next to each other.
Once satisfied that the two squares are actually identical in colour, it is worth stopping for a moment and thinking about why they look so remarkably different. In my experience, most peoples' initial reactions to this image are "how could our eyes screw up so badly?" However, if you think about it, it could equally be viewed a mistake to see the two squares as being identical in colour. It all comes down to the job one wishes to do. In most complex vision tasks, however, grasping the overall pattern of the scene is quite important, which makes viewing the two squares as different shades the more desirable perception to have. However, attaining the perception that a human effortlessly has (in fact, it takes a fair bit of effort to overcome the apparent difference in magnitude of grey) is actually quite difficult for a computer.
Say, for example, that you wish to take an image of a checkerboard like the one in the above image, find all the exposed squares, and label each square as either a dark or a light square. Since we are allowing the possibility of occlusion, variable illumination, and shadows, clearly there can be no simple global threshold used to label each pixel as either a dark or light pixel and then simply group like pixels with each other. If one attempted to do that, A and B would be labeled in the same manner and, therefore, one would be incorrect. So, perhaps you decide to be a little more complex and a little cleverer than that. You decide to locate the top left corner of your checkerboard (assuming that it is never occluded, which can sometimes be a hefty assumption, but we'll allow it) and then find its boundaries using either a blob tool (group all nearby pixels within a certain range of the starting pixel's value) or edge finder. Then, once all the boundaries have been located, compare neighbouring square average pixel values to determine which ones are the light and which ones the dark squares. Unfortunately, there are several problems with this approach. The first is the possibility of falsely splitting a square because of a shadow lying across only part of it. The second is that the thresholds used to find the boundaries of each square will not work at all illumination levels. Algorithms do exist for trying to dynamically find appropriate thresholds, but they are beyond the scope of this brief discussion and are not completely reliable.
Thus, you might decide to try one last method of being to clever to fix the downfalls of your previous method. You model the size of the squares with predetermined values, so once you find one you will be able to know where the others are even if the edge or blob detector fails to find a boundary or finds an extra boundary. However, this will only work if the orientation and apparent size of the board are completely fixed, which is not a reasonable assumption for any but the most constrained environments.
This is not an impossible task. It is not even a particularly difficult vision task if one were able to constrain the position and illumination of the checkerboard. Constraining at least one would still make it much simpler. However, I hope this simple discussion has made it clear just how much more complicated even a simple task like this might be, especially once the light levels and orientation are allowed to vary. When the task is scaled up to allow a wide assortment of objects, it becomes virtually intractable. I say virtually intractable because it clearly must not be, since it is a task that nearly every person on the planet manages to accomplish every day. Figuring out just how that is done, though, is a very fun problem to cogitate.