Basically there appears to be a missing level of complexity with these image generators as they probably see these images as 2D...
Basically there appears to be a missing level of complexity with these image generators as they probably see these images as 2D pixel grids and not as projections of 3 dimensional space onto a 2d surface, and don’t engage in spatial reasoning about the 3D layout of a cat and the surrounding scene.
We can think of it as having developed a bunch of alien paint tools without understanding the underlying theory.
Can one of these models learn that? Well, more importantly, what’s the factor difference in required resources?
has anyone (publicly) tried ML training on stereoscopic image pairs, or Magic Eye-type shit?