Skip to content

Children’s visual experience may be key to better computer vision training

According to research by an interdisciplinary team from the State of Pennsylvania.

In the first two years of life, children experience a somewhat limited set of objects and faces, but with many different points of view and under different lighting conditions. Inspired by this development idea, the researchers introduced a new machine learning approach that uses spatial position information to train visual AI systems more efficiently. They found that AI models trained with the new method outperformed base models by up to 14.99%. They reported their findings in the May issue of the journal Patterns.

“Current approaches in AI use massive sets of random photographs from the Internet for training. In contrast, our strategy is based on developmental psychology, which studies how children perceive the world,” said Lizhen Zhu, lead author and candidate a doctorate in the Penn State College of Information Science and Technology.

The researchers developed a new contrastive learning algorithm, which is a type of self-supervised learning method in which an artificial intelligence system learns to detect visual patterns to identify when two images are derivations of the same base image, resulting in a positive pair. However, these algorithms typically treat images of the same object taken from different perspectives as separate entities rather than positive pairs. According to the researchers, taking into account environmental data, including location, allows the artificial intelligence system to overcome these challenges and detect positive pairs regardless of changes in camera position or rotation, lighting angle or condition, and the focal length or zoom.

“Our hypothesis is that infants’ visual learning depends on location perception. To generate an egocentric dataset with spatiotemporal information, we set up virtual environments on the ThreeDWorld platform, which is a high-fidelity interactive 3D physical simulation environment “This allowed the location of the viewing cameras to be manipulated and measured as if a child were walking through a house,” Zhu added.

The scientists created three simulation environments: House14K, House100K and Apartment14K, where ’14K’ and ‘100K’ refer to the approximate number of sample images taken in each environment. They then ran base contrastive learning models and models with the new algorithm through the simulations three times to see how well they classified each of the images. The team found that models trained with their algorithm outperformed base models on a variety of tasks. For example, on a task of recognizing the room in the virtual apartment, the augmented model had an average performance of 99.35%, a 14.99% improvement over the base model. These new data sets are available for other scientists to use in training through www.child-view.com.

“It is always difficult for models to learn in a new environment with a small amount of data. Our work represents one of the first attempts at more flexible and energy-efficient AI training using visual content,” said James Wang, distinguished professor of information sciences. and technology and advisor to Zhu.

According to the scientists, the research has implications for the future development of advanced artificial intelligence systems aimed at navigating and learning from new environments.

“This approach would be particularly beneficial in situations where a resource-constrained team of autonomous robots needs to learn to navigate in a completely unknown environment,” Wang said. “To pave the way for future applications, we plan to refine our model to better take advantage of the spatial information and incorporate more diverse environments.

Collaborators from Penn State’s Department of Psychology and Department of Computer Science and Engineering also contributed to this study. This work was supported by the US National Science Foundation as well as the Institute for Computational and Data Sciences at Penn State.