Making a Computer Observe Like a Human

by Magali Gruet

Metamorworks/Getty Images

The human mind is an impressive machine. With a few sensory inputs, it can identify almost anything. For example, with just a glance, you can look at a stranger and guess their age, gender, and many more attributes. Often, you do this by identifying features, such as the clothes they are wearing or their height. 

Machines can identify people, places and things the same way. Teaching a machine to identify images, video, and other media explicitly through this type of feature identification is called representation learning. While many different methods to teach a machine in this way exist, they all have major flaws. Jiageng Zhu, a Ph.D. student in the USC Ming Hsieh Department of Electrical and Computer Engineering, tied together some of these representation learning methods to create a revolutionary way to train these computers.  

Zhu, in his recent paper titled, “Weakly Supervised Invariant Representation Learning Via Disentangling Known and Unknown Nuisance Factors,” utilizes the technique of weakly supervised learning. When a machine is taught to identify individual features using this method, the exact features are not labeled by the researcher. The machine must learn by itself which features are important to identification (hence the lack of “supervision”). Zhu guided the program slightly by separating semantic, or useful features, with “nuisance” features. Such nuisance features include the blurriness or shakiness of a camera, which have no correlation with the identity of the picture’s subject whatsoever. 

Under the tutelage of Wael AbdAlmageed, USC Information Sciences Institute Research Director & Associate Professor of Electrical and Computer Engineering, Zhu used the technique of disentangled representation to break down, or “disentangle”, the features into defined variables and encode them separately. This method mimics the intuition process of a human by breaking reasoning into “high” and “low” dimensions. For the process of identifying a picture of a person, an example of high dimensional reasoning would be observing the clothing, where low dimensional reasoning would be singling out the type of pants. A program built this way should be able to connect the low dimensional reasoning to the high dimensional reasoning to come to a conclusion, such as estimating the age of someone based on the type of pants they wear. 

Zhu paired this technology with invariant representation, a technique that ensures the neural network model separates the content of things from their environment. For example, when identifying a cat/dog, where the animal is should be separated from the content of that cat/dog since the environment is independent of the animal.  

In tests, Zhu found the method proposed in the paper – which combines weakly supervised learning, disentangled representation, and invariant representation – outperformed the state-of-the-art technology, indicating that it is nearly ready for industry use. This type of algorithm is useful in situations where the results have near-zero room for error, such as a self-driving car identifying hazards on the road. 

“The ultimate goal is to make the neural network understand the world like a human,” said Zhu. And to “make AI algorithms more robust when they go out of the lab and deployed in the real world,” adds AbdAlmageed. 

Zhu will continue to work on this project during his time as a PhD student at USC, honing the accuracy of the model. AbdAlmageed would also like apply this work to the medical field, “extending the research to discover predictive information in healthcare records in order to personalize and improve quality of healthcare.” 

The research project was funded in part by the Department of Defense’s (DoD) Defense Advanced Research Projects Agency (DARPA). Zhu presented his research at the 2022 European Conference on Computer Vision (ECCV). 

Published on March 1st, 2023

Last updated on March 1st, 2023

Want to write about this story?