Dynamics prediction – predicting future states of objects based on current and prior states – is something humans are quite good at. For example, “We can see a billiard ball moving on a billiards table, and we can somehow predict the movement of the ball at a later state, one second later, two seconds later, and so on,” said Hanchen Xie, a fourth-year Ph.D. student at USC Thomas Lord Department of Computer Science and a Graduate Research Assistant at USC Information Sciences Institute (ISI). The challenge of dynamics prediction is to teach a machine to predict this movement as well as humans.
At the 40th International Conference on Machine Learning (ICML ‘23), held July 23rd to July 29th in Honolulu, HI, Xie, who works with the VIMAL (Visual Intelligence and Multimedia Analytics Laboratory) research group at ISI, presented a paper that points out potential weaknesses of common dynamics prediction methods, and addresses these limitations.
The paper is co-written by VIMAL founding director Wael AbdAlmageed, Research Associate Professor at USC Viterbi’s Ming Hsieh Department of Electrical and Computer Engineering and Research Director at ISI. AbdAlmageed said, “Artificial intelligence methods have well-known generalization and interpretability limitations that must be addressed if we really aspire to truly intelligent machines. Our paper introduces a novel framework that uses symbolic representations to address some generalization limitations.”
Physics or Vision-Based: Two Ways of Teaching Machines to Predict Movements
There are different approaches for solving the dynamics prediction problem. One methodology models movement using physics. Parameters of the moving object – position, mass, and velocity – are extracted from an image or video, and the model uses physics to predict the movement. The drawback: developing a physics model that precisely describes complex real-world scenarios requires expert knowledge, sophisticated physics parameters, and a lot of information.
The second methodology is vision-based dynamics prediction. This relies on neural networks – computing models used in AI to identify relationships in datasets – for learning physics, and the underlying mechanism is approximated through data.
In vision-based dynamics prediction, the model takes the raw images as input, incorporating visual information of the objects and the environment. The neural network is trained with data that allows it to predict the movement of the objects. A benefit of this method is that it requires much less input data, and that input data can be more general and easier to obtain. But, there are drawbacks.
When Your Training Doesn’t Match Your Testing
These models can be vulnerable to environment misalignment, which means the training data didn’t match the testing. In other words, things happen in real life tests that the model was never trained on (think: a self-driving car only trained on city data experiences rural roads in testing).
In their paper, the researchers explored two types of environment misalignment challenges: cross-domain and cross-context, which can significantly compromise the vision-based model’s capability.
The cross-domain challenge is when the video appearance in training is different than in testing. Xie gave a real-world example: “If we want to train a model to predict car collisions, we have lots of static images of cars, but we don’t have nearly as much dynamic data on car collisions. And we obviously cannot make cars collide. We can simulate collisions, but the simulation visual domain is different from the real domain. The question is: will the model trained in the simulation digital domain work in real life?”
The cross-context challenge focuses on another aspect of environment misalignment challenge, where, even if the visual domain stays the same, the environment is different. Xie gave the example of a self-driving car: “Let’s say a self-driving car is trained in a very controlled environment. Objects are always in the same location on the road. There are line markings always on the left side of the road. The people always come from the left to the right, and so on.” But in real-life situations, objects might appear in places that we’ve never seen before, road lines might not exist, and people come from different directions. The question is: will the same model work in this new environment?
What’s the Solution?
In the paper, Xie and his co-authors show that the performance of the current state of the art vision-based dynamics prediction models will not work under either the cross-domain challenge or the cross-context challenge. They discuss a promising solution to the cross-domain challenge, however it is a proof of concept and the team hopes to develop more sophisticated alternatives.
The conceptual solution they propose involves performing semantic segmentation to the image [each pixel of the image is categorized and labeled], and then using the segmentation mask [the pixel label “mask” generated by the semantic segmentation process] to replace the raw image as the input into the dynamics prediction model.
Xie said, “At least in the given example of a billiard game, we do not really care about the visual appearance of the object but care about the semantic meaning of the object.” In other words, they care about what the pixels of the ball image have been labeled as. He continued, “By doing so, we align different visual domains to the same abstract space.”
For the cross-context challenge, the team has identified the problem, but not the solution. Xie said, “We haven’t found a good solution and it is still an open challenge.”
Xie presented the paper, A Critical View of Vision-Based Long-Term Dynamics Prediction Under Environment Misalignment, as a poster presentation at ICML ‘23. It was co-written by Jiageng Zhu, Mahyar Khayatkhoei, Jiazhi Li, Mohamed E. Hussein, and Wael AbdAlmageed.
ICML is one of the fastest growing AI conferences in the world. This year, the conference had a record-high 6538 submissions (a 16% increase from last year’s record-high of 5630), and an acceptance rate of 27.9%.
Published on August 1st, 2023
Last updated on August 15th, 2023