Thursday, April 18, 2024

New method improves vision transformer AI systems efficiency

Vision Transformers (ViTs) are powerful artificial intelligence (AI) technologies that can identify or categorize objects in an image, such as identifying all of the cars or all of the pedestrians in an image. However, there are significant challenges related to both computing power requirements and decision-making transparency.

Researchers at North Carolina State University have now developed a new methodology that addresses both challenges while also improving the ViT’s ability to identify, classify, and segment objects in images.

Vision transformers face two challenges. First, transformer models are very complex. Relative to the amount of data being plugged into the AI, these models require significant computational power and use a large amount of memory. This is particularly problematic for ViTs because images contain so much data.

Second, it is difficult for users to understand exactly how ViTs make decisions. Depending on the application, understanding ViT’s decision-making process, also known as its model interpretability, can be very important.

Researchers have addressed both challenges in a new ViT methodology called “Patch-to-Cluster attention” (PaCa). “We address the challenge related to computational and memory demands by using clustering techniques, which allow the transformer architecture to better identify and focus on objects in an image,” says Tianfu Wu, corresponding author of a paper on the work.

“Clustering is when the AI lumps sections of the image together based on similarities it finds in the image data. This significantly reduces computational demands on the system. Before clustering, computational demands for a ViT are quadratic. For example, if the system breaks an image down into 100 smaller units, it would need to compare all 100 units to each other – which would be 10,000 complex functions.”

“By clustering, we’re able to make this a linear process, where each smaller unit only needs to be compared to a predetermined number of clusters. Let’s say you tell the system to establish 10 clusters; that would only be 1,000 complex functions,” Wu says. “Clustering also allows us to address model interpretability because we can look at how it created the clusters in the first place. What features did it decide were important when lumping these sections of data together? And because the AI is only creating a small number of clusters, we can look at those pretty easily.”

Researchers performed comprehensive testing of PaCa, comparing it to two state-of-the-art ViTs called SWin and PVT.

“We found that PaCa outperformed SWin and PVT in every way,” Wu says. “PaCa was better at classifying objects in images, better at identifying objects in images, and better at segmentation – essentially outlining the boundaries of objects in images. It was also more efficient, meaning that it was able to perform those tasks more quickly than the other ViTs.”

Next, the team is planning to scale up PaCa by training on larger, foundational data sets.

Journal reference:

  1. Ryan Grainger, Thomas Paniagua, Xi Song, Naresh Cuntoor, Mun Wai Lee, Tianfu Wu. PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers. arXiv, 2023; DOI: 10.48550/arXiv.2203.11987