Sunday, April 28, 2024

Novel tools help cut down on energy use in data centers

Artificial intelligence (AI) is becoming an increasingly important part of human life as it is applied to various domains such as health, education, entertainment, security, and more. Also, today’s AI-powered models have the potential to help us understand and mitigate the effects of climate change, such as the one developed by IBM and NASA to track greenhouse gas emissions and forecast extreme weather events.

However, these models – such as ChatGPT – also have a significant environmental impact, as they require large amounts of energy to run and train. This trend towards large-scale AI could result in data centers consuming up to 21% of the world’s electricity supply by 2030.

Like many data centers, the MIT Lincoln Laboratory Supercomputing Center (LLSC) has seen a significant uptick in the number of AI jobs running on its hardware. As energy consumption began to surge, computer scientists at LLSC started exploring options to optimize job execution for greater efficiency.

The LLSC team is developing techniques to reduce power, train efficiently, and make energy use transparent in their data centers. These techniques range from simple but effective changes, like power-capping hardware, to adopting novel tools that can stop AI training early on. Remarkably, they’ve found that these techniques have a minimal impact on model performance.

“Energy-aware computing is not really a research area because everyone’s been holding on to their data,” says Vijay Gadepally, senior staff in the LLSC who leads energy-aware research efforts. “Somebody has to start, and we’re hoping others will follow.”

They studied the effects of capping power to limit the amount of power intake of graphics processing units (GPUs), which are power-hungry hardware. Capping the power helped researchers reduce energy consumption by about 12% to 15%, depending on the model.

When it comes to capping power, the downside is that it can increase task time. According to Gadepally, GPUs will take around 3% longer to complete a task. However, this increase is often barely noticeable since models are typically trained over days or even months.

In one experiment, the team trained the popular BERT language model and limited the GPU power to 150 watts. This saw a two-hour increase in training time, from 80 to 82 hours, but saved the equivalent of a week of energy for a U.S. household.

Additionally, the team built software that lets data center owners set limits across their system or on a job-by-job basis. After implementing power constraints, the GPUs on LLSC supercomputers have been running about 30 degrees Fahrenheit cooler and at a more consistent temperature, reducing stress on the cooling system. This can also potentially increase the hardware’s reliability and service lifetime, thereby reducing the center’s embodied carbon emissions created through equipment manufacturing.

LLSC researchers also found another way to cut down on energy consumption. When training AI models, developers often aim to improve accuracy. However, figuring out the right parameters to use can be a challenging task that involves testing thousands of configurations. This process, known as hyperparameter optimization, has been identified by LLSC researchers as an area where energy waste can be reduced.

So, the team has developed a model that predicts the likely performance of the configurations, and the underperforming models stopped early. They found that this early stopping results in an 80% reduction in the energy used for model training.

The LLSC team wants to help other data centers apply these interventions and provide users with energy-aware options. Applying these techniques can significantly reduce their energy consumption and cost.