Meta, the tech giant, previously known as Facebook, has designed and built the AI Research SuperCluster (RSC), a new supercomputer that is among the fastest AI supercomputers running today. It is believed that the AI Research SuperCluster will be the fastest AI supercomputer in the world when it’s fully built out in mid-2022.
Meta aims to use it to train AI models with more than a trillion parameters that could advance fields such as natural-language processing for jobs like identifying harmful content in real-time.
The new AI supercomputer currently comprises a total of 760 NVIDIA DGX A100 systems as its compute nodes, for a total of 6,080 NVIDIA A100 GPUs, linked through a Quantum InfiniBand network that can transmit data at 200 Gigabits per second. RSC’s storage tier has 175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.
Meta’s early benchmarks have shown that RSC runs computer vision workflows up to 20 times faster, runs the NVIDIA Collective Communication Library (NCCL) more than nine times faster, and trains large-scale NLP models three times faster than the prior system.
Meta says RSC has all the needed security and privacy controls in place to protect any training data they use. Meta researchers can safely train models using encrypted user-generated data that is not decrypted until right before training.
In a second phase later this year, RSC will expand to 16,000 GPUs that Meta believes will deliver a whopping 5 exaflops of mixed precision AI performance. The InfiniBand fabric will expand to support 16,000 ports in a two-layer topology with no oversubscription. The storage system will have a target delivery bandwidth of 16 terabytes per second and exabyte-scale capacity to meet increased demand.