Table of Contents
For years, the U.S. Department of Energy (DOE) labs have been leaders in high-performance compute innovation and research. They’ve pioneered and deployed some of the most powerful supercomputers in the world. In fact, this past November, the DOE demonstrated how exascale computing will accelerate scientific discoveries and innovation tenfold at the SC22 conference in Dallas, TX. Thus, it comes as no shock that industry experts are dying to see what they’re cooking up in the lab. Answering the industry’s burning questions, HPCWire hosted a 2022 Great American Supercomputing Road Trip visiting all the top national labs; Pacific Northwest National Lab, Idaho National Laboratory, NCAR, NREL, Los Alamos National Laboratory, Sandia National Laboratories, and NASA Ames. Read on to learn what these labs need from the market to help fuel their innovative projects.
National labs rely on four main characteristics when deciding on data center infrastructure: portability, repeatability, scalability, and capability. In lay terms, the infrastructure and design must be location-agnostic, easy to duplicate or scale, and proven to work efficiently. Historically, high-performance computing has been supported with on-premises data centers. On-premise data centers provide the labs with greater control over data handling and security, higher performance capabilities, and faster speeds. Additionally, the labs benefit from lower power usage and reduced maintenance costs compared to cloud-based data centers.
However, integrating cloud-based or hybrid computing to help on-premises data center scalability is becoming more common in national labs. In fact, Pacific Northwest National Laboratory already leverages hybrid computing with on-prem and cloud to manage their dense compute loads and projects. As PNNL HPC group leader Kevin Barker mentions, “we’re seeing a convergence of a lot of different modalities of computing, so that’s pushing us not only towards the cloud but also to edge computing.” This convergence of technologies is resulting in an evolution of on-premises computing as it aims to keep pace with high compute loads. Alongside their pacific counterpart, experts at Sandia also envision cloud infrastructure as an opportunity to support future missions and expansions.
With that being said, I wouldn’t put all your investments into cloud computing technologies just yet. Various security concerns, from data breaches to hijacking, have caused understandable hesitation and slow implementation. In addition, cloud storage may even become a burdensome operational cost. Cloud computing usually focuses on the centrality of its data centers to maintain a large customer base rather than proximity-based data centers for specific clients. Comparatively, edge computing helps labs cut costs due to minimal data transferring and data locality benefits. More specifically, processing the data in the same location it’s generated, reduces the bandwidth needed to handle the data load. This helps mitigate the dreaded memory bandwidth challenges that most high-performance computing resources face.
In short, on-premises computing is evolving to keep pace with the advanced applications such as AI/ML. The national labs are looking for the next step forward to assist their projects so they can better store, manage and compute dense loads of data without worrying about any security or latency concerns. As the compute loads grow exponentially, so does the need for better liquid cooling. As the national labs are looking for ways to evolve their current computing architecture, they’ll need to address cooling concerns to ensure their supercomputers break exascale records and not global emissions. Luckily, advanced liquid technologies, such as microconvective cooling, provide labs with efficient liquid cooling that supports compute research with optimal performance. Microconvective technology can integrate at various levels, from integrated die cooling to fully sealed cold plates.
Just a decade ago, artificial intelligence and machine learning were a pipedream. Now, they’ve empowered some of the fastest supercomputers in the world. In June of 2022, the world’s fastest supercomputer, Frontier at Oak Ridge National Laboratory, broke the exaflop barrier operating faster than the next seven best supercomputers combined. Machine learning accelerators now saturate the market as manufacturers aim to innovate the next supercomputer on the horizon. Los Almos National Laboratory is in the process of installing and developing Crossroads, their next-generation weapons simulation supercomputer. As a result, AI/ML has become commonplace for developing new simulations and fueling research initiatives. The industry is seeing an increase in ML accelerators that are purpose specific to help meet the demand for supercomputers. Sandia lab experts are looking for new ML technologies to better support their programming goals in the future.
However, AI/ML growth is greatly attributed to edge computing technologies. Edge computing helps facilitate AI/ML applications by reducing latency and connectivity issues when fetching and processing data from a centralized server. More specifically, AI/ML applications perform extremely well when data is being processed close to where it’s created, and edge computing is a key technology to their successful implementation. With the increased popularity of edge computing, we’re seeing labs rely more and more on AI/ML applications to process large volumes of data.
In addition, AI/ML technologies require advanced liquid cooling solutions to support their processing power. With edge computing on the rise, the labs must address scalable liquid cooling solutions to support AI/ML technologies. SmartPlate System, a liquid cooling solution that fits in an air-cooled form factor, is an easily scalable solution that requires little to no infrastructure changes. With advanced liquid cooling, labs can sustain their AI/ML modeling initiatives. In summary, as the demand for AI/ML continues to grow, we can expect the labs to rely heavily on edge computing and liquid cooling to help support their projects moving forward.
Accelerators are the bread and butter of all things compute. They are quite literally the brain of the operation when it comes to high-performance computing. Their large role in the evolution of scientific computing for over a decade is expected to continue into 2023. After discussing with some of the DOE labs, it’s clear that they’re looking for more efficient accelerators, along with heterogeneous solutions, to support their advanced technologies. In fact, many labs are keeping their eye on the open-source RISC-V chip. Originating out of a 2010 project of UC Berkeley’s Parallel Computing Laboratory, the RISC-V hardware aims to simplify the instructions given to the processor to accomplish tasks. Some enterprises are leveraging the open-source chip architecture to create custom processors designed to handle the new advanced technologies, such as AI/ML and VR.
Unlike legacy ISAs such as ARM, the RISC-V chip provides enterprises with simplicity, modularity, and extensibility, allowing for a more robust software ecosystem. Pacific Northwest National Laboratory (PNNL) is even researching and creating its custom accelerators as part of the U.S. government. The research and development team is looking into how they can leverage and integrate third-party IPs such as RISC-V in the future.
In addition, most of the DOE labs deeply rely on heterogeneous computing. In April 2022, the DOE announced $20 million in basic exploratory research to find high-impact approaches to extreme-scale science and heterogeneous computing. Barbara Helland, DOE Associate Director of Science for Advanced Scientific Computing Research, says, “we need new innovative ideas to develop effective approaches and enable technologies to realize the full potential of scientific heterogeneous computing from these emerging technologies.” Heterogeneous computing traditionally refers to a system that uses more than one type of accelerator, such as a CPU, GPU, DPU, VPU, ASIC, and more. Specialized and purpose-driven accelerators are heavily utilized across the majority if not all, labs; this is a method that we expect to continue to grow. Many labs hope for better heterogeneous architectures that work more seamlessly together, from manufacturing to packaging. This would allow them project-based, purpose-specific, and expansive flexibility in scientific computing.
Alongside the desire for heterogeneous architectures is the need for standardization and tighter integrations. This will help allow the labs to pick and choose from different vendors without worrying about modifications from vendor to vendor. The improvement of tighter integration would also help fix any latency issues that often occur during heterogeneous solutions. However, standardizing commodity or ancillary products could be a double-edged sword. On the one hand, it could help facilitate innovation across various companies, vendors, and products. On the other hand, it begs whether the market has reached a point where it’s sophisticated enough to begin standardization. While it sounds like a logical next step, standardization stands directly in opposition to innovation.
Liquid cooling isn’t a new concept for the DOE labs. Liquid cooling offers the labs significant PUE, energy, and cost savings compared to traditional air-cooling methods. And simply put, supercomputing is extremely hot. Hosting some of the largest supercomputers in the world, the labs rely on liquid cooling to support future supercomputer projects. In fact, Sandia utilized liquid cooling to cool their supercomputer, Attaway, efficiently, resulting in the Albuquerque campus earning a U.S. Department of Energy Sustainability Award in 2021. In addition, Los Alamos will be using graywater to cool its next supercomputer, Crossroads. Graywater cooling systems create a closed-loop data center cooling system, which utilizes non-potable water as a coolant and offers significantly better performance than traditional data centers. This cooling technology allows Los Alamos to reduce their energy costs and environmental impact while still providing its data centers with the highest performance capabilities.
As the labs plan their new supercomputer projects, they’ll be concerned about geographical and load scalability. However, with scalability comes the need for more power. Unfortunately, bigger and faster means hotter data centers. Oddly enough, the heat byproducts of supercomputers have the power to destroy the entire system if not properly cooled. The labs will need to ensure that their liquid cooling solutions are easily scalable as they move forward with innovation. Some national labs, like Sandia’s newest high-performance computing cluster, Manzano, utilize microconvective liquid cooling technology, enabling them to run servers in room temperatures over 130°F. Processors over 1,000W TDP also become feasible with this cooling approach. Liquid cooling solutions that are flexible and scalable, such as microconvective technology, offer the labs a targeted approach integrated at any level. Unlike immersion cooling, microconvective cooling doesn’t require submerging components in liquid and oil to cool hot spots on the processors. This makes it extremely easy to scale and modify components during testing and research.
In addition, the labs have a strong responsibility to create sustainable supercomputers. Progress always comes at a cost; however, our environment is too big of a price to pay. If supercomputers designed to help society end up hurting the environment instead, was it worth it? As leaders in the computing industry, the labs will help set the baseline for sustainable data centers. Advanced liquid cooling technologies can support lab innovation without sacrificing performance.
Overall, the labs are setting a precedent for the industry to help them innovate. The industry is constantly evolving, and new technologies can have dramatic impact on accelerating growth. AI/ML represents just how much new technologies can influence a market. The lab experts always watch for the next technology to accelerate the industry. Craig Vineyard, a resource scientist at Sandia, even encourages innovators to not only tell them what their technology is good at but how they should be using the system. Innovation requires having transparent conversations and teamwork to produce a life-changing technology.
Check out HPCwire’s full 2022 Great American Supercomputing Road Trip podcast series here.