Not Lonely at the Top: ORNL Leads Open Science with the Frontier Exascale Supercomputer
A Q&A with Justin Whitt
Twice a year since 1993, the famed TOP500 list project has meticulously measured and ranked the world's fastest, top-performing, nondistributed supercomputer systems. The latest list, published in June 2022, highlighted a new milestone in computing as Oak Ridge National Laboratory's Frontier HPE Cray EX soared into the number one spot as the first supercomputer ever to demonstrate true exascale speeds.
Exascale represents a threshold of a quintillion calculations, or flops, per second — floating point operations per second — as measured by the Linpack benchmark test. Frontier sailed past that with a demonstration of 1.1 exaflops performance and a theoretical peak of 2 exaflops.
Frontier is housed at ORNL's Oak Ridge Leadership Computing Facility under the auspices of the DOE Office of Science and is the product of more than a decade of collaboration among industry, academia, and national labs. The powerful system will be tested further and be fully operational for early science use during 2022 and then made ready for open science in 2023.
Here, CT talks with Oak Ridge Leadership Computing Facility Director Justin Whitt for more about Frontier — both the impressive technology and the advanced problem solving to be shared by the open science community.
Mary Grush: Just how fast is Frontier and how is its performance described in terms of the Linpack benchmark and ranked for the TOP500?
Justin Whitt: Frontier has debuted as the world's fastest computer — number one on the TOP500 list as measured by the Linpack benchmark. It came in with a speed of 1.1024 exaflops, which means that it did, indeed, break the exaflop barrier — a goal that we've been working toward for the past decade. So that's a pretty big development. Frontier is capable of doing over one quintillion calculations every second. One way to think about this is that it's capable of multiplying one number by another number more than a quintillion times each second.
Frontier has debuted as the world's fastest computer — number one on the TOP500 list as measured by the Linpack benchmark.
When the numbers are that big they really do start to lose meaning. Another way to try to imagine this is, if every person on planet earth could do one calculation per second, it would take four years to do what Frontier does in one second.
Grush: I know that ORNL should feel right at home in the TOP500's highest rankings. It seems like you usually are there, with Summit and previous machines. But this year is different, because Frontier has broken that exascale barrier.
Whitt: Yes, Oak Ridge has a considerable history of deploying systems that end up being ranked as the fastest computer in the world — going back to Jaguar and Titan, and most recently Summit and now Frontier.
But the Department of Energy and other scientific communities have had their sights set on breaching that exaflop barrier, going back ten or even twelve years. Early on, DOE commissioned a study to look at the question of how we would get there. The study pointed out some significant challenges: How do you keep that much hardware up all at once, without losing a fundamental component (like one of the processors); or, how do you focus all that hardware to compute on a single problem given the computing environments of the day (considering that you would need to create billion weight parallelism to do that); and finally, even if you could keep the hardware up and you could harness the hardware to focus on a problem, it might have taken somewhere between 100 and 200 megawatts of power — something that would not be reasonable.
The Department of Energy and other scientific communities have had their sights set on breaching that exaflop barrier, going back ten or even twelve years.
So, over the past ten years we've seen significant investments to seed and evaluate technologies in light of these major challenges. And industry has stepped up in public/private partnerships — helping to close the gap that had existed in those three major challenges.
We've seen significant investments to seed and evaluate technologies… And industry has stepped up in public/private partnerships — helping to close the gap.
Today, we have a system that's resilient enough that you can keep a massive amount of hardware up and running; we have a programming environment that allows users to harness that hardware, focused on their individual problems; and, for an exaflop of computing power we're running at only about 15 megawatts of power.
For an exaflop of computing power we're running at only about 15 megawatts of power.
Grush: Who was involved in Frontier's genesis (what government agencies, academic or commercial partners, labs…)?
Whitt: There were, of course, a few different efforts, but the one I'm most familiar with is DOE's Exascale Computing Project, particularly the PathForward program. The federal government partnered with companies including Intel, AMD, IBM, NVIDIA, and Hewlett Packard Enterprises to develop the technologies needed to reach exascale. These technologies were then refined and prototyped for DOE's leadership computing facilities.
Grush: Could you describe the computer architecture and significant technology elements of Frontier? Is the computer a Cray?
Whitt: It is a Cray. Cray merged with Hewlett Packard Enterprises a few years ago. So we started this journey with Cray, and Cray and Hewlett Packard Enterprises have been tremendous partners through the years.
As a Cray system, Frontier relies on very distinct new technologies. The CPU is an AMD Epyc. The compute power, though, comes primarily from the GPU — the system has 9,408 nodes, each of which has one CPU and four GPUs, the AMD MI-250x GPUs.
The high speed network is a proprietary Cray, and now HPE network called Slingshot, that has new switching technologies, and it has brand new, just-off-the-assembly-line network insertion cards that allow each of those 9,408 nodes to communicate with each other at high speed.
Grush: What are some examples of the kinds of problems this computer can handle — let's say some outstanding problems that are "too hard" for most classical computers?
Whitt: At ORNL we have researchers from a broad spectrum of scientific domains beginning to use Frontier. As a Department of Energy lab, we end up doing a lot of modeling and simulation associated with topics around energy and new materials: new energy sources, how we store energy, and new technologies and materials related to batteries. We have researchers working on how to make better photovoltaic, for solar cells… So, we do a lot of energy-driven applications, which is not surprising considering that we are a Department of Energy national lab.
In other disciplines we also have researchers looking at different elements of biology; some looking at protein docking and how that drives the efficacy of different drugs for treatment of disease.
The last few generations of computers have turned out to be tremendous resources for machine learning and artificial intelligence, in addition to the modeling and simulation that we've done for years. So we have a lot of researchers who are doing things like using the computers to digest thousands and thousands of pathology reports for a certain disease — they use the AI part of the computer's capabilities to look for inferences across all these reports. A human couldn't draw these inferences across massive amounts of data, but a supercomputer can. All of that is very exciting for research potential.
The last few generations of computers have turned out to be tremendous resources for machine learning and artificial intelligence, in addition to the modeling and simulation that we've done for years.
Grush: Given that, and thinking forward to 2023, when Frontier is ready for open science, what are the ways the open science community will share access to Frontier's resources?
Whitt: Between the Leadership Computing Facility here at Oak Ridge and the Exascale Computing Project, we are working to have 24 different pieces of scientific software that will be available, scaled to the full machine and ready to do science on Day One of full user operation — on January 1, 2023 in calendar terms.
We are working to have 24 different pieces of scientific software that will be available, scaled to the full machine and ready to do science on Day One of full user operation — on January 1, 2023 in calendar terms.
Most of the community code academic researchers use and rely upon will be ready on Day One. That's when we'll begin taking allocations on Frontier through our two DOE user programs: The INCITE program and the ALCC program. Researchers can apply for time on the system through either of those programs.
Grush: Especially considering our higher education readership, I'd like to ask, what are some of the opportunities for research and education institutions to be involved with the open science that will surround Frontier?
Whitt: A large proportion researchers that use the system come from higher education institutions, and they come in one of three ways: through one of the two allocation programs that I mentioned — ALCC or INCITE — that accept responses to a call for proposals and have a peer review of those proposals, awarding time on the system based on that; and the third way they come in is through our Director's Discretionary Program, held here at the Leadership Computing Facility where you can apply directly. The Director's Discretionary Program awards are generally intended to be seed awards, where, if you are not yet ready to run at full scale on the system, you can get a seed allocation to explore the potential and get ready to apply through either INCITE or ALCC.
Grush: Are there ways in which you'd predict Frontier will "pay it forward" with technology transfer and/or research about this platform?
Whitt: If you look at the current TOP500 list, there are already a few other systems that have the Frontier boards. In fact, I saw four other systems that showed up on that list that used Frontier boards. So, you are already starting to see the proliferation of those technologies, which is a great thing.
You are already starting to see the proliferation of those technologies, which is a great thing.
Of course, we don't like to build one-off systems where the technology is so customized that it's never used again. We see a lot of value in co-designing technology that will get picked up through industry and academia. If these early indications are any sign, I think we've been very successful this round, with Frontier.
We see a lot of value in co-designing technology that will get picked up through industry and academia.
It's also interesting, with Frontier, that we have this really advanced computing system that's highly instrumented, with sensors throughout — more instrumentation and sensors than any other system we've ever deployed. All of these sensors are collecting massive amounts of data that computer scientists use over a long period of time. You may find publications around that data, particularly around resiliency, maybe five to seven years into the life of the system. I think that kind of data collection will become more and more valuable on modern systems like Frontier.
Grush: Looking back on the development of Frontier, besides the achievement of bursting through the exascale barrier, is there anything else specifically that you'd point to as one of the most important strides Frontier has made?
Whitt: Yes, I think that's the energy efficiency we've been able to achieve. A decade ago, as I've mentioned, we were looking at 100 to 200 megawatts… and now, we're running at exaflop speeds at only 15MW. This is the kind of thing that at smaller scales is feasible to deploy in a variety of data centers.
Grush: Finally, are you prepared for a potentially huge amount of activity around Frontier as a world renowned supercomputer ready for open science? I'm guessing this level of supercomputing is one area where it's not going to be lonely at the top…
Whitt: I'd agree with that! Just working with the Exascale Computing Project and the fact that we already have 24 science teams that are working on code, we think that there is going to be a tremendous burst of activity around Frontier, especially when we have it into full production mode on January 1.
[Editor's note: Images courtesy Oak Ridge National Laboratory]