Taking on the ITER Challenge, Scientists Look to Innovative Algorithms, Petascale Computers
Related Articles |
by Michelle Sipics
The promise of fusion as a clean, self-sustaining and essentially limitless energy source has become a mantra for the age, held out by many scientists as a possible solution to the world’s energy crisis and a way to reduce the amounts of greenhouse gases released into the atmosphere by more conventional sources of energy. If self-sustaining fusion reactions can be realized and maintained long enough to produce electricity, the technology could potentially revolutionize energy generation and use.
ITER, initially short for International Thermonuclear Experimental Reactor, is now the official, non-acronymic name (meaning “the way” in Latin) of what is undoubtedly the largest undertaking of its kind. Started as a collaboration between four major parties in 1985, ITER has evolved into a seven-party project that finally found a physical home last year, when it was announced that the ITER fusion reactor would be built in Cadarache, in southern France. (The participants are the European Union, Russia, Japan, China, India, South Korea, and the United States.) In May, the seven initialed an agreement documenting the negotiated terms for the construction, operation, and decommissioning of the ITER tokamak, signifying another milestone for both the project itself and its eventual goal of using fusion to facilitate large-scale energy generation for the world.
Problems remain, however—notably the years, and perhaps decades, of progress needed to attain such a goal. In fact, even simulating the proposed ITER tokamak is currently out of reach. But according to David Keyes, a computational mathematician at Columbia University and acting director of the Institute for Scientific Computing Research (ISCR) at Lawrence Livermore National Laboratory, the ability to perform such simulations may be drawing closer.
Hardware 3, Software 9
“Fusion scientists have been making useful characterizations about plasma fusion devices, physics, operating regimes and the like for over 50 years,” Keyes says. “However, to simulate the dynamics of ITER for a typical experimental ‘shot’ over scales of interest with today’s most commonly used algorithmic technologies would require approximately 1024 floating-point operations.” That sounds bleak, given the 280.6 Tflop/s (1012 flops/s) benchmark performance of the IBM BlueGene/L at Lawrence Livermore National Laboratory—as of June the fastest supercomputer in the world. But Keyes is optimistic: “We expect that with proper algorithmic ingenuity, we can reduce this to 1015 flops.”
Optimizing the algorithms used, in other words, could lower the computing power required for some ITER simulations by an astounding nine orders of magnitude. Even more exciting, those newly feasible simulations would be at the petascale—ready to run on the petaflop/s supercomputers widely expected within a few years.
The ingenuity envisioned by Keyes even has a roadmap. Together with Stephen Jardin of the Princeton Plasma Physics Laboratory, Keyes developed a breakdown that explains where as many as 12 orders of magnitude of speedup will come from over the next decade: 1.5 from increased parallelism, 1.5 from greater processor speed and efficiency, four from adaptive gridding, one from higher-order elements, one from field-line following coordinates, and three from implicit algorithms.
The increased parallelism, processor speeds, and efficiency that together account for three orders of magnitude are hardware rather than software improvements; Keyes and Jardin based their roadmap estimates simply on trends over the past decade that are extrapolated in a document known as the semiconductor industry roadmap. Keyes quickly gets to the software side of things by explaining the potential four orders of magnitude improvement from adaptive gridding.
Conceptually, he says, “adaptive gridding means not putting observers where they’re not needed.” In a uniform grid whose resolution is dictated by the smallest feature, “start taking away points where the solution is smooth, until further loss would endanger its resolution.” In certain cases, he says, the volume of the domain that dictates the finest resolution may be as little as 1% and the refinement may be required in only one or two dimensions.
Using high-order elements, Keyes says, could also lower the performance requirements for some ITER simulations—in essence, a given accuracy can be obtained with fewer discretization variables and less overall work. He gives an example of a fourth-order method in which error decreases like h4 (h being a typical mesh spacing), compared with a second-order method in which the error decreases like h2.
“Suppose we can tolerate an error of about 10–4,” he says. “We can get this with about 10 points in the high-order method, but it takes about 100 points with the low-order method [per dimension].” In higher spatial dimensions, he says, “this ratio of ten in savings powers up.”
Keyes identifies an additional advantage: “Higher-order methods link together fewer points with greater coupling.” As a result, “there are fewer memory loads and stores to do, and more flops to do between loads and stores, which plays to the advantage of today’s computer architectures.”
However, he cautions: “The best algebraic solution methods we have today have a much greater difficulty with high-order discretizations than with the simplest low-order discretizations.”
Discretizing the domain with field-line-following coordinates can contribute an order of magnitude by lowering the resolution requirements in certain directions, says Keyes. In a tokamak the most rapid changes in the fields tend to occur across magnetic flux surfaces rather than along them.
For a final three orders of magnitude, Jardin and Keyes turned to implicit algorithms. Explaining the basis for the potential improvement, Keyes cites the CFL (Courant–Friedrichs–Lewy) criterion, in particular the finding from the landmark 1928 paper that “small errors can be amplified to arbitrarily large errors in timestepping algorithms in which the computational timestep exceeds certain size.” A large number of computational timesteps are required to get to a certain physical time limit, Keyes explains. And the size of the maximum computational timestep relates to two other parameters: the smallest physical mesh step and the fastest wavespeed supported by the differential equation.
“For magnetohydrodynamic problems typical of tokamaks, we should not have to resolve the fastest waves, which for ITER are as much as a million times faster than time scales of interest,” he says. “To avoid the catastrophic effects of numerical instability, however, we need either implicit methods, or frightfully many explicit timesteps.” Implicit methods are expensive, he continues, but not nearly as expensive as timestepping beneath the CFL stability limit. If an implicit timestep costs a thousand times as much computation as an explicit timestep, but a million fewer implicit timesteps are required, three orders of magnitude can be saved.
But why focus on algorithms? Why not concentrate on hardware improvement to achieve faster supercomputers? Pointing to the breakdown of the 12 orders of magnitude of speedup to be gained, Keyes gives a final score: hardware 3, software 9.
“Since the beginning of computing, algorithmic innovation has been much, much more important than improvements in computer hardware engineering,” Keyes says. “Pick any starting point, any year you want in any simulation field, and I bet someone can make the case that algorithmic advances have outpaced Moore’s law in getting more science done per second.”
Moore’s Law for the Petascale Era
Hardware advances are clearly a driving force behind what scientists call “the dawn of the petascale era.” But in some ways, says Caltech computer scientist Thomas Sterling, hardware improvements are a double-edge sword.
“It’s ironic that the improvements garnered through Moore’s law also make it increasingly difficult to benefit from the same technology for high-end computing,” he says, citing the exponentially increasing memory capacity in DRAM (dynamic random access memory) chips. “Even with improved data access to these components, the amount of time required, measured in processor clock cycles, to touch every byte on a memory chip once continues to increase,” he says, “in essence slowing down at least those simulations that are memory constrained.”
Moore’s original assertion, in a 1965 article in Electronics Magazine titled “Cramming More Components onto Integrated Circuits,” was that the density of transistors on a chip doubles approximately every two years. Frequent casual references in the media have directly linked the “law” with performance, but, as Sterling’s statement reflects, the number of transistors on a chip is not directly related to the number of flop/s a system can perform. The number of transistors is important, but the overall performance of the system also depends on how efficiently each transistor can be utilized.
The memory constraints that Sterling refers to, often called the “memory wall, are a case in point of the good news/bad news element of Moore’s law. Current DRAM chips can accommodate huge numbers of transistors and capacitors, giving them enormous memory capacity—but researchers have yet to devise a rapid method for accessing this data. The resulting memory wall is one of the largest obstacles to high efficiency in large-scale systems.
“It takes approximately 100 times longer to access an 8-byte word from main memory than to perform an arithmetic operation on it in the CPU,” Keyes explains. “Hence, most CPUs operate at only slightly over 1% efficiency on average, with much higher efficiencies—close to 100%—on codes structured to exploit cache locality. Accessing cache typically takes only a few processor cycles.”
Nanotech memory, which would store information in carbon nanotubes, should make all memory appear as close as cache once it becomes available for on-chip integration, Keyes says. But “nanotech memory is not even on the drawing board of the BlueGene designers at IBM.”
When it comes to addressing such problems, Sterling sees a few possibilities as worth consideration. One specific change that he describes as not popular but probably inevitable is a move to merge CMOS logic and DRAM memory on the same chip.
“That could greatly expand the available local memory bandwidth while reducing the access latency,” Sterling says. He also cites significant reductions in power consumption as a benefit of such a change, through the avoidance of going off-chip and the use of simpler processors. (Because of the energy required to operate and cool a machine’s components, power consumption is an extremely important issue in high-end computing.)
Achieving a Hardware/Software Balance
Sterling doesn’t disagree with Keyes’s assertion that software can greatly improve the performance of high-end computers—at least once current systems have been redesigned to give programmers a better hardware basis from which to start.
“Many consider the parallel computing problem to be largely a software problem, and there are good reasons for that viewpoint,” he says. “However, in spite of a decade of investment by government and industry in system software and tools for parallel processing, the big breakthrough has yet to be realized.”
During an interview at the 2006 NEC User Group meeting in Toronto, Sterling stressed that nearly all current high-performance computers are designed with essentially no scalability, forcing programmers to address latency, contention for resources, and similar problems in excruciating detail.
“We are not, in fact, programming parallel computers but rather attempting to domesticate large herds of sequential processors, not designed for the scalable domain,” he says. “Without components designed by intent to enable parallel computing, it’s no wonder that our current processor ensembles are such a challenge to coerce into effective parallel computing.”
Advances to be made in architecture design, according to Sterling, will address some of the major difficulties with software development for large-scale systems.
“It’s the balance between hardware and software that needs to be addressed, and from such an improved relationship can come the derivation of both,” he says.
In the race to deploy the world’s first relatively general-purpose petascale system, Sterling’s current best guess is that the U.S. will get there first. “But I would not be disappointed should the honors go to Japan, the EU, or to anyone else for that matter,” he says. “The world needs such systems, and many of them.”
The ITER project will certainly be among the beneficiaries of such systems. Keyes points out that if he is right and algorithmic improvements can bring ITER’s simulation needs down to the petascale, “we could simulate one ITER operating cycle in about one second. That’s probably enough for making reasonable progress in designing and operating ITER.”
Still, he admits, although specific applications can be suggested for petascale machines, there is nothing particularly special about petaflop/s. “It is just one milepost along a long path that has stretched from one Gflop/s in 1998 to 100 Tflop/s in 2005. People get inordinately excited when the prefix changes, as when a car turns 100,000 miles,” he says. “But each mile is the same as the one before.”
Regardless, as high-end machines roll toward 1,000,000,000,000,000, you can be sure mathematicians will be watching.
Michelle Sipics spent the summer as an intern at SIAM News and is now a contributing editor.
« Back to September 2006 Index