Alumni Project

National Computational Infrastructure for Lattice Gauge Theory

R. Brower, (Boston U.), N. Christ (Columbia U.), M. Creutz (BNL), P. Mackenzie (Fermilab), J. Negele (MIT), C. Rebbi (Boston U.), S. Sharpe (U. Washington),
R. Sugar (UCSB) and W. Watson, III (JLab)

Summary

The goal of our research is to obtain a quantitative understanding of the physical phenomena encompassed by quantum chromodynamics (QCD), the fundamental theory governing the strong interactions. Achievement of this goal requires terascale numerical simulations. The SciDAC Program is enabling U.S. theoretical physicists to develop the software and prototype the hardware they need to carry out these simulations.

The long term goals of high energy and nuclear physicists are to identify the fundamental building blocks of matter, and to determine the interactions among them. Remarkable progress has been made through the development of the Standard Model of High Energy Physics, which provides fundamental theories of the strong, electromagnetic and weak interactions. However, our understanding of the Standard Model is incomplete because it has proven extremely difficult to determine many of the predictions of quantum chromodynamics (QCD), the component of the Standard Model that describes the strong interactions. To do so requires terascale numerical simulations.

The study of the Standard Model is at the core of the Department of Energy's experimental programs in high energy and nuclear physics, and lattice QCD calculations are essential to this research effort. Recent advances in algorithms and calculational methods, coupled with progress on massively parallel computers, have created opportunities for major advances in the next few years. U.S. theoretical physicists must move quickly to take advantage of these opportunities in order to support the experimental programs in a timely fashion, and to keep pace with the ambitious plans of theoretical physicists in Europe and Japan. For this reason, the entire U.S. lattice QCD community has joined together in the SciDAC Program to build the computational infrastructure needed for the next generation of calculations.

Computational facilities capable of sustaining tens of teraflops are needed to meet our near term scientific goals. By taking advantage of simplifying features of lattice QCD calculations, such as regular grids and uniform, predictable communications, it is possible to construct computers for lattice QCD that are far more cost effective than general purpose supercomputers. We are targeting a price/performance of $1M per sustained teraflops by 2004. We have identified two computer architectures that promise to meet the needs of lattice QCD. One is the QCDOC, the latest generation of highly successful Columbia/Riken/Brookhaven National Laboratory (BNL) special purpose computers, which is being developed at Columbia University in partnership with IBM. The other is commodity clusters, which are being specially optimized for lattice QCD at Fermi National Accelerator Laboratory (FNAL) and Thomas Jefferson National Accelerator Facility (JLab). We propose to create a distributed topical computing facility for lattice QCD with major hardware located at BNL, FNAL and JLab. Initially, BNL will focus on the QCDOC, while FNAL and JLab will concentrate on clusters.

Under the SciDAC Program, we have designed and are in the process of implementing a QCD Applications Program Interface (QCD API) which will provide a unified programming environment to achieve high efficiency on the multi-terascale computer architectures we have targeted. The QCD API has three layers. At the lowest level are the message passing and linear algebra routines essential for all QCD applications. These have been written, and are being optimized for the QCDOC and clusters. The middle layer provides a data parallel language which will enable new applications to be developed rapidly and run with efficiency beyond the reach of generic C or C++ codes. C and C++ versions are currently available in serial form. Robust parallel implementations are scheduled for completion by June 2003. The top layer of the QCD API consists of highly optimized versions of the small number of subroutines which dominate all QCD calculations. They are scheduled for completion in the Spring of 2003.

The SciDAC Program is also supporting the construction of prototype clusters for the study of QCD. The objectives of this work are to determine optimal configurations for the multi-teraflops clusters we propose to build in the next few years; to provide platforms for testing the SciDAC software; and to enable important research in QCD. Clusters based on commodity components offer many advantages. Market forces are producing rapid gains in processor and memory performance. The market for interconnects is growing and the technological options are increasing. We are exploring Myrinet, gigabit ethernet and field programmable gate arrays. The SciDAC funded work on cluster development is being undertaken collaboratively by Fermilab and JLab/MIT. FNAL has recently constructed a cluster with 128 dual Pentium 4 nodes and Myrinet interconnect, while JLab is building one with 256 Pentium 4 nodes arranged in a mesh architecture with gigabit ethernet interconnects. Both machines will be in operation by the Spring of 2003. In addition to providing important stepping stones towards the construction of terascale clusters, they will provide testbeds for the SciDAC software, and will be very useful research tools for the U.S. lattice QCD community.

The QCDOC plays a central role in our overall plans. Development work, which is being carried out by physicists at Columbia University and computer engineers at IBM, is funded outside the SciDAC Program. Compute nodes will consist of individual chips which contain processor, network, and memory. These chips will be arranged in a mesh-style network to form multi-teraflops computers.

Design of the QCDOC chip has been completed, and a 128 node prototype is scheduled for completion early this spring. We propose to build a development machine capable of sustaining 1.5 teraflops in the summer of 2003. This machine will be used to test the stability of the QCDOC hardware and operating system, and the capabilities of the SciDAC software. Upon completion, it will be the most powerful computing facility in existence dedicated to the study of lattice QCD. In 2004, we plan to build a 10 teraflops (sustained) QCDOC at BNL, followed by clusters of the same capabilities at FNAL and JLab in 2005 and 2006. These machines will be linked into a distributed topical computing facility for QCD, which will enable major progress in our understanding of the fundamental laws of nature.

back to project page

 


Home  |  ASCR  |  Contact Us  |  DOE disclaimer