![]() |
|
||||||
| Home
| Mission
|
about SciDAC
|
Contact Us |
||||||
Alumni ProjectScalable Systems Software for Terascale Computer CentersCoordinator: Al Geist ORNL SummaryThe nation's premiere scientific computing centers are facing a crisis where they are having to rewrite all their home-grown systems software to scale to the multi-teraflops systems that are being installed in their centers. The goal of the Scalable Systems Software project is to fundamentally change the way future high-end systems software is developed to make it more cost effective and robust. The research involves two efforts: Collectively getting the DOE centers, NSF centers, and industry to agree on standardized interfaces between system components. Secondly, producing a compliant, fully integrated suite of systems software that can be used across all the terascale computer centers for the cost effective management and utilization of their computational resources. A first release of the suite was made this year. System administrators and managers of terascale computer centers are facing a crisis. The nation's premiere scientific computing centers all use incompatible, ad hoc sets of systems tools (See Figure 1) and these tools were not designed to scale to the multi-teraflop systems that are being installed in these centers today. One solution would be for each computer center to take its home-grown software and rewrite it to be scalable. But this would incur a tremendous duplication of effort and delay the availability of terascale computers for scientific discovery. The purpose of the Scalable Systems Software project is to provide a much more timely and cost effective solution by pulling together representatives from the major computer centers and industry and collectively defining standardized interfaces between system components. At the same time this group is producing a fully integrated suite of systems software components that can be used by the nation's largest scientific computing centers.
The scalable systems software suite is being designed to support computers that scale to very large physical sizes without requiring that the number of support staff scale along with the machine. But this research goes beyond just creating a collection of separate scalable components. By defining a software architecture and interfaces between system components, the Scalable Systems Software research is creating an interoperable framework for the components. This makes it much easier and cost effective for supercomputer centers to adapt, update, and maintain the components in order to keep up with new hardware and software. A well-defined interface allows a site to replace or customize individual components as needed. Defining the interfaces between components across the entire system software architecture provides an integrating force between the system components as a whole and improves the long-term usability and manageability of terascale systems at supercomputer centers across the country. The standardization of the systems interfaces is being done using a process similar to that used to successfully define the message passing standard (MPI). It is an open forum of university, lab, and industry representatives who meet regularly to propose and vote on pieces of the standard.
Figure 2 represents the significant progress to date on producing scalable components and defining standardized interfaces between them. The bold lines represent working interfaces. The light lines represent interfaces in progress. The colors of the components just represent which of the four multi-lab working groups inside the project is responsible for it. In November 2003 the first release of a complete, integrated set of scalable systems components was made. This distribution utilized the popular OSCAR packaging and install technology. A second release is scheduled in March 2004. This past year the system administrators at ANL decided to switch Chiba City to use our scalable systems suite exclusively. In January 2004 the suite underwent scale tests on the 2560 processor Tungsten cluster at NCSA. Our research has developed software to provide communication service between components over multiple protocols as well as a flexible authentication scheme to provide security to the overall system. Research continues to harden the working prototypes, improve integration, and increase scalability to the target of 10,000 processor systems. Impact: The Scalable Systems Software project is a catalyst for fundamentally changing the way future high-end systems software is developed and distributed. It will reduce facility management costs by: reducing the need to support home-grown software, making higher quality systems tools available, and being able to get new machines up and running faster and keep them running. The project will also facilitate more effective use of machines by scientific applications by providing scalable job launch, standardized job monitoring and management software, and allocation tools for the cost effective management and utilization of terascale computer resources. For further information on this subject contact:
|
Home | ASCR | Contact Us | DOE disclaimer |
|
|