Petascale Data Storage

Addressing the challenges of petascale computing for scientific discovery on information storage capacity, performance, concurrency, reliability, availability, and manageability

Garth Gibson (project web page)
Carnegie Mellon University

With the advent of new experimental facilities and more powerful supercomputers, researchers are now faced with the task of managing, sharing and analyzing petabytes of data. The Petascale Data Storage Institute brings together data storage and management expertise to meet the high performance storage requirements of today’s DOE terascale computational science, while simultaneously identifying, resolving and setting in motion solutions for the storage capacity, performance, concurrency, reliability, availability and manageability problems arising from petascale computing infrastructures for scientific discovery.

This project will educate the community on best practices for efficiently using large-scale storage systems, and jump start the community to prepare for effectively using petascale systems. To reach out and engage the scientific computing community in the emerging problems of petascale storage system performance, the Institute will develop and chair an annual petascale storage workshop in conjunction with a major scientific computing conference, such as the annual SC conference. This project will also engage the academic computer science community by targeting the USENIX Conference on File and Storage Technologies, or, as appropriate, the IEEE Conference on Mass Storage Systems and Technologies. Other workshops will be sought or accepted as appropriate.

To communicate the techniques, mechanisms, programming practices and tools to the broader communities of scientific computing, academic computer science, and industrial storage systems development, the Institute will develop and conduct multiple tutorials. Included in the scope of these tutorials will be advice to scientific discovery application developers on the strategies for maximizing the effectiveness of petascale storage access. Target venues for these tutorials include conferences such as SCxy, FAST, Storage Networking World, USENIX Annual Technical Conference, LISA, DSN, IEEE MSST and others.

The effective development of solutions for the petascale systems of the next decade depends on the development of the human resources that will be needed to design, operate and manage these systems. This Institute will create classroom materials covering the scope of the Institute and deploy them in at least the graduate programs at all three of the Institute’s university members. Possible courses to be offered include advanced operating and distributed systems, advanced storage systems, security systems, and advanced scientific algorithms.

Petascale computing infrastructures for scientific discovery make enormous demands on information storage capacity, performance, concurrency, reliability, availability, and manageability. The last decade has shown that parallel file systems can barely keep pace with high performance computing along these dimensions; this poses a critical challenge when petascale requirements are considered. The Petascale Data Storage Institute will focus on the data storage problems found in petascale scientific computing environments, with special attention to community issues such as interoperability, community buy-in, and shared tools. Leveraging experience in applications and diverse file and storage systems expertise of its members, the Institute will enable a group of researchers to collaborate extensively on developing requirements, standards, algorithms, and development and performance tools. Mechanisms for petascale storage and results will be made available to the petascale computing community. The Institute will hold periodic workshops and develop educational materials on petascale data storage for science.

SciDAC Institute: Computer Science

Project Title: Petascale Data Storage Institute

Principal Investigator: Garth Gibson
Affiliation: Carnegie Mellon University

Project Webpage:

Participating Institutions and Co-Investigators:
Carnegie Mellon University - Garth Gibson (PI)
Lawrence Berkeley National Laboratory - William Kramer
Los Alamos National Laboratory - Gary Grider
Oak Ridge National Laboratory - Philip Roth
Pacific Northwest National Laboratory - Evan Felix
Sandia National Laboratories - Lee Ward
University of California at Santa Cruz - Darrell Long
University of Michigan at Ann Arbor - Peter Honeyman

Funding Partners: Office of ScienceOffice of Advanced Scientific Computing Research

Budget and Duration: Approximately $2.2 million per year for five years 1

Other SciDAC Institutes
Other SciDAC computer science efforts

1Subject to acceptable progress review and the availability of appropriated funds


Home  |  ASCR  |  Contact Us  |  DOE disclaimer