![]() |
|
||||||
| Home
| Mission
|
about SciDAC
|
Contact Us |
||||||
Alumni ProjectThe Scientific Data Management CenterPI: Coordinating PIs: Area Leaders: SummaryThe Scientific Data Management Center (SDM) focuses on the application of known and emerging data management technologies to scientific applications. The Center's goals are to integrate and deploy software-based solutions to the efficient and effective management of large volumes of data generated by scientific applications. Our purpose is not only to achieve efficient storage and access to the data using specialized indexing, compression, and parallel storage and access technology, but also to enhance the effective use of the scientist's time by eliminating unproductive simulations, by providing specialized data-mining techniques, by streamlining time-consuming tasks, and by automating the scientist's workflows. Our approach is to provide an integrated scientific data management framework where components can be chosen by the scientists and applied to their specific domains. By overcoming the data management bottlenecks and unnecessary information-technology overhead through the use of this integrated framework, scientists are freed to concentrate on their science and achieve new scientific insights. Scientific exploration and discovery typically takes place in two phases: data collection/generation and data analysis. In the data collection/generation phase large volumes of data are generated by simulation programs running on supercomputers or collected from experiments' instruments. This requires efficient parallel data systems that can keep up with the volumes of data generated. In the data analysis phase, it is necessary to have efficient indexes and effective analysis tools to find and focus on the information that can be extracted from the data, and the knowledge learned from that information. Because of the large volume of data it is also useful to perform analysis as the data are generated. For example, a scientist running a thousand-time-step three-dimensional simulation can benefit from analyzing the data generated by the individual steps in order to steer the simulation, saving unnecessary computation, and accelerating the discovery process. This requires sophisticated workflow tools, as well as efficient dataflow capabilities to move large volumes of data between the analysis components. For these reasons we use an integrated framework that provides a scientific workflow capability, supports data mining and analysis tools, and accelerates storage access and data searching. This framework facilitates hiding the details of the underlying parallel and indexing technology, and streamlining the assembly of modules using process automation technologies. Accomplishments Over the last three years we have adopted, improved, and applied various data management technologies to several scientific application areas. We chose to concentrate on typical scenarios provided to us by scientists from different disciplines. By working with these scenarios we not only learned the important aspects of the data management problems from the scientist's point of view, but also provided solutions that led to actual results. The successful results achieved so far include: • More than a tenfold speedup in writing and reading NetCDF files was achieved by developing Parallel NetCDF software on top of the MPI-IO using the General Parallel File System (GPFS) from IBM. A similar performance level has been shown when compared to HDF5. This was applied to Astrophysics data (FLASH code) as well as Climate Modeling simulations. • An improved version of PVFS is now offered by cluster vendors, including Dell, HP, Atipa, and CRAY. • A new method for signal separation of observational data was achieved by the use of a combination of Principal Component Analysis (PCA) and Independent Component Analysis (PCA). This was used to identify accurately El Nino signals in observed data that included a volcano signal in a Climate application. The ICA software was packaged to be used with other applications. Similar techniques are now being applied to a Fusion application for the purpose of identifying the key parameters that are relevant to the presence of edge harmonic oscillations in a Tokomak. • A new specialized method for indexing mesh data using bitmaps was used to achieve more than a tenfold speedup in generating regions and tracking them over time. The key to this achievement is that this method works just as efficiently for selection conditions over multiple measures, a problem previously unsolved with any known indexing techniques. This bitmap-based indexing method was applied to Combustion applications, as well as for indexing over collisions (events) in High Energy Physics applications. • More than a tenfold improvement to VTK for visualizing NetCDF files was achieved by the development of a software layer on top of Parallel NetCDF. This method was applied to Astrophysics data. • An integrated framework, called ASPECT (Adaptable Simulation Product Exploration and Control Tool), was developed for simulation data exploration using to provide pluggable analysis tools such as PCA, ICA, bitmap indexing, and a suite of statistical tools based on the R package. This framework is being applied to a Terascale Supernova Astrophysics application. • A workflow system was adopted and enhanced for scientific applications including access to Web-based services and databases. This system is designed to streamline repetitive scientific workflows, such as running simulations over multiple time steps. This system was applied to a Biology application for the analysis of microarray data, a process that requires a series of component invocations over the Web, and has to be repeated hundreds of times. Automation was shown to increase productivity by as much as an order of magnitude, a result that would have been impossible to attain without automation of the scientific workflow. The SDM Framework The SDM technologies we use fall into three general categories:
This led us to the organization of a three-layer framework shown in the figure below. In this figure, the SEA layer is immediately on top of hardware, operating systems, file systems, and mass storage systems. The SEA layer provides parallel disk access technology (PVFS, ROMIO, MPI-IO), parallel structured data access (parallel NetCDF), and a Storage Resource Manager (SRM) – a software layer to manage multifile requests from the mass storage system (currently HPSS). On top of the SEA layer exists the DMA layer, consisting of various indexing (bitmap index), data analysis (PCA, ICA), and statistical analysis components (based on the R package), as well as a parallel visualization component (parallel VTK). This layer also provides an integration framework (ASPECT) that allows the analysis components to interact. The SPA layer which is shown on top of the DMA layer provides the ability to compose workflows from the components in the DMA layer. It also contains the technology to wrap any components as Web services, thus allowing for a uniform method to invoke legacy components as services. From previous experience, we have realized that the concept of layering components is practical and useful. In particular, hiding the details of the SEA layer is a desirable feature, as it simplifies the scientist's task. In addition, providing a flexible framework for specifying a scientific workflow made of underlying (usually existing) components is a powerful method for setting up repetitive tasks. This framework also provides the ability to audit and track the scientific process, which is essential for verifying the correctness of complex tasks and recovering from failures. For further information contact:
|
Home | ASCR | Contact Us | DOE disclaimer |
|
|