BLCR: Serial, Parallel and Distributed Checkpoint/Restart for Linux Clusters

Paul Hargrove
Lawrence Berkeley National Laboratory

Researchers in Lawrence Berkeley National Lab's Future Technologies Group have developed a new Open Source, system-level, preemptive implementation of checkpoint/restart for Linux clusters as part of the SciDAC Scalable Systems Software ISIC. The goal is to support checkpointing of a wide range of scientific applications without requiring modifications to the application code. This poster highlights the capabilities of the current version, including single- and multi-threaded processes and distributed MPI jobs. These checkpointing capabilities are available both as a stand-alone tool, and as an integrated part of the Scalable Systems Software Suite.