Editing Application checkpointing (section)

== Implementations for applications ==
=== Save State ===
{{unreferenced section|date=January 2024}}
One of the original and now most common means of application checkpointing was a "save state" feature in interactive applications, in which the user of the application could save the state of all variables and other data and either continue working or exit the application and restart the application and restore the saved state at a later time. This was implemented through a "save" command or menu option in the application. In many cases, it became standard practice to ask the user, if they had unsaved work when exiting an application, if they wanted to save their work before doing so.
 
This functionality became extremely important for usability in applications in which a particular task could not be completed in one sitting (such as playing a video game expected to take dozens of hours) or in which the work was being done over a long period of time (such as data entry into a document such as rows in a spreadsheet).

The problem with save state is it requires the operator of a program to request the save. For non-interactive programs, including automated or batch processed workloads, the ability to checkpoint such applications also had to be automated.

=== Checkpoint/Restart ===
{{unreferenced section|date=January 2024}}
As batch applications began to handle tens to hundreds of thousands of transactions, where each transaction might process one record from one file against several different files, the need for the application to be restartable at some point without the need to rerun the entire job from scratch became imperative. Thus the "checkpoint/restart" capability was born, in which after a number of transactions had been processed, a "snapshot" or "checkpoint" of the state of the application could be taken.  If the application failed before the next checkpoint, it could be restarted by giving it the checkpoint information and the last place in the transaction file where a transaction had successfully completed. The application could then restart at that point.

Checkpointing tends to be expensive, so it was generally not done with every record, but at some reasonable compromise between the cost of a checkpoint vs. the value of the computer time needed to reprocess a batch of records. Thus the number of records processed for each checkpoint might range from 25 to 200, depending on cost factors, the relative complexity of the application and the resources needed to successfully restart the application.

=== Fault Tolerance Interface (FTI) ===
{{unreferenced section|date=January 2024}}
FTI is a library that aims to provide computational scientists with an easy way to perform checkpoint/restart in a scalable fashion.<ref>Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., & Matsuoka, S. (2011, November). FTI: high performance fault tolerance interface for hybrid systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (p. 32). ACM.</ref> FTI leverages local storage plus multiple replications and erasures techniques to provide several levels of reliability and performance. FTI provides application-level checkpointing that allows users to select which data needs to be protected, in order to improve efficiency and avoid space, time and energy waste. It offers a direct data interface so that users do not need to deal with files and/or directory names. All metadata is managed by FTI in a transparent fashion for the user. If desired, users can dedicate one process per node to overlap fault tolerance workload and scientific computation, so that post-checkpoint tasks are executed asynchronously.

=== Berkeley Lab Checkpoint/Restart (BLCR) ===
The Future Technologies Group at the Lawrence National Laboratories are developing a hybrid kernel/user implementation of checkpoint/restart called BLCR. Their goal is to provide a robust, production quality implementation that checkpoints a wide range of applications, without requiring changes to be made to application code.<ref>Hargrove, P. H., & Duell, J. C. (2006, September). Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series (Vol. 46, No. 1, p. 494). IOP Publishing.</ref> BLCR focuses on checkpointing parallel applications that communicate through MPI, and on compatibility with the software suite produced by the SciDAC Scalable Systems Software ISIC. Its work is broken down into 4 main areas: Checkpoint/Restart for Linux (CR), Checkpointable MPI Libraries, Resource Management Interface to Checkpoint/Restart and Development of Process Management Interfaces.

=== DMTCP ===
DMTCP (Distributed MultiThreaded Checkpointing) is a tool for transparently checkpointing the state of an arbitrary group of programs spread across many machines and connected by sockets.<ref>Ansel, J., Arya, K., & Cooperman, G. (2009, May). DMTCP: Transparent checkpointing for cluster computations and the desktop. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on (pp. 1-12). IEEE.</ref> It does not modify the user's program or the operating system. Among the applications supported by DMTCP are [[Open MPI]], [[Python (programming language)|Python]], [[Perl]], and many [[programming language]]s and shell scripting languages. With the use of TightVNC, it can also checkpoint and restart X Window applications, as long as they do not use extensions (e.g. no OpenGL or video). Among the Linux features supported by DMTCP are open [[file descriptor]]s, pipes, sockets, signal handlers, process id and thread id virtualization (ensure old pids and tids continue to work upon restart), ptys, fifos, process group ids, session ids, terminal attributes, and [[mmap]]/mprotect (including mmap-based shared memory). DMTCP supports the OFED API for InfiniBand on an experimental basis.<ref>{{Cite web | url=https://github.com/dmtcp/dmtcp/blob/master/contrib/infiniband/README |title =GitHub - DMTCP/DMTCP: DMTCP: Distributed MultiThreaded CheckPointing.|website =[[GitHub]]|date = 2019-07-11}}</ref>

=== Collaborative checkpointing ===
Some recent protocols perform collaborative checkpointing by storing fragments of the checkpoint in nearby nodes.<ref>{{Cite journal|last1=Walters|first1=J. P.|last2=Chaudhary|first2=V.|date=2009-07-01|title=Replication-Based Fault Tolerance for MPI Applications|journal=IEEE Transactions on Parallel and Distributed Systems|volume=20|issue=7|pages=997–1010|doi=10.1109/TPDS.2008.172|issn=1045-9219|citeseerx=10.1.1.921.6773|s2cid=2086958}}</ref> This is helpful because it avoids the cost of storing to a parallel file system (which often becomes a bottleneck for large-scale systems) and it uses storage that is closer.{{cn|date=January 2024}} This has found use particularly in large-scale supercomputing clusters. The challenge is to ensure that when the checkpoint is needed when recovering from a failure, the nearby nodes with fragments of the checkpoints are available.{{cn|date=January 2024}}

=== Docker ===
[[Docker (software)|Docker]] and the underlying technology contain a checkpoint and restore mechanism.<ref>{{Cite web | url=https://criu.org/Docker |title = Docker - CRIU}}</ref>

=== CRIU ===
[[CRIU]] is a user space checkpoint library.<ref>{{Cite web |title=CRIU |url=https://criu.org/Main_Page |access-date=2024-10-15 |website=criu.org |language=en}}</ref>