Abstract

Fault tolerance support automation is a relevant problem, because it is both demanded for large scale computations and hard to implement manually. General approaches exist, but they lack efficiency which is required in high performance computing as compared to particular approaches, which exploit peculiarities of subject domain and applied algorithm in order to reduce overhead on fault tolerance support. Usage of parallel programming systems, such as LuNA, opens possibility to automatically or semiautomatically implement fault tolerance support in constructed programs which are more efficient than general approaches due to exploitation of peculiarities of the computational model on which the system is based. The problem of automated fault tolerance support in LuNA system is considered in the work. Some preliminary results are presented, such as checkpointing technique adaptation and the problem analysis.

File
Issue
Pages
43-55