Solidor research

Aleth

A scalable distributed system integrating fault tolerance and coherence for the execution of long-running parallel applications

Members

Michel Banātre, Alain Gefflaut, Anne-Marie Kermarrec, Christine Morin

Motivations

Scalable shared memory architectures have a high probability of failure due to their large number of components. Hence, fault tolerance mechanisms are needed to allow the execution of long-running parallel applications on such architectures. In a first study [1, 2, 3] we have shown that the class of Cache Only Memory Architectures (COMA) is well-suited to the implementation of a backward error recovery strategy which allows to tolerate any single node failure.

Contribution

The proposed solution relies on two COMA features:

data replication mechanisms provided by COMAs are used to create recovery data in the node memories, and
the absence of data fixed physical location in a COMA simplifies the architecture reconfiguration in the event of a node permanent failure.

The availability of COMAs is implemented by an extension of the coherence protocol which manages the multiple copies of data in different nodes. The implementation of the proposed protocol requires few hardware modifications of COMA architectures that are described in the litterature. The extended coherence protocol has been evaluated by simulation. Performance results show that our approach is scalable. The cost of fault tolerance mechanisms has been evaluated for failure-free executions. The solution that we have proposed for COMA machines is applicable in other architectures. In fact, a shared virtual memory system offers mechanisms that are similar to those exploited by the extended coherence protocol in a COMA. Workstations memories are used as caches and data have no fixed physical location. Data replication mechanisms that are implemented in hardware for cache lines in COMAs are implemented in software at a page granularity in a shared virtual memory system.

Availability in shared virtual memory systems. We have implemented a recoverable shared virtual memory, called Icare, based on the extension of the coherence protocol of Myoan shared virtual memory on the Intel Paragon machine and on Astrolab platform. The first performance results obtained in the context of the Paragon implementation have shown a weak degradation. Some executions are even more efficient on top of the recoverable shared virtual memory than on the standard version of Myoan. This phenomenon can be explained by the fact that pages replicated at checkpoint time are then used by the processors whose memory contains one of the page recovery copies. Page faults are somehow anticipated at checkpoint time. Starting from these observations, we have defined data replication policies that aim first at increasing the efficiency in normal functioning and, second, at minimizing perturbations observed when processes are restarted after a failure. Managing an historic of processes which have accessed a page allows to decrease application execution time in normal functioning. By creating affinity between a process and a node memory at checkpoint time, it is also possible to balance the load of the faulty node on valid nodes and to avoid the transient increasing of network traffic when processes re-build their working set after system restart.

Publications

Anne-Marie Kermarrec and Christine Morin. Création d'affinité mémoire pour l'efficacité dans ICARE. MPR'96, Journées de recherche sur la mémoire partagée répartie, Bordeaux, May 1996.
You can download its postscript version (48K).
Anne-Marie Kermarrec. Contrôle de la réplication des données dans une mémoire virtuelle partagée recouvrable efficace. Technique et Science Informatiques 15 (5), 1996.
You can download its postscript version (66K).
Alain Geffault, Christine Morin, Michel Banâtre. Tolerating Node Failures in Cache Only Memory Architectures. In Proceedings of Supercomputing'94, 1994.
You can read the abstract of the paper or download its postscript version (67K).
Christine Morin, Alain Geffault, Michel Banâtre. COMA: an Opportunity for Building Fault-tolerant Scalable Shared Memory Multiprocessors. In Proceedings of the 23th International Symposium on Computer Architectures. 1996.
You can read the abstract of the paper or download its postscript version (92K).
Alain Geffault. Proposition et évaluation d'une architecture multiprocesseur extensible à mémoire partagée tolérante aux fautes. Thèse de doctorat, Université de Rennes 1, January 1995.

dernière mise à jour : 17 02 2000
	french version		puaut@irisa.fr		©copyright