|
Members
Michel Banātre, Alain
Gefflaut, Anne-Marie Kermarrec,
Christine Morin
Motivations
Scalable shared memory architectures have a high probability of
failure due to their large number of components. Hence, fault tolerance
mechanisms are needed to allow the execution of long-running parallel
applications on such architectures. In a first study [1, 2, 3] we
have shown that the class of Cache Only Memory Architectures
(COMA) is well-suited to the implementation of a backward error
recovery strategy which allows to tolerate any single node failure.
Contribution
The proposed solution relies on two COMA features:
- data replication mechanisms provided by COMAs are used to create
recovery data in the node memories, and
- the absence of data fixed physical location in a COMA simplifies
the architecture reconfiguration in the event of a node permanent
failure.
The availability of COMAs is implemented by an extension of the
coherence protocol which manages the multiple copies of data in
different nodes. The implementation of the proposed protocol requires
few hardware modifications of COMA architectures that are described
in the litterature. The extended coherence protocol has been evaluated
by simulation. Performance results show that our approach is scalable.
The cost of fault tolerance mechanisms has been evaluated for failure-free
executions. The solution that we have proposed for COMA machines
is applicable in other architectures. In fact, a shared virtual
memory system offers mechanisms that are similar to those exploited
by the extended coherence protocol in a COMA. Workstations memories
are used as caches and data have no fixed physical location. Data
replication mechanisms that are implemented in hardware for cache
lines in COMAs are implemented in software at a page granularity
in a shared virtual memory system.
Availability in shared virtual memory systems. We have implemented
a recoverable shared virtual memory, called Icare, based
on the extension of the coherence protocol of Myoan shared virtual
memory on the Intel Paragon machine and on Astrolab platform. The
first performance results obtained in the context of the Paragon
implementation have shown a weak degradation. Some executions are
even more efficient on top of the recoverable shared virtual memory
than on the standard version of Myoan. This phenomenon can be explained
by the fact that pages replicated at checkpoint time are then used
by the processors whose memory contains one of the page recovery
copies. Page faults are somehow anticipated at checkpoint time.
Starting from these observations, we have defined data replication
policies that aim first at increasing the efficiency in normal functioning
and, second, at minimizing perturbations observed when processes
are restarted after a failure. Managing an historic of processes
which have accessed a page allows to decrease application execution
time in normal functioning. By creating affinity between a process
and a node memory at checkpoint time, it is also possible to balance
the load of the faulty node on valid nodes and to avoid the transient
increasing of network traffic when processes re-build their working
set after system restart.
Publications
-
Anne-Marie Kermarrec and Christine Morin. Création
d'affinité mémoire pour l'efficacité dans
ICARE. MPR'96, Journées de recherche sur la mémoire
partagée répartie, Bordeaux, May 1996.
You can download its postscript
version (48K).
-
Anne-Marie Kermarrec. Contrôle de la
réplication des données dans une mémoire
virtuelle partagée recouvrable efficace. Technique
et Science Informatiques 15 (5), 1996.
You can download its postscript
version (66K).
-
Alain Geffault, Christine Morin, Michel Banâtre.
Tolerating Node Failures in Cache Only Memory Architectures.
In Proceedings of Supercomputing'94, 1994.
You can read the abstract
of the paper or download its postscript
version (67K).
-
Christine Morin, Alain Geffault, Michel Banâtre.
COMA: an Opportunity for Building Fault-tolerant Scalable
Shared Memory Multiprocessors. In Proceedings of the
23th International Symposium on Computer Architectures.
1996.
You can read the abstract
of the paper or download its postscript
version (92K).
-
Alain Geffault. Proposition et évaluation
d'une architecture multiprocesseur extensible à mémoire
partagée tolérante aux fautes. Thèse
de doctorat, Université de Rennes 1, January 1995.
|