Title: Tolerating Node Failures in Cache Only Memory Architectures.
Authors: Alain Gefflaut, Christine Morin, Michel Banâtre.
Authors' address: IRISA, Campus de Beaulieu,
35042 Rennes Cedex,
FRANCE

Abstract: COMAs (Cache Only Memory Architectures) are an interesting class of large scale shared memory multiprocessors. They extend the concepts of cache memories and shared virtual memory by using the local memories of the nodes as large caches for a single shared address space. Due to their large number of components, these architectures are particularly susceptible to hardware failures and so fault tolerance mechanisms have to be introduced to ensure a high availability. In this paper, we propose an implementation of backward error recovery in a COMA which minimizes performance degradation and requires little hardware modifications. This implementation uses the features of a COMA to implement a stable storage abstraction using the standard memories of the architecture. Recovery data are replicated and mixed with current data in node memories both of which are managed in a transparent way using an extended coherence protocol.

Paper available in postscript form (67K).