|
Members
Gilbert Cabillic, Isabelle
Puaut
Objectives
The proliferation of inexpensive and powerful workstations has
continued to increase at a rapid rate in the last few years. This
increase of machine performance is likely to continue for several
years, with faster processors and multiprocessor machines. Studies
have shown that for a large percentage of their lifetime, the machines
are used for small tasks (reading e-mail, editing files), thus demonstrating
an average idle percentage of at least 90% even during peak hours.
One possible use of these idle cycles is to run parallel applications
on networks of workstations. The network of workstations can then
be considered as a parallel computer, or hypercomputer whose
performance is similar to the one of a parallel machine for a much
lower cost. Numerous research activities have tried to exploit the
computing power of networks of workstations, like PVM, and MPI,
based on the message-passing paradigm, and Linda, Mermaid, based
on the shared-memory paradigm. To be fully usable, an environment
for parallel computing on networks of workstations should include
the following facilities:
- Support for multiple programming paradigms: two classes
of programming paradigms for parallel applications can be distinguished:
the message-passing programming paradigm, in which processes
communicate through message exchanges, and the shared-memory
paradigm, in which processes communicate by reading and writing
shared data items. Although all applications can be written using
both paradigms, some studies have shown that application performance
can be increased by the use of both message-passing and shared
memory.
- Support for heterogeneity: existing computing environments
often include a number of computers with various architectures
and performance (parallel machines, workstations) and different
operating systems; more computing power is then available to environments
supporting heterogeneous machines.
- Support for load balancing and application reconfiguration:
the concept of ownership is frequently present when using workstations
for executing parallel applications. Workstation owners do not
want their machine to be overloaded by the execution of parallel
applications, or simply want exclusive access to their machine
when they are working. Reconfiguration mechanisms are thus required
to balance the machines loads, and to allow parallel computations
to coexist with other applications.
- Support for fault tolerance: as the number of nodes achieving
a parallel computation increases, possibility of a node failure
also increases. The failure can be a hardware failure or may be
a reboot done by the owner of the workstation. Without provision
for tolerance to such failures, it is necessary to restart the
entire computation.
Contributions
Stardust is an environment that provides all these facilities.
Processes executing on Stardust can communicate both through message-passing
and (page-based) distributed shared memory (DSM). Library calls
are provided for sending/receiving a message, as well as for creating
a shared memory region and mapping it onto a process address space.
Stardust executes on an heterogeneous computing environment composed
of a parallel machine (the Intel Paragon) and PCs connected by an
ATM network.
Stardust includes transparent mechanisms for converting data between
different types of hosts. These mechanisms are used for converting
both the contents of communication buffers and the contents of shared
virtual memory regions. In addition, in order to cope with the processors
different instruction formats, each program developed with Stardust
is compiled for every type of architecture.
In addition to its support for heterogeneity, Stardust includes
a mechanism used both for checkpointing and for load balancing.
The basic principle of this mechanism is to consider an application
state only when all the application processes have reached a synchronization
point (synchronization barrier) in their main program (C function
main). The application state can then be saved onto disk
to support node failures, or moved to other nodes for balancing
the system load. An architecture-independent standard representation
of data is used for saving an application state. No architecture-dependent
data (e.g., threads stacks) is saved onto disk (respectively transferred
to other nodes). Hence, migration of applications between heterogeneous
hosts is made possible.
Publications
-
Gilbert Cabillic and Isabelle Puaut.
Stardust: an Environment for Parallel Programming on Networks
of Heterogeneous Workstations.
Journal of Parallel and Distributed Computing, 40(1),
January 1997.
You can download its postscript
version (93K).
-
Gilbert Cabillic and Isabelle Puaut.
Stardust: an environment for parallel programming on networks
of heterogeneous workstations.
IRISA Technical Report No.1006, April 1996.
You can read the abstract
of the paper, or download its postscript
version (128K).
-
Gilbert Cabillic and Isabelle Puaut.
Dealing with heterogeneity in Stardust: an environment for
parallel programming on networks of heterogeneous workstations.
In Proceedings of the Euro-Par'96 conference, August
1996.
You can read the abstract
of the paper, or download its postscript
version (31K).
-
Gilbert Cabillic and Isabelle Puaut.
Répartition de charge dans Stardust : un environnement
pour l'exécution d'applications parallè en milieu
hétérogène.
Actes de l'école d'étéplacement dynamique
et répartition de charge, Collection didactique Inria,
Presqu'ile de Giens, France, Juillet 1996.
You can read the abstract
of the paper, or download its postscript
version (25K).
-
Christine Morin and Isabelle Puaut.
A survey of recoverable distributed shared memory systems.
IEEE Transactions on Parallel and Distributed Systems,
8(9), September 1997.
Also appears as IRISA Technical Report No.975, (gziped
postscript 86K), December 1995.
You can read the abstract
of the paper.
-
Gilbert Cabillic, Gilles Muller and Isabelle
Puaut.
The performance of consistent checkpointing in distributed
shared memory systems
In Proceedings of the 14th symposium on reliable distributed
systems, September 1995.
You can read the abstract
of the paper, or download its postscript
version (57K).
|