In my talk, I will present Kerrighed, a single system image operating system for high performance computing on clusters. Kerrighed targets ease of programming, high performance and high availability. Ease of programming is achieved as Kerrighed supports both the message passing and the shared memory programming models. Kerrighed takes benefit of the underlying hardware performance by providing global management of all cluster resources (processor, memory and disk).
Kerrighed also provides dynamic resource management to make cluster configuration changes transparent to the applications and to guarantee the system availability in presence of node failures. Kerrighed offers sequential and parallel applications a checkpointing facility. Several kinds of applications can take advantage of Kerrighed. We currently target scientific applications such as numerical simulations (including OpenMP, MPI and Posix multithreaded applications).
Kerrighed is implemented as an extension to Linux operating system (a set of Linux modules and a small patch to the kernel). Kerrighed is independent of the cluster interconnection network.
Christine Morin holds a research director position at INRIA(http://www.inria.fr).
She carries out her research activities in
the PARIS project-team (http://www.irisa.fr/paris) at IRISA (http://www.irisa.fr)
research center (INRIA research unit in Rennes).
She currently leads a research activity aiming at designing and building a single
system image operating system, called Kerrighed (formely,
Gobelins), for high performance computing on clusters (http://www.kerrighed.org).
She has made contribution to the design of
fault tolerant shared memory multiprocessor architectures (SMP, COMA, clusters)
and to the design of distributed systems.
Christine Morin received an engineering degree from the Institut National des
Sciences Appliquées (INSA), of Rennes (France), in 1987 and
master and PhD degrees in computer science from the University of Rennes I in
1987 and 1990, respectively. She received the "Habilitation à diriger
des recherches" in computer science from the University of Rennes 1 in
1998.
The Open Source Cluster Application Resources (OSCAR) is a cluster software
stack providing a complete infrastructure for cluster
computing. The OSCAR project started in April 2000 with its first public release
a year later as a self-installing compilation of "best
practices" for high-performance classic Beowulf cluster computing. Since
its inception approximately three years ago, OSCAR has matured to
include cluster installation, maintenance, and operation capabilities and as
a result has become one of the most popular cluster computing
packages worldwide. In the past year, OSCAR has begun to expand into other cluster
paradigms including a diskless cluster solution (Thin
OSCAR) and the high-availability version embracing fault tolerant capabilities
(HA-OSCAR). In this talk, I will discuss the current status of the OSCAR project
including some preliminary information on its two latest invocations - Thin
OSCAR and HA-OSCAR.
Dr. Stephen L. Scott is a senior research scientist in the Network and Cluster
Computing Group of the Computer Science and Mathematics Division of Oak Ridge
National Laboratory - USA. Stephen's responsibilities include research and development
efforts in high performance scalable cluster computing. Primary research interest
is in experimental systems with a focus on high performance, scalable, distributed,
heterogeneous, and parallel computing. Stephen is a founding member and on the
steering committee of The Open Cluster Group (OCG), a consortium of research
and industry dedicated to making cluster computing practical for high performance
computing. He is also a founding member, version 2 release manager, and past
working group chair of the OCG's primary working group, Open Source Cluster
Application Resources (OSCAR). This working group is dedicated to bringing current
"best practices" in cluster computing to all users via a self-installing
software suite. He is also a contributor to the Parallel Virtual Machine (PVM)
and Heterogeneous Adaptable Reconfigurable NEtworked SystemS (HARNESS) research
efforts at ORNL. Stephen has a Ph.D. and M.S. in computer science and is a member
of ACM, IEEE Computer, and the IEEE Task Force on Cluster Computing.
Personal www.csm.ornl.gov/~sscott
Cluster tools www.csm.ornl.gov/ClusterPowerTools
TORC www.csm.ornl.gov/torc
PVM www.csm.ornl.gov/pvm
HARNESS www.csm.ornl.gov/harness
OCG www.OpenClusterGroup.org
OSCAR www.OpenClusterGroup.org/OSCAR
The presentation will address some of the recent innovations in the linux kernel
especially related to the new kernel 2.6, and we'll analyze how
these can affect and improve clustering and distributed/parallel computing.
Andrea Arcangeli works for SuSE as kernel developer, on many parts of the linux
kernel including memory management, scheduler,
I/O subsystem, x86-64 port, and networking. His primary object is to make linux
always more reliable, performant, responsive and scalable.
High-Availability (HA) clustering is a clustering technique where services are provided by the cluster as a whole, rather than by individual servers. Failure of individual nodes and services are recovered using redundancy in the cluster. The Linux-HA project is the oldest and best known open source HA project on Linux.
This talk will discuss the Linux-HA project - it's capabilities, limitations
and future plans. In addition, we will also discuss the Open Clustering
Framework project which is defining clustering APIs (HA and HPC) for Linux.
Alan Robertson has been an active developer and project leader for High-Availability
Linux for the last several years. He maintains the Linux-HA project web site
at http://linux-ha.org, and has been a key developer for the open source heartbeat
program. He worked for SuSE for a year, then joined IBM's Linux Technology Center
in March 2001.
Alan also jointly leads the Open Cluster Framework effort (http://opencf.org/) to define standard APIs for clustering, and provide an open source reference implementation of these APIs.
Before joining SuSE, he was a Distinguished Member of Technical Staff for Bell Labs. He worked for Bell Labs 21 years in a variety of roles. These included developing products, designing communication controllers and providing leading-edge computing support.
He obtained an MS in Computer Science from Oklahoma State University in 1978 and a BS in Electrical Engineering from OSU in 1976.
Abstract: For decades, through its missions for the EDF Group, EDF R&D Division has been developing and maintaining scientific applications. At the end of the 90's, EDF R&D engaged a deep and large questioning about software architecture of its applications and the organization of its computing facilities. This reorganization of the scientific computing has made clusters of PCs possible target machines for departmental or project uses.
The CALIBRE project has been launched in 2000 in order to spread PC cluster technology at EDF R&D. Its objectives was to study of the technical feasibility of such platform, evaluates its associate cost (TCO), developing expertise and build a service offer. The deployment of cluster have now outreach the R&D division.
This talk will present the results of the CALIBRE project and the roadmap for
the deployment of clusters at EDF.
Jean-Yves Berthou has been a researcher for EDF R&D since 1997. He regularly
teaches computer science at various French Universities and Engineering Schools.
He received a Ph.D in computer science from "Pierre et Marie Curie"
University (PARIS VI) in 1993. He also worked two years for the CEA, the French
National Atomic Agency, as an expert in High Performance Computing. His current
research and teaching deals mainly with Parallel Programming and Software Architecture
for scientific computing.
Jean-Yves Berthou is currently the head of the Applied Scientific Computing Group at EDF R&D. The main issues the group have in charge are Software Architecture, Code Optimization, High Performance Computing, Cluster and Grid Computing.
CLIC, which stands for "Cluster LInux pour le Calcul", is a project
sponsored by the French government. Its aim
is to popularize the use of clusters with an easy to install GPL clustering
suite.
To meet this requirement, CLIC combines mandrake linux expertise, for easy installation
and HPC specific hardware
support, and a growing set of tools developped by researchers, in a 32 &
64 bits processors Linux Distribution.
These tools includes some management/deployment tools, developped at the ID
laboratory, and already used on its
clusters, and some contributions that allows research teams from various fields
(bioinformatics, astronomy, ...)
to easily distribute their software through rpms.
Compagnie Générale de Géophysique (CGG) is a leading supplier of geophysical products and services to the worldwide oil and gas industry. CGG's Sercel ® subsidiary produces seismic sources and data acquisition equipment. Based in Paris, France, CGG works worldwide on land and offshore to gather seismic data. CGG's processing and reservoir services offer seismic data management and processing as well as reservoir geophysics activities.
As seismic processing implies to have the capacity to handle terabytes of data
and is also really compute intensive, CGG had always been a major European supercomputing
player. CGG is now operating clusters at very large scale to process our seismic
data.
Our basic goal is to achieve better price/performance ratio than the NUMA systems
clusters were supposed to replace. Due to clusters nature many interesting challenges
had to be overcomed by our design and our operational mode, in order to achieve
the best TCO and performance:
· based on comodity hardware, clusters nodes have limited reliability
· clusters are not well balanced for intensive calculations
· are complex systems, large scale industrial operation is far from trivial
· evolution of computer hardware make them obsolete in a few months,
short term leases are expensive and obsolescence is disruptive for large processing
centers
CGG operates today more than 15 000 CPUs in clusters worldwide. This huge computing
power, combined with CGG's Geocluster, allows us to efficiently produce quality
products for all our oil and gas clients, day after day.
Jean-Yves Blanc is the IT Architect for the Processing & Reservoir Business
Unit. He defines the IT strategy and architects the processing centers IT infrastructures
in collaboration with the Operational units. J-Y. Blanc joined CGG in 1992.
Prior to his current position inside CGG, he was the head of the Parallel Computing
Development Group. He is holding a Ph. D. of Applied Maths from the Institut
Polytechnique de Grenoble, France. He is reporting to Laurent Vercelli, manager
of the IT Industrialisation Department. CGG is headquarted in Paris, France.