User Tools

Site Tools


activity

Activity


Towards Blob-Based Convergence Between HPC and Big Data

The ever-increasing computing power available on HPC platforms raises major challenges for the underlying storage systems, both in terms of I/O performance requirements and scalability. Metadata management has been identified as a key factor limiting the performance of POSIX file systems. Due to their lower metadata overhead, Blobs (Binary Large Objects) have been proposed as an alternative to traditional POSIX file systems for answering the storage needs of data-intensive HPC applications. Yet, the interest for blobs spans far beyond HPC, as they are also used as a low-level primitive for providing higher-level storage abstractions such as key-value stores or relational databases in cloud environments.

Activity in 2017

Sub-goal 1: We did explore the following question: Could Blobs be an enabling factor for storage convergence between both these worlds? The objective is to leverage standard benchmarks for Big Data as well as real-world HPC applications to answer the question. Through extensive experiments, we seek to prove that blobs are a solid base for HPC and Big Data storage convergence, by demonstrating that the same storage systems (namely RADOS and Týr) applied in both contexts can yield significant performance advantages over the state of the art on both sides. This work is done in collaboration with ANL, UPM (Spain) and DKRZ (Germany).

Sub-goal 2: Leveraging on the previous goal, we intend to further demonstrate the applicability of Blobs on HPC context, which is less studied in the literature. For that matter, we focus on one use-case, distributed logging. We prove that this can bring highly-efficient support for applications such as computational steering, while offering unprecedented scalability on large-scale platforms. We leverage the local storage on the Theta supercomputer to deploy the storage system directly on the compute nodes.

Results: The first sub-goal lead to an extensive research leveraging a set of 5 benchmarks for Spark Bench as well as 4 real-world data-intensive HPC applications. The results show that all 4 applications are able to work unmodified atop object storage. More importantly, they show that the same storage systems (Týr or RADOS) can be used for both HPC and Big Data contexts, and on both show significant performance advantages. This clearly confirms the applicability and relevance of blob-based storage in a storage convergence context. These results have been published as a preliminary publication in the Cluster 2017 conference. A follow-up paper introducing the full set of experiments and giving a first positive answer to the question asked has been published in Future Generation Computer Systems (FGCS).

The second sub-goal was achieved through a 3-month visit of Pierre Matri (Inria) at ANL, under the direct supervision of Philip Carns and Rob Ross. Through extensive experiments on the Theta supercomputer up to 120,000 cores, we did demonstrate the performance benefits of using blobs for distributed logging on HPC. We also showed that the near-limitless scaling of blob storage systems brings significant performance benefits at large scale when compared to traditional, file-based storage. The results of this work resulted in a publication at ICDCS 2018.

Activity in 2018

Sub-goal 1: Such a convergence through blobs leads to two main questions that we intend to answer in the next year. How can we further optimize object storage for convergence? Indeed, the conclusions from this preliminary experiments prove that the performance advantage of blobs over the state of the art can be further improved by leveraging the technologies available on both platforms. This includes RDMA, that is increasingly available on Big Data platforms through RoCE for example, or leveraging advanced memory architectures such as NVRAM available on next-generation supercomputers.

Sub-goal 2: How can this model be applied to converging data processing frameworks? While originating on clouds, there is a growing interest in the HPC community for artificial intelligence, through frameworks such as TensorFlow. This gives an unprecedented opportunity to take convergence one level higher, and shows how convergent frameworks can be integrated with convergent storage to offer users with a unified experience on HPC and Clouds for such use-cases

Results: The first sub-goal is ongoing investigation, using tools for HPC systems such as Mercury and Argobots, developed at Argonne National Laboratory. As part of the Mochi microservices framework, we intend to demonstrate that distributed logging is able to scale beyond what we demonstrated possible with Týr, and can be seamlessly integrated with the existing applicative stack. The second-sub-goal drives the choice of applications to which to apply such storage solution, exploring the potential of distributed logs to serve as a key data structure for machine learning applications harnessing the power of next-generation HPC systems.

People: Pierre Matri, Phil Carns, Alexandru Costan, Maria S. Pérez, Rob Ross, Gabriel Antoniu

Status: Running


Elastic storage systems

Goals: For efficient data processing, efficient resource utilization becomes a major concern as large-scale computing infrastructures such as supercomputers or clouds keep growing in size. Naturally, energy and cost savings can be obtained by reducing idle resources. Malleability, which is the possibility for resource managers to dynamically increase or reduce the resources of jobs, appears as a promising means to progress towards this goal. However, state-of-the-art parallel and distributed file systems have not been designed with malleability in mind. This is mainly due to the supposedly high cost of data transfers associated with the resizing of the distributed file systems. This project aims to evaluate the gain that such malleable file system would bring to current application.

Activity in 2017

Sub-goal: We explored the decommission of nodes in a distributed file system that is based on a replication fault-tolerance.

Results: A realistic lower bound of the time needed to decommission nodes have been modeled. Then, an experimentation campaign has been launched to study the decommission in HDFS showing that the process is efficient when the network is the bottleneck, but could be sped up by up to a factor 3 when the storage is the bottleneck. The paper presenting the findings, How Fast Can One Scale Down a Distributed File System?, has been accepted at the BigData conference of 2017.

People: Nathanaël Cheriere, Matthieu Dorier, Gabriel Antoniu, Rob Ross

Activity in 2018

Sub-goal 1: We explored the commission of nodes in a distributed file system that is based on a replication fault-tolerance.

Sub-goal 2: Leveraging on the previous goals and results, we applied the results in practical applications: a benchmark to evaluate the potential of distributed storage system malleability on a given platform, and a component that ease the implementation of the malleability in distributed storage systems by scheduling all data transfers needed by rescaling operations.

Results: A realistic lower bound of the time needed to commission nodes have been modeled. A experimental study highlighted that the commission mechanism of HDFS is not optimized for speed. The results are available as a research report link. A paper regrouping the design of lower bounds for the decommission and commission, How Fast Can One Resize a Distributed File System?, has been submitted to the journal JPDC.

A benchmark, Pufferbench, has been designed with two goals in mind. First, measure the duration of actual rescaling operations on a given platform. Second, optimize the commission and decommission algorithms quickly. The benchmark helped to show that the lower bounds previously established can be approached by real world implementations. On average, the commission and decommission times obtained with Pufferbench were within 16% of the lower bound. A paper presenting the results, Pufferbench: Evaluating and Optimizing Malleability of Distributed Storage , has been submitted to PDSW-DISCS 2018.

A library aiming to ease the implementation of commission and decommission mechanisms is being developed during a 3-months visit of Nathanaël Cheriere at ANL, under the supervision of Rob Ross. Extensive experiments will be done on the Theta and Cooley supercomputers.

People: Nathanaël Cheriere, Matthieu Dorier, Gabriel Antoniu, Rob Ross


Exploiting the Dragonfly Topology to Improve Communication Operations

Goals: High-radix direct network topologies such as Dragonfly have been proposed for petascale and exascale supercomputers because they ensure fast interconnections and reduce the cost of the network compared with traditional network topologies. The design of new machines such as Theta with a Dragonfly network present an opportunity to further improve the performance of distributed applications by making the algorithms aware of the topology. Indeed, current algorithms do not consider the topology and thus lose numerous opportunities of optimization for performance that have been created by the topology. This project aims to explores ways to exploit the strengths of the Dragonfly network topology to propose and evaluate optimized algorithms global communication operations, such as AllGather, Scatter, etc.

People: Matthieu Dorier, Nathanael Cheriere, Shadi Ibrahim, Rob Ross, Gabriel Antoniu

Activity in 2016

Sub-goal: Studying Scatter and Allgather algorithms adapted to the Dragonfly topology using the CODES network simulation framework.

Results: We studied and extended existing algorithms for collective communication operations and use CODES, an event-driven simulator, to evaluate them. The simulations show expected results for AllGather, as well as surprising ones for Scatter: the naive algorithm perform oustandingly well on Dragonfly because they exploit the characteristics of the routers in the network. In particular, the Scatter operation could be accelerated up by a factor up to 2 times using a hardware aware algorithm. These results have been accepted as a poster for the ACM Student Research Competition at SC 2016, and will be submitted at IPDPS 2017.

Activity in 2017

Sub-goal: The Dragonfly model studied in the 2016 activity is a theoretical model that does not exactly match the topology and the hardware of current supercomputers such as Argonne's Theta machine. The objective of this year is to study the Scatter and Allgather algorithms proposed in the 2016 activity in the context of a model matching the Theta supercomputer, still using the CODES network simulation framework.

Results: We studied the previously proposed algorithms on the network used for the Theta supercomputer (ANL) using CODES. The simulations shows unexpected results. Contrary to the previous study on the subject, topology-aware algorithms do not bring a significant improvement in the run time. However, very simple algorithms based on non-blocking communications perform better than the state-of-the-art. For instance, Scatter has been sped up by up to a factor 7 compared to state-of-the-art algorithms. These results will be submitted to CCGrid 2018.

Status: Closing


In Situ Visualization of HPC Simulations using Dedicated Cores

General goals: Large scale simulations running on leadership class supercomputers generate massive amount of data for subsequent analysis and visualization. As the performance of storage systems show its limits, an alternative consists in embedding visualization and analysis algorithms within the simulation (in situ visualization). Our goal within this context is to explore the potential benefit of using Damaris, a middleware for I/O forwarding and post-processing using dedicated cores, to offload in situ visualization while sharing resources with the running simulation.

People: Matthieu Dorier, Lokman Rahmani, Tom Peterka, Gabriel Antoniu, Roberto Sisneros, Dave Semeraro

Activity in 2013

Sub-goals: The goal of this year is to enable in situ visualization capabilities as a software layer on top of Damaris. This software layer, called Damaris/Viz, allows to connect a running simulation to the VisIt visualization software and to process data using Python scripts. This interface will be evaluated with real simulations on different large scale platforms. The second goal is to investigate how Damaris could be interfaced with a broader range of visualization software.

Results: Image generated through in situ visualization of the CM1 tornado simulation with Damaris and VisIt. Damaris/Viz was evaluated with the CM1 atmospheric simulation on Grid'5000 and on NCSA's Blue Waters with up to 6400 cores, and with the Nek5000 CFD simulation. The use of Damaris as a bridge to existing visualization packages allows to (1) reduce code modification to a minimum for existing simulations, (2) gather capabilities of several visualization tools to offer a unified data management interface, (3) use dedicated cores to hide the run time impact of in situ visualization and (4) efficiently use memory through a shared-memory-based communication model. This work led to a publication at the IEEE LDAV 2013 (Large scale Data Analysis and Visualization) conference, as well as a demo at the Inria booth at SC13.

Activity for 2014

Sub-goal: In situ visualization using Damaris/Viz still poses the problem of large amounts of data requiring to be processed at a high rate, while only a small part of the data might be of interest for the end user. Our goal is to improve Damaris/Viz by enabling an automated detection of interesting features in the datasets in order to reduce the visualization payload and thus, further improve the performance of in situ visualization.

Results: To meet the above goal, we made use of ITL (Information Theory Library) and DIY to help integrate the efficient computation of information theory based metrics. Implementation and evaluation are in progress on Grid'5000 with various simulations including CM1. A joint paper is in preparation on this topic.

Activity for 2015

Sub-goal 1: In situ visualization is only one aspect of a bigger challenge that consists of building workflows for HPC simulations at Exascale. This year's activity will focus on this challenge and propose ways to address it using existing tools developed by Inria and ANL, including Damaris, FlowVR, Decaf and Swift.

Results: Following discussions between partners, we have published a paper at ISAV 2015 describing the aforementioned tools and the lessons learned from using them on production environment with real applications. We intend to follow up with a larger survey of existing research in the field of in situ analysis and visualization, and workflow management systems.

Sub-goal 2: While our experiments with Damaris and VisIt on Grid'5000 were conclusive, VisIt turns our to be difficult to deploy on machines such as Blue Waters, due to the necessary interactivity with the user. We intend to solve this by switching from VisIt to ParaView in our Smart Visualization workfload defined in previous years.

Results: The integration of ParaView and its in situ interface Catalyst are in progress, with promising preliminary experiments on Blue Waters.

Activity for 2016

Sub-goal 1: Within the Decaf workflow system, we will first look at ways to enable cycles in workflows, which will eventually enable feedback- driven in situ analysis with more human interactions. Finally we will investigate how to couple Damaris, FlowVR and Decaf to enable new types of coupled tasks.

Results: The runtime of Decaf was rewritten to enable cycle and the support of multiple executable within the same workflow. This makes it easier for the users to integrate their code with Decaf. We developed a steering workflow to guide a molecular dynamic simulation (Gromacs). The workflow combines Decaf (workflow runtime), FlowVR (connection with specialized MD visualization) and Damaris/Viz (Visit visualization) in a cooperative manner.

Sub-goal 2: Data rate miss match between coupled components is of great concerned for in situ workflows. Current solutions are local buffering or frame drops, both implemented for instance in FlowVR. We will explore other complementary approach based on data staging on other nodes to reduce the amount of frame drop.

Results: Mehmet Aktaş, from Rutgers University, spent 10 weeks at ANL to couple Decaf with Dataspaces. The result is a new 2 layers buffering mechanism in Decaf. The first layer stores data in local memory. The second layer sends data to Dataspaces servers to store data on dedicated nodes.

Status: closed.


Mitigating I/O Interference in Concurrent HPC Applications

General goals: With million-core supercomputers comes the problem of interference between distinct applications accessing a shared file system in a concurrent manner. Our work aims to investigate and quantify this interference effect, and to mitigate I/O interference through a novel approach that uses cross-application communication and coordination: CALCioM.

People: Matthieu Dorier, Gabriel Antoniu, Shadi Ibrahim, Rob Ross, Dries Kimpe, Orcun Yildiz

Activity in 2013

Sub-goal: This year is dedicated to investigating interference phenomenon on large-scale platforms: the French Grid'5000 and Argonne's BlueGene/P machine Surveyor. Our target is to show the importance of mitigating I/O interference, and to propose new scheduling strategies that take advantage of cross-application communication.

Results: Experiments done during Matthieu Dorier's internship led to a better understanding of the I/O interference phenomena, and to the implementation of a prototype of the CALCioM approach with currently includes 3 scheduling strategies. As a result of this work, a paper was accepted at IEEE IPDPS 2014.

Activity for 2014

Sub-goal 1: Having exemplified the interference phenomenon on synthetic benchmarks, we are now interested in showing how often such interference occurs and the nature of the applications that are involved in thisphenomenon. This investigation will be done through the analysis of traces produced by the Darshan library on ANL's Intrepid BlueGene/P system.

Results: We developed Darshan-Ruby and Darshan-Web. Darshan-Ruby is a Ruby wrapper to ANL's Darshan library. Darshan-Web is a Web platform for online analysis of Darshan log files. This platform is based on Ruby on Rails, D3.js, AJAX technologies.

Sub-goal 2: Our second goal is to find a way to improve CALCioM by modeling and predicting I/O patterns. This prediction should be made at run time, with no prior knowledge of the application, and should converge toward an accurate model of the application's I/O within a few iterations only.

Results: To this end, we developed Omnisc'IO, an approach that leverages format grammars to model and predict the I/O behavior of HPC applications. Omnisc'IO was evaluated with four real application: CM1, Nek5000, LAMMPS and GTC, and our results led to a paper at SC'14.

Activity for 2015

Sub-goal 1: Our plan is to further develop and evaluate Omnisc'IO, in particular, we plan to investigate ways to further optimize the size of the grammar model generated by Omnisc'IO and improve its prediction capabilities.

Results: Omnisc'IO has been improved and now use a new algorithm that keeps its model's size constant over time, drastically reducing memory requirement. A paper has been submitted to TPDS (IEEE Transactions on Parallel and Distributed Systems) extending our previous SC paper with new results.

Sub-goal 2: Our plan is to investigate the different factors which contribute to the interference in HPC systems and explore a model to predicte this interference. This work is a potential subject for a summer internship for Orçun Yildiz at ANL (summer 2015).

Results: Orcun Yildiz visited ANL for 3 months (July to September) to investigate the root causes of I/O interference in HPC system. An extensive experimentation campaign was conducted on Grid'5000, leading to results published at the IPDPS 2016 conference.

Status: Closed.


Streaming in the clouds

Goals: In the recent years BigData has become an important aspect of scientific discoveries - a process referred to as the Forth Paradigm. From the wide spectrum of applications and acquisitions methods, the ones that will generate the biggest amounts of data fall in the category of streaming data (i.e. network of sensors, wide area observatories,telescopes or physical experiments like CERN’s LHC). As the amount of acquired information grows and the location of the sources that generate data become geographically distributed, highly elastic and scalable infrastructures are required. An interesting option for executing such application comes from cloud computing, which has emerged in the last years as the default solution for supporting on-demand scalable processing environments around the globe. Cloud infrastructures provide alongside with the compute capacity multiple options for storing data. The question that arises is what are the best options for using the clouds for such geographically distributed stream processing.

People: Radu Tudoran, Kate Keahey, Gabriel Antoniu, Alexandru Costan

Activity in 2013

Sub-goals: Investigate different data streaming scenarios using several platforms: FutureGrid and Microsoft Azure in order to understand the performance delivered by different storage and compute cloud options.

Results: This work was realized during Radu Tudoran's summer internship at ANL. It consists in experimenting and analyzing different cloud solutions (Azure, Nimbus, OpenStack) and storage options to understand how clouds can support scalable stream processing and at what costs. As a use case application scenario, the ATLAS from CERN was used. Our results show that, despite initial expectations, copying data into virtual cloud storage attached to a local node (Copy & Compute strategy) delivers better performance than streaming data directly to the compute nodes (Stream & Compute strategy). The results were verified both on a commercial cloud infrastructure, the Microsoft Azure cloud, as well as in a scientific one, FutureGrid, using Nimbus and OpenStack cloud software stacks. A joint paper is currently being submitted to the IEEE/ACM CCGRID 2014 conference.

Activity in 2014

Sub-goal 1: We further investigated several strategies for efficient data transfer by leveraging the cloud infrastructure capabilities to support large-scale data dissemination across geographically distributed sites. The novelty comes from the use of self-adaptation combined with network and computing parallelism to provide reasonable times. Using monitoring based performance modelling we plan to predict the best combination of protocol (e.g. memory-to-memory, FTP, BitTorrent) and transfer parameters (e.g. flow count, multicast enhancement, replication degree) in order to maximize throughput or minimize costs, according to application requirements.

Results: We implemented a prototype illustrating these design principles and presented its validation through large scale experiments in a paper accepted at the IEEE/ACM CCGrid 2014 conference.

Sub-goal 2: We leveraged the above strategies in a proposal for a dedicated transfer service, targeting cloud providers. We argue that the adoption of such a uniform approach for large scale transfers brings several benefits for both users and the cloud providers who propose it. For users of multi-site or federated clouds, our proposal is able to decrease the variability of transfers and increase the throughput up to three times compared to baseline user options, while benefiting from the well-known high availability of cloud-provided services. For cloud providers, such a service can decrease the energy consumption within a datacenter down to half compared to user-based transfers.

Results: We developed a Transfer as a Service (TaaS) offering, implemented in the Azure cloud. Its benefits were explained in a paper presented at the IEEE SRDS 2014 conference.

Status: Closed.


Big Data Management for Scientific Workflows on Multi-Site HPC Clouds

Goals: We plan to develop a framework for the efficient data management of data-intensive workflows in geographically dispersed HPC clouds. We focus on building a solution that enables information exchange between computation and storage layers for a purposeful co-scheduling of tasks and data. Since workflow data can reach huge sizes and need to support fine-grain data stripping, metadata becomes a critical issue.

People: Luis Pineda Morales, Kate Keahey, Gabriel Antoniu, Alexandru Costan

Activity in 2015

Sub-goal: In a first phase, our work is focused in the development of a hierarchical, multi-site metadata registry that handles both file location and workflow attributes. The experimental validation will be done using real life workflows (BuzzFlow and Montage).

Sub-goal 2: We intend to leverage our metadata registry to provide support for elastic scaling of scientific applications on clouds. This support includes supplying information to an elastic provisioner (Phantom) to drive scheduling decisions based on certain policies. Such policies will be defined as a result of experimental evaluation on a real-life pipeline workflow for spatial data synthesis.

Results: We developed a multi-site metadata management tool and tested it with real life workflows in a geo-distributed environment on Azure Cloud; the results were published in a paper at the IEEE Cluster 2015 conference. Also, Luis Pineda-Morales participated in a Summer internship at ANL, evaluating a GIS scientific application to derive models for elastic scaling on clouds; a poster has been accepted to SC 2015 conference.

Status: Closed.


activity.txt · Last modified: 2018/09/24 10:46 by pmatri