Wednesday, November 5, 2014

Storage Utilization in the Long Tail of Science

Shameless Advertising
If the things I discuss in this blog post sound like something you'd like to do professionally, you're in luck!  NERSC is looking to fill a position in user services in conjunction with the Joint Genome Institute.  The folks at NERSC are a world-class bunch, and this position is, in my biased opinion, one of the best roles to have in the world of HPC and bioinformatics.

This particular position is at the exciting crossroads of next-generation sequencing and high-performance computing, and you'll get to work on some of the largest scientific datasets being generated.  DOE and NERSC are great employers and organizations, and I seriously can't say enough good things about them.


Since changing careers and moving up to the San Francisco Bay Area in July, I haven't had nearly as much time to post interesting things here on my blog—I guess that's the startup life. That isn't to say that my life in DNA sequencing hasn't been without interesting observations to explore though; the world of high-throughput sequencing is becoming increasingly dependent on high-performance computing, and many of the problems being solved in genomics and bioinformatics are stressing aspects of system architecture and cyberinfrastructure that haven't gotten a tremendous amount of exercise from the more traditional scientific domains in computational research.

Take, for example, the biggest and baddest DNA sequencer on the market: over the course of a three-day run, it outputs around 670 GB of raw (but compressed) sequence data, and this data is spread out over 1,400,000 files. This would translate to an average file size of around 500 KB, but the reality is that the file sizes are a lot less uniform:

Figure 1. File size distribution of a single flow cell output (~770 gigabases) on Illumina's highest-end sequencing platform

After some basic processing (which involves opening and closing hundreds of these files repeatedly and concurrently), these data files are converted into very large files (tens or hundreds of gigabytes each) which then get reduced down to data that is more digestible over the course of hundreds of CPU hours. As one might imagine, this entire process is very good at taxing many aspects of file systems, and on the computational side, most of this IO-intensive processing is not distributed and performance benefits most from single-stream, single-client throughput.

As a result of these data access and processing patterns, the storage landscape in the world of DNA sequencing and bioinformatics is quite different from conventional supercomputing. Some large sequencing centers do use the file systems we know and love (and hate) like GPFS at JGI and Lustre at Sanger, but it appears that most small- and mid-scale sequencing operations are relying heavily on network-attached storage (NAS) for both receiving raw sequencer data and being a storage substrate for all of the downstream data processing.

I say all of this because these data patterns—accessing large quantities of small files and large files with a high degree of random IO—is a common trait in many scientific applications used in the "long tail of science." The fact is, the sorts of IO for which parallel file systems like Lustre and GPFS are designed are tedious (if not difficult) to program, and for the majority of codes that don't require thousands of cores to make new discoveries, simply reading and writing data files in a naïve way is "good enough."

The Long Tail

This long tail of science is also using up a huge amount of the supercomputing resources made available to the national open science community; to illustrate, 98% of all jobs submitted to the XSEDE supercomputers in 2013 used 1024 or fewer CPU cores, and these modest-scale jobs represented over 50% of all the CPU time burned up on these machines.

Figure 2. Cumulative job size distribution (weighted by job count and SUs consumed) for all jobs submitted to XSEDE compute resources in 2013

The NSF has responded to this shift in user demand by awarding Comet, a 2 PF supercomputer designed to run these modest-scale jobs. The Comet architecture limits its full-bisection bandwidth interconnectivity to groups of 72 nodes, and these 72-node islands will actually have enough cores to satisfy 99% of all the jobs submitted to XSEDE clusters in 2013 (see above). By limiting the full-bisection connectivity to smaller islands and using less rich connectivity between islands, the cost savings in not having to buy so many mid-tier and core switches are then turned into additional CPU capacity.

What the Comet architecture doesn't address, however, is the question of data patterns and IO stress being generated by this same long tail of science—the so-called 99%. If DNA sequencing is any indicator of the 99%, parallel file systems are actually a poor choice for high-capacity, mid-scale jobs because their performance degrades significantly when facing many small files. Now, the real question is, are the 99% of HPC jobs really generating and manipulating lots of small files in favor of the large striped files that Lustre and GPFS are designed to handle? That is, might the majority of jobs on today's HPC clusters actually be better served by file systems that are less scalable but handle small files and random IO more gracefully?

Some colleagues and I set out to answer this question last spring, and a part of this quest involved looking at every single file on two of SDSC's Data Oasis file systems. This represented about 1.7 PB of real user data spread across two Lustre 2.4 file systems—one designed for temporary scratch data and the other for projects storage—and we wanted to know if users' data really consisted of the large files that Lustre loves or if, like job size, the 99% are really working with small files.  Since SDSC's two national resources, Gordon and Trestles, restrict the maximum core count for user jobs to modest-scale submissions, these file systems should contain files representative of long-tail users.

Scratch File Systems

At the roughest cut, files can be categorized based on whether their size is on the order of bytes and kilobytes (size < 1024*1024 bytes), megabytes (< 1024 KB), gigabytes (<1024 MB), and terabytes (< 1024 GB). Although pie charts are generally a terrible way to show relative compositions, this is how the files on the 1.2 PB scratch file system broke down:

Figure 3. Fraction of file count consumed by files of a given size on Data Oasis's scratch file system for Gordon

The above figure shows the number of files on the file system classified by their size, and there are clearly a preponderance of small files less than a gigabyte in size. This is not terribly surprising as the data is biased towards smaller files; that is, you can fit a thousand one-megabyte files in the same space that a single one-gigabyte file would take up. Another way to show this data is by how much file system capacity is taken up by files of each size:

Figure 4. File system capacity consumed by files of a given size on Data Oasis's scratch file system for Gordon

This makes it very apparent that the vast majority of the used space on this scratch file system—a total of 1.23 PB of data—are taken up by files on the order of gigabytes and megabytes. There were only seventeen files that were a terabyte or larger in size.

Incidentally, I don't find it too surprising that there are so few terabyte-sized files; even in the realm of Hadoop, median job dataset sizes are on the order of a dozen gigabytes (e.g., Facebook has reported that 90% of its jobs read in under 100 GB of data). Examining file sizes with much finer granularity reveals that the research data on this file system isn't even of Facebook scale though:

Figure 5. Number of files of a given size on Data Oasis's scratch file system for Gordon.  This data forms the basis for Figure 3 above

While there are a large number of files on the order of a few gigabytes, it seems that files on the order of tens of gigabytes or larger are far more scarce. Turning this into relative terms,

Figure 6. Cumulative distribution of files of a given size on Data Oasis's scratch file system for Gordon

we can make more meaningful statements. In particular,

  • 90% of the files on this Lustre file system are 1 megabyte or smaller
  • 99% of files are 32 MB or less
  • 99.9% of files are 512 MB or less
  • and 99.99% of files are 4 GB or less

The first statement is quite powerful when you consider the fact that the default stripe size in Lustre is 1 MB. The fact that 90% of files on the file system are smaller than this means that 90% of users' files really gain no advantages by living on Lustre. Furthermore, since this is a scratch file system that is meant to hold temporary files, it would appear that either user applications are generating a large amount of small files, or users are copying in large quantities of small files and improperly using it for cold storage. Given the quota policies for Data Oasis, I suspect there is a bit of truth to both.

Circling back a bit though, I said earlier that comparing just the quantity of files can be a bit misleading since a thousand 1 KB files will take up the same space as a single 1 MB file. We can also look at how much total space is taken up by files of various sizes.

Figure 7. File system capacity consumed by files of a given size on Data Oasis's scratch file system for Gordon.  This is just a more finely diced version of the data presented in Figure 4 above.

The above chart is a bit data-dense so it takes some staring at to understand what's going on. First looking at the purple line, we can pull out some pretty interesting facts:

  • Half of the file system's used capacity (50%) is consumed by files that are 1 GB or less in size
  • Over 20% of the file system's used capacity is taken up by files smaller than 64 MB
  • About 10% of the capacity is used by files that are 64 GB or larger

The blue boxes represent the derivative of that purple line—that is, how much space is taken up by files of only one specific size. The biggest chunk of the file system (141 TB) is taken up by 4 GB files, but it appears that there is a substantial range of file sizes that take up very similarly sized pieces of the pie. 512 MB files take up a total of 139 TB; 1 GB, 2 GB, and 8 GB files all take up over 100 TB of total space each as well. In fact, files ranging from 512 MB to 8 GB comprise 50% of the total file system capacity.

Why the sweet spot for space-consuming files is between 512 MB and 8 GB is unclear, but I suspect it's more caused by the human element in research. In my own research, I worked with files in this range simply because it was enough data to be statistically meaningful while still small enough to quickly re-analyze or transfer to a colleague. For file sizes above this range, the mass of the data made it difficult to manipulate using the "long-tail" cyberinfrastructure available to me. But, perhaps as more national-scale systems comes online to meet the needs of these sorts of workloads, this sweet spot will creep out to larger file sizes.

Projects Storage

The above discussion admittedly comes with a lot of caveats.  In particular, the scratch file system we examined was governed by no hard quotas which did lead some people to leave data resident for longer than they probably should have.  However, the other file system we analyzed was SDSC's Data Oasis projects storage which was architected for capacity over performance and featured substantially more disks per OSS.  This projects storage also came with 500 GB quotas by default, forcing users to be a little more mindful of what was worth keeping.

Stepping back to the coarse-grained kilobyte/megabyte/gigabyte/terabyte pie charts, here is how projects storage utilization compared to scratch storage:

Figure 8. Fraction of file count consumed by files of a given size on Data Oasis's projects file system (shared between Gordon and Trestles users)

On the basis of file counts, it's a bit surprising that users seem to store more smaller (kilobyte-sized) files in their projects space than their scratch space.  This may imply that the beginning and end data bookending simulations aren't as large as the intermediate data generated during the calculation.  Alternately, it may be a reflection of user naïveté; I've found that newer users were often afraid to use the scratch space because of the perception that their data may vanish from there without advanced notice.  Either way, gigabyte-sized files comprised a few hundredths of a percent of files, and terabyte-sized files were more scarce still on both file systems.  The trend was uniformly towards smaller sizes on projects space.

As far as space consumed by these files, the differences remain subtle.

Figure 9. Fraction of file system capacity consumed by files of a given size on Data Oasis's projects file system

There appears to be a trend towards users keeping larger files in their projects space, and the biggest change is the decrease in megabyte-sized files in favor of gigabyte-sized files.  However, this trend is very small and persists across a finer-grained examination of file size distributions:

Figure 10. File system capacity consumed by files of a given size on Data Oasis's projects file system

Half of the above plot is the same data shown above, making this plot twice as busy and confusing.  However there's a lot of interesting data captured in it, so it's worth the confusing presentation.  In particular, the overall distribution of mass with respect to the various file sizes is remarkably consistent between scratch and projects storage.  We see the same general peak of file size preference in the 1 GB to 10 GB range, but there is a subtle bimodal divide in projects storage that reveals preference for 128MB-512MB and 4GB-8GB files which manifests in the integrals (red and purple lines) that show a visibly greater slope in these regions.

The observant reader will also notice that the absolute values of the bars are smaller for projects storage and scratch storage; this is a result of the fact that the projects file system is subject to quotas and, as a result, is not nearly as full of user data.  To complicate things further, the projects storage represents user data from two different machines (each with unique job size policies, to boot), whereas the scratch storage is only accessible from one of those machines.  Despite these differences though, user data follows very similar distributions between both file systems.


It is probably unclear what to take away from these data, and that is with good reason.  There are fundamentally two aspects to quantifying storage utilizations--raw capacity and file count--because they represent two logically separate things.  There is some degree of interchangeability (e.g., storing a whole genome in one file vs. storing each chromosome its own file), and this is likely contributing to the broad peak in file size between 512 MB and 8 GB.  With that being said, it appears that the typical long-tail user stores a substantial amount of decidedly "small" files on Lustre, and this is exemplified by the fact that 90% of the files resident on the file systems analyzed here are 1 MB or less in size.

This alone suggests that large parallel file systems may not actually be the most appropriate choice for HPC systems that are designed to support a large group of long-tail users.  While file systems like Lustre and GPFS certainly provide a unique capability in that some types of medium-sized jobs absolutely require the IO capabilities of parallel file systems, there are a larger number of long-tail applications that do single-thread IO, and some of these perform IO in such an abusive way (looking at you, quantum chemistry) that they cannot run on file systems like Lustre or GPFS because of the number of small files and random IO they use.

So if Lustre and GPFS aren't the unequivocal best choice for storage in long-tail HPC, what are the other options?

Burst Buffers

I would be remiss if I neglected to mention burst buffers here since they are designed, in part, to address the limitations of parallel file systems.  However, their actual usability remains unproven.  Anecdotally, long-tail users are generally not quick to alter the way they design their jobs to use cutting-edge technology, and my personal experiences with Gordon (and its 300 TB of flash) were that getting IO-nasty user applications to effectively utilize the flash was often a very manual process that introduced new complexities, pitfalls, and failure modes.  Gordon was a very experimental platform though, and Cray's new DataWarp burst buffer seems to be the first large-scale productization of this idea.  It will be interesting to see how well it works for real users when the technology starts hitting the floor for open science in mid-2016, if not sooner.

High-Performance NAS

An emerging trend in HPC storage is the use of high-performance NAS as a complementary file system technology in HPC platforms.  Traditionally, NAS has been a very poor choice for HPC applications because of the limited scalability of the typical NAS architecture--data resides on traditional local file system with network service being provided by an additional software layer like NFS, and the ratio of storage capacity to network bandwidth out of the NAS is very high.

The emergence of cheap RAM and enterprise SSDs has allowed some sophisticated file systems like ZFS and NetApp's WAFL to demonstrate very high performance, especially in delivering very high random read performance, by using both RAM and flash as a buffer between the network and spinning rust.  This allows certain smaller-scale jobs to enjoy substantially better performance when running on flash-backed NAS than a parallel file system.  Consider the following IOP/metadata benchmark run on a parallel file system and a NAS head with SSDs for caching:

Figure 11. File stat rate on flash-backed NAS vs. a parallel file system as measured by the mdtest benchmark

A four-node job that relies on statting many small files (for example, an application that traverses a large directory structure such as the output of one of the Illumina sequencers I mentioned above) can achieve a much higher IO rate on a high-performance NAS than on a parallel file system.  Granted, there are a lot of qualifications to be made with this statement and benchmarking high-performance NAS is worth a post of its own, but the above data illustrate a case where NAS may be preferable over something like Lustre.

Greater Context

Parallel file systems like Lustre and GPFS will always play an essential role in HPC, and I don't want to make it sound like they can be universally replaced by high-performance NAS.  They are fundamentally architected to scale out so that increasing file system bandwidth does not require adding new partitions or using software to emulate a single namespace.  In fact, the single namespace of parallel file systems makes the management of the storage system, its users, and the underlying resources very flexible and straightforward.  No volume partitioning needs to be imposed, so scientific applications' and projects' data consumption do not have to align with physical hardware boundaries.

However, there are cases where a single namespace is not necessary at all; for example, user home directories are naturally partitioned with fine granularity and can be mounted in a uniform location while physically residing on different NAS heads with a simple autofs map.  In this example, leaving user home directories on a pool of NAS filers offers two big benefits:

  1. Full independence of the underlying storage mitigates the impact of one bad user.  A large job dropping multiple files per MPI process will crush both Lustre and NFS, but in the case of Lustre, the MDS may become unresponsive and block IO across all users' home directories.
  2. Flash caches on NAS can provide higher performance on IOP-intensive workloads at long-tail job sizes.  In many ways, high-performance NAS systems have the built-in burst buffers that parallel file systems are only now beginning to incorporate.
Of course, these two wins come at a cost:
  1. Fully decentralized storage is more difficult to manage.  For example, balancing capacity across all NAS systems is tricky when users have very different data generation rates that they do not disclose ahead of time.
  2. Flash caches can only get you so far, and NFS will fall over when enough IO is thrown at it.  I mentioned that 98% of all jobs use 1024 cores or fewer (see Figure 1), but 1024 cores all performing heavy IO on a typical capacity-rich, bandwidth-poor NAS head will cause it to grind to a halt.
Flash-backed high-performance NAS is not an end-all storage solution for long-tail computational science, but it also isn't something to be overlooked outright.  As with any technology in the HPC arena, its utility may or may not match up well with users' workloads, but when it does, it can deliver less pain and better performance than parallel file systems.


As I mentioned above, the data I presented here was largely generated as a result of an internal project in which I participated while at SDSC.  I couldn't have cobbled this all together without the help of SDSC's HPC Systems group, and I'm really indebted to +Rick+Haisong, and +Trevor for doing a lot of the heavy lifting in terms of generating the original data, getting systems configured to test, and figuring out what it all meant when the dust settled (even after I had left!).  SDSC's really a world-class group of individuals.

Sunday, June 29, 2014

Exascale in perspective: RSC's 1.2 petaflop rack

Russian supercomputing manufacturer RSC generated some buzz at ISC'14 last week when they showed their 1.2 PF-per-rack Xeon Phi-based platform.  I was aware of this system from when they first announced it a few months prior, and I referenced it in a piece of a blog post I was writing about the scarier aspects of exascale computing.  Given my impending career change though, it is unclear that I will have the time to ever finish that post before it becomes outdated.  Since RSC is back in the spotlight though, I thought I'd post the piece I wrote up to illustrate how wacky this 1.2 PF rack really is in terms of power consumption.  Power consumption, of course, is the limiting factor standing between today and the era of exascale computing.

So, to put a 400 kW, 1.2 PF rack into perspective, here is that piece:

The Importance of Energy Efficiency

Up through the petascale era in which we currently live, raw performance of high-performance components--processors, RAM, and interconnect--were what limited the ultimate performance of a given high-end machine.  The first petaflop machine, Los Alamos' Roadrunner, derived most of its FLOPs from high-speed PowerXCell 8i processors pushing 3.2 GHz per core.  Similarly, the first 10 PF supercomputer, RIKEN's K computer, derived its performance from its sheer size of 864 cabinets.  Although I don't mean to diminish the work done by the engineers that actually got these systems to deliver this performance, the petascale era really was made possible by making really big systems out of really fast processors.

By contrast, Exascale represents the first milestone where the limitation does not lie in making these high-performance components faster; rather, performance is limited by the amount of electricity that can be physically delivered to a processor and the amount of heat that can be extracted from it.  This limitation is what has given rise to these massively parallel processors that eschew a few fast cores for a larger number of low-powered ones.  By keeping clock speeds low and densely packing many (dozens or hundreds) of compute cores on a single silicon die, these massively parallel processors are now realizing power efficiencies (flops per watt) that are an order of magnitude higher than what traditional CPUs can deliver.

The closest technology on the market that will probably resemble the future's exaflop machines are based on accelerators--either NVIDIA GPUs or Intel's MICs.  The goal will be to jam as many of these massively parallel processors into as small a space and with as tight of an integration as possible.  Recognizing this trend, NERSC has opted to build what I would call the first "pre-exascale" machine in its NERSC-8 procurement which will feature a homogeneous system of manycore processors.

However, such pre-exascale hardware doesn't actually exist yet, and NERSC-8 won't appear until 2016.  What does exist, though, is a product by Russia's RSC Group called PetaStream: a rack packed with 1024 current-generation Xeon Phi (Knight's Corner) coprocessors that has a peak performance of 1.2 PF/rack.  While this sounds impressive, it also highlights the principal challenge of exascale computing: power consumption.  One rack of RSC PetaStream is rated for 400 kW, delivering 3 GFLOPs/watt peak.  Let's put this into perspective.

Kilowatts, megawatts, and gigawatts in perspective

During a recent upgrade to our data center infrastructure, three MQ DCA220SS-series diesel generators were brought in for the critical systems.  Each is capable of producing 220 kVA according to the spec sheets.
Three 220 kVA diesel generators plugged in during a PM at SDSC
It would take three of these diesel generators to power a single rack of RSC's PetaStream.  Of course, these backup diesel generators aren't a very efficient way of generating commercial power, so this example is a bit skewed.

Let's look at something that is used to generate large quantities of commercial power instead.  A GE 1.5-77 wind turbine, which is GE's most popular model, is advertised as delivering 1.5 megawatts at wind speeds above 15 miles per hour.

GE 1.5 MW wind turbine.   Source: NREL
Doing the math, this means that the above pictured turbine would be able to power only three racks of RSC PetaStream on a breezy day.

To create a supercomputer with a peak capability of an exaflop using RSC's platform, you'd need over 800 racks of PetaStream and over 300 MW of power to turn it all on.  That's over 200 of the above GE wind turbines and enough electrity to power about 290,000 homes in the U.S.  Wind farms of this size do exist; for example,

300 MW Stateline Wind Farm.  Source: Wikimedia Commons
the Stateline Wind Farm, which was built on the border between Oregon and Washington, has a capacity of about 300 MW.  Of course, wind farms of this capacity cannot be built in any old place.

Commercial nuclear power plants can be built in a variety of places though, and they typically generate on the order of 1 gigawatt (GW) of power per reactor.  In my home state of New Jersey, the Hope Creek Nuclear Generating Station has a single reactor that was built to deliver about 1.2 GW of power:

1.2 GW Hope Creek nuclear power station.  The actual reactor is housed in the concrete cylinder to the bottom left.  Courtesy of the Nuclear Regulatory Commission.

This is enough to power almost 4 exaflops of PetaStream.  Of course, building a nuclear reactor for every exaflop supercomputer would be extremely costly, given the multi-billion dollar cost of building reactors like this.  Clearly, the energy efficiency (flops/watt) of computing technology needs to improve substantially before we can arrive at the exascale era.

Tuesday, June 24, 2014

Perspectives on the Current State of Data-Intensive Scientific Computing

I recently had the benefit of being invited to attend two workshops in Oakland, CA, hosted by the U.S. Department of Energy (DOE), that shared the common theme of emerging trends in data-intensive computing: the Joint User Forum on Data-Intensive Computing and the High Performance Computing Operational Review.  My current employment requires that I stay abreast of all topics in data-intensive scientific computing (I wish there was an acronym to abbreviate this...DISC perhaps?) so I didn't go in with the expectation of being exposed to a world of new information.  As it turned out though, I did gain a very insightful perspective on how data-intensive scientific computing (DISC), and I daresay Big Data, is seen from the people who operate some of the world's largest supercomputers.

The DOE perspective is surprisingly realistic, application-oriented, and tightly integrated with high-performance computing.  There was the obligatory discussion of Hadoop and how it may be wedged into machines at LLNL with Magpie, ORNL with Spot Hadoop, and SDSC with myHadoop, of course, and there was also some discussion of real production use of Hadoop on bona fide Hadoop clusters at some of the DOE labs.  However, Hadoop played only a minor role in the grand scheme of the two meetings for all of the reasons I've outlined previously.

Rather, these two meetings had three major themes that crept into all aspects of the discussion:
  1. Scientific workflows
  2. Burst buffers
  3. Data curation
I found this to be a very interesting trend, as #1 and #2 (workflows and burst buffers) aren't topics I'd heard come up at any other DISC workshops, forums, or meetings I've attended.  The connection between DISC and workflows wasn't immediately evident to me, and burst buffers are a unique aspect of cyberinfrastructure that have only been thrust into the spotlight with the NERSC-8/LANL Trinity RFP last fall.  However, all three of these topics will become central to both data-intensive scientific computing and, by virtue of their ability to produce data, exascale supercomputers.

Scientific workflows

Workflows are one of those aspects of scientific computing that have been easy to dismiss as the toys of computer scientists because traditional problems in high-performance computing have typically been quite monolithic in how they are run.  SDSC's own Kepler and USC's Pegasus systems are perhaps the most well-known and highly engineered workflow management systems, and I have to confess that when I'd first heard of them a few years ago, I thought they seemed like a very complicated way to do very simple tasks.

As it turns out though, both data-intensive scientific computing and exascale computing (by virtue of the output size of exaflop calculations) tend to follow patterns that look an awful lot like map/reduce at a very abstract level.  This is a result of the fact that most data-intensive problems are not processing giant monoliths of tightly coupled and inter-related data; rather, they are working on large collections of generally independent data.  Consider the recent talk I gave about a large-scale genomic study on which I consulted; the general data processing flow was
  1. Receive 2,190 input files, 20 GB each, from a data-generating instrument
  2. Do some processing on each input file
  3. Combine groups of five input files into 438 files, each 100 GB in size
  4. Do more processing 
  5. Combine 438 files into 25 overlapping groups to get 100 files, each 2.5 GB in size
  6. Do more processing
  7. Combine 100 files into a single 250 GB file
  8. Perform statistical analysis on this 250 GB file for scientific insight
The natural data-parallelism inherent from the data-generating instrument means that any collective insight to be gleaned from this data requires some sort of mapping and reduction, and the process of managing this large volume of distributed data is where scientific workflows become a necessary part of data-intensive scientific computing.  Managing terabytes or petabytes of data distributed across thousands or millions of logical records (whether they be files on a file system, rows in a database, or whatever else) very rapidly becomes a problem that nobody will want to do by hand.  Hadoop/HDFS delivers an automated framework for managing these sorts of workflows if you don't mind rewriting all of your processing steps against the Hadoop API and building out HDFS infrastructure, but if this is not the case, alternate workflow management systems begin to look very appealing.

The core debate was not whether or not workflow management systems were a necessary component in DISC; rather, I observed two salient, open questions:
  1. The systems in use at DOE (notably Fireworks and qdo) are primarily used to work around deficiencies in current HPC schedulers (e.g., Moab and SLURM) in that they cannot handle scheduling hundreds of thousands of tiny jobs concurrently.  Thus, should these workflow managers be integrated into the scheduler to address these shortcomings at their source?
  2. How do we stop every user from creating his or her own workflow manager scripts and adopt an existing solution instead?  Should one workflow manager rule them all, or should a Darwinian approach be taken towards the current diverse landscape of existing software?
Question #1 is a highly technical question that has several dimensions; ultimately though, it's not clear to me that there is enough incentive for resource manager and scheduler developers to really dig into this problem.  They haven't done this yet, and I can only assume that this is a result of the perceived domain-specificity and complexity of each workflow.  In reality, a large number of workflows can be accommodated by two simple features: support for directed acyclic graphs (DAGs) of tasks and support for lightweight, fault-tolerant task scheduling within a pool of reserved resources.  Whether or not anyone will rise to the challenge of incorporating this in a usable way is an open question, but there certainly is a need for this in the emerging realm of DISC.

Question #2 is more interesting to me since this problem of multiple people cooking up different but equivalent solutions to the same problems is pervasive throughout computational and computer science. This is in large part due to the fatal assumption held by many computer scientists that good software can be simply "thrown over the fence" to scientists and it will be adopted.  This has never worked; rather, the majority of widely adopted software technologies in HPC have been a result of the standardization of a landscape of similar but non-standard tools.  This is something I touched on in a previous post when outlining the history of MPI and OpenMP's successes.

I don't think the menagerie of workflows' developers are ready to settle on a standard, as the field is not mature enough to have a holistic understanding of all of the issues that workflows need to solve.  Despite the numerous presentations and discussions of various workflow solutions being used across DOE's user facilities, my presentation was the only one that considered optimizing workflow execution for the underlying hardware.  Given that the target audience of these talks were users of high-performance computing, the lack of consideration given to the performance aspects of workflow optimization is a testament to this immaturity.

Burst buffers

For those who haven't been following the details of one of DOE's more recent procurement rounds, the NERSC-8 and Trinity request for proposals (RFP) explicitly required that all vendor proposals include a burst buffer to address the capability of multi-petaflop simulations to dump tremendous amounts of data in very short order.  The target use case is for petascale checkpoint-restart, where the memory of thousands of nodes (hundreds of terabytes of data) needs to be flushed to disk in an amount of time that doesn't dominate the overall execution time of the calculation.

The concept of what a burst buffer is remains poorly defined.  I got the sense that there are two outstanding definitions:
  • The NERSC burst buffer is something more tightly integrated on the compute side of the system and may be a resource that can be allocated on a per-job basis
  • The Argonne burst buffer is something more tightly integrated on the storage side of the system and acts in a fashion that is largely transparent to the user.  This sounded a lot like the burst buffer support being explored for Lustre.
In addition, Los Alamos National Labs (LANL) is exploring burst buffers for the Trinity procurement, and it wasn't clear to me if they had chosen a definition or if they are exploring all angles.  One commonality is that DOE is going full-steam ahead on providing this burst buffer capability in some form or another, and solid-state storage is going to be a central enabling component.

Personally, I find the NERSC burst buffer concept a lot more interesting since it provides a more general purpose flash-based resource that can be used in novel ways.  For example, emerging software-defined storage platforms like EMC's Vipr can potentially provide very fine-grained access to flash as-needed to make better overall use of the underlying SSDs in HPC environments serving a broad user base (e.g., NERSC and the NSF centers).  Complementing these software technologies are emerging hardware technologies like DSSD's D5 product which will be exposing flash to compute systems in innovative ways at hardware, interconnect, and software levels.

Of course, the fact that my favorite supercomputer provides dynamically allocatable SSDs in a fashion not far removed from these NERSC burst buffers probably biases me, but we've demonstrated unique DISC successes enabled by our ability to pile tons of flash on to single compute nodes.  This isn't to say that the Argonne burst buffer isn't without merit; given that the Argonne Leadership Computing Facility (ALCF) caters to capability jobs rather than capacity jobs, their user base is better served by providing a uniform, transparent burst I/O capability across all nodes.  The NERSC burst buffer, by comparison, is a lot less transparent and will probably be much more susceptible to user disuse or misuse.  I suspect that when the dust settles, both takes on the burst buffer concept will make their way into production use.

A lot of the talk and technologies surrounding burst buffers are shrouded in NNSA secrecy or vendor non-disclosures, so I'm not sure what more there is to be said.  However, the good folks at HPCwire ran an insightful article on burst buffers after the NERSC-8 announcement for those who are interested in more detail.

Data curation

The final theme that bubbled just beneath the surface of the DOE workshops was the idea that we are coming upon an era where scientists can no longer save all their data from all their calculations in perpetuity.  Rather, someone will have to become the curator of the scientific data being generated by computations and figure out what is and is not worth keeping, and how or where that data should be stored and managed.  This concept of selectively retaining user data manifested in a variety of discussions ranging from in-place data sharing and publication with Globus Plus and science DMZs to transparently managing online data volumes with hierarchical storage management (HSM).  However, the common idea was that scientists are going to have to start coming to grips with data management themselves, as facilities will soon be unable to cope with the entirety of their users' data.

This was a particularly interesting problem to me because it very closely echoed the sentiments that came about from Datanami's recent LeverageBIGDATA event which had a much more industry-minded audience.  The general consensus is that several fields are far ahead of the pack in terms of addressing this issue; the high-energy physics community has been filtering data at its genesis (e.g., ignoring the data from uninteresting collision events) for years now, and enterprises seem comfortable with retaining marketing data for only as long as it is useful.  By comparison, NERSC's tape archive has not discarded user data since its inception several decades ago; each new tape system simply repacks the previous generation's tape to roll all old data forward.

All of the proposed solutions for this problem revolve around metadata.  The reality is that not all user data has equal importance, and there is a need to provide a mechanism for users (or their applications) to describe this fact.  For example, the principal use case for the aforementioned burst buffers is to store massive checkpoint-restart files; while these checkpoints are important to retain while a calculation is running, they have limited value after the calculation has completed.  Rather than rely on a user to manually recognize that these checkpoints can be deleted, the hope is that metadata attributes can be attached to these checkpoint files to indicate that they are not critical data that must be retained forever for automated curation systems to understand.

The exact way this metadata would be used to manage space on a file system remains poorly defined.  A few examples of exactly how metadata can be used to manage data volume in data-intensive scientific computing environments include
  • tagging certain files or directories as permanent or ephemeral, signaling that the file system can purge certain files whenever a cleanup is initiated;
  • tagging certain files with a set expiration date, either as an option or by default.  When a file ages beyond a certain point, it would be deleted;
  • attributing a sliding scale of "importance" to each file, so that files of low importance can be transparently migrated to tape via HSM
Some of these concepts are already implemented, but the ability for users and applications to attach extensible metadata to files in a file system-agnostic way does not yet exist.  I think this is a significant gap in technology that will need to be filled in very short order as pre-exascale machines begin to demonstrate the ability to generate tremendous I/O loads.  Frankly, I'm surprised this issue hasn't been solved in a broadly deployable way yet.

The good news here is that the problem of curating digital data is not new; it is simply new to high-performance computing.  In the spirit of doing things the right way, DOE invited the director of LANL's Research Library to attend the workshops, and she provided valuable insights into how methods of digital data curation may be applied to these emerging challenges in data-intensive scientific computing.

Final Thoughts

The products of the working groups' conventions at the HPC Operational Review are being assembled into a report to be delivered to DOE's Office of Science, and it should be available online at the HPCOR 2014 website as well as the usual DOE document repository in a few months.  Hopefully it will reflect what I feel was the essence of the workshop, but at any rate, it should contain a nice perspective on how we can expect the HPC community to address the new demands emerging from data-intensive scientific computing (DISC) community.

In the context of high-performance computing, 
  • Workflow management systems will continue to gain importance as data sets become larger, more parallel, and more unwieldy.
  • Burst buffers, in one form or another, will become the hardware solution to the fact that all exascale simulations will become data-intensive problems.
  • Data curation frameworks are the final piece of the puzzle and will provide the manageability of data at rest.
None of these three legs are fully developed, and this is simply an indication of data-intensive scientific computing's immaturity relative to more traditional high-performance computing:  
  • Workflows need to converge on some sort of standardized API or feature set in order to provide the incentive to users to abandon their one-off solutions.
  • Burst buffer technology has diverged into two solutions centered at either the compute or storage side of a DISC platform; both serve different workloads, and the underlying hardware and software configurations remain unfinished.
  • Effective data curation requires a metadata management system that will allow both users and their applications to identify the importance of data to automate sensible data retention policy enforcement and HSM.
Of course, I could be way off in terms of what I took away from these meetings seeing as how I don't really know what I'm talking about.  Either way, it was a real treat to be invited out to hang out with the DOE folks for a week; I got to meet some of my personal supercomputing heroes, share war stories, and make some new pals.

I also got to spend eight days getting to know the Bay Area.  So as not to leave this post entirely without a picture,

I also learned that I have a weird fascination with streetcars.  I'm glad I was introduced to supercomputers first.