Wednesday, April 29, 2015

More Conjecture on KNL's Near Memory

The Platform ran an interesting collection of conjectures on how KNL's on-package MCDRAM might be used this morning, and I recommend reading through it if you're following the race to exascale.  I was originally going to write this commentary as a Google+ post, but it got a little long, so pardon the lack of a proper lead-in here.

I appreciated Mr. Funk's detailed description of how processor caches interact with DRAM, and how this might translate into KNL's caching mode.  However, he underplays exactly why MCDRAM (and the GDDR on KNC) exists on these manycore architectures in his discussion on how MCDRAM may act as an L3 cache.  On-package memory is not simply another way to get better performance out of the manycore processor; rather, it is a hard requirement for keeping all 60+ cores (and their 120+ 512-bit vector registers, 1.8+ MB of L1 data cache, etc) loaded.  Without MCDRAM, it would be physically impossible for these KNL processors to achieve their peak performance due to memory starvation.  By extension, Mr. Funk's assumption that this MCDRAM will come with substantially lower latency than DRAM might not be true.

As a matter of fact, the massive parallelism game is not about latency at all; it came about as a result of latencies hitting a physical floor.  So, rather than drive clocks up to lower latency and increase performance, the industry has been throwing more but slower clocks at a given problem to mask the latencies of data access for any given worker.  While one thread may be stalled due to a cache miss on a Xeon Phi core, the other three threads are keeping the FPU busy to achieve the high efficiency required for performance.  This is at the core of the Xeon Phi architecture (as well as every other massively parallel architecture including GPUs and Blue Gene), so it is unlikely that Intel has sacrificed their power envelope to actually give MCDRAM lower latency than the off-package DRAM on KNL nodes.

At an architectural level, accesses to MCDRAM still needs to go through memory controllers like off-package DRAM.  Intel hasn't been marketing the MCDRAM controllers as "cache controllers," so it is likely that the latencies of memory access are on par with the off-package memory controllers.  There are simply more of these parallel MCDRAM controllers (eight) operating relative to off-package DRAM controllers (two), again suggesting that bandwidth is the primary capability.

Judging by current trends in GPGPU and KNC programming, I think it is far more likely that this caching mode acts at a much higher level, and Intel is providing it as a convenience for (1) algorithmically simple workloads with highly predictable memory access patterns, and (2) problems that will fit entirely within MCDRAM.  Like with OpenACC, I'm sure there will be some problems where explicitly on/off-package memory management (analogous to OpenACC's copyin, copyout, etc) aren't necessary and cache mode will be fine.  Intel will also likely provide all of the necessary optimizations in their compiler collection and MKL to make many common operations (BLAS, FFTs, etc) work well in cache mode as they did for KNC's offload mode.

However, to answer Mr. Funk's question of "Can pre-knowledge of our application’s data use--and, perhaps, even reorganization of that data--allow our application to run still faster if we instead use Flat Model mode," the answer is almost unequivocally "YES!"  Programming massively parallel architectures has never been easy, and magically transparent caches rarely deliver reliable, high performance.  Even the L1 and L2 caches do not work well without very deliberate application design to accommodate wide vectors; cache alignment and access patterns are at the core of why, in practice, it's difficult to get OpenMP codes working with high efficiency on current KNC processors.  As much as I'd like to believe otherwise, the caching mode on KNL will likely be even harder to effectively utilize, and explicitly managing the MCDRAM will be an absolute requirement for the majority of applications.

Wednesday, January 28, 2015

Thoughts on the NSF Future Directions Interim Report

The National Academies recently released an interim report entitled Future Directions for NSF Advanced Computing Infrastructure to Support U.S. Science and Engineering in 2017-2020 as a part of a $723,000 award commissioned to take a hard look at where the NSF's supercomputing program is going.  Since releasing the interim report, the committee has been soliciting feedback and input from the research community to consider as they draft their final report, and I felt compelled to put some of my thoughts into a response.

NSF's HPC programs are something I hold near and dear since I got my start in the industry by supporting two NSF-owned supercomputers.  I put a huge amount of myself into Trestles and Gordon, and I still maintain that job encompassed the most engaging and rewarding work I've ever done.  However, the NSF's lack of a future roadmap for its HPC program made my future feel perpetually uncertain, and this factored heavily in my decision to eventually pursue other opportunities.

Now that I am no longer affiliated with NSF, I wanted to delineate some of the problems I observed during my time on the inside with the hope that someone more important than me really thinks about how they can be addressed.  The report requested feedback in nine principal areas, so I've done my best to contextualize my thoughts with the committee's findings.

With that being said, I wrote this all up pretty hastily.  Some of it may be worded strongly, and although I don't mean to offend anybody, I stand by what I say.  That doesn't mean that my understanding of everything is correct though, so it's probably best to assume that I have no idea what I'm talking about here.

Finally, a glossary of terms may make this more understandable:

  • XD is the NSF program that funds XSEDE; it finances infrastructure and people, but it does not fund supercomputer procurements or operations
  • Track 1 is the program that funded Blue Waters, the NSF's leadership-class HPC resource
  • Track 2 is the program that funds most of the XSEDE supercomputers.  It funded systems like Ranger, Keeneland, Gordon, and Stampede

1. How to create advanced computing infrastructure that enables integrated discovery involving experiments, observations, analysis, theory, and simulation.

Answering this question involves a few key points:
  1. Stop treating NSF's cyberinfrastructure as a computer science research project and start treating it like research infrastructure operation.  Office of Cyberinfrastructure (OCI) does not belong in Computer & Information Science & Engineering (CISE).
  2. Stop funding cyberinfrastructure solely through capital acquisition solicitations and restore reliable core funding to NSF HPC centers.  This will restore a community that is conducive to retaining expert staff.
  3. Focus OCI/ACI and raise the bar for accountability and transparency.   Stop funding projects and centers that have no proven understanding of operational (rather than theoretical) HPC.
  4. Either put up or give up.  The present trends in funding lie on a road to death by attrition.  
  5. Don't waste time and funding by presuming that outsourcing responsibility and resources to commercial cloud or other federal agencies will effectively serve the needs of the NSF research community.
I elaborate on these points below.

2. Technical challenges to building future, more capable advanced computing systems and how NSF might best respond to them.

"Today’s approach of federating distributed compute- and data-intensive resources to meet the increasing demand for combined computing and data capabilities is technically challenging and expensive."
This is true.
"New approaches that co-locate computational and data resources might reduce costs and improve performance. Recent advances in cloud data center design may provide a viable integrated solution for a significant fraction of (but not all) data- and compute-intensive and combined workloads."
This strong statement is markedly unqualified and unsubstantiated.  If it is really recommending that the NSF start investing in the cloud, consider the following:
  • Cloud computing resources are designed for burst capabilities and are only economical when workloads are similarly uneven.  In stark contrast, most well-managed HPCs see constant, high utilization which is where the cloud becomes economically intractable.
  • The suggestion that cloud solutions can "improve performance" is unfounded.  At a purely technological level, the cloud will never perform as well as unvirtualized HPC resources, period.  Data-intensive workloads and calculations that require modest inter-node communication will suffer substantially.

In fact, if any cost reduction or performance improvement can be gained by moving to the cloud, I can almost guarantee that incrementally more can be gained by simply addressing the non-technological aspects of the current approach of operating federated HPC.  Namely, the NSF must
  1. Stop propping up failing NSF centers who have been unable to demonstrate the ability to effectively design and operate supercomputers. 
  2. Stop spending money on purely experimental systems that domain scientists cannot or will not use.

The NSF needs to re-focus its priorities and stop treating the XD program like a research project and start treating it like a business.  Its principal function should be to deliver a product (computing resources) to customers (the research community).  Any component that is not helping domain scientists accelerate discovery should be strongly scrutinized.  Who are these investments truly satisfying?
"New knowledge and skills will be needed to effectively use these new advanced computing technologies."
This is a critical component of XD that is extremely undervalued and underfunded.  Nobody is born with the ability to know how to use HPC resources, and optimization should be performed on users in addition to code.  There is huge untapped potential in collaborative training between U.S. federal agencies (DOE, DOD) and European organizations (PRACE).  If there is bureaucratic red tape in the way, it needs to be dealt with at an official level or circumvented at the grassroots level.

3. The computing needs of individual research areas.

XDMoD shows this.  The principal workloads across XSEDE are from traditional domains like physics and chemistry, and the NSF needs to recognize that this is not going to change substantially over the lifetime of a program like XD.

Straight from XDMoD for 2014.  MPS = math and physical sciences, BIO = biological sciences, GEO = geosciences.  NSF directorate is not a perfect alignment; for example, I found many projects in BIO were actually chemistry and materials science.

While I wholeheartedly agree that new communities should be engaged by lowering the barriers to entry, these activities cannot be done at a great expense of undercutting the resources required by the majority of XD users.

The cost per CPU cycle should not be deviating wildly between Track 2 awards because the ROI on very expensive cycles will be extremely poor.  If the NSF wants to fund experimental systems, it needs to do that as an activity that is separate from the production resources.  Alternatively, only a small fraction of each award should be earmarked for new technologies that represent a high risk; the Stampede award was a fantastic model of how a conservative fraction of the award (10%) can fund an innovative and high-risk technology.

4. How to balance resources and demand for the full spectrum of systems, for both compute- and data-intensive applications, and the impacts on the research community if NSF can no longer provide state-of-the-art computing for its research community.

"But it is unclear, given their likely cost, whether NSF will be able to invest in future highest-tier systems in the same class as those being pursued by the Department of Energy, Department of Defense, and other federal mission agencies and overseas."
The NSF does not have the budget to support leadership computing.  This is clear even from a bird's eye view: DOE ASCR's budget for FY2012 was $428 million and, by comparison, NSF ACI's budget was only $211 million.  Worse yet, despite having half the funding of its DOE counterpart, the NSF owned HPC resources at seven universities in FY2012 compared to ASCR's three centers.

Even if given the proper funding, the NSF's practice of spreading Track 2 awards across many universities to operate its HPC assets is not conducive to operating leadership computing.  The unpredictable nature of Track 2 awards has resulted in very uneven funding for NSF centers which, quite frankly, is a terrible way to attract and retain the highly knowledgeable world-class staff that is necessary to operate world-class supercomputers.

5. The role of private industry and other federal agencies in providing advanced computing infrastructure.

The report makes some very troubling statements in reference to this question.
"Options for providing highest-tier capabilities that merit further exploration include purchasing computing services from federal agencies…"
This sounds dirty.  Aren't there are regulations in place that restrict the way in which money can flow between the NSF and DOE?  I'm also a little put off by the fact that this option is being put forth in a report that is crafted by a number of US DOE folks whose DOE affiliations are masked by university affiliations in the introductory material.
"…or by making arrangements with commercial services (rather than more expensive purchases by individual researchers)."
Providing advanced cyberinfrastructure for the open science community is not a profitable venture.  There is no money in HPC operations.  I do not see any "leadership" commercial cloud providers offering the NSF a deal on spare cycles, and the going rate for commercial cloud time is known to be far more expensive than deploying HPC resources in-house at the national scale.

6. The challenges facing researchers in obtaining allocations of advanced computing resources and suggestions for improving the allocation and review processes.

"Given the “double jeopardy” that arises when researchers must clear two hurdles—first, to obtain funding for their research proposal and, second, to be allocated the necessary computing resources—the chances that a researcher with a good idea can carry out the proposed work under such conditions is diminished."
XD needs to be more tightly integrated with other award processes to mitigate the double jeopardy issue.  I have a difficult time envisioning the form which this integration would take, but the NSF GRF's approach of prominently featuring NSF HPC resources as a part of the award might be a good start.  As an adaptive proposal reviewer within XSEDE and a front-line interface with first-time users, I found that having the NSF GRF bundle XSEDE time greatly reduced the entry barrier for new users and made it easier for us reviewers to stratify the proposals.  Another idea may be to invite NSF center staff to NSF contractors' meetings (if such things exist; I know they do for DOE BES) to show a greater amount of integration across NSF divisions.

In addition, the current XSEDE allocation proposal process is extremely onerous.  The document that describes the process is ridiculously long and contains of obscure requirements that serve absolutely no purpose.  For example, all XSEDE proposals require a separate document detailing the scaling performance of their scientific software.  Demonstrating an awareness of the true costs of performing certain calculations has its merits, but a detailed analysis of scaling is not even relevant for the majority of users who run modest-scale jobs or use off-the-shelf black-box software like Gaussian.  The only thing these obscure requirements do is prevent new users, who are generally less familiar with all of the scaling requirements nonsense, from getting any time.  If massive scalability is truly required by an application, the PI needs to be moved over to the Track 1 system (Blue Waters) or referred to INCITE.

As a personal anecdote, many of us center staff found ourselves simply short-circuiting the aforementioned allocations guide and providing potential new users with a guide to the guide.  It was often sufficient to provide a checklist of minutia whose absence would result in an immediate proposal rejection and allow the PIs to do what they do best—write scientific proposals for their work.  Quite frankly, the fact that we had to provide a guide to understanding the guide to the allocations process suggests that the allocations process itself is grossly over-engineered.

7. Whether wider and more frequent collection of requirements for advanced computing could be used to inform strategic planning and resource allocation; how these requirements might be used; and how they might best be collected and analyzed.

The XD program has already established a solid foundation for reporting the popularity and usability of NSF HPC resources in XDMoD.  The requirements of the majority are evolving more slowly than computer scientists would have everyone believe.

Having been personally invested in two Track 2 proposals, I have gotten the impression that the review panels who select the destiny of the NSF's future HPC portfolio are more impressed by cutting edge, albeit untested and under-demanded, proposals.  Consequentially, taking a "functional rather than a technology-focused or structural approach" to future planning will result in further loss of focus.  Instead of delivering conservatively designed architectures that will enjoy guaranteed high utilization, functional approaches will give way to computer scientists on review panels dictating what resources domain scientists should be using to solve their problems.  The cart will be before the horse.

Instead, it would be far more valuable to include more operational staff in strategic planning.  The people on the ground know how users interact with systems and what will and won't work.  As with the case of leadership computing, the NSF does not have the financial commitment to be leading the design of novel computing architectures at large scales.  Exotic and high-risk technologies should be simply left out of the NSF's Track 2 program, incorporated peripherally but funded through other means (e.g., MRIs), or incorporated in the form of a small fraction of a larger, lower-risk resource investment.

A perspective of the greater context of this has been eloquently written by Dr. Steven Gottlieb.  Given his description of the OCI conversion to ACI, it seems like taking away the Office of Cyberinfrastructure's (OCI's) autonomy and placing it under Computer & Information Science & Engineering (CISE) exemplifies an ongoing and significant loss of focus within NSF.  This changed reflected the misconception that architecting and operating HPC resources for domain sciences is a computer science discipline.

This is wrong.

Computer scientists have a nasty habit of creating tools that are intellectually interesting but impractical for domain scientists.  These tools get "thrown over the wall," never to be picked up, and represent an overall waste of effort in the context of operating HPC services for non-computer scientists.  Rather, operating HPC resources for the research community requires experienced technical engineers with a pragmatic approach to HPC.  Such people are most often not computer scientists, but former domain scientists who know what does and doesn't work for their respective communities.

8. The tension between the benefits of competition and the need for continuity as well as alternative models that might more clearly delineate the distinction between performance review and accountability and organizational continuity and service capabilities.

"Although NSF’s use of frequent open competitions has stimulated intellectual competition and increased NSF’s financial leverage, it has also impeded collaboration among frequent competitors, made it more difficult to recruit and retain talented staff, and inhibited longer-term planning."
Speaking from firsthand experience, I can say that working for an NSF center is a life of a perpetually uncertain future and dicing up FTEs into frustratingly tiny pieces.  While some people are driven by competition and fundraising (I am one of them), an entire organization built up to support multi-million dollar cyberinfrastructure cannot be sustained this way.

At the time I left my job at an NSF center, my salary was covered by six different funding sources at levels ranging from 0.05 to 0.30 FTEs.  Although this officially meant that I was only 30% committed to directly supporting the operation of one of our NSF supercomputers, the reality was that I (and many of my colleagues) simply had to put in more than 100% of my time into the job.  This is a very high-risk way to operate because committed individuals get noticed and almost invariably receive offers of stable salaries elsewhere.  Retaining talent is extremely difficult when you have the least to offer, and the current NSF funding structure makes it very difficult for centers to do much more than continually hire entry-level people to replace the rising stars who find greener pastures.

Restoring reliable, core funding to the NSF centers would allow them to re-establish a strong foundation that can be an anchor point for other sites wishing to participate in XD.  This will effectively cut off some of the current sites operating Track 2 machines, but frankly, the NSF has spread its HPC resources over too many sites at present and is diluting its investments in people and infrastructure.  The basis for issuing this core funding could follow a pattern similar to that of XD where long-term (10-year) funding is provisioned with a critical 5-year review.

If the NSF cannot find a way to re-establish reliable funding, it needs to accept defeat and stop trying to provide advanced cyberinfrastructure.  The current method of only funding centers indirectly through HPC acquisitions and associated operations costs is unsustainable for two reasons:
  • The length of these Track 2 awards (typically 3 years of operations) makes future planning impossible.  Thus, this current approach forces centers to follow high-risk and inadequately planned roadmaps.
  • All of the costs associated with maintaining world-class expertise and facilities have to come from someone else's coffers.  Competitive proposals for HPC acquisitions simply cannot afford to request budgets that include strong education, training, and outreach programs, so these efforts wind up suffering.

9. How NSF might best set overall strategy for advanced computing-related activities and investments as well as the relative merits of both formal, top-down coordination and enhanced, bottom-up process.

Regarding the top-down coordination, the NSF should drop the Track 2 program's current solicitation model where proposers must have a vendor partner to get in the door.  This is unnecessarily restrictive and fosters an unhealthy ecosystem where vendors and NSF centers are both scrambling to pair up, resulting in high-risk proposals.  Consider the implications:
  1. Vendors are forced to make promises that they may not be able to fulfill (e.g., Track 2C and Blue Waters).  Given these two (of nine) solicitations resulted in substantial wastes of time and money (over 20% vendor failure rate!), I find it shocking that the NSF continues to operate this way.
  2. NSF centers are only capable of choosing the subset of vendors who are willing to play ball with them, resulting in a high risk of sub-optimal pricing and configurations for the end users of the system.

I would recommend a model, similar to many European nations', where a solicitation is issued for a vendor-neutral proposal to deploy and support a program that is built around a resource.  A winning proposal is selected based on not only the system features, its architecture, and the science it will support, but the plan for training, education, collaboration, and outreach as well.  Following this award, the bidding process for a specific hardware solution begins.

This addresses the two high-risk processes mentioned above and simultaneously eliminates the current qualification in Track 2 solicitations that no external funding can be included in the proposal.  By leaving the capital expenses out of the selection process, the NSF stands to get the best deal from all vendors and other external entities independent of the winning institution.

Bottom-up coordination is much more labor-intensive because it requires highly motivated people at the grassroots to participate.  Given the NSF's current inability to provide stable funding for highly qualified technical staff, I cannot envision how this would actually come together.

Wednesday, November 5, 2014

Storage Utilization in the Long Tail of Science


Since changing careers and moving up to the San Francisco Bay Area in July, I haven't had nearly as much time to post interesting things here on my blog—I guess that's the startup life. That isn't to say that my life in DNA sequencing hasn't been without interesting observations to explore though; the world of high-throughput sequencing is becoming increasingly dependent on high-performance computing, and many of the problems being solved in genomics and bioinformatics are stressing aspects of system architecture and cyberinfrastructure that haven't gotten a tremendous amount of exercise from the more traditional scientific domains in computational research.

Take, for example, the biggest and baddest DNA sequencer on the market: over the course of a three-day run, it outputs around 670 GB of raw (but compressed) sequence data, and this data is spread out over 1,400,000 files. This would translate to an average file size of around 500 KB, but the reality is that the file sizes are a lot less uniform:

Figure 1. File size distribution of a single flow cell output (~770 gigabases) on Illumina's highest-end sequencing platform

After some basic processing (which involves opening and closing hundreds of these files repeatedly and concurrently), these data files are converted into very large files (tens or hundreds of gigabytes each) which then get reduced down to data that is more digestible over the course of hundreds of CPU hours. As one might imagine, this entire process is very good at taxing many aspects of file systems, and on the computational side, most of this IO-intensive processing is not distributed and performance benefits most from single-stream, single-client throughput.

As a result of these data access and processing patterns, the storage landscape in the world of DNA sequencing and bioinformatics is quite different from conventional supercomputing. Some large sequencing centers do use the file systems we know and love (and hate) like GPFS at JGI and Lustre at Sanger, but it appears that most small- and mid-scale sequencing operations are relying heavily on network-attached storage (NAS) for both receiving raw sequencer data and being a storage substrate for all of the downstream data processing.

I say all of this because these data patterns—accessing large quantities of small files and large files with a high degree of random IO—is a common trait in many scientific applications used in the "long tail of science." The fact is, the sorts of IO for which parallel file systems like Lustre and GPFS are designed are tedious (if not difficult) to program, and for the majority of codes that don't require thousands of cores to make new discoveries, simply reading and writing data files in a naïve way is "good enough."

The Long Tail

This long tail of science is also using up a huge amount of the supercomputing resources made available to the national open science community; to illustrate, 98% of all jobs submitted to the XSEDE supercomputers in 2013 used 1024 or fewer CPU cores, and these modest-scale jobs represented over 50% of all the CPU time burned up on these machines.

Figure 2. Cumulative job size distribution (weighted by job count and SUs consumed) for all jobs submitted to XSEDE compute resources in 2013

The NSF has responded to this shift in user demand by awarding Comet, a 2 PF supercomputer designed to run these modest-scale jobs. The Comet architecture limits its full-bisection bandwidth interconnectivity to groups of 72 nodes, and these 72-node islands will actually have enough cores to satisfy 99% of all the jobs submitted to XSEDE clusters in 2013 (see above). By limiting the full-bisection connectivity to smaller islands and using less rich connectivity between islands, the cost savings in not having to buy so many mid-tier and core switches are then turned into additional CPU capacity.

What the Comet architecture doesn't address, however, is the question of data patterns and IO stress being generated by this same long tail of science—the so-called 99%. If DNA sequencing is any indicator of the 99%, parallel file systems are actually a poor choice for high-capacity, mid-scale jobs because their performance degrades significantly when facing many small files. Now, the real question is, are the 99% of HPC jobs really generating and manipulating lots of small files in favor of the large striped files that Lustre and GPFS are designed to handle? That is, might the majority of jobs on today's HPC clusters actually be better served by file systems that are less scalable but handle small files and random IO more gracefully?

Some colleagues and I set out to answer this question last spring, and a part of this quest involved looking at every single file on two of SDSC's Data Oasis file systems. This represented about 1.7 PB of real user data spread across two Lustre 2.4 file systems—one designed for temporary scratch data and the other for projects storage—and we wanted to know if users' data really consisted of the large files that Lustre loves or if, like job size, the 99% are really working with small files.  Since SDSC's two national resources, Gordon and Trestles, restrict the maximum core count for user jobs to modest-scale submissions, these file systems should contain files representative of long-tail users.

Scratch File Systems

At the roughest cut, files can be categorized based on whether their size is on the order of bytes and kilobytes (size < 1024*1024 bytes), megabytes (< 1024 KB), gigabytes (<1024 MB), and terabytes (< 1024 GB). Although pie charts are generally a terrible way to show relative compositions, this is how the files on the 1.2 PB scratch file system broke down:

Figure 3. Fraction of file count consumed by files of a given size on Data Oasis's scratch file system for Gordon

The above figure shows the number of files on the file system classified by their size, and there are clearly a preponderance of small files less than a gigabyte in size. This is not terribly surprising as the data is biased towards smaller files; that is, you can fit a thousand one-megabyte files in the same space that a single one-gigabyte file would take up. Another way to show this data is by how much file system capacity is taken up by files of each size:

Figure 4. File system capacity consumed by files of a given size on Data Oasis's scratch file system for Gordon

This makes it very apparent that the vast majority of the used space on this scratch file system—a total of 1.23 PB of data—are taken up by files on the order of gigabytes and megabytes. There were only seventeen files that were a terabyte or larger in size.

Incidentally, I don't find it too surprising that there are so few terabyte-sized files; even in the realm of Hadoop, median job dataset sizes are on the order of a dozen gigabytes (e.g., Facebook has reported that 90% of its jobs read in under 100 GB of data). Examining file sizes with much finer granularity reveals that the research data on this file system isn't even of Facebook scale though:

Figure 5. Number of files of a given size on Data Oasis's scratch file system for Gordon.  This data forms the basis for Figure 3 above

While there are a large number of files on the order of a few gigabytes, it seems that files on the order of tens of gigabytes or larger are far more scarce. Turning this into relative terms,

Figure 6. Cumulative distribution of files of a given size on Data Oasis's scratch file system for Gordon

we can make more meaningful statements. In particular,

  • 90% of the files on this Lustre file system are 1 megabyte or smaller
  • 99% of files are 32 MB or less
  • 99.9% of files are 512 MB or less
  • and 99.99% of files are 4 GB or less

The first statement is quite powerful when you consider the fact that the default stripe size in Lustre is 1 MB. The fact that 90% of files on the file system are smaller than this means that 90% of users' files really gain no advantages by living on Lustre. Furthermore, since this is a scratch file system that is meant to hold temporary files, it would appear that either user applications are generating a large amount of small files, or users are copying in large quantities of small files and improperly using it for cold storage. Given the quota policies for Data Oasis, I suspect there is a bit of truth to both.

Circling back a bit though, I said earlier that comparing just the quantity of files can be a bit misleading since a thousand 1 KB files will take up the same space as a single 1 MB file. We can also look at how much total space is taken up by files of various sizes.

Figure 7. File system capacity consumed by files of a given size on Data Oasis's scratch file system for Gordon.  This is just a more finely diced version of the data presented in Figure 4 above.

The above chart is a bit data-dense so it takes some staring at to understand what's going on. First looking at the purple line, we can pull out some pretty interesting facts:

  • Half of the file system's used capacity (50%) is consumed by files that are 1 GB or less in size
  • Over 20% of the file system's used capacity is taken up by files smaller than 64 MB
  • About 10% of the capacity is used by files that are 64 GB or larger

The blue boxes represent the derivative of that purple line—that is, how much space is taken up by files of only one specific size. The biggest chunk of the file system (141 TB) is taken up by 4 GB files, but it appears that there is a substantial range of file sizes that take up very similarly sized pieces of the pie. 512 MB files take up a total of 139 TB; 1 GB, 2 GB, and 8 GB files all take up over 100 TB of total space each as well. In fact, files ranging from 512 MB to 8 GB comprise 50% of the total file system capacity.

Why the sweet spot for space-consuming files is between 512 MB and 8 GB is unclear, but I suspect it's more caused by the human element in research. In my own research, I worked with files in this range simply because it was enough data to be statistically meaningful while still small enough to quickly re-analyze or transfer to a colleague. For file sizes above this range, the mass of the data made it difficult to manipulate using the "long-tail" cyberinfrastructure available to me. But, perhaps as more national-scale systems comes online to meet the needs of these sorts of workloads, this sweet spot will creep out to larger file sizes.

Projects Storage

The above discussion admittedly comes with a lot of caveats.  In particular, the scratch file system we examined was governed by no hard quotas which did lead some people to leave data resident for longer than they probably should have.  However, the other file system we analyzed was SDSC's Data Oasis projects storage which was architected for capacity over performance and featured substantially more disks per OSS.  This projects storage also came with 500 GB quotas by default, forcing users to be a little more mindful of what was worth keeping.

Stepping back to the coarse-grained kilobyte/megabyte/gigabyte/terabyte pie charts, here is how projects storage utilization compared to scratch storage:

Figure 8. Fraction of file count consumed by files of a given size on Data Oasis's projects file system (shared between Gordon and Trestles users)

On the basis of file counts, it's a bit surprising that users seem to store more smaller (kilobyte-sized) files in their projects space than their scratch space.  This may imply that the beginning and end data bookending simulations aren't as large as the intermediate data generated during the calculation.  Alternately, it may be a reflection of user naïveté; I've found that newer users were often afraid to use the scratch space because of the perception that their data may vanish from there without advanced notice.  Either way, gigabyte-sized files comprised a few hundredths of a percent of files, and terabyte-sized files were more scarce still on both file systems.  The trend was uniformly towards smaller sizes on projects space.

As far as space consumed by these files, the differences remain subtle.

Figure 9. Fraction of file system capacity consumed by files of a given size on Data Oasis's projects file system

There appears to be a trend towards users keeping larger files in their projects space, and the biggest change is the decrease in megabyte-sized files in favor of gigabyte-sized files.  However, this trend is very small and persists across a finer-grained examination of file size distributions:

Figure 10. File system capacity consumed by files of a given size on Data Oasis's projects file system

Half of the above plot is the same data shown above, making this plot twice as busy and confusing.  However there's a lot of interesting data captured in it, so it's worth the confusing presentation.  In particular, the overall distribution of mass with respect to the various file sizes is remarkably consistent between scratch and projects storage.  We see the same general peak of file size preference in the 1 GB to 10 GB range, but there is a subtle bimodal divide in projects storage that reveals preference for 128MB-512MB and 4GB-8GB files which manifests in the integrals (red and purple lines) that show a visibly greater slope in these regions.

The observant reader will also notice that the absolute values of the bars are smaller for projects storage and scratch storage; this is a result of the fact that the projects file system is subject to quotas and, as a result, is not nearly as full of user data.  To complicate things further, the projects storage represents user data from two different machines (each with unique job size policies, to boot), whereas the scratch storage is only accessible from one of those machines.  Despite these differences though, user data follows very similar distributions between both file systems.


It is probably unclear what to take away from these data, and that is with good reason.  There are fundamentally two aspects to quantifying storage utilizations--raw capacity and file count--because they represent two logically separate things.  There is some degree of interchangeability (e.g., storing a whole genome in one file vs. storing each chromosome its own file), and this is likely contributing to the broad peak in file size between 512 MB and 8 GB.  With that being said, it appears that the typical long-tail user stores a substantial amount of decidedly "small" files on Lustre, and this is exemplified by the fact that 90% of the files resident on the file systems analyzed here are 1 MB or less in size.

This alone suggests that large parallel file systems may not actually be the most appropriate choice for HPC systems that are designed to support a large group of long-tail users.  While file systems like Lustre and GPFS certainly provide a unique capability in that some types of medium-sized jobs absolutely require the IO capabilities of parallel file systems, there are a larger number of long-tail applications that do single-thread IO, and some of these perform IO in such an abusive way (looking at you, quantum chemistry) that they cannot run on file systems like Lustre or GPFS because of the number of small files and random IO they use.

So if Lustre and GPFS aren't the unequivocal best choice for storage in long-tail HPC, what are the other options?

Burst Buffers

I would be remiss if I neglected to mention burst buffers here since they are designed, in part, to address the limitations of parallel file systems.  However, their actual usability remains unproven.  Anecdotally, long-tail users are generally not quick to alter the way they design their jobs to use cutting-edge technology, and my personal experiences with Gordon (and its 300 TB of flash) were that getting IO-nasty user applications to effectively utilize the flash was often a very manual process that introduced new complexities, pitfalls, and failure modes.  Gordon was a very experimental platform though, and Cray's new DataWarp burst buffer seems to be the first large-scale productization of this idea.  It will be interesting to see how well it works for real users when the technology starts hitting the floor for open science in mid-2016, if not sooner.

High-Performance NAS

An emerging trend in HPC storage is the use of high-performance NAS as a complementary file system technology in HPC platforms.  Traditionally, NAS has been a very poor choice for HPC applications because of the limited scalability of the typical NAS architecture--data resides on traditional local file system with network service being provided by an additional software layer like NFS, and the ratio of storage capacity to network bandwidth out of the NAS is very high.

The emergence of cheap RAM and enterprise SSDs has allowed some sophisticated file systems like ZFS and NetApp's WAFL to demonstrate very high performance, especially in delivering very high random read performance, by using both RAM and flash as a buffer between the network and spinning rust.  This allows certain smaller-scale jobs to enjoy substantially better performance when running on flash-backed NAS than a parallel file system.  Consider the following IOP/metadata benchmark run on a parallel file system and a NAS head with SSDs for caching:

Figure 11. File stat rate on flash-backed NAS vs. a parallel file system as measured by the mdtest benchmark

A four-node job that relies on statting many small files (for example, an application that traverses a large directory structure such as the output of one of the Illumina sequencers I mentioned above) can achieve a much higher IO rate on a high-performance NAS than on a parallel file system.  Granted, there are a lot of qualifications to be made with this statement and benchmarking high-performance NAS is worth a post of its own, but the above data illustrate a case where NAS may be preferable over something like Lustre.

Greater Context

Parallel file systems like Lustre and GPFS will always play an essential role in HPC, and I don't want to make it sound like they can be universally replaced by high-performance NAS.  They are fundamentally architected to scale out so that increasing file system bandwidth does not require adding new partitions or using software to emulate a single namespace.  In fact, the single namespace of parallel file systems makes the management of the storage system, its users, and the underlying resources very flexible and straightforward.  No volume partitioning needs to be imposed, so scientific applications' and projects' data consumption do not have to align with physical hardware boundaries.

However, there are cases where a single namespace is not necessary at all; for example, user home directories are naturally partitioned with fine granularity and can be mounted in a uniform location while physically residing on different NAS heads with a simple autofs map.  In this example, leaving user home directories on a pool of NAS filers offers two big benefits:

  1. Full independence of the underlying storage mitigates the impact of one bad user.  A large job dropping multiple files per MPI process will crush both Lustre and NFS, but in the case of Lustre, the MDS may become unresponsive and block IO across all users' home directories.
  2. Flash caches on NAS can provide higher performance on IOP-intensive workloads at long-tail job sizes.  In many ways, high-performance NAS systems have the built-in burst buffers that parallel file systems are only now beginning to incorporate.
Of course, these two wins come at a cost:
  1. Fully decentralized storage is more difficult to manage.  For example, balancing capacity across all NAS systems is tricky when users have very different data generation rates that they do not disclose ahead of time.
  2. Flash caches can only get you so far, and NFS will fall over when enough IO is thrown at it.  I mentioned that 98% of all jobs use 1024 cores or fewer (see Figure 1), but 1024 cores all performing heavy IO on a typical capacity-rich, bandwidth-poor NAS head will cause it to grind to a halt.
Flash-backed high-performance NAS is not an end-all storage solution for long-tail computational science, but it also isn't something to be overlooked outright.  As with any technology in the HPC arena, its utility may or may not match up well with users' workloads, but when it does, it can deliver less pain and better performance than parallel file systems.


As I mentioned above, the data I presented here was largely generated as a result of an internal project in which I participated while at SDSC.  I couldn't have cobbled this all together without the help of SDSC's HPC Systems group, and I'm really indebted to +Rick+Haisong, and +Trevor for doing a lot of the heavy lifting in terms of generating the original data, getting systems configured to test, and figuring out what it all meant when the dust settled (even after I had left!).  SDSC's really a world-class group of individuals.