Thursday, August 3, 2017

Understanding I/O on the mid-2017 iMac

My wife recently bought me a brand new mid-2017 iMac to replace my ailing, nine-year-old HP desktop.  Back when I got the HP, I was just starting to learn about how computers really worked and really didn't really understand much about how the CPU connected to all of the other ports that came off the motherboard--everything that sat between the SATA ports and the CPU itself was a no-man's land of mystery to me.

Between then and now though, I've somehow gone from being a poor graduate student doing molecular simulation to a supercomputer I/O architect.  Combined with the fact that my new iMac had a bunch of magical new ports that I didn't understand (USB-C ports that can tunnel PCIe, USB 3.1, and Thunderbolt??), I figure I'd sit down and see if I could actually figure out exactly how the I/O subsystem on this latest Kaby Lake iMac was wired up.

I'll start out by saying that the odds were in my favor--over the last decade, the I/O subsystem of modern computers has gotten a lot simpler as more of the critical components (like the memory controllers and PCIe controllers) have moved on-chip.  As CPUs become more tightly integrated, individual CPU cores, system memory, and PCIe peripherals can all talk to each other without having to cross a bunch of proprietary middlemen like in days past.  Having to understand how the front-side bus clock is related to the memory channel frequency all gets swept under the rug that is the on-chip network, and I/O (that is, moving data between system memory and stuff outside of the CPU) is a lot easier.

With all that said, let's cut to the chase.  Here's a block diagram showing exactly how my iMac is plumbed, complete with bridges to external interfaces (like PCIe, SATA, and so on) and the bandwidths connecting them all:

Aside from the AMD Radeon GPU, just about every I/O device and interface hangs off of the Platform Controller Hub (PCH) through a DMI 3.0 connection.  When I first saw this, I was a bit surprised by how little I understood; PCIe makes sense since that is the way almost all modern CPUs (and their memory) talk to the outside world, but I'd never given the PCH a second thought, and I didn't even know what DMI was.

As with any complex system though, the first step towards figuring out how it all works is to break it down into simpler components.  Here's what I figured out.

Understanding the PCH

In the HPC world, all of the performance-critical I/O devices (such as InfiniBand channel adapters, NICs, SSDs, and GPUs) are all directly attached to the PCIe controller on the CPU.  By comparison, the PCH is almost a non-entity in HPC nodes since all they do is provide low-level administration interfaces like a USB and VGA port for crash carts.  It had never occurred to me that desktops, which are usually optimized for universality over performance, would depend so heavily on the rinky-dink PCH.

Taking a closer look at the PCIe devices that talk to the Sunrise Point PCH:

we can see that the PCH chip provides PCIe devices that act as

  • a USB 3.0 controller
  • a SATA controller
  • a HECI controller (which acts as an SMBus controller)
  • a LPC controller (which acts as an ISA controller)
  • a PCI bridge (0000:00:1b) (to which the NVMe drive, not a real PCI device, is attached)
  • a PCIe bridge (0000:00:1c) that breaks out three PCIe root ports
Logically speaking, these PCIe devices are all directly attached to the same PCIe bus (domain #0000, bus #00; abbreviated 0000:00) as the CPU itself (that is, the host bridge device #00, or 0000:00:00).  However, we know that the PCH, by definition, is not integrated directly into the on-chip network of the CPU (that is, the ring that allows each core to maintain cache coherence with its neighbors).  So how can this be?  Shouldn't there be a bridge that connects the CPU's bus (0000:00) to a different bus on the PCH?

Clearly the answer is no, and this is a result of Intel's proprietary DMI interface which connects the CPU's on-chip network to the PCH in a way that is transparent to the operating system.  Exactly how DMI works is still opaque to me, but it acts like an invisible PCIe bridge that glues together physically separate PCIe buses into a single logical bus.  The major limitation to DMI as implemented on Kaby Lake is that it only has the bandwidth to support four lanes of PCIe Gen 3.

Given that DMI can only support the traffic of a 4x PCIe 3.0 device, there is an interesting corollary: the NVMe device, which attaches to the PCH via a 4x PCIe 3.0 link itself, can theoretically saturate the DMI link.  In such a case, all other I/O traffic (such as that coming from SATA-attached hard drive and the gigabit NIC) is either choked out by the NVMe device or competes with it for bandwidth.  In practice, very few NVMe devices can actually saturate a PCIe 3.0 4x link though, so unless you replace the iMac's NVMe device with an Optane SSD, this shouldn't be an issue.

Understanding Alpine Ridge

The other mystery component in the I/O subsystem is the Thunderbolt 3 controller (DSL6540), called Alpine Ridge.  These are curious devices that I still admittedly don't understand fully (they play no role in HPC) because, among other magical properties, they can tunnel PCIe to external devices.  For example, the Thunderbolt to Ethernet adapter widely available for MacBooks are actually fully fledged PCIe NICs, wrapped in a neat white plastic package, that tunnel PCIe signaling over a cable.  In addition, they can somehow deliver this PCIe signaling, DisplayPort, and USB 3.1 through a single self-configuring physical interface.

It turns out that being able to run multiple protocols over a single cable is a feature of the USB-C physical specification, which is a completely separate standard from USB 3.1.  However, the PCIe magic that happens inside Alpine Ridge is a result of an integrated PCIe switch which looks like this:

The Alpine Ridge PCIe switch connects up to the PCH with a single PCIe 3.0 4x and provides four downstream 4x ports for peripherals.  If you read the product literature for Alpine Ridge, it advertises two of these 4x ports for external connectivity; the remaining two 4x ports are internally wired up to two other controllers:

  • an Intel 15d4 USB 3.1 controller.  Since USB 3.1 runs at 10 Gbit/sec, this 15d4 USB controller  should support at least two USB 3.1 ports that can talk to the upstream PCH at full speed
  • an Thunderbolt NHI controller.  According to a developer document from Apple, NHI is the native host interface for Thunderbolt and is therefore the true heart of Alpine Ridge.
The presence of the NHI on the PCIe switch is itself kind of interesting; it's not a peripheral device so much as a bridge that allows non-PCIe peripherals to speak native Thunderbolt and still get to the CPU memory via PCIe.  For example, Alpine Ridge also has a DisplayPort interface, and it's likely that DisplayPort signals enter the PCIe subsystem through this NHI controller.

Although Alpine Ridge delivers some impressive I/O and connectivity options, it has some pretty critical architectural qualities that limit its overall performance in a desktop.  Notably,

  • Apple recently added support for external GPUs that connect to MacBooks through Thunderbolt 3.  While this sounds really awesome in the sense that you could turn a laptop into a gaming computer on demand, note that the best bandwidth you can get between an external GPU and the system memory is about 4 GB/sec, or the performance of a single PCIe 3.0 4x link.  This pales in comparison to the 16 GB/sec bandwidth available to the AMD Radeon which is directly attached to the CPU's PCIe controller in the iMac.
  • Except in the cases where Thunderbolt-attached peripherals are talking to each other via DMA, they appear to all compete with each other for access to the host memory through the single PCIe 4x upstream link.  4 GB/sec is a lot of bandwidth for most peripherals, but this does mean that an external GPU and a USB 3.1 external SSD or a 4K display will be degrading each others' performance.
In addition, Thunderbolt 3 advertises 40 Gbit/sec performance, but PCIe 3.0 4x only provides 32 Gbit/sec.  Thus, it doesn't look like you can actually get 40 Gbit/sec from Thunderbolt all the way to system memory under any conditions; the peak Thunderbolt performance is only available between Thunderbolt peripherals.

Overall Performance Implications

The way I/O in the iMac is connected definitely introduces a lot of performance bottlenecks that would make this a pretty scary building block for a supercomputer.  The fact that the Alpine Ridge's PCIe switch has a 4:1 taper to the PCH, and the PCH then further tapers all of its peripherals to a single 4x link to the CPU, introduces a lot of cases where performance of one component (for example, the NVMe SSD) can depend on what another device (for example, a USB 3.1 peripheral) is doing.  The only component which does not compromise on performance is the Radeon GPU, which has a direct connection to the CPU and its memory; this is how all I/O devices in typical HPC nodes are connected.

With all that being said, the iMac's I/O subsystem is a great design for its intended use.  It effectively trades peak I/O performance for extreme I/O flexibility; whereas a typical HPC node would ensure enough bandwidth to operate an InfiniBand adapter at full speed while simultaneously transferring data to a GPU, it wouldn't support plugging in a USB 3.1 hard drive or a 4K monitor.

Plugging USB 3 hard drives into an HPC node is surprisingly annoying.  I've had to do this for bioinformaticians, and it involves installing a discrete PCIe USB 3 controller alongside high-bandwidth network controllers.

Curiously, as I/O becomes an increasingly prominent bottleneck in HPC though, we are beginning to see very high-performance and exotic I/O devices entering the market.  For example, IBM's BlueLink  is able to carry a variety of protocols at extreme speeds directly into the CPU, and NVLink over BlueLink is a key technology enabling scaled-out GPU nodes in the OpenPOWER ecosystem.  Similarly, sophisticated PCIe switches are now proliferating to meet the extreme on-node bandwidth requirements of NVMe storage nodes.

Ultimately though, PCH and Thunderbolt aren't positioned well to become HPC technologies.  If nothing else, I hope this breakdown helps illustrate how performance, flexibility, and cost drive the system designs decisions that make desktops quite different from what you'd see in the datacenter.

Appendix: Deciphering the PCIe Topology

Figuring out everything I needed to write this up involved a little bit of anguish.  For the interested reader, here's exactly how I dissected my iMac to figure out how its I/O subsystem was plumbed.

Foremost, I had to boot my iMac into Linux to get access to dmidecode and lspci since I don't actually know how to get at all the detailed device information from macOS.  From this,

ubuntu@ubuntu:~$ lspci -t -v
-[0000:00]-+-00.0  Intel Corporation Device 591f
           +-01.0-[01]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480]
           |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0
           +-14.0  Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller
           +-16.0  Intel Corporation Sunrise Point-H CSME HECI #1
           +-17.0  Intel Corporation Sunrise Point-H SATA controller [AHCI mode]
           +-1b.0-[02]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961
           +-1c.0-[03]----00.0  Broadcom Limited BCM43602 802.11ac Wireless LAN SoC
           +-1c.1-[04]--+-00.0  Broadcom Limited NetXtreme BCM57766 Gigabit Ethernet PCIe
           |            \-00.1  Broadcom Limited BCM57765/57785 SDXC/MMC Card Reader

we see a couple of notable things right away:

  • there's a single PCIe domain, numbered 0000
  • everything branches off of PCIe bus number 00
  • there are a bunch of PCIe bridges hanging off of bus 00 (which connect to bus number 0102, etc)
  • there are a bunch of PCIe devices hanging off both bus 00 and the other buses such as device 0000:00:14 (a USB 3.0 controller) and device 0000:01:00 (the AMD/ATI GPU)
  • at least one device (the GPU) has multiple PCIe functions (0000:01:00.0, a video output, and 0000:01:00.1 an HDMI audio output)

But lspci -t -v actually doesn't list everything that we know about.  For example, we know that there are bridges that connect bus 00 to the other buses, but we need to use lspci -Dv to actually see the information those bridges provides to the OS:

ubuntu@ubuntu:~$ lspci -vD
0000:00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
DeviceName: SATA
Subsystem: Apple Inc. Device 0180
0000:00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode])
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
Kernel driver in use: pcieport
0000:00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31) (prog-if 30 [XHCI])
Subsystem: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller
Kernel driver in use: xhci_hcd
0000:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev c0) (prog-if 00 [VGA controller])
Subsystem: Apple Inc. Ellesmere [Radeon RX 470/480]
Kernel driver in use: amdgpu
0000:01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0
Kernel driver in use: snd_hda_intel
This tells us more useful information:

  • Device 0000:00:00 is the PCIe host bridge--this is the endpoint that all PCIe devices use to talk to the CPU and, by extension, system memory (since the system memory controller lives on the same on-chip network that the PCIe controller and the CPU cores do)
  • The PCIe bridge connecting bus 00 and bus 01 (0000:00:01) is integrated into the PCIe controller on the CPU.  In addition, the PCI ID for this bridge is the same as the one used on Intel Skylake processors--not surprising, since Kaby Lake is an optimization (not re-architecture) of Skylake.
  • The two PCIe functions on the GPU--0000:01:00.0 and 0000:01:00.1--are indeed a video interface (as evidenced by the amdgpu driver) and an audio interface (snd_hda_intel driver).  Their bus id (01) also indicates that they are directly attached to the Kaby Lake processor's PCIe controller--and therefore enjoy the lowest latency and highest bandwidth available to system memory.
Finally, the Linux kernel's procfs interface provides a very straightforward view of every PCIe device's connectivity by presenting them as symlinks:

ubuntu@ubuntu:/sys/bus/pci/devices$ ls -l
... 0000:00:00.0 -> ../../../devices/pci0000:00/0000:00:00.0
... 0000:00:01.0 -> ../../../devices/pci0000:00/0000:00:01.0
... 0000:00:14.0 -> ../../../devices/pci0000:00/0000:00:14.0
... 0000:00:16.0 -> ../../../devices/pci0000:00/0000:00:16.0
... 0000:00:17.0 -> ../../../devices/pci0000:00/0000:00:17.0
... 0000:00:1b.0 -> ../../../devices/pci0000:00/0000:00:1b.0
... 0000:00:1c.0 -> ../../../devices/pci0000:00/0000:00:1c.0
... 0000:00:1c.1 -> ../../../devices/pci0000:00/0000:00:1c.1
... 0000:00:1c.4 -> ../../../devices/pci0000:00/0000:00:1c.4
... 0000:00:1f.0 -> ../../../devices/pci0000:00/0000:00:1f.0
... 0000:00:1f.2 -> ../../../devices/pci0000:00/0000:00:1f.2
... 0000:00:1f.3 -> ../../../devices/pci0000:00/0000:00:1f.3
... 0000:00:1f.4 -> ../../../devices/pci0000:00/0000:00:1f.4
... 0000:01:00.0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0
... 0000:01:00.1 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.1
... 0000:02:00.0 -> ../../../devices/pci0000:00/0000:00:1b.0/0000:02:00.0
... 0000:03:00.0 -> ../../../devices/pci0000:00/0000:00:1c.0/0000:03:00.0
... 0000:04:00.0 -> ../../../devices/pci0000:00/0000:00:1c.1/0000:04:00.0
... 0000:04:00.1 -> ../../../devices/pci0000:00/0000:00:1c.1/0000:04:00.1
... 0000:05:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0
... 0000:06:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:00.0
... 0000:06:01.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:01.0
... 0000:06:02.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:02.0
... 0000:06:04.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:04.0
... 0000:07:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:00.0/0000:07:00.0
... 0000:08:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:02.0/0000:08:00.0

This topology, combined with the lspci outputs above, reveals that most of the I/O peripherals are either directly provided by or hang off of the Sunrise Point chip.  There is another fan-out of PCIe ports hanging off of the Alpine Ridge chip (0000:00:1b.0 and 0000:00:1c.{0,1,4}), and what's not shown are the Native Thunderbolt (NHI) connections, such as DisplayPort, on the other side of the Alpine Ridge.  Although I haven't looked very hard, I did not find a way to enumerate these Thunderbolt NHI devices.

There remain a few other open mysteries to me as well; for example, lspci -vv reveals the PCIe lane width of most PCIe-attached devices, but it does not obviously display the maximum lane width for each connection.  Furthermore, the USB, HECI, SATA, and LPC bridges hanging off the Sunrise Point do not list a lane width at all, so I still don't know exactly what level of bandwidth is available to these bridges.

If anyone knows more about how to peel back the onion on some of these bridges, or if I'm missing any important I/O connections between the CPU, PCH, or Alpine Ridge that are not enumerated via PCIe, please do let me know!  I'd love to share the knowledge and make this more accurate if possible.

Saturday, May 27, 2017

A less-biased look at tape versus disks

Executive Summary

Tape isn't dead despite what object store vendors may tell you, and it still plays an important role in both small- and large-scale storage environments.  Disk-based object stores certainly have eroded some of the areas where tape has historically been the obvious choice, but in the many circumstances where low latency is not required and high cost cannot be tolerated, tape remains a great option.

This post is a technical breakdown of some of the misconceptions surrounding the future of tape in the era of disk-based object stores as expressed in a recent blog post from an object store vendor's chief marketing officer.  Please note that the opinions stated below are mine alone and not a reflection of my employer or the organizations and companies mentioned.  I also have no direct financial interests in any tape, disk, or object store vendors or technologies.


IBM 701 tape drive--what many people picture when they hear about tape-based storage.  It's really not still like this, I promise.
Scality, and object store software vendor whose product relies on hard disk-based (HDD-based) storage, recently posted a marketing blog post claiming that tape is finally going to die and disk is the way of the future.  While I don't often rise to the bait of marketing material, tape takes a lot more flack than it deserves because of how old and of a technology it is.  There is no denying that tape is old--it actually precedes the first computers by decades, and digital tape recording goes back to the early 1950s.  Like it or not though, tape technology is about as up-to-date as HDD technology (more on this later), and you're likely still using tape on a regular basis whether you like it or not.  For example, Google relies on tape to archive your everyday including Gmail because, in terms of cost per bit and power consumption, tape will continue to beat disk for years to come.  So in the interests of sticking up for tape for both its good and bad, let's walk through Scality's blog post, authored by their chief of marketing Paul Turner, and tell the other side of the story.

1. Declining Tape Revenues

Mr. Turner starts by pointing out that "As far back as 2010, The Register reported a 25% decline in tape drive and media sales."  This decrease is undeniably true:

Market trends for LTO tape, 2008-2015.  Data from the Santa Clara Consulting Group, presented at MSST 2016 by Bob Fontana (IBM)

Although tape revenue has been decreasing, an increasing amount of data is landing on tape.  How can these seemingly contradictory trends be reconciled?

The reality is that the tape industry at large is not technologically limited like CPU processors, flash storage, or even spinning disk.  Rather, the technology that underlies both the magnetic tape media and the drive heads that read and write this media are actually lifted over from the HDD industry.  That is, the hottest tape drives on the market today are using technology that the HDD industry figured out years ago.  As such, even if HDD innovation completely halted overnight, the tape industry would still be able to release new products for at least one or two more technology generations.

This is all to say that the rate at which new tape technologies reach market are not limited by the rate of innovation in the underlying storage technology.  Tape vendors simply lift over HDD innovations into new tape products when it becomes optimally profitable to do so, so declining tape revenues simply means that the cadence of the tape technology refresh will stretch out.  While this certainly widens the gap between HDD and tape and suggests a slow down-ramping of tape as a storage media, you cannot simply extrapolate these market trends in tape down to zero.  The tape industry simply doesn't work like that.

2. The Shrinking Tape Vendor Ecosystem

Mr. Turner goes on to cite an article published in The Register about Oracle's EOL of the StorageTek line of enterprise tape:
"While this falls short of a definitive end-of-life statement, it certainly casts serious doubt on the product’s future. In fairness, we’ll note that StreamLine is a legacy product family originally designed and built for mainframes. Oracle continues to promote the open LTO tape format, which is supported by products from IBM, HPE, Quantum, and SpectraLogic."
To be fair, Mr. Turner deserves credit for pointing out that StorageTek (which was being EOL'ed) and LTO are different tape technologies, and Oracle continues to support LTO.  But let's be clear here--the enterprise (aka mainframe) tape market has been roughly only 10% of the global tape market by exabytes shipped, and even then, IBM and Oracle have been the only vendors in this space.  Oracle's exit from the enterprise tape market is roughly analogous to Intel recently EOL'ing Itanium with the 9700-series Kittson chips in that a boutique product is being phased out in favor of a product that hits a much wider market.

3. The Decreasing Cost of Disk

Mr. Turner goes on to cite a Network Computing article:
"In its own evaluation of storage trends, including the increasing prevalence of cloud backup and archiving, Network Computing concludes that “…tape finally appears on the way to extinction.” As evidence, they cite the declining price of hard disks,"
Hard disk prices decrease on a cost per bit basis, but there are a few facts that temper the impact of this trend:

Point #1: HDDs include both the media and the drive that reads the media.  This makes the performance of HDDs scale a lot more quickly than tape, but it also means HDDs have a price floor of around $40 per device.  The cost of the read heads, voice coil, and drive controller are not decreasing.  When compared to the tape cartridges of today (whose cost floor is limited by the magnetic tape media itself) or the archival-quality flash of tomorrow (think of how cheaply thumb drives can be manufactured), HDD costs don't scale very well.  And while one can envision shipping magnetic disk platters that rely on external drives to drive the cost per bit down, such a solution would look an awful lot like a tape archive.

Point #2: The technology that underpins the bit density of hard drives has been rapidly decelerating.  The ultra high-density HDDs of today seem to have maxed out at around 1 terabit per square inch using parallel magnetic recording (PMR) technology, so HDD vendors are just cramming more and more platters into individual drives.  As an example, Seagate's recently unveiled 12 TB PMR drives contain an astounding eight platters and sixteen drive heads; their previous 10 TB PMR drives contained seven platters, and their 6 TB PMR drives contained five platters.  Notice a trend?

There are truly new technologies that radically change the cost-per-bit trajectory for hard drives which include shingled magnetic recording (SMR), heat-assisted magnetic recording (HAMR), and bit-patterned media (BPM).  However, SMR's severe performance limitations for non-sequential writes make them a harder sell as a wholesale replacement for tape.  HAMR and BPM hold much more universal promise, but they simply don't exist as products yet and therefore simply don't compete with tape.  Furthermore, considering our previous discussion of how tape technology evolves, the tape industry has the option to adopt these very same technologies to drive down the cost-per-bit of tape by a commensurate amount.

4. The Decreasing Cost of Cloud

Mr. Turner continues citing the Network Computing article, making the bold claim that two other signs of the end of tape are
"...the ever-greater affordability of cloud storage,"
This is deceptive.  The cloud is not a charitable organization; their decreasing costs are a direct reflection of the decreasing cost per bit of media, which are savings that are realized irrespective of whether the media is hosted by a cloud provider or on-premise.  To be clear, the big cloud providers are definitely also reducing their costs by improving their efficiencies at scale; however, these savings are transferred to their customers only to the extent that they can be price competitive with each other.  My guess, which is admittedly uneducated, is that most of these cost savings are going to shareholders, not customers.
"and the fact that cloud is labor-free."
Let's be real here--labor is never "free" in the context of data management.  It is true that you don't need to pay technicians to swap disks in your datacenter if you have no tape (or no datacenter).  However, it's a bit insulting to presume that the only labor done by storage engineers is replacing disks.  Storage requires babysitting regardless of if it lives in the cloud or on-premise, and regardless of if it is backed by tape or disk.  It needs to be integrated with the rest of a company's infrastructure and operations, and this is where the principal opex of storage should be spent.  Any company that is actually having to scale personnel linearly with storage is doing something terribly wrong, and making the choice to migrate to the cloud to save opex is likely putting a band-aid over a much bigger wound.

Finally, this cloud-tape argument conflates disk as a technology and cloud as a business model.  There's nothing preventing tape from existing in the cloud; in fact, the Oracle Cloud does exactly this and hosts archival data in StorageTek archives at absolute rock-bottom prices--$0.001/GB, which shakes out to $1,000 per month to host a petabyte of archive.  Amazon Glacier also offers a tape-like performance and cost balance relative to its disk-based offerings.  The fact that you don't have to see the tapes in the cloud doesn't mean they don't exist and aren't providing you value.

5. The Performance of Archival Disk over Tape

The next argument posed by Mr. Turner is the same one that people have been using to beat up on tape for decades:
"...spotlighting a tape deficit that’s even more critical than price: namely, serial--and glacially slow--access to data."
This was a convincing argument back in the 1980s, but to be frank, it's really tired at this point.  If you are buying tape for low latency, you are doing something wrong.

As I discussed above, tape's benefits lie in its
  1. rock-bottom cost per bit, achievable because it uses older magnetic recording technology and does not package the drive machinery with the media like disk does, and
  2. total cost of ownership, which is due in large part to the fact that it does not draw power when data is at rest.
I would argue that if 
  1. you don't care about buying the cheapest bits possible (for example, if the cost of learning how to manage tape outweighs the cost benefits of tape at your scale), or
  2. you don't care about keeping power bills low (for example, if your university foots the power bill)
there are definitely better options for mass storage than tape.  Furthermore, if you need to access any bit of your data at nearline speeds, you should definitely be buying nearline storage media.  Tape is absolutely not nearline, and it would just be the wrong tool for the job.

However, tape remains the obvious choice in cases where data needs to be archived or a second copy has to be retained.  Consider the following anecdotes:
In both cases--offline second copy and offline archive--storing data in nearline storage often just doesn't make economic sense since the data is not being frequently accessed.

However, it is critical to point out that there are scales at which using tape does not make great sense. Let's break these scales out and look at each:

At small scales where the number of cartridges in on the same order as the number of drives (e.g., a single drive with a handful of cartridges), tape is not too difficult to manage.  At these scales, such as those which might be found in a small business' IT department, performing offline backups of financials to tape is a lot less expensive than continually buying external USB drives and juggling them.

At large scales where the number of cartridges is far larger than the number of drives (e.g., in a data-driven enterprise or large-scale scientific computing complex), tape is also not too difficult to manage.  The up-front cost of tape library infrastructure and robotics is amortized by the annual cost of media, and sophisticated data management software (more on this below!) prevents humans from having to juggle tapes manually.

At medium scales, tape can be painful.  If the cost of libraries and robotics is difficult to justify when compared to the cost of the media (and therefore has a significant impact on the net $/GB of tape), you wind up having to pay people to do the job of robots in managing tapes.  This is a dangerous way to operate, as you are tickling the upper limits of how far you can scale people and you have to carefully consider how much runway you've got before you are better off buying robotics, disks, or cloud-based resources.

6. The Usability of Archival Disk over Tape

The Scality post then begins to paint with broad strokes:
"To access data from a disk-based archive, you simply search the index, click on the object or file you want, and presto, it’s yours.  By contrast, pulling a specific file from tape is akin to pulling teeth. First, you physically comb through a pile of cartridges, either at a remote site or by having them trucked to you."
The mistake that Mr. Turner makes here is conflating disk media with archival software.  Tape archives come with archival software just like disk archives do.  For example, HPSS indexes metadata from objects stored on tape in a DB2 database.  There's no "pulling teeth" to "identify a cartridge that seems to contain what you're looking for" and no "manually scroll[ing] through to pinpoint and retrieve the data."

Data management software systems including HPSSIBM's Spectrum ProtectCray's TAS, and SGI's DMF all provide features that can make your tape archive look an awful lot like an object store if you want them.  The logical semantics of storing data on disks versus tape are identical--you put some objects into an archive, and you get some objects out later.  The only difference is the latency of retrieving data on a tape.

That said, these archival software solutions also allow you to use both tape and disk together to ameliorate the latency hit of retrieving warmer data from the archive based on heuristics, data management policies, or manual intervention.  In fact, they provide S3 interfaces too, so you can make your tapes and disk-based object stores all look like one archive--imagine that!

What this all boils down to is that the perceived usability of tape is a function of the software on top of it, not the fact that it's tape and not disk.

7. Disks Enable Magical Business Intelligence

The Scality post tries to drive the last nail in the coffin of tape by conjuring up tales of great insight enabled by disk:
"...mountains of historical data are a treasure trove of hidden gems—patterns and trends of purchasing activity, customer preferences, and user behavior that marketing, sales, and product development can use to create smarter strategies and forecasts."
"Using disk-based storage, you can retrieve haystacks of core data on-demand, load it into analytic engines, and emerge with proverbial “needles” of undiscovered business insight."
which is to imply that tape is keeping your company stupid, and migrating to disk will propel you into a world of deep new insights:

Those of us doing statistical analysis on a daily basis keep this xkcd comic taped to our doors and pinned to our cubes.  We hear it all the time.

This is not to say that the technological sentiment expressed by Mr. Turner is wrong; if you have specific analyses you would like to perform over massive quantities of data on a regular basis, hosting that data in offline tape is a poor idea.  But if you plan on storing your large archive on disk because you might want to jump on the machine learning bandwagon someday, realize that you may be trading significant, guaranteed savings on media for a very poorly defined opportunity cost.  This tradeoff may be worth the risk in some early-stage, fast-moving startups, but it is unappetizing in more conservative organizations.

I also have to point out that "[g]one are the days when data was retained only for compliance and auditing" is being quite dramatic and disconnected from the realities of data and lifecycle management.  A few anecdotes:

  • Compliance: The United States Department of Energy and the National Science Foundation both have very specific guidance regarding the retention and management of data generated during federally funded research.  At the same time, extra funding is generally not provided to help support this data management, so eating the total cost of ownership of storing such data on disk over tape can be very difficult to justify when there is no funding to maintain compliance, let alone perform open-ended analytics on such data.
  • Auditing: Keeping second copies of data critical to business continuity is often a basic requirement in demonstrating due diligence.  In data-driven companies and enterprises, it can be difficult to rationalize keeping the second archival copy of such data nearline.  Again, it comes down to figuring out the total cost of ownership.
That said, the sentiment expressed by Mr. Turner is not wrong, and there are a variety of cases where keeping archival data nearline has clear benefits:
  • Cloud providers host user data on disk because they cannot predict when a user may want to look at an e-mail they received in 2008.  While it may cost more in media, power, and cooling to keep all users' e-mails nearline, being able to deliver desktop-like latency to users in a scalable way can drive significantly larger returns.  The technological details driving this use case have been documented in a fantastic whitepaper from Google.
  • Applying realtime analytics to e-commerce is a massive industry that is only enabled by keeping customer data nearline.  Cutting through the buzz and marketing floating surrounding this space, it's pretty darned neat that companies like Amazon, Netflix, and Pandora can actually suggest things to me that I might actually want to buy or consume.  These sorts of analytics could not happen if my purchase history was archived to tape.

Tape's like New Jersey - Not Really That Bad

Mr. Turner turns out to be the Chief Marketing Officer of Scality, a company that relies on disk to sell its product.  The greatest amount of irony, though, comes from the following statement of his:
"...Iron Mountain opines that tape is best. This is hardly a surprising conclusion from a provider of offsite tape archive services. It just happens to be incorrect."
Takeoff from Newark Liberty International Airport--what most people think of New Jersey.  It's really not all like this, I promise.
I suppose I shouldn't have been surprised that a provider of disk-dependent archival storage should conclude that tape is dead and disks are the future, and I shouldn't have risen to the bait.  But, like my home state of New Jersey, tape is a great punching bag for people with a cursory knowledge of it.  Just like how Newark Airport is what shapes most people's opinions of New Jersey, old images of reel-to-reel PDP-11s and audio cassettes make it easy to trash tape as a digital storage medium.  And I just as I will always feel unduly compelled to stick up for my home state, I can't help but fact-check people who want to beat up tape.

The reality is that tape really isn't that terrible, and there are plenty of aspects to it that make it a great storage technology.  Like everything in computing, understanding its strengths (its really low total cost) and weaknesses (its high access latency) is the best way to figure out if the costs of deploying or maintaining a tape-based archive make it a better alternative to disk-based archives.  For very small-scale or large-scale offline data archive, tape can be very cost effective.  As the Scality blog points out though, if you're somewhere in between, or if you need low-latency access to all of your data for analytics or serving user data, disk-based object storage may be a better value overall.

Many of Mr. Turner's points, if boiled down to their objective kernels, are not wrong.  Tape is on a slow decline in terms of revenue, and this may stretch out the cadence of new tape technologies hitting the market.  However there will always be a demand for high-endurance, low-cost, offline archive despite however good object stores become, and I have a difficult time envisioning a way in which tape completely implodes in the next ten years.  It may be the case that, just like how spinning disk is rapidly disappearing from home PCs, tape may become even more of a boutique technology that primarily exists as the invisible backing store for a cloud-based archival solution.  I just don't buy into the doom and gloom, and I'll bet blog posts heralding the doom of tape will keep coming for years to come.

Sunday, March 12, 2017

Reviewing the state of the art of burst buffers

Just over two years ago I attended my first DOE workshop as a guest representative of the NSF supercomputing centers, and I wrote a post that summarized my key observations of how the DOE was approaching the increase in data-intensive computing problems.  At the time, the most significant thrusts seemed to be
  1. understanding scientific workflows to keep pace with the need to process data in complex ways
  2. deploying burst buffers to overcome the performance limitations of spinning disk relative to the increasing scale of simulation data
  3. developing methods and processes to curate scientific data
Here we are now two years later, and these issues still take center stage in the discussion surrounding the future of  data-intensive computing.  The DOE has made significant progress in defining its path forward in these areas though, and in particular, both the roles of burst buffers and scientific workflows have a much clearer focus on DOE’s HPC roadmap.  Burst buffers in particular are becoming a major area of interest since they are now becoming commercially available, so in the interests of updating some of the incorrect or incomplete thoughts I wrote about two years ago, I thought I'd write about the current state of the art in burst buffers in HPC.

Two years ago I had observed that there were two major camps in burst buffer implementations: one that is more tightly integrated with the compute side of the platform that utilizes explicit allocation and use, and another that is more closely integrated with the storage subsystem and acts as a transparent I/O accelerator.  Shortly after I made that observation though, Oak Ridge and Lawrence Livermore announced their GPU-based leadership systems, Summit and Sierra, which would feature a new type of burst buffer design altogether that featured on-node nonvolatile memory.

This CORAL announcement, combined with the deployment of production, large-scale burst buffers at NERSCLos Alamos, and KAUST, has led me to re-think my taxonomy of burst buffers.  Specifically, it really is important to divide burst buffers into their hardware architectures and software usage modes; different burst buffer architectures can provide the same usage modalities to users, and different modalities can be supported by the same architecture.

For the sake of laying it all out, let's walk through the taxonomy of burst buffer hardware architectures and burst buffer software usage modalities.

Burst Buffer Hardware Architectures

First, consider your typical medium- or large-scale HPC system architecture without a burst buffer:

In this design, you have

  • Compute Nodes (CN), which might be commodity whitebox nodes like the Dell C6320 nodes in SDSC's Comet system or Cray XC compute blades
  • I/O Nodes (ION), which might be commodity Lustre LNET routers (commodity clusters), Cray DVS nodes (Cray XC), or CIOD forwarders (Blue Gene)
  • Storage Nodes (SN), which might be Lustre Object Storage Servers (OSSes) or GPFS Network Shared Disk (NSD) servers
  • The compute fabric (blue lines), which is typically Mellanox InfiniBand, Intel OmniPath, or Cray Aries
  • The storage fabric (red lines), which is typically Mellanox InfiniBand or Intel OmniPath

Given all these parts, there are a bunch of different places you can stick flash devices to create a burst buffer.  For example...

ION-attached Flash

You can put SSDs inside IO nodes, resulting in an ION-attached flash architecture that looks like this:

Gordon, which was the first large-scale deployment of what one could call a burst buffer, had this architecture.  The flash was presented to the compute nodes as block devices using iSCSI, and a compute node could have anywhere between zero and sixteen SSDs mounted to it entirely via software.  More recently, the Tianhe-2 system at NUDT also deployed this architecture and exposes the flash to user applications via their H2FS middleware.

Fabric-attached Flash

A very similar architecture is to add specific burst buffer nodes on the compute fabric that don't route I/O, resulting in a fabric-attached flash architecture:

Like the ION-attached flash design of Gordon, the flash is still embedded within the compute fabric and is logically closer to the compute nodes than the storage nodes.  Cray's DataWarp solution uses this architecture.

Because the flash is still on the compute fabric, this design is very similar to ION-attached flash and the decision to chose it over the ION-attached flash design is mostly non-technical.  It can be more economical to embed flash directly in I/O nodes if those nodes have enough peripheral ports (or physical space!) to support the NICs for the compute fabric, the NICs for the storage fabric, and the flash devices.  However as flash technology moves away from being attached via SAS and towards being directly attached to PCIe, it becomes more difficult to stuff that many high-performance peripherals into a single box without imbalancing something.  As such, it is likely that fabric-attached flash architectures will replace ION-attached flash going forward.

Fortunately, any burst buffer software designed for ION-attached flash designs will also probably work on fabric-attached flash designs just fine.  The only difference is that the burst buffer software will no longer have to compete against the I/O routing software for on-node resources like memory or PCIe bandwidth.

CN-attached Flash

A very different approach to building burst buffers is to attach a flash device to every single compute node in the system, resulting in a CN-attached flash architecture:

This design is neither superior nor inferior to the ION/fabric-attached flash design.  The advantages it has over ION/fabric-attached flash include

  • Extremely high peak I/O performance -The peak performance scales linearly with the number of compute nodes, so the larger your job, the more performance your job can have.
  • Very low variation in I/O performance - Because each compute node has direct access to its locally attached SSD, contention on the compute fabric doesn't affect I/O performance.
However, these advantages come at a cost:
  • Limited support for shared-file I/O -  Because each compute node doesn't share its SSD with other compute nodes, having many compute nodes write to a single shared file is not a straightforward process.  The solution to this issue include from such N-1 style I/O being simply impossible (the default case), relying on I/O middleware like the SCR library to manage data distribution, or relying on sophisticated I/O services like Intel CPPR to essentially journal all I/O to the node-local flash and flush it to the parallel file system asynchronously.
  • Data movement outside of jobs becomes difficult - Burst buffers allow users to stage data into the flash before their job starts and stage data back to the parallel file system after their job ends.  However in CN-attached flash, this staging will occur while someone else's job might be using the node.  This can cause interference, capacity contention, or bandwidth contention.  Furthermore, it becomes very difficult to persist data on a burst buffer allocation across multiple jobs without flushing and re-staging it.
  • Node failures become more problematic - The point of writing out a checkpoint file is to allow you to restart a job in case one of its nodes fails.  If your checkpoint file is actually stored on one of the nodes that failed, though, the whole checkpoint gets lost when a node fails.  Thus, it becomes critical to flush checkpoint files to the parallel file system as quickly as possible so that your checkpoint file is safe if a node fails.  Realistically though, most application failures are not caused by node failures; a study by LLNL found that 85% of job interrupts do not take out the whole node.
  • Performance cannot be decoupled from job size - Since you get more SSDs by requesting more compute nodes, there is no way to request only a few nodes and a lot of SSDs.  While this is less an issue for extremely large HPC jobs whose I/O volumes typically scale linearly with the number of compute nodes, data-intensive applications often have to read and write large volumes of data but cannot effectively use a huge number of compute nodes.
If you take a step back and look at what these strengths and weaknesses play to, you might be able to envision what sort of supercomputer design might be best suited for this type of architecture:
  • Relatively low node count, so that you aren't buying way more SSD capacity or performance than you can realistically use given the bandwidth of the parallel file system to which the SSDs must eventually flush
  • Relatively beefy compute nodes, so that the low node count doesn't hurt you and so that you can tolerate running I/O services to facilitate the asynchronous staging of data and middleware to support shared-file I/O
  • Relatively beefy network injection bandwidth, so that asynchronous stage in/out doesn't severely impact the MPI performance of the jobs that run before/after yours
There are also specific application workloads that are better suited to this CN-attached flash design:
  • Relatively large job sizes on average, so that applications routinely use enough compute nodes to get enough I/O bandwidth.  Small jobs may be better off using the parallel file system directly, since parallel file systems can usually deliver more I/O bandwidth to smaller compute node counts.
  • Relatively low diversity of applications, so that any applications that rely on shared-file I/O (which is not well supported by CN-attached flash, as we'll discuss later) can either be converted into using the necessary I/O middleware like SCR, or can be restructured to use only file-per-process or not rely on any strong consistency semantics.
And indeed, if you look at the systems that are planning on deploying this type of CN-attached flash burst buffer in the near future, they all fit this mold.  In particular, the CORAL Summit and Sierra systems will be deploying these burst buffers at extreme scale, and before them, Tokyo Tech's Tsubame 3.0 will as well.  All of these systems derive the majority of their performance from GPUs, leaving the CPUs with the capacity to implement more functionality of their burst buffers in software on the CNs.

Storage Fabric-attached Flash

The last notable burst buffer architecture involves attaching the flash on the storage fabric rather than the compute fabric, resulting in SF-attached flash:

This is not a terribly popular design because
  1. it moves the flash far away from the compute node, which is counterproductive to low latency
  2. it requires that the I/O forwarding layer (the IONs) support enough bandwidth to saturate the burst buffer, which can get expensive
However, for those HPC systems with custom compute fabrics that are not amenable to adding third-party burst buffers, this may be the only possible architecture.  For example, the Argonne Leadership Computing Facility has deployed a high-performance GPFS file system as a burst buffer alongside their high-capacity GPFS file system in this fashion because it is impractical to integrate flash into their Blue Gene/Q's proprietary compute fabric.  Similarly, sites that deploy DDN's Infinite Memory Engine burst buffer solution on systems with proprietary compute fabrics (e.g., Cray Aries on Cray XC) will have to deploy their burst buffer nodes on the storage fabric.

Burst Buffer Software

Ultimately, all of the different burst buffer architectures still amount to sticking a bunch of SSDs into a supercomputing system, and if this was all it took to make a burst buffer though, burst buffers wouldn't be very interesting.  Thus, there is another half of the burst buffer ecosystem: the software and middleware that transform a pile of flash into an I/O layer that applications can actually use productively.

In the absolute simplest case, this software layer can just be an XFS file system atop RAIDed SSDs that is presented to user applications as node-local storage.  And indeed, this is what SDSC's Gordon system did; for many workloads such as file-per-process I/O, it is a suitable way to get great performance.  However, as commercial vendors have gotten into the burst buffer game, they have all started using this software layer to differentiate their burst buffer solutions from their competitors'.  This has resulted in modern burst buffers now having a lot of functionality that allow users to do interesting new things with their I/O.

Because this burst buffer differentiation happens entirely in software, it should be no surprise that these burst buffer software solutions look a lot like the software-defined storage products being sold in the enterprise cloud space.  The difference is that burst buffer software can be optimized specifically for HPC workloads and technologies, resulting in much nicer and accessible ways in which they can be used by HPC applications.

Common Software Features

Before getting too far, it may be helpful to enumerate the features common to many burst buffer software solutions:
  • Stage-in and stage-out - Burst buffers are designed to make a job's input data already be available on the burst buffer immediately when the job starts, and to allow the flushing of output data to the parallel file system after the job ends.  To make this happen, the burst buffer service must give users a way to indicate what files they want to be available on the burst buffer when they submit their job, and they must also have a way to indicate what files they want to flush back to the file system after the job ends.
  • Background data movement - Burst buffers are also not designed to be long-term storage, so their reliability can be lower than the underlying parallel file system.  As such, users must also have a way to tell the burst buffer to flush intermediate data back to the parallel file system while the job is still running.  This should happen using server-to-server copying that doesn't involve the compute node at all.
  • POSIX I/O API compatibility - The vast majority of HPC applications rely on the POSIX I/O API (open/close/read/write) to perform I/O, and most job scripts rely on tools developed for the POSIX I/O API (cd, ls, cp, mkdir).  As such, all burst buffers provide the ability to interact with data through the POSIX I/O API so that they look like regular old file systems to user applications.  That said, the POSIX I/O semantics might not be fully supported; as will be described below, you may get an I/O error if you try to perform I/O in a fashion that is not supported by the burst buffer.
With all this being said, there are still a variety of ways in which these core features can be implemented into a complete burst buffer software solution.  Specifically, burst buffers can be accessed through one of several different modes, and each mode provides a different balance of peak performance and usability.

Transparent Caching Mode

The most user-friendly burst buffer mode uses flash to simply act as a giant cache for the parallel file system which I call transparent caching mode.  Applications see the burst buffer as a mount point on their compute nodes, and this mount point mirrors the contents of the parallel file system, and any changes I make to one will appear on the other.  For example,

$ ls /mnt/lustre/glock
bin  project1  project2  public_html  src

### Burst buffer mount point contains the same stuff as Lustre
$ ls /mnt/burstbuffer/glock
bin  project1  project2  public_html  src

### Create a file on Lustre...
$ touch /mnt/lustre/glock/hello.txt

$ ls /mnt/lustre/glock
bin  hello.txt  project1  project2  public_html  src

### ...and it automatically appears on the burst buffer.
$ ls /mnt/burstbuffer/glock
bin  hello.txt  project1  project2  public_html  src

### However its contents are probably not on the burst buffer's flash
### yet since we haven't read its contents through the burst buffer
### mount point, which is what would cause it to be cached

However, if I access a file through the burst buffer mount (/mnt/burstbuffer/glock) rather than the parallel file system mount (/mnt/lustre/glock),
  1. if hello.txt is already cached on the burst buffer's SSDs, it will be read directly from flash
  2. if hello.txt is not already cached on the SSDs, the burst buffer will read it from the parallel file system, cache its contents on the SSDs, and return its contents to me
Similarly, if I write to hello.txt via the burst buffer mount, my data will be cached to the SSDs and will not immediately appear on the parallel file system.  It will eventually flush out to the parallel file system, or I could tell the burst buffer service to explicitly flush it myself.

This transparent caching mode is by far the easiest, since it looks exactly like the parallel file system for all intents and purposes.  However if you know that your application will never read any data more than once, it's far less useful in this fully transparent mode.  As such, burst buffers that implement this mode provide proprietary APIs that allow you to stage-in data, control the caching heuristics, and explicitly flush data from the flash to the parallel file system.  

DDN's Infinite Memory Engine and Cray's DataWarp both implement this transparent caching mode, and, in principle, it can be implemented on any of the burst buffer architectures outlined above.

Private PFS Mode

Although the transparent caching mode is the easiest to use, it doesn't give users a lot of control over what data does or doesn't need to be staged into the burst buffer.  Another access mode involves creating a private parallel file system on-demand for jobs, which I will call private PFS mode.  It provides a new parallel file system that is only mounted on your job's compute nodes, and this mount point contains only the data you explicitly copy to it:

### Burst buffer mount point is empty; we haven't put anything there,
### and this file system is private to my job
$ ls /mnt/burstbuffer

### Create a file on the burst buffer file system...
$ dd if=/dev/urandom of=/mnt/burstbuffer/mydata.bin bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.776115 s, 13.5 MB/s

### appears on the burst buffer file system...
$ ls -l /mnt/burstbuffer
-rw-r----- 1 glock glock 10485760 Jan  1 00:00 mydata.bin

### ...and Lustre remains entirely unaffected
$ ls /mnt/lustre/glock
bin  project1  project2  public_html  src

This is a little more complicated than transparent caching mode because you must now manage two file system namespaces: the parallel file system and your private burst buffer file system.  However this gives you the option to target your I/O to one or the other, so that a tiny input deck can stay on Lustre while your checkpoints are written out to the burst buffer file system.

In addition, the burst buffer private file system is strongly consistent; as soon as you write data out to it, you can read that data back from any other node in your compute job.  While this is true of transparent caching mode if you always access your data through the burst buffer mount point, you can run into trouble if you accidentally try to read a file from the original parallel file system mount point after writing out to the burst buffer mount.  Since private PFS mode provides a completely different file system and namespace, it's a bit harder to make this mistake.

Cray's DataWarp implements private PFS mode, and the Tsubame 3.0 burst buffer will be implementing private PFS mode using on-demand BeeGFS.  This mode is most easily implemented on fabric/ION-attached flash architectures, but Tsubame 3.0 is demonstrating that it can also be done on CN-attached flash.

Log-structured/Journaling Mode

As probably the least user-friendly but highest-performing use mode, log-structured (or journaling) mode burst buffers present themselves to users like a file system, but they do not support the full extent of file system features.  Under the hood, writes are saved to the flash not as files, but as records that contain a timestamp, the data to be written, and the location in the file to which the data should be written.  These logs are continually appended as the application performs its writes, and when it comes time to flush the data to the parallel file system, the logs are replayed to effectively reconstruct the file that the application was trying to write.

This can perform extremely well since even random I/O winds up being restructured as sequentially appended I/O.  Furthermore, there can be as many logs as there are writers; this allows writes to happen with zero lock contention, since contended writes are resolved out when the data is re-played and flushed.

Unfortunately, log-structured writes make reading very difficult, since the read can no longer seek directly to a file offset to find the data it needs.  Instead, the log needs to be replayed to some degree, effectively forcing a flush to occur.  Furthermore, if the logs are spread out across different logical flash domains (as would happen in CN-attached flash architectures), read-back may require the logs to be centrally collected before the replay can happen, or it may require inter-node communication to coordinate who owns the different bytes that the application needs to read.

What this amounts to is functionality that may present itself like a private parallel file system burst buffer, but behaves very differently on reads and writes.  For example, attempting to read the data that exists in a log that doesn't belong to the writer might generate an I/O error, so applications (or I/O middleware) probably need to have very well-behaved I/O to get the full performance benefits of this mode.  Most extreme-scale HPC applications already do this, so log-structured/journaling mode is a very attractive approach for very large applications that rely on extreme write performance to checkpoint their progress.

Log-structured/journaling mode is well suited for CN-attached flash since logs do not need to live on a file system that presents a single shared namespace across all compute nodes.  In practice, the IBM CORAL systems will probably provide log-structured/journaling mode through IBM's burst buffer software.  Oak Ridge National Laboratory has also demonstrated a log-structured burst buffer system called BurstMem on a fabric-attached flash architecture.  Intel's CPPR library, to be deployed with the Argonne Aurora system, may also implement this functionality atop the 3D XPoint to be embedded in each compute node.

Other Modes

The above three modes are not the only ones that burst buffers may implement, and some burst buffers support more than one of the above modes.  For example, Cray's DataWarp, in addition to supporting private PFS and transparent caching modes, also has a swap mode that allows compute nodes to use the flash as swap space to prevent hard failures for data analysis applications that consume non-deterministic amounts of memory.  In addition, Intel's CPPR library is targeting byte-addressable nonvolatile memory which would expose a load/store interface, rather than the typical POSIX open/write/read/close interface, to applications.


Burst buffers, practically speaking, remain in their infancy, and there is a lot of room for the landscape I've outlined here to change.  For example, the common software features I highlighted (staging, background data movement, and POSIX API support) are still largely implemented via proprietary, non-standard APIs at present.  There is effort to get burst buffer vendors to agree to a common API, and as this process proceeds, features may appear or disappear as customers define what is and isn't a worthwhile differentiating feature.

On the hardware front, the burst buffer ecosystem is also in flux.  ION-attached flash is where burst buffers began, but as discussed above, they are likely to be replaced by dedicated fabric-attached flash servers.  In addition, the emergence of storage-class memory (that is, byte-addressable nonvolatile memory) will also add a new dimension to burst buffers that may make one architecture the clear winner over the others.  At present though, both fabric-attached and CN-attached burst buffers have their strengths and weaknesses, and neither is at risk of disappearing in the next five years.

As more extreme-scale systems begin to hit the floor and users figure out what does and doesn't work across the diversity of burst buffer hardware and software features, the picture is certain to become clearer.  Once that happens, I'll be sure to post another update.