Monday, June 20, 2016

An uninformed perspective on TaihuLight's design

Note: What follows are my own personal thoughts, opinions, and analyses.  I am not a computer scientist and I don't really know anything about processor design or application performance, so it is safe to assume I don't know what I'm talking about.  None of this represents the views of my employer, the U.S. government, or anyone except me.

China's new 93 PF TaihuLight system is impressive given the indigenous processor design and its substantial increase in its HPL score over the #2 system, Tianhe-2.  The popular media has started covering this new system and the increasing presence of Chinese systems on Top500, suggesting that China's string of #1 systems may be a sign of shifting tides.  And maybe it is.  China is undeniably committed to investing in supercomputing and positioning itself as a leader in extreme-scale computing.

That being said, the TaihuLight system isn't quite the technological marvel and threat to the HPC hegemony that it may seem at first glance.  The system features some some critically limiting design choices that make the system smell like a supercomputer that was designed to be #1 on Top500, not solve scientific problems.  This probably sounds like sour grapes at this point, so let's take a look at some of the details.

Back-of-the-envelope math

Consider the fact that each TaihuLight node turns 3,062 GFLOPS (that's 3 TFLOPS) and has 136.51 GB/sec of memory bandwidth. This means that in the time it takes for the processor to load two 64-bit floats into the processor from memory, it could theoretically perform over 350 floating point operations. But it won't, because it can only load the two operands for one single FLOP.

Of course, this is an oversimplification of how CPUs work.  Caches exist to feed the extremely high operation rate of modern processors, and where there are so many cores that their caches can't be fed fast enough, we see technologies like GDDR DRAM and HBM (on accelerators) and on-package MCDRAM (on KNL) appearing so that dozens or hundreds of cores can all retrieve enough floating-point operands from memory to sustain high rates of floating point calculations.

However, the ShenWei SW26010 chips in the TaihuLight machine have neither GDDR nor MCDRAM; they rely on four DDR3 controllers running at 136 GB/sec to keep all 256 compute elements fed with data.  Dongarra's report on the TaihuLight design briefly mentions this high skew:

"The ratio of floating point operations per byte of data from memory on the SW26010 is 22.4 Flops(DP)/Byte transfer, which shows an imbalance or an overcapacity of floating point operations per data transfer from memory. By comparison the Intel Knights Landing processor with 7.2 Flops(DP)/Byte transfer."

This measure of "Flops(DP)/Byte transfer" is called arithmetic intensity, and it is a critical optimization parameter when writing applications for manycore architectures.  Highly optimized GPU codes can show arithmetic intensities of around 10 FLOPS/byte, but such applications are often the exception; there are classes of problems that simply do not have high arithmetic intensities.  This diagram, which I stole from the Performance and Algorithms Research group at Berkeley Lab, illustrates the spectrum:

To put this into perspective in the context of hardware, let's look at the #3 supercomputer, the Titan system at Oak Ridge National Lab.  The GPUs on which it is built (NVIDIA's K20X) each have a GDDR5-based memory subsystem that can feed the 1.3 TFLOP GPUs at 250 GB/sec.  This means that Titan's FLOPS/byte ratio is around 5.3, or over 4x lower (more balanced) than the 22 FLOPS/byte of TaihuLight's SW26010 chips.

This huge gap means that an application that is perfectly balanced to run on a Titan GPU--that is, an application with an arithmetic intensity of 5.3--will run 4x slower on one of TaihuLight's SW26010 processors than a Titan GPU.  Put simply, despite being theoretically capable of doing 3 TFLOPS of computing, TaihuLight's processors would only be able to deliver performance to 1/4th of that, or 0.75 TFLOPS, to this application.  Because of the severely limited per-node memory bandwidth, this 93 PFLOP system would perform like a 23 PFLOP system on an application that, given an arithmetic intensity of 5.3, would be considered highly optimized by most standards.

Of course, the indigenous architecture also means that application developers will have to rely on indigenous implementations or ports of performance runtimes like OpenMP and OpenACC, libraries like BLAS, and ISA-specific vector intrinsics.  The maturity of this software stack for the ShenWei-64 architecture remains unknown.

What is interesting

This all isn't to say that the TaihuLight system isn't a notable achievement; it is the first massive-scale deployment of a CPU-based manycore processor, it is the first massive-scale deployment of EDR InfiniBand, and its CPU design is extremely interesting in a number of ways.

The CPU block diagrams included in Dongarra's report are a bit like a Rorschach test; my esteemed colleagues at The Next Platform astutely pointed out its similarities to KNL, but my first reaction was to compare it with IBM's Cell processor:

IBM Cell BE vs. ShenWei SW26010.  Cell diagram stolen from NAS; SW26010 diagram stolen from the Dongarra report.

The Cell processor was ahead of its time in many ways and arguably the first manycore chip targeted at HPC.  It had
  • a single controller core (the PPE) with L1 and L2 caches
  • eight simpler cores (the SPEs) on an on-chip network with no L2 cache, but an embedded SRAM scratchpad
and by comparison, the SW26010 has
  • a single controller core (the MPE) with L1 and L2 caches
  • sixty-four simpler cores (the CPEs) on an on-chip network with no L2 cache, but an embedded SRAM scratchpad
Of course, the similarities are largely superficial and there are vast differences between the two architectures, but the incorporation of heterogeneous (albeit very similar) cores on a single package is quite bold and is a design point that may play a role in exascale processor designs:

What an exascale processor might look like, as stolen from Kathy Yelick

which may feature a combination of many lightweight cores (not unlike the CPE arrays on the TaihuLight processor) and are accompanied by a few capable cores (not unlike the MPE cores).

The scratchpad SRAM present on all of the CPE cores is also quite intriguing, as it is a marked departure from the cache-oriented design of on-package SRAM that has dominated CPU architectures for decades.  The Dongarra report doesn't detail how the scratchpad SRAM is used by applications, but it may offer a unique new way to perform byte-granular loads and stores that do not necessarily waste a full cache line's worth of memory bandwidth if the application knows that memory access is to be unaligned.

This is a rather forward-looking design decision that makes the CPU look a little more like a GPU.  Some experimental processor designs targeting exascale have proposed eschewing deep cache hierarchies in favor of similar scratchpads:

The Traleika Glacier processor design, featuring separate control and execution blocks and scratchpad SRAM.  Adapted from the Traleika Glacier wiki page.

Whether or not we ever hear about how successful or unsuccessful these processor features are remains to be seen, but there may be valuable lessons to be learned ahead of the first generation of exascale processors from architectures like those in the TaihuLight system.


At a glance, it is easy to call out the irony in the U.S. government's decision to ban the sale of Intel's KNL processors to the Chinese now that the TaihuLight system is public.  It is clear that China is in a position to begin building extreme-scale supercomputers without the help of Intel, and it is very likely that the U.S. embargo accelerated this effort.  As pondered by an notable pundit in the HPC community,

And this may have been the case.  However, despite the TaihuLight system's #1 position and very noteworthy Linpack performance and efficiency, is not the massive disruptor that puts the U.S. in the back seat.  Underneath TaihuLight's shiny, 93-petaflop veneer are some cut corners that substantially lower its ability to reliably deliver the performance and scientific impact commensurate to its Linpack score.  As pointed out by a colleague wiser than me, Intel's impending KNL chip is the product of years of effort, and it is likely that it will be years before ShenWei's chip designs and fabs are able to be really deliver a fully balanced, competitive, HPC-oriented microarchitecture.

With that being said, TaihuLight is still a massive system, and even if its peak Linpack score is not representative of its actual achievable performance in solving real scientific problems, it is undeniably a leadership system.  Even if applications can only realize a small fraction of its Linpack performance, there is a lot of discovery to be made in petascale computing.

Further, the SW201060 processor itself features some bold design points, and being able to test a heterogeneous processor with scratchpad SRAM at extreme scale may give China a leg up in the exascale architecture design space.  Only time will tell if these opportunities are pursued, or if TaihuLight follows its predecessors into an existence of disuse in a moldy datacenter caused by a high electric bill, poor system design, and lack of software.

Monday, July 20, 2015

On "active learning" and teaching science

Nature ran an article last week by Dr. Mitchell Waldrop titled "Why we are teaching science wrong, and how to make it right" (or alternatively, "The science of teaching science") which really ground my gears.  The piece puts forward this growing trend of "active learning" where, rather than traditional lecture-based course instruction, students are put in a position where they must apply subject matter to solve open-ended problems.  In turn, this process of applying knowledge leads students to walk away with a more meaningful understanding of the material and demonstrate a much longer retention of the information.

It bothers me that the article seems to conflate "life sciences" with "science."  The fact that students more effectively learn material when they are required to engage with the information over rote memorization and regurgitation is not new.  This "active learning" methodology may seem revolutionary to life science (six of eight advocates quoted are of the life sciences), but the fact of the matter is that this method has been the foundation of physics and engineering education for literally thousands of years.  "Active learning," which seems to be a re-branding of the Socratic method, is how critical thinking skills are developed.  If this concept of education by application is truly new to the life sciences, then that is a shortcoming that is not endemic throughout the sciences as the article's title would suggest.

The article goes on to highlight a few reasons why adoption of the Socratic method in teaching "science" is slow going, but does so while failing to acknowledge two fundamental facts about education and science: effective education takes time, and scientists are not synonymous with educators.

I have had the benefit of studying under some of the best educators I have ever known.  The views I express below are no doubt colored by this, and perhaps all of science is truly filled with ineffective educators.  However as a former materials scientist now working in the biotech industry, I have an idea that the assumptions expressed in this article (which mirror the attitudes of the biologists with whom I work) are not as universal throughout science as Dr. Waldrop would have us think.  With that being said, I haven't taught anything other than workshops for the better part of a decade, so the usual caveats about my writing apply here--I don't know what I'm talking about, so take it all with a grain of salt.

Effective education takes time

The article opens with an anecdote about how Tammy Tobin, a biology professor at Susquehanna University, has her third- and fourth-year students work through a mock viral outbreak.  While this is an undoubtedly memorable exercise that gives students a chance to apply what they learned in class, the article fails to acknowledge that one cannot actually teach virology or epidemiology this way.  This exercise is only effective for third- and fourth-year students who have spent two or three years obtaining the foundational knowledge that allows them to translate the lessons learned from this mock outbreak to different scenarios--that is, to actually demonstrate higher-order cognitive understanding of the scientific material.

As I said above though, this is not a new or novel concept.  In fact, all engineering and applied sciences curricula accredited by ABET are required to include a course exactly like this Susquehanna University experience.  Called the capstone design component, students spend their last year at university working in a collaborative setting with their peers to tackle an applied project like designing a concrete factory or executing an independent research program.  As a result, it is a fact that literally every single graduate of an accredited engineering undergraduate degree program in the United States has gone through an "active learning" project where they have to apply their coursework knowledge to solving a real-world problem.

In all fairness, the capstone project requirement is just a single course that represents a small fraction (typically less than 5%) of students' overall credits towards graduation.   This is a result of a greater fact that the article completely ignores--education takes time.  Professor Tobin's virus outbreak exercise had students looking at flight schedules to Chicago to ensure there were enough seats for a mock trip to ground zero, but realize that students were paying tuition money to do this.  In the time it took students to book fake plane tickets, how much information about epidemiology could have been conveyed in lecture format?  When Prof. Tobin says her course "looked at the intersection of politics, sociology, biology, even some economics," is that really appropriate for a virology course?

This is not to say that the detail with which Prof. Tobin's exercise was executed was a waste of time, tuition dollars, or anything else; as the article rightly points out, the students who took this course are likely to have walked away from it with a more meaningful grasp of applied virology and epidemiology than they would have otherwise.  However, the time it takes to execute these active learning projects at such a scale cuts deeply into the two- or three-year curriculum that most programs have to provide all of the required material for a four-year degree.  This is why "standard lectures" remain the prevailing way to teach scientific courses--lectures are informationally dense, and the "active learning" component comes in the form of homework and projects that are done outside of the classroom.

While the article implies that homework and exercises in this context are just "cookbook exercises," I get the impression that such is only true in the life sciences.  Rote memorization in physics and engineering is simply not valued, and this is why students are typically allowed to bring cheat sheets full of equations, constants, and notes with them into exams.  Rather than providing cookbook exercises, assignments and examinations require that students be able to apply the physical concepts learned in lecture to solve problems.  This is simply how physics and engineering are taught, and it is a direct result of the fact that there are not enough hours in a four-year program to forego lecturing and still effectively convey all of the required content.

And this is not to say that lecturing has to be completely one-way communication; the Socratic method can be extremely effective in lectures.  The article cites a great example of this when describing a question posed by Dr. Sarah Leupen's to her students:  What would happen if the sensory neurons in your legs stopped working as you were walking down the street?  Rather than providing all of the information to answer the question before posing the question itself, posing the question first allows students to figure out the material themselves through discussion.  The discussion is guided towards the correct answer by the lecturer's careful choice of follow-up questions to students' hypotheses to further stimulate critical thinking.

Of course, this Socratic approach in class can waste a tremendous amount of time if the lecturer is not able to effectively dial into each student's aptitudes when posing questions.  In addition, this only works for small classroom sizes; in practice, the discussion is often dominated by a minority of students and the majority simply remain unengaged.  Being able to keep all students engaged, even in a small-classroom setting, requires a great deal of skill in understanding people and how to motivate them.   Finding the right balance of one-sided lecturing and Socratic teaching is an exercise in careful time economics which can change every week.  As a result, it is often easier to simply forego the Socratic method and just deliver lecture; however, this is not always a matter of stodginess or laziness as the article implies, but simply weighing the costs given a fixed amount of material and a fixed period of time.

"Active learning" can be applied in a time-conservative way; this is the basis for a growing number of intensive, hands-on bootcamp programs that teach computer programming skills in twelve weeks. These programs eschew teaching the foundational knowledge of computer science and throw their students directly into applying it in useful (read: employable) ways.  While these programs certainly produce graduates who can write computer programs, these graduates are often unable to grasp important design and performance considerations because they lack a knowledge of the foundations.  In a sense, this example of how applied-only coursework produces technicians, not scientists and engineers.

Scientists are not always educators

The article also cites a number of educators and scientists (all in the life sciences, of course) who are critical of other researchers for not investing time (or alternatively, not being incentivized to invest time) into exploring more effective teaching methodologies.  While I agree that effective teaching is the responsibility of anyone whose job is to teach, the article carries an additional undertone that asserts that researchers should be effective teachers.  The problem is that this is not true; the entanglement of scientific research and scientific education is a result of necessity, and the fact of the matter is that there are a large group of science educators who simply teach because they are required to.

I cannot name a single scientist who went through the process of earning a doctorate in science or engineering because he or she wanted to teach.  Generally speaking, scientists become scientists because they want to do science, and teaching is often a byproduct of being one of the elite few who have the requisite knowledge to actually teach others how to be scientists or engineers.  This is not to say that there are no good researchers who also value education; this article's interviews are a testament to that.  Further, the hallmarks of great researchers and great educators overlap; dissemination of new discoveries is little more than being the first person to teach a new concept to other scientists.  However, the issue of science educators being often disinterested in effective teaching techniques can only be remedied by first acknowledging that teaching is not always most suitably performed by researchers.

The article does speak to some progress being made by institutions which include teaching as a criteria for tenure review.  However the notion of tenure is, at its roots, tied to preserving the academic freedom to do research in controversial areas.  It has little to do with the educational component of being a professor, so to a large degree, it does make sense to base tenure decisions largely on the research productivity, not the pedagogical productivity, of individuals.  Thus, the fact that educators are being driven to focus on research over education is a failing of the university brought about by this entanglement of education and research.

Actually building a sustainable financial model that supports this disentangling of education from research is not something I can pretend to do.  Just as effective teaching takes time, it also costs money, and matching every full-time researcher with a full-time educator across every science and engineering department at a university would not be economical.  However just as there are research professors whose income is derived solely from grants, perhaps there should be equivalent positions for distinguished educators who are fully supported by the university.  As it stands, there is little incentive (outside of financial necessity) for any scientist with a gift for teaching to become a full-time lecturer within the typical university system.

Whatever form progress may take though, as long as education remains entangled with research, the cadence of improvement will be set by the lowest common denominator.

Wednesday, April 29, 2015

More Conjecture on KNL's Near Memory

The Platform ran an interesting collection of conjectures on how KNL's on-package MCDRAM might be used this morning, and I recommend reading through it if you're following the race to exascale.  I was originally going to write this commentary as a Google+ post, but it got a little long, so pardon the lack of a proper lead-in here.

I appreciated Mr. Funk's detailed description of how processor caches interact with DRAM, and how this might translate into KNL's caching mode.  However, he underplays exactly why MCDRAM (and the GDDR on KNC) exists on these manycore architectures in his discussion on how MCDRAM may act as an L3 cache.  On-package memory is not simply another way to get better performance out of the manycore processor; rather, it is a hard requirement for keeping all 60+ cores (and their 120+ 512-bit vector registers, 1.8+ MB of L1 data cache, etc) loaded.  Without MCDRAM, it would be physically impossible for these KNL processors to achieve their peak performance due to memory starvation.  By extension, Mr. Funk's assumption that this MCDRAM will come with substantially lower latency than DRAM might not be true.

As a matter of fact, the massive parallelism game is not about latency at all; it came about as a result of latencies hitting a physical floor.  So, rather than drive clocks up to lower latency and increase performance, the industry has been throwing more but slower clocks at a given problem to mask the latencies of data access for any given worker.  While one thread may be stalled due to a cache miss on a Xeon Phi core, the other three threads are keeping the FPU busy to achieve the high efficiency required for performance.  This is at the core of the Xeon Phi architecture (as well as every other massively parallel architecture including GPUs and Blue Gene), so it is unlikely that Intel has sacrificed their power envelope to actually give MCDRAM lower latency than the off-package DRAM on KNL nodes.

At an architectural level, accesses to MCDRAM still needs to go through memory controllers like off-package DRAM.  Intel hasn't been marketing the MCDRAM controllers as "cache controllers," so it is likely that the latencies of memory access are on par with the off-package memory controllers.  There are simply more of these parallel MCDRAM controllers (eight) operating relative to off-package DRAM controllers (two), again suggesting that bandwidth is the primary capability.

Judging by current trends in GPGPU and KNC programming, I think it is far more likely that this caching mode acts at a much higher level, and Intel is providing it as a convenience for (1) algorithmically simple workloads with highly predictable memory access patterns, and (2) problems that will fit entirely within MCDRAM.  Like with OpenACC, I'm sure there will be some problems where explicitly on/off-package memory management (analogous to OpenACC's copyin, copyout, etc) aren't necessary and cache mode will be fine.  Intel will also likely provide all of the necessary optimizations in their compiler collection and MKL to make many common operations (BLAS, FFTs, etc) work well in cache mode as they did for KNC's offload mode.

However, to answer Mr. Funk's question of "Can pre-knowledge of our application’s data use--and, perhaps, even reorganization of that data--allow our application to run still faster if we instead use Flat Model mode," the answer is almost unequivocally "YES!"  Programming massively parallel architectures has never been easy, and magically transparent caches rarely deliver reliable, high performance.  Even the L1 and L2 caches do not work well without very deliberate application design to accommodate wide vectors; cache alignment and access patterns are at the core of why, in practice, it's difficult to get OpenMP codes working with high efficiency on current KNC processors.  As much as I'd like to believe otherwise, the caching mode on KNL will likely be even harder to effectively utilize, and explicitly managing the MCDRAM will be an absolute requirement for the majority of applications.