Monday, July 20, 2015

On "active learning" and teaching science

Nature ran an article last week by Dr. Mitchell Waldrop titled "Why we are teaching science wrong, and how to make it right" (or alternatively, "The science of teaching science") which really ground my gears.  The piece puts forward this growing trend of "active learning" where, rather than traditional lecture-based course instruction, students are put in a position where they must apply subject matter to solve open-ended problems.  In turn, this process of applying knowledge leads students to walk away with a more meaningful understanding of the material and demonstrate a much longer retention of the information.

It bothers me that the article seems to conflate "life sciences" with "science."  The fact that students more effectively learn material when they are required to engage with the information over rote memorization and regurgitation is not new.  This "active learning" methodology may seem revolutionary to life science (six of eight advocates quoted are of the life sciences), but the fact of the matter is that this method has been the foundation of physics and engineering education for literally thousands of years.  "Active learning," which seems to be a re-branding of the Socratic method, is how critical thinking skills are developed.  If this concept of education by application is truly new to the life sciences, then that is a shortcoming that is not endemic throughout the sciences as the article's title would suggest.

The article goes on to highlight a few reasons why adoption of the Socratic method in teaching "science" is slow going, but does so while failing to acknowledge two fundamental facts about education and science: effective education takes time, and scientists are not synonymous with educators.

I have had the benefit of studying under some of the best educators I have ever known.  The views I express below are no doubt colored by this, and perhaps all of science is truly filled with ineffective educators.  However as a former materials scientist now working in the biotech industry, I have an idea that the assumptions expressed in this article (which mirror the attitudes of the biologists with whom I work) are not as universal throughout science as Dr. Waldrop would have us think.  With that being said, I haven't taught anything other than workshops for the better part of a decade, so the usual caveats about my writing apply here--I don't know what I'm talking about, so take it all with a grain of salt.

Effective education takes time

The article opens with an anecdote about how Tammy Tobin, a biology professor at Susquehanna University, has her third- and fourth-year students work through a mock viral outbreak.  While this is an undoubtedly memorable exercise that gives students a chance to apply what they learned in class, the article fails to acknowledge that one cannot actually teach virology or epidemiology this way.  This exercise is only effective for third- and fourth-year students who have spent two or three years obtaining the foundational knowledge that allows them to translate the lessons learned from this mock outbreak to different scenarios--that is, to actually demonstrate higher-order cognitive understanding of the scientific material.

As I said above though, this is not a new or novel concept.  In fact, all engineering and applied sciences curricula accredited by ABET are required to include a course exactly like this Susquehanna University experience.  Called the capstone design component, students spend their last year at university working in a collaborative setting with their peers to tackle an applied project like designing a concrete factory or executing an independent research program.  As a result, it is a fact that literally every single graduate of an accredited engineering undergraduate degree program in the United States has gone through an "active learning" project where they have to apply their coursework knowledge to solving a real-world problem.

In all fairness, the capstone project requirement is just a single course that represents a small fraction (typically less than 5%) of students' overall credits towards graduation.   This is a result of a greater fact that the article completely ignores--education takes time.  Professor Tobin's virus outbreak exercise had students looking at flight schedules to Chicago to ensure there were enough seats for a mock trip to ground zero, but realize that students were paying tuition money to do this.  In the time it took students to book fake plane tickets, how much information about epidemiology could have been conveyed in lecture format?  When Prof. Tobin says her course "looked at the intersection of politics, sociology, biology, even some economics," is that really appropriate for a virology course?

This is not to say that the detail with which Prof. Tobin's exercise was executed was a waste of time, tuition dollars, or anything else; as the article rightly points out, the students who took this course are likely to have walked away from it with a more meaningful grasp of applied virology and epidemiology than they would have otherwise.  However, the time it takes to execute these active learning projects at such a scale cuts deeply into the two- or three-year curriculum that most programs have to provide all of the required material for a four-year degree.  This is why "standard lectures" remain the prevailing way to teach scientific courses--lectures are informationally dense, and the "active learning" component comes in the form of homework and projects that are done outside of the classroom.

While the article implies that homework and exercises in this context are just "cookbook exercises," I get the impression that such is only true in the life sciences.  Rote memorization in physics and engineering is simply not valued, and this is why students are typically allowed to bring cheat sheets full of equations, constants, and notes with them into exams.  Rather than providing cookbook exercises, assignments and examinations require that students be able to apply the physical concepts learned in lecture to solve problems.  This is simply how physics and engineering are taught, and it is a direct result of the fact that there are not enough hours in a four-year program to forego lecturing and still effectively convey all of the required content.

And this is not to say that lecturing has to be completely one-way communication; the Socratic method can be extremely effective in lectures.  The article cites a great example of this when describing a question posed by Dr. Sarah Leupen's to her students:  What would happen if the sensory neurons in your legs stopped working as you were walking down the street?  Rather than providing all of the information to answer the question before posing the question itself, posing the question first allows students to figure out the material themselves through discussion.  The discussion is guided towards the correct answer by the lecturer's careful choice of follow-up questions to students' hypotheses to further stimulate critical thinking.

Of course, this Socratic approach in class can waste a tremendous amount of time if the lecturer is not able to effectively dial into each student's aptitudes when posing questions.  In addition, this only works for small classroom sizes; in practice, the discussion is often dominated by a minority of students and the majority simply remain unengaged.  Being able to keep all students engaged, even in a small-classroom setting, requires a great deal of skill in understanding people and how to motivate them.   Finding the right balance of one-sided lecturing and Socratic teaching is an exercise in careful time economics which can change every week.  As a result, it is often easier to simply forego the Socratic method and just deliver lecture; however, this is not always a matter of stodginess or laziness as the article implies, but simply weighing the costs given a fixed amount of material and a fixed period of time.

"Active learning" can be applied in a time-conservative way; this is the basis for a growing number of intensive, hands-on bootcamp programs that teach computer programming skills in twelve weeks. These programs eschew teaching the foundational knowledge of computer science and throw their students directly into applying it in useful (read: employable) ways.  While these programs certainly produce graduates who can write computer programs, these graduates are often unable to grasp important design and performance considerations because they lack a knowledge of the foundations.  In a sense, this example of how applied-only coursework produces technicians, not scientists and engineers.

Scientists are not always educators

The article also cites a number of educators and scientists (all in the life sciences, of course) who are critical of other researchers for not investing time (or alternatively, not being incentivized to invest time) into exploring more effective teaching methodologies.  While I agree that effective teaching is the responsibility of anyone whose job is to teach, the article carries an additional undertone that asserts that researchers should be effective teachers.  The problem is that this is not true; the entanglement of scientific research and scientific education is a result of necessity, and the fact of the matter is that there are a large group of science educators who simply teach because they are required to.

I cannot name a single scientist who went through the process of earning a doctorate in science or engineering because he or she wanted to teach.  Generally speaking, scientists become scientists because they want to do science, and teaching is often a byproduct of being one of the elite few who have the requisite knowledge to actually teach others how to be scientists or engineers.  This is not to say that there are no good researchers who also value education; this article's interviews are a testament to that.  Further, the hallmarks of great researchers and great educators overlap; dissemination of new discoveries is little more than being the first person to teach a new concept to other scientists.  However, the issue of science educators being often disinterested in effective teaching techniques can only be remedied by first acknowledging that teaching is not always most suitably performed by researchers.

The article does speak to some progress being made by institutions which include teaching as a criteria for tenure review.  However the notion of tenure is, at its roots, tied to preserving the academic freedom to do research in controversial areas.  It has little to do with the educational component of being a professor, so to a large degree, it does make sense to base tenure decisions largely on the research productivity, not the pedagogical productivity, of individuals.  Thus, the fact that educators are being driven to focus on research over education is a failing of the university brought about by this entanglement of education and research.

Actually building a sustainable financial model that supports this disentangling of education from research is not something I can pretend to do.  Just as effective teaching takes time, it also costs money, and matching every full-time researcher with a full-time educator across every science and engineering department at a university would not be economical.  However just as there are research professors whose income is derived solely from grants, perhaps there should be equivalent positions for distinguished educators who are fully supported by the university.  As it stands, there is little incentive (outside of financial necessity) for any scientist with a gift for teaching to become a full-time lecturer within the typical university system.

Whatever form progress may take though, as long as education remains entangled with research, the cadence of improvement will be set by the lowest common denominator.

Wednesday, April 29, 2015

More Conjecture on KNL's Near Memory

The Platform ran an interesting collection of conjectures on how KNL's on-package MCDRAM might be used this morning, and I recommend reading through it if you're following the race to exascale.  I was originally going to write this commentary as a Google+ post, but it got a little long, so pardon the lack of a proper lead-in here.

I appreciated Mr. Funk's detailed description of how processor caches interact with DRAM, and how this might translate into KNL's caching mode.  However, he underplays exactly why MCDRAM (and the GDDR on KNC) exists on these manycore architectures in his discussion on how MCDRAM may act as an L3 cache.  On-package memory is not simply another way to get better performance out of the manycore processor; rather, it is a hard requirement for keeping all 60+ cores (and their 120+ 512-bit vector registers, 1.8+ MB of L1 data cache, etc) loaded.  Without MCDRAM, it would be physically impossible for these KNL processors to achieve their peak performance due to memory starvation.  By extension, Mr. Funk's assumption that this MCDRAM will come with substantially lower latency than DRAM might not be true.

As a matter of fact, the massive parallelism game is not about latency at all; it came about as a result of latencies hitting a physical floor.  So, rather than drive clocks up to lower latency and increase performance, the industry has been throwing more but slower clocks at a given problem to mask the latencies of data access for any given worker.  While one thread may be stalled due to a cache miss on a Xeon Phi core, the other three threads are keeping the FPU busy to achieve the high efficiency required for performance.  This is at the core of the Xeon Phi architecture (as well as every other massively parallel architecture including GPUs and Blue Gene), so it is unlikely that Intel has sacrificed their power envelope to actually give MCDRAM lower latency than the off-package DRAM on KNL nodes.

At an architectural level, accesses to MCDRAM still needs to go through memory controllers like off-package DRAM.  Intel hasn't been marketing the MCDRAM controllers as "cache controllers," so it is likely that the latencies of memory access are on par with the off-package memory controllers.  There are simply more of these parallel MCDRAM controllers (eight) operating relative to off-package DRAM controllers (two), again suggesting that bandwidth is the primary capability.

Judging by current trends in GPGPU and KNC programming, I think it is far more likely that this caching mode acts at a much higher level, and Intel is providing it as a convenience for (1) algorithmically simple workloads with highly predictable memory access patterns, and (2) problems that will fit entirely within MCDRAM.  Like with OpenACC, I'm sure there will be some problems where explicitly on/off-package memory management (analogous to OpenACC's copyin, copyout, etc) aren't necessary and cache mode will be fine.  Intel will also likely provide all of the necessary optimizations in their compiler collection and MKL to make many common operations (BLAS, FFTs, etc) work well in cache mode as they did for KNC's offload mode.

However, to answer Mr. Funk's question of "Can pre-knowledge of our application’s data use--and, perhaps, even reorganization of that data--allow our application to run still faster if we instead use Flat Model mode," the answer is almost unequivocally "YES!"  Programming massively parallel architectures has never been easy, and magically transparent caches rarely deliver reliable, high performance.  Even the L1 and L2 caches do not work well without very deliberate application design to accommodate wide vectors; cache alignment and access patterns are at the core of why, in practice, it's difficult to get OpenMP codes working with high efficiency on current KNC processors.  As much as I'd like to believe otherwise, the caching mode on KNL will likely be even harder to effectively utilize, and explicitly managing the MCDRAM will be an absolute requirement for the majority of applications.

Wednesday, January 28, 2015

Thoughts on the NSF Future Directions Interim Report

The National Academies recently released an interim report entitled Future Directions for NSF Advanced Computing Infrastructure to Support U.S. Science and Engineering in 2017-2020 as a part of a $723,000 award commissioned to take a hard look at where the NSF's supercomputing program is going.  Since releasing the interim report, the committee has been soliciting feedback and input from the research community to consider as they draft their final report, and I felt compelled to put some of my thoughts into a response.

NSF's HPC programs are something I hold near and dear since I got my start in the industry by supporting two NSF-owned supercomputers.  I put a huge amount of myself into Trestles and Gordon, and I still maintain that job encompassed the most engaging and rewarding work I've ever done.  However, the NSF's lack of a future roadmap for its HPC program made my future feel perpetually uncertain, and this factored heavily in my decision to eventually pursue other opportunities.

Now that I am no longer affiliated with NSF, I wanted to delineate some of the problems I observed during my time on the inside with the hope that someone more important than me really thinks about how they can be addressed.  The report requested feedback in nine principal areas, so I've done my best to contextualize my thoughts with the committee's findings.

With that being said, I wrote this all up pretty hastily.  Some of it may be worded strongly, and although I don't mean to offend anybody, I stand by what I say.  That doesn't mean that my understanding of everything is correct though, so it's probably best to assume that I have no idea what I'm talking about here.

Finally, a glossary of terms may make this more understandable:

  • XD is the NSF program that funds XSEDE; it finances infrastructure and people, but it does not fund supercomputer procurements or operations
  • Track 1 is the program that funded Blue Waters, the NSF's leadership-class HPC resource
  • Track 2 is the program that funds most of the XSEDE supercomputers.  It funded systems like Ranger, Keeneland, Gordon, and Stampede

1. How to create advanced computing infrastructure that enables integrated discovery involving experiments, observations, analysis, theory, and simulation.

Answering this question involves a few key points:
  1. Stop treating NSF's cyberinfrastructure as a computer science research project and start treating it like research infrastructure operation.  Office of Cyberinfrastructure (OCI) does not belong in Computer & Information Science & Engineering (CISE).
  2. Stop funding cyberinfrastructure solely through capital acquisition solicitations and restore reliable core funding to NSF HPC centers.  This will restore a community that is conducive to retaining expert staff.
  3. Focus OCI/ACI and raise the bar for accountability and transparency.   Stop funding projects and centers that have no proven understanding of operational (rather than theoretical) HPC.
  4. Either put up or give up.  The present trends in funding lie on a road to death by attrition.  
  5. Don't waste time and funding by presuming that outsourcing responsibility and resources to commercial cloud or other federal agencies will effectively serve the needs of the NSF research community.
I elaborate on these points below.

2. Technical challenges to building future, more capable advanced computing systems and how NSF might best respond to them.

"Today’s approach of federating distributed compute- and data-intensive resources to meet the increasing demand for combined computing and data capabilities is technically challenging and expensive."
This is true.
"New approaches that co-locate computational and data resources might reduce costs and improve performance. Recent advances in cloud data center design may provide a viable integrated solution for a significant fraction of (but not all) data- and compute-intensive and combined workloads."
This strong statement is markedly unqualified and unsubstantiated.  If it is really recommending that the NSF start investing in the cloud, consider the following:
  • Cloud computing resources are designed for burst capabilities and are only economical when workloads are similarly uneven.  In stark contrast, most well-managed HPCs see constant, high utilization which is where the cloud becomes economically intractable.
  • The suggestion that cloud solutions can "improve performance" is unfounded.  At a purely technological level, the cloud will never perform as well as unvirtualized HPC resources, period.  Data-intensive workloads and calculations that require modest inter-node communication will suffer substantially.

In fact, if any cost reduction or performance improvement can be gained by moving to the cloud, I can almost guarantee that incrementally more can be gained by simply addressing the non-technological aspects of the current approach of operating federated HPC.  Namely, the NSF must
  1. Stop propping up failing NSF centers who have been unable to demonstrate the ability to effectively design and operate supercomputers. 
  2. Stop spending money on purely experimental systems that domain scientists cannot or will not use.

The NSF needs to re-focus its priorities and stop treating the XD program like a research project and start treating it like a business.  Its principal function should be to deliver a product (computing resources) to customers (the research community).  Any component that is not helping domain scientists accelerate discovery should be strongly scrutinized.  Who are these investments truly satisfying?
"New knowledge and skills will be needed to effectively use these new advanced computing technologies."
This is a critical component of XD that is extremely undervalued and underfunded.  Nobody is born with the ability to know how to use HPC resources, and optimization should be performed on users in addition to code.  There is huge untapped potential in collaborative training between U.S. federal agencies (DOE, DOD) and European organizations (PRACE).  If there is bureaucratic red tape in the way, it needs to be dealt with at an official level or circumvented at the grassroots level.

3. The computing needs of individual research areas.

XDMoD shows this.  The principal workloads across XSEDE are from traditional domains like physics and chemistry, and the NSF needs to recognize that this is not going to change substantially over the lifetime of a program like XD.

Straight from XDMoD for 2014.  MPS = math and physical sciences, BIO = biological sciences, GEO = geosciences.  NSF directorate is not a perfect alignment; for example, I found many projects in BIO were actually chemistry and materials science.

While I wholeheartedly agree that new communities should be engaged by lowering the barriers to entry, these activities cannot be done at a great expense of undercutting the resources required by the majority of XD users.

The cost per CPU cycle should not be deviating wildly between Track 2 awards because the ROI on very expensive cycles will be extremely poor.  If the NSF wants to fund experimental systems, it needs to do that as an activity that is separate from the production resources.  Alternatively, only a small fraction of each award should be earmarked for new technologies that represent a high risk; the Stampede award was a fantastic model of how a conservative fraction of the award (10%) can fund an innovative and high-risk technology.

4. How to balance resources and demand for the full spectrum of systems, for both compute- and data-intensive applications, and the impacts on the research community if NSF can no longer provide state-of-the-art computing for its research community.

"But it is unclear, given their likely cost, whether NSF will be able to invest in future highest-tier systems in the same class as those being pursued by the Department of Energy, Department of Defense, and other federal mission agencies and overseas."
The NSF does not have the budget to support leadership computing.  This is clear even from a bird's eye view: DOE ASCR's budget for FY2012 was $428 million and, by comparison, NSF ACI's budget was only $211 million.  Worse yet, despite having half the funding of its DOE counterpart, the NSF owned HPC resources at seven universities in FY2012 compared to ASCR's three centers.

Even if given the proper funding, the NSF's practice of spreading Track 2 awards across many universities to operate its HPC assets is not conducive to operating leadership computing.  The unpredictable nature of Track 2 awards has resulted in very uneven funding for NSF centers which, quite frankly, is a terrible way to attract and retain the highly knowledgeable world-class staff that is necessary to operate world-class supercomputers.

5. The role of private industry and other federal agencies in providing advanced computing infrastructure.

The report makes some very troubling statements in reference to this question.
"Options for providing highest-tier capabilities that merit further exploration include purchasing computing services from federal agencies…"
This sounds dirty.  Aren't there are regulations in place that restrict the way in which money can flow between the NSF and DOE?  I'm also a little put off by the fact that this option is being put forth in a report that is crafted by a number of US DOE folks whose DOE affiliations are masked by university affiliations in the introductory material.
"…or by making arrangements with commercial services (rather than more expensive purchases by individual researchers)."
Providing advanced cyberinfrastructure for the open science community is not a profitable venture.  There is no money in HPC operations.  I do not see any "leadership" commercial cloud providers offering the NSF a deal on spare cycles, and the going rate for commercial cloud time is known to be far more expensive than deploying HPC resources in-house at the national scale.

6. The challenges facing researchers in obtaining allocations of advanced computing resources and suggestions for improving the allocation and review processes.

"Given the “double jeopardy” that arises when researchers must clear two hurdles—first, to obtain funding for their research proposal and, second, to be allocated the necessary computing resources—the chances that a researcher with a good idea can carry out the proposed work under such conditions is diminished."
XD needs to be more tightly integrated with other award processes to mitigate the double jeopardy issue.  I have a difficult time envisioning the form which this integration would take, but the NSF GRF's approach of prominently featuring NSF HPC resources as a part of the award might be a good start.  As an adaptive proposal reviewer within XSEDE and a front-line interface with first-time users, I found that having the NSF GRF bundle XSEDE time greatly reduced the entry barrier for new users and made it easier for us reviewers to stratify the proposals.  Another idea may be to invite NSF center staff to NSF contractors' meetings (if such things exist; I know they do for DOE BES) to show a greater amount of integration across NSF divisions.

In addition, the current XSEDE allocation proposal process is extremely onerous.  The document that describes the process is ridiculously long and contains of obscure requirements that serve absolutely no purpose.  For example, all XSEDE proposals require a separate document detailing the scaling performance of their scientific software.  Demonstrating an awareness of the true costs of performing certain calculations has its merits, but a detailed analysis of scaling is not even relevant for the majority of users who run modest-scale jobs or use off-the-shelf black-box software like Gaussian.  The only thing these obscure requirements do is prevent new users, who are generally less familiar with all of the scaling requirements nonsense, from getting any time.  If massive scalability is truly required by an application, the PI needs to be moved over to the Track 1 system (Blue Waters) or referred to INCITE.

As a personal anecdote, many of us center staff found ourselves simply short-circuiting the aforementioned allocations guide and providing potential new users with a guide to the guide.  It was often sufficient to provide a checklist of minutia whose absence would result in an immediate proposal rejection and allow the PIs to do what they do best—write scientific proposals for their work.  Quite frankly, the fact that we had to provide a guide to understanding the guide to the allocations process suggests that the allocations process itself is grossly over-engineered.

7. Whether wider and more frequent collection of requirements for advanced computing could be used to inform strategic planning and resource allocation; how these requirements might be used; and how they might best be collected and analyzed.

The XD program has already established a solid foundation for reporting the popularity and usability of NSF HPC resources in XDMoD.  The requirements of the majority are evolving more slowly than computer scientists would have everyone believe.

Having been personally invested in two Track 2 proposals, I have gotten the impression that the review panels who select the destiny of the NSF's future HPC portfolio are more impressed by cutting edge, albeit untested and under-demanded, proposals.  Consequentially, taking a "functional rather than a technology-focused or structural approach" to future planning will result in further loss of focus.  Instead of delivering conservatively designed architectures that will enjoy guaranteed high utilization, functional approaches will give way to computer scientists on review panels dictating what resources domain scientists should be using to solve their problems.  The cart will be before the horse.

Instead, it would be far more valuable to include more operational staff in strategic planning.  The people on the ground know how users interact with systems and what will and won't work.  As with the case of leadership computing, the NSF does not have the financial commitment to be leading the design of novel computing architectures at large scales.  Exotic and high-risk technologies should be simply left out of the NSF's Track 2 program, incorporated peripherally but funded through other means (e.g., MRIs), or incorporated in the form of a small fraction of a larger, lower-risk resource investment.

A perspective of the greater context of this has been eloquently written by Dr. Steven Gottlieb.  Given his description of the OCI conversion to ACI, it seems like taking away the Office of Cyberinfrastructure's (OCI's) autonomy and placing it under Computer & Information Science & Engineering (CISE) exemplifies an ongoing and significant loss of focus within NSF.  This changed reflected the misconception that architecting and operating HPC resources for domain sciences is a computer science discipline.

This is wrong.

Computer scientists have a nasty habit of creating tools that are intellectually interesting but impractical for domain scientists.  These tools get "thrown over the wall," never to be picked up, and represent an overall waste of effort in the context of operating HPC services for non-computer scientists.  Rather, operating HPC resources for the research community requires experienced technical engineers with a pragmatic approach to HPC.  Such people are most often not computer scientists, but former domain scientists who know what does and doesn't work for their respective communities.

8. The tension between the benefits of competition and the need for continuity as well as alternative models that might more clearly delineate the distinction between performance review and accountability and organizational continuity and service capabilities.

"Although NSF’s use of frequent open competitions has stimulated intellectual competition and increased NSF’s financial leverage, it has also impeded collaboration among frequent competitors, made it more difficult to recruit and retain talented staff, and inhibited longer-term planning."
Speaking from firsthand experience, I can say that working for an NSF center is a life of a perpetually uncertain future and dicing up FTEs into frustratingly tiny pieces.  While some people are driven by competition and fundraising (I am one of them), an entire organization built up to support multi-million dollar cyberinfrastructure cannot be sustained this way.

At the time I left my job at an NSF center, my salary was covered by six different funding sources at levels ranging from 0.05 to 0.30 FTEs.  Although this officially meant that I was only 30% committed to directly supporting the operation of one of our NSF supercomputers, the reality was that I (and many of my colleagues) simply had to put in more than 100% of my time into the job.  This is a very high-risk way to operate because committed individuals get noticed and almost invariably receive offers of stable salaries elsewhere.  Retaining talent is extremely difficult when you have the least to offer, and the current NSF funding structure makes it very difficult for centers to do much more than continually hire entry-level people to replace the rising stars who find greener pastures.

Restoring reliable, core funding to the NSF centers would allow them to re-establish a strong foundation that can be an anchor point for other sites wishing to participate in XD.  This will effectively cut off some of the current sites operating Track 2 machines, but frankly, the NSF has spread its HPC resources over too many sites at present and is diluting its investments in people and infrastructure.  The basis for issuing this core funding could follow a pattern similar to that of XD where long-term (10-year) funding is provisioned with a critical 5-year review.

If the NSF cannot find a way to re-establish reliable funding, it needs to accept defeat and stop trying to provide advanced cyberinfrastructure.  The current method of only funding centers indirectly through HPC acquisitions and associated operations costs is unsustainable for two reasons:
  • The length of these Track 2 awards (typically 3 years of operations) makes future planning impossible.  Thus, this current approach forces centers to follow high-risk and inadequately planned roadmaps.
  • All of the costs associated with maintaining world-class expertise and facilities have to come from someone else's coffers.  Competitive proposals for HPC acquisitions simply cannot afford to request budgets that include strong education, training, and outreach programs, so these efforts wind up suffering.

9. How NSF might best set overall strategy for advanced computing-related activities and investments as well as the relative merits of both formal, top-down coordination and enhanced, bottom-up process.

Regarding the top-down coordination, the NSF should drop the Track 2 program's current solicitation model where proposers must have a vendor partner to get in the door.  This is unnecessarily restrictive and fosters an unhealthy ecosystem where vendors and NSF centers are both scrambling to pair up, resulting in high-risk proposals.  Consider the implications:
  1. Vendors are forced to make promises that they may not be able to fulfill (e.g., Track 2C and Blue Waters).  Given these two (of nine) solicitations resulted in substantial wastes of time and money (over 20% vendor failure rate!), I find it shocking that the NSF continues to operate this way.
  2. NSF centers are only capable of choosing the subset of vendors who are willing to play ball with them, resulting in a high risk of sub-optimal pricing and configurations for the end users of the system.

I would recommend a model, similar to many European nations', where a solicitation is issued for a vendor-neutral proposal to deploy and support a program that is built around a resource.  A winning proposal is selected based on not only the system features, its architecture, and the science it will support, but the plan for training, education, collaboration, and outreach as well.  Following this award, the bidding process for a specific hardware solution begins.

This addresses the two high-risk processes mentioned above and simultaneously eliminates the current qualification in Track 2 solicitations that no external funding can be included in the proposal.  By leaving the capital expenses out of the selection process, the NSF stands to get the best deal from all vendors and other external entities independent of the winning institution.

Bottom-up coordination is much more labor-intensive because it requires highly motivated people at the grassroots to participate.  Given the NSF's current inability to provide stable funding for highly qualified technical staff, I cannot envision how this would actually come together.