CIOReview
| | May 20169CIOReviewing re-purposed supercomputer hardware for research purposes. Gary Grider, Division Leader for High Performance Computing at LANL came up with the idea for PRObE in 2006 after arriving to the conclusion that many of their computer systems that are normally decommissioned and subsequent-ly destroyed despite still having quite a bit of useful life left in them. Many facilities deal with their decommissioned systems by putting them on trucks and driving them to a secure facility where the components are placed in an industrial metal shredder which chops them into tiny pieces which are then melted down to recover precious metals. But does something that might have cost $30M just three or four years prior really only pos-sess scrap value today? Neither Grider or Jacobson thought so and co-wrote the NSF proposal together other collaborators from Carnegie Mellon University and the Uni-versity of Utah. In October 2010 the NMC was awarded $10M from the NSF to build PRObE.From a pure profitability standpoint the answer to the scrap value question is prob-ably yes. Based on historical trends it is usually possible to achieve about double the performance in a 10th of the floor footprint and to one half of the power consumption by performing an upgrade of systems that are four years into production. As we will see, the operational expenses (OPEX) for run-ning an outdated computer system quickly exceeds the capital expense (CAPEX) in-vestment with the accompanying reduced OPEX for a new, more efficient system.Many universities that begin deploying cluster style research computing often re-sort to using discarded desktop computers. However, these cobbled together systems are simply not adequate to meet the needs of researchers who require very large com-puter systems to perform their research. This means the value of a decommissioned supercomputer might be significantly higher than the scrap value to the average person or researcher at a university because these older systems can provide plentiful and more pow-erful computational capabilities than would otherwise be available.A Different Approach PRObE is an answer to getting these decommis-sioned systems into the hands of peo-ple who can use them, but setting up and maintaining large clusters containing more than 1000 nodes requires overcoming several obstacles:1) Sheer volume: Decommissioning, moving, inspecting, troubleshooting, and bringing back thousands of old computers online takes significant time and effort. Also, unlike when a system is slated for destruc-tion - care must be taken throughout the de-commissioning process so that parts are not damaged.2) Space: A computer system with 1000 or more nodes and appropriate interconnect net-works will likely require about 40-50 whole racks of computer equipment. PRObE has capacity for 1MW of compute power, about 280 tons of cooling, and 3000 sq ft of server room space to house these large machines. This is sufficient for housing two large and a few smaller clusters.3) Electricity cost: 1MW costs around $1M per year in New Mexico. It is a required OPEX and in PRObE's case, is provided by NSF funding. This is not a typical setup, but since there is no procurement cost for the computers - the electricity is covered instead. This allows PRObE to provide the com-pute services to the community at no cost to the individual users.4) Lack of spare parts: Vendors do not necessarily keep old spare parts around once a product has reached end-of-life and some-times the vendor of an old system might have vanished. In such cases, the only outlet is the gray market - such as eBay and other vendors specializing in reused computer equipment. In PRObE's case - LANL's systems are usu-ally larger than what PRObE can house, so a sufficient number of spares (typically about 20 percent) can accompany each system. Ma-chines can also be cannibalized to keep the system running once the spares run out.5) Staff to operate: PRObE is success-ful primarily because of the workforce we use to build the clusters and to maintain them. In particular, our staff is creative as they can both assemble and maintain the hardware even with limited funds. Instead of hiring consultants or full time staff members to per-form this work, PRObE relies on local high-school and early college talent, which is also a wonderful way to train young people. Over the past 6 years we have employed close to 40 high school students that spend a couple of hours with us each week. During summers and winter break, these students often work full time. To PRObE this is an affordable solution and the students get hands-on expe-rience building large computer systems.The Future PRObE is fortunate that the NSF sees the value in what we do, the training that we provide, and the scientific value these older systems can contribute to the academic and scientific communities. Without NSF support, PRObE would not be possible. While the operation of PRObE require both skill and creativity, the work is rewarding and the scientific benefits are as real as exemplified by the many research citations PRObE regularly receives in the scientific literature. Andree Jacobson
< Page 8 | Page 10 >